raspberrypilearning · jamdelion · Apr 30, 2026 · Apr 30, 2026 · May 5, 2026 · May 5, 2026
diff --git a/README.md b/README.md
@@ -4,6 +4,10 @@
 
 Note - NTTT will work on Windows, macOS and Linux.
 
+## Documentation
+
+For maintainers, [doc/transformations.md](doc/transformations.md) describes what NTTT changes in `meta.yml` and Markdown files (sections, HTML, formatting, URLs, and related behaviour).
+
 ## Prerequisites
 
 The tool requires having Python 3.7 or newer. 
@@ -61,6 +65,14 @@ pip3 install . --upgrade
 
 ![install nttt](images/install_nttt.png)
 
+You could also use `pipx` (instructions below for Mac using homebrew):
+
+```bash
+brew install pipx
+pipx install /path/to/project/nttt
+nttt --help
+```
+
 You can uninstall nttt using:
 
 ```bash
@@ -102,6 +114,28 @@ You can specify different directories for the input and output folder using the
 nttt --input c:\path\to\project\de-DE --output c:\path\to\project\de-DE-tidy
 ```
 
+### Crowdin marker stripping and restoring
+
+NTTT has three processing modes:
+
+- `tidy` (default): restore stripped Markdown markers for non-English locale folders, then run the existing tidy-up transforms.
+- `strip`: remove non-translatable Markdown markers before uploading English source files to Crowdin.
+- `restore`: reinsert stripped Markdown markers into translated files after downloading from Crowdin.
+
+Use `strip` on the English source folder before Crowdin upload:
+
+```bash
+nttt --mode strip -i en -o en -Y on
+```
+
+Use `restore` on a translated locale folder after Crowdin download:
+
+```bash
+nttt --mode restore -i de-DE -e en -o de-DE -Y on
+```
+
+Modern bare markers such as `> [!TASK]` are removed entirely, along with their paired empty `>` line. Modern labelled markers such as `> [!ACCORDION] Where are my voice recordings stored?` keep the label available for translation by becoming `> Where are my voice recordings stored?`; restore reinserts `[!ACCORDION]` before the translated label. Legacy markers such as `--- task ---` and `--- /task ---` are also removed and restored by line alignment against `en/`.
+
 ### Help
 
 To bring up full usage information use the `-h`/`--help` option.

diff --git a/doc/transformations.md b/doc/transformations.md
@@ -0,0 +1,134 @@
+# NTTT: transformations reference
+
+This document describes what **Nina's Translation Tidy-up Tool (NTTT)** changes on disk, so maintainers know what to expect and where to look in code.
+
+## Scope
+
+- **Inputs:** Files under the chosen **input** directory. The tool collects every `meta.yml` and every `*.md` (see `find_files` in [`nttt/utilities.py`](../nttt/utilities.py)).
+- **English reference:** A parallel tree (default: `INPUT/../en`) used for `meta.yml` sync and optional section-tag revert.
+- **Outputs:** Corresponding paths under the **output** directory (created as needed). After processing, **missing** files/folders can be copied from input and English (`add_missing_entries`).
+
+NTTT does **not** process standalone `.html` files. HTML-related steps run on **HTML inside Markdown**.
+
+---
+
+## High-level pipeline (`fix_md_step`)
+
+For each `.md` file, [`nttt/tidyup.py`](../nttt/tidyup.py) applies, in order:
+
+1. **`restore_tree`** — for non-English locale folders, restore Markdown markers stripped before Crowdin upload.
+2. **`fix_sections`** — normalise `---` section lines (Crowdin quirks).
+3. **`revert_section_translation`** — optional; restore English section tag lines when structure matches.
+4. **`trim_md_tags`** — strip padding inside paired Markdown delimiters (outside ` ``` ` fences).
+5. **`trim_html_tags`** — strip padding inside simple inline HTML tags (outside single `` ` `` spans).
+6. **`trim_formatting_tags`** — normalise `{ … }` attribute blocks after a word (Scratch/Pico-style).
+7. **URL rewrite:** replace `/en/` with `/<language>/` everywhere in the file body.
+
+Steps 1–5 can be skipped via **`--disable`** (see [`nttt/arguments.py`](../nttt/arguments.py)).
+
+`meta.yml` is handled separately by **`fix_meta`** (YAML round-trip, revert non-translatable keys from English). This doc focuses on Markdown/HTML-style transforms.
+
+---
+
+## Crowdin marker strip/restore (`nttt/strip.py`, `nttt/restore.py`)
+
+**Modes:** `--mode strip`, `--mode restore`, and default `--mode tidy`.
+
+| Mode | Behaviour |
+|------|-----------|
+| `strip` | Runs on `en/` before Crowdin upload. Removes structural-only markers and keeps labelled marker text translatable. |
+| `restore` | Runs on a locale folder after Crowdin download. Rebuilds markers from the matching English file. |
+| `tidy` | For non-English locale folders, runs restore first, then the existing tidy transforms. |
+
+**Marker classification (`nttt/markers.py`):**
+
+| Kind | Pattern | Strip output | Restore output |
+|------|---------|--------------|----------------|
+| Modern bare | `> [!TASK]`, `> [!SAVE]`, nested forms like `> > [!HINT]` | Dropped. A following empty blockquote line (`>`, `> >`) is also dropped. | Copied back from `en/`. |
+| Modern labelled | `> [!ACCORDION] Where are my voice recordings stored?` | Rewritten to `> Where are my voice recordings stored?`. | Rewritten to `> [!ACCORDION] <translated label>`. |
+| Legacy bare | `--- task ---`, `--- /task ---`, `--- print-only ---`, `--- feedback ---` | Dropped. | Copied back from `en/`. |
+
+Restore uses line-index alignment against the stripped English file. If the translated file has a different number of lines from the stripped English reference, NTTT logs a warning and leaves that file unchanged for this step.
+
+Fenced code blocks split by ` ``` ` are not stripped.
+
+## 1. Section markers (`nttt/cleanup_sections.py`)
+
+**Function:** `fix_sections`
+
+| Behaviour | Purpose |
+|-----------|---------|
+| Replace `\---` with `---` | Crowdin sometimes escapes section markers. |
+| Normalise `--` / `---` wrappers around section names | Fix missing dash or inconsistent spacing; target form **`--- <tag> ---`**. Tags allow word chars, digits, hyphens, and certain Unicode space characters inside the name. |
+| Normalise closing sections | **`--- /tag ---`** — removes extra spaces between `/` and the tag name. |
+| Split jammed section lines | Restore newline between adjacent **`--- … ---`** lines when Crowdin merges them (e.g. hints/hint); regex also tolerates some translator edits. |
+| Repair broken collapse/title blocks | Restore **`--- collapse ---`** plus YAML-style **`title:`** block when Crowdin breaks the structure; colons may be ASCII or full-width (`：`). |
+
+**Function:** `revert_section_translation` (requires English `.md`)
+
+- Collects lines matching **`--- <anything> ---`** in translation and English.
+- If **counts match**, replaces each translated section line with the **English** line at the same index (keeps English tag names, e.g. `task` vs translated word).
+- If counts differ, logs a **warning** to stderr and leaves the file unchanged for this step.
+
+---
+
+## 2. Markdown delimiters (`nttt/cleanup_markdown.py`)
+
+**Function:** `trim_md_tags`
+
+- Splits content on **` ``` `** (triple backtick). **`apply_to_every_other_part`** runs trimming only on segments **outside** fenced blocks (indices 0, 2, 4, …); fence interiors are untouched.
+- Per line outside fences:
+  - **List lines:** odd number of `*` and line starts with `*` after `lstrip` → only the substring **after the first `*`** is trimmed (preserves the bullet marker).
+  - Otherwise the **whole line** is trimmed.
+- **Trim rule:** regex finds paired **`` ` ``**, **`_` … `___`**, or **`*` … `***`** wrapping content; inner content is **`.strip()`**; delimiters unchanged.
+
+Logging can record each replacement (`log_replacement`).
+
+---
+
+## 3. Inline HTML (`nttt/cleanup_html.py`)
+
+**Function:** `trim_html_tags`
+
+- Splits on **single** `` ` ``. Only **even-index** segments are processed; **inline code** segments are preserved.
+- Matches **paired** tags: `<tagName>…</tagName>` where `tagName` is **word characters + digits only** (no hyphenated custom elements in the pattern). Inner HTML is **`.strip()`**.
+- Does **not** handle attributes on the opening tag, self-closing tags, or arbitrary XML namespaces.
+
+---
+
+## 4. Formatting braces (`nttt/cleanup_formatting.py`)
+
+**Function:** `trim_formatting_tags`
+
+- Single-pass regex over the **entire** file (no code-fence splitting).
+- Targets patterns like **`word { … key = "value" … }`** with flexible Unicode spaces, colons, and quotes (see [`nttt/constants.py`](../nttt/constants.py) `RegexConstants`).
+- **Lowercases** the attribute name and value.
+- Normalises "blank" link targets: values matching **`_` + spaces + `blank`** → **`_blank`**.
+
+---
+
+## 5. Locale URLs (`nttt/tidyup.py`)
+
+After cleanup: **replace every `/en/` with `/<language>/`** in the Markdown file (`language` from resolved CLI args, defaulting from input folder basename).
+
+---
+
+## Operational notes
+
+- **Confirmation:** Unless **`-Y`**, the tool lists files and waits for **`y`** before writing.
+- **Volunteer acknowledgements / missing files:** Separate from Markdown transforms; see `add_volunteer_acknowledgement` and `add_missing_entries` in [`nttt/tidyup.py`](../nttt/tidyup.py).
+- **Logging:** Several modules accept a `logging` object for replacement traces (`nttt_logging`).
+
+---
+
+## Quick code map
+
+| Concern | Module |
+|---------|--------|
+| Orchestration | `nttt/tidyup.py`, `nttt/__init__.py` |
+| CLI / disable flags | `nttt/arguments.py` |
+| Sections | `nttt/cleanup_sections.py` |
+| Markdown emphasis / code delimiters | `nttt/cleanup_markdown.py` |
+| Inline HTML | `nttt/cleanup_html.py` |
+| Brace attributes | `nttt/cleanup_formatting.py` |
+| Split "every other segment" | `nttt/utilities.py` → `apply_to_every_other_part` |
diff --git a/nttt/__init__.py b/nttt/__init__.py
@@ -1,4 +1,7 @@
 from .arguments import parse_command_line, resolve_arguments, check_arguments, show_arguments
+from .constants import ArgumentKeyConstants, Modes
+from .restore import restore_tree
+from .strip import strip_tree
 from .tidyup import tidyup_translations
 from ._version import __version__
 
@@ -7,4 +10,15 @@ def main():
     resolved_arguments = resolve_arguments(command_line_args)
     show_arguments(resolved_arguments)
     if (check_arguments(resolved_arguments)):
-        tidyup_translations(resolved_arguments)
+        mode = resolved_arguments[ArgumentKeyConstants.MODE]
+        if mode == Modes.STRIP:
+            strip_tree(
+                resolved_arguments[ArgumentKeyConstants.INPUT],
+                resolved_arguments[ArgumentKeyConstants.OUTPUT])
+        elif mode == Modes.RESTORE:
+            restore_tree(
+                resolved_arguments[ArgumentKeyConstants.INPUT],
+                resolved_arguments[ArgumentKeyConstants.ENGLISH],
+                resolved_arguments[ArgumentKeyConstants.OUTPUT])
+        else:
+            tidyup_translations(resolved_arguments)
diff --git a/nttt/arguments.py b/nttt/arguments.py
@@ -1,4 +1,4 @@
-from .constants import ArgumentKeyConstants
+from .constants import ArgumentKeyConstants, Modes
 import os
 from pathlib import Path
 from argparse import ArgumentParser
@@ -51,6 +51,11 @@ def parse_command_line(version):
     parser.add_argument("-l", "--language",   help="The language of the content to be tidied up, defaults to basename(INPUT).")
     parser.add_argument("-v", "--volunteers", help="The list of volunteers as a comma separated list, defaults to an empty list.")
     parser.add_argument("-f", "--final",      help="The number of the final step file, defaults to the step file with the highest number.")
+    parser.add_argument("-m", "--mode",       choices=[Modes.TIDY, Modes.STRIP, Modes.RESTORE],
+                                                   help="The processing mode. Options are: tidy (default cleanup), "
+                                                   "strip (remove non-translatable structural markers before Crowdin upload), "
+                                                   "restore (restore stripped structural markers after Crowdin download). "
+                                                   "Default is tidy.")
     parser.add_argument("-D", "--Disable",    help="The risky features to be disabled, separated by commas. "
                                                    "Options are: fix_md (fix common markdown-related issues), "
                                                    "fix_html (fix common issues in HTML-like tags (<kbd>Return</kbd>)), "
@@ -120,6 +125,11 @@ def resolve_arguments(command_line_args):
     else:
         arguments[ArgumentKeyConstants.YES] = "off"
 
+    if hasattr(command_line_args, "mode") and command_line_args.mode:
+        arguments[ArgumentKeyConstants.MODE] = command_line_args.mode
+    else:
+        arguments[ArgumentKeyConstants.MODE] = Modes.TIDY
+
     return arguments
 
 
@@ -138,6 +148,7 @@ def show_arguments(arguments):
     print("Disabled functions - '{}'".format(arguments[ArgumentKeyConstants.DISABLE]))
     print("Logging - '{}'".format(arguments[ArgumentKeyConstants.LOGGING]))
     print("Yes - '{}'".format(arguments[ArgumentKeyConstants.YES]))
+    print("Mode - '{}'".format(arguments[ArgumentKeyConstants.MODE]))
 
 
 def check_folder(folder):

diff --git a/nttt/constants.py b/nttt/constants.py
@@ -17,6 +17,13 @@ class ArgumentKeyConstants:
     DISABLE = 'DISABLE'
     LOGGING = 'LOGGING'
     YES = 'YES'
+    MODE = 'MODE'
+
+
+class Modes:
+    TIDY = "tidy"
+    STRIP = "strip"
+    RESTORE = "restore"
 
 
 class RegexConstants:

diff --git a/nttt/markers.py b/nttt/markers.py
@@ -0,0 +1,72 @@
+import re
+
+
+LINE_KIND_BARE_MARKER = "bare"
+LINE_KIND_LABELLED_MARKER = "labelled"
+LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE = "paired_empty_blockquote"
+LINE_KIND_REGULAR = "regular"
+
+
+MODERN_BARE_MARKER_PATTERN = re.compile(
+    r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s*$'
+)
+
+MODERN_LABELLED_MARKER_PATTERN = re.compile(
+    r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s+(?P<label>\S.*?)\s*$'
+)
+
+LEGACY_BARE_MARKER_PATTERN = re.compile(
+    r'^\s*---\s+/?[\w-]+\s+---\s*$'
+)
+
+EMPTY_BLOCKQUOTE_PATTERN = re.compile(r'^\s*(?:>\s*)+$')
+
+
+def remove_eol(line):
+    return line.rstrip("\r\n")
+
+
+def get_eol(line):
+    if line.endswith("\r\n"):
+        return "\r\n"
+    if line.endswith("\n"):
+        return "\n"
+    if line.endswith("\r"):
+        return "\r"
+    return ""
+
+
+def classify_line(line):
+    line_without_eol = remove_eol(line)
+
+    match = MODERN_LABELLED_MARKER_PATTERN.match(line_without_eol)
+    if match:
+        return LINE_KIND_LABELLED_MARKER, match
+
+    match = MODERN_BARE_MARKER_PATTERN.match(line_without_eol)
+    if match:
+        return LINE_KIND_BARE_MARKER, match
+
+    match = LEGACY_BARE_MARKER_PATTERN.match(line_without_eol)
+    if match:
+        return LINE_KIND_BARE_MARKER, match
+
+    match = EMPTY_BLOCKQUOTE_PATTERN.match(line_without_eol)
+    if match:
+        return LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE, match
+
+    return LINE_KIND_REGULAR, None
+
+
+def is_marker_line(line):
+    line_kind, _ = classify_line(line)
+    return line_kind in (LINE_KIND_BARE_MARKER, LINE_KIND_LABELLED_MARKER)
+
+
+def is_modern_bare_marker_line(line):
+    return MODERN_BARE_MARKER_PATTERN.match(remove_eol(line)) is not None
+
+
+def is_paired_empty_blockquote(line):
+    line_kind, _ = classify_line(line)
+    return line_kind == LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE