diff --git a/README.md b/README.md index 6647096..e8f2214 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,10 @@ Note - NTTT will work on Windows, macOS and Linux. +## Documentation + +For maintainers, [doc/transformations.md](doc/transformations.md) describes what NTTT changes in `meta.yml` and Markdown files (sections, HTML, formatting, URLs, and related behaviour). + ## Prerequisites The tool requires having Python 3.7 or newer. @@ -61,6 +65,14 @@ pip3 install . --upgrade ![install nttt](images/install_nttt.png) +You could also use `pipx` (instructions below for Mac using homebrew): + +```bash +brew install pipx +pipx install /path/to/project/nttt +nttt --help +``` + You can uninstall nttt using: ```bash @@ -102,6 +114,28 @@ You can specify different directories for the input and output folder using the nttt --input c:\path\to\project\de-DE --output c:\path\to\project\de-DE-tidy ``` +### Crowdin marker stripping and restoring + +NTTT has three processing modes: + +- `tidy` (default): restore stripped Markdown markers for non-English locale folders, then run the existing tidy-up transforms. +- `strip`: remove non-translatable Markdown markers before uploading English source files to Crowdin. +- `restore`: reinsert stripped Markdown markers into translated files after downloading from Crowdin. + +Use `strip` on the English source folder before Crowdin upload: + +```bash +nttt --mode strip -i en -o en -Y on +``` + +Use `restore` on a translated locale folder after Crowdin download: + +```bash +nttt --mode restore -i de-DE -e en -o de-DE -Y on +``` + +Modern bare markers such as `> [!TASK]` are removed entirely, along with their paired empty `>` line. Modern labelled markers such as `> [!ACCORDION] Where are my voice recordings stored?` keep the label available for translation by becoming `> Where are my voice recordings stored?`; restore reinserts `[!ACCORDION]` before the translated label. Legacy markers such as `--- task ---` and `--- /task ---` are also removed and restored by line alignment against `en/`. + ### Help To bring up full usage information use the `-h`/`--help` option. diff --git a/doc/transformations.md b/doc/transformations.md new file mode 100644 index 0000000..2573823 --- /dev/null +++ b/doc/transformations.md @@ -0,0 +1,134 @@ +# NTTT: transformations reference + +This document describes what **Nina's Translation Tidy-up Tool (NTTT)** changes on disk, so maintainers know what to expect and where to look in code. + +## Scope + +- **Inputs:** Files under the chosen **input** directory. The tool collects every `meta.yml` and every `*.md` (see `find_files` in [`nttt/utilities.py`](../nttt/utilities.py)). +- **English reference:** A parallel tree (default: `INPUT/../en`) used for `meta.yml` sync and optional section-tag revert. +- **Outputs:** Corresponding paths under the **output** directory (created as needed). After processing, **missing** files/folders can be copied from input and English (`add_missing_entries`). + +NTTT does **not** process standalone `.html` files. HTML-related steps run on **HTML inside Markdown**. + +--- + +## High-level pipeline (`fix_md_step`) + +For each `.md` file, [`nttt/tidyup.py`](../nttt/tidyup.py) applies, in order: + +1. **`restore_tree`** — for non-English locale folders, restore Markdown markers stripped before Crowdin upload. +2. **`fix_sections`** — normalise `---` section lines (Crowdin quirks). +3. **`revert_section_translation`** — optional; restore English section tag lines when structure matches. +4. **`trim_md_tags`** — strip padding inside paired Markdown delimiters (outside ` ``` ` fences). +5. **`trim_html_tags`** — strip padding inside simple inline HTML tags (outside single `` ` `` spans). +6. **`trim_formatting_tags`** — normalise `{ … }` attribute blocks after a word (Scratch/Pico-style). +7. **URL rewrite:** replace `/en/` with `//` everywhere in the file body. + +Steps 1–5 can be skipped via **`--disable`** (see [`nttt/arguments.py`](../nttt/arguments.py)). + +`meta.yml` is handled separately by **`fix_meta`** (YAML round-trip, revert non-translatable keys from English). This doc focuses on Markdown/HTML-style transforms. + +--- + +## Crowdin marker strip/restore (`nttt/strip.py`, `nttt/restore.py`) + +**Modes:** `--mode strip`, `--mode restore`, and default `--mode tidy`. + +| Mode | Behaviour | +|------|-----------| +| `strip` | Runs on `en/` before Crowdin upload. Removes structural-only markers and keeps labelled marker text translatable. | +| `restore` | Runs on a locale folder after Crowdin download. Rebuilds markers from the matching English file. | +| `tidy` | For non-English locale folders, runs restore first, then the existing tidy transforms. | + +**Marker classification (`nttt/markers.py`):** + +| Kind | Pattern | Strip output | Restore output | +|------|---------|--------------|----------------| +| Modern bare | `> [!TASK]`, `> [!SAVE]`, nested forms like `> > [!HINT]` | Dropped. A following empty blockquote line (`>`, `> >`) is also dropped. | Copied back from `en/`. | +| Modern labelled | `> [!ACCORDION] Where are my voice recordings stored?` | Rewritten to `> Where are my voice recordings stored?`. | Rewritten to `> [!ACCORDION] `. | +| Legacy bare | `--- task ---`, `--- /task ---`, `--- print-only ---`, `--- feedback ---` | Dropped. | Copied back from `en/`. | + +Restore uses line-index alignment against the stripped English file. If the translated file has a different number of lines from the stripped English reference, NTTT logs a warning and leaves that file unchanged for this step. + +Fenced code blocks split by ` ``` ` are not stripped. + +## 1. Section markers (`nttt/cleanup_sections.py`) + +**Function:** `fix_sections` + +| Behaviour | Purpose | +|-----------|---------| +| Replace `\---` with `---` | Crowdin sometimes escapes section markers. | +| Normalise `--` / `---` wrappers around section names | Fix missing dash or inconsistent spacing; target form **`--- ---`**. Tags allow word chars, digits, hyphens, and certain Unicode space characters inside the name. | +| Normalise closing sections | **`--- /tag ---`** — removes extra spaces between `/` and the tag name. | +| Split jammed section lines | Restore newline between adjacent **`--- … ---`** lines when Crowdin merges them (e.g. hints/hint); regex also tolerates some translator edits. | +| Repair broken collapse/title blocks | Restore **`--- collapse ---`** plus YAML-style **`title:`** block when Crowdin breaks the structure; colons may be ASCII or full-width (`:`). | + +**Function:** `revert_section_translation` (requires English `.md`) + +- Collects lines matching **`--- ---`** in translation and English. +- If **counts match**, replaces each translated section line with the **English** line at the same index (keeps English tag names, e.g. `task` vs translated word). +- If counts differ, logs a **warning** to stderr and leaves the file unchanged for this step. + +--- + +## 2. Markdown delimiters (`nttt/cleanup_markdown.py`) + +**Function:** `trim_md_tags` + +- Splits content on **` ``` `** (triple backtick). **`apply_to_every_other_part`** runs trimming only on segments **outside** fenced blocks (indices 0, 2, 4, …); fence interiors are untouched. +- Per line outside fences: + - **List lines:** odd number of `*` and line starts with `*` after `lstrip` → only the substring **after the first `*`** is trimmed (preserves the bullet marker). + - Otherwise the **whole line** is trimmed. +- **Trim rule:** regex finds paired **`` ` ``**, **`_` … `___`**, or **`*` … `***`** wrapping content; inner content is **`.strip()`**; delimiters unchanged. + +Logging can record each replacement (`log_replacement`). + +--- + +## 3. Inline HTML (`nttt/cleanup_html.py`) + +**Function:** `trim_html_tags` + +- Splits on **single** `` ` ``. Only **even-index** segments are processed; **inline code** segments are preserved. +- Matches **paired** tags: `` where `tagName` is **word characters + digits only** (no hyphenated custom elements in the pattern). Inner HTML is **`.strip()`**. +- Does **not** handle attributes on the opening tag, self-closing tags, or arbitrary XML namespaces. + +--- + +## 4. Formatting braces (`nttt/cleanup_formatting.py`) + +**Function:** `trim_formatting_tags` + +- Single-pass regex over the **entire** file (no code-fence splitting). +- Targets patterns like **`word { … key = "value" … }`** with flexible Unicode spaces, colons, and quotes (see [`nttt/constants.py`](../nttt/constants.py) `RegexConstants`). +- **Lowercases** the attribute name and value. +- Normalises "blank" link targets: values matching **`_` + spaces + `blank`** → **`_blank`**. + +--- + +## 5. Locale URLs (`nttt/tidyup.py`) + +After cleanup: **replace every `/en/` with `//`** in the Markdown file (`language` from resolved CLI args, defaulting from input folder basename). + +--- + +## Operational notes + +- **Confirmation:** Unless **`-Y`**, the tool lists files and waits for **`y`** before writing. +- **Volunteer acknowledgements / missing files:** Separate from Markdown transforms; see `add_volunteer_acknowledgement` and `add_missing_entries` in [`nttt/tidyup.py`](../nttt/tidyup.py). +- **Logging:** Several modules accept a `logging` object for replacement traces (`nttt_logging`). + +--- + +## Quick code map + +| Concern | Module | +|---------|--------| +| Orchestration | `nttt/tidyup.py`, `nttt/__init__.py` | +| CLI / disable flags | `nttt/arguments.py` | +| Sections | `nttt/cleanup_sections.py` | +| Markdown emphasis / code delimiters | `nttt/cleanup_markdown.py` | +| Inline HTML | `nttt/cleanup_html.py` | +| Brace attributes | `nttt/cleanup_formatting.py` | +| Split "every other segment" | `nttt/utilities.py` → `apply_to_every_other_part` | diff --git a/nttt/__init__.py b/nttt/__init__.py index 9976cfb..546ab30 100644 --- a/nttt/__init__.py +++ b/nttt/__init__.py @@ -1,4 +1,7 @@ from .arguments import parse_command_line, resolve_arguments, check_arguments, show_arguments +from .constants import ArgumentKeyConstants, Modes +from .restore import restore_tree +from .strip import strip_tree from .tidyup import tidyup_translations from ._version import __version__ @@ -7,4 +10,15 @@ def main(): resolved_arguments = resolve_arguments(command_line_args) show_arguments(resolved_arguments) if (check_arguments(resolved_arguments)): - tidyup_translations(resolved_arguments) + mode = resolved_arguments[ArgumentKeyConstants.MODE] + if mode == Modes.STRIP: + strip_tree( + resolved_arguments[ArgumentKeyConstants.INPUT], + resolved_arguments[ArgumentKeyConstants.OUTPUT]) + elif mode == Modes.RESTORE: + restore_tree( + resolved_arguments[ArgumentKeyConstants.INPUT], + resolved_arguments[ArgumentKeyConstants.ENGLISH], + resolved_arguments[ArgumentKeyConstants.OUTPUT]) + else: + tidyup_translations(resolved_arguments) diff --git a/nttt/arguments.py b/nttt/arguments.py index 35f76b0..6e2ca09 100644 --- a/nttt/arguments.py +++ b/nttt/arguments.py @@ -1,4 +1,4 @@ -from .constants import ArgumentKeyConstants +from .constants import ArgumentKeyConstants, Modes import os from pathlib import Path from argparse import ArgumentParser @@ -51,6 +51,11 @@ def parse_command_line(version): parser.add_argument("-l", "--language", help="The language of the content to be tidied up, defaults to basename(INPUT).") parser.add_argument("-v", "--volunteers", help="The list of volunteers as a comma separated list, defaults to an empty list.") parser.add_argument("-f", "--final", help="The number of the final step file, defaults to the step file with the highest number.") + parser.add_argument("-m", "--mode", choices=[Modes.TIDY, Modes.STRIP, Modes.RESTORE], + help="The processing mode. Options are: tidy (default cleanup), " + "strip (remove non-translatable structural markers before Crowdin upload), " + "restore (restore stripped structural markers after Crowdin download). " + "Default is tidy.") parser.add_argument("-D", "--Disable", help="The risky features to be disabled, separated by commas. " "Options are: fix_md (fix common markdown-related issues), " "fix_html (fix common issues in HTML-like tags (Return)), " @@ -120,6 +125,11 @@ def resolve_arguments(command_line_args): else: arguments[ArgumentKeyConstants.YES] = "off" + if hasattr(command_line_args, "mode") and command_line_args.mode: + arguments[ArgumentKeyConstants.MODE] = command_line_args.mode + else: + arguments[ArgumentKeyConstants.MODE] = Modes.TIDY + return arguments @@ -138,6 +148,7 @@ def show_arguments(arguments): print("Disabled functions - '{}'".format(arguments[ArgumentKeyConstants.DISABLE])) print("Logging - '{}'".format(arguments[ArgumentKeyConstants.LOGGING])) print("Yes - '{}'".format(arguments[ArgumentKeyConstants.YES])) + print("Mode - '{}'".format(arguments[ArgumentKeyConstants.MODE])) def check_folder(folder): diff --git a/nttt/constants.py b/nttt/constants.py index ce14cee..1b08b17 100644 --- a/nttt/constants.py +++ b/nttt/constants.py @@ -17,6 +17,13 @@ class ArgumentKeyConstants: DISABLE = 'DISABLE' LOGGING = 'LOGGING' YES = 'YES' + MODE = 'MODE' + + +class Modes: + TIDY = "tidy" + STRIP = "strip" + RESTORE = "restore" class RegexConstants: diff --git a/nttt/markers.py b/nttt/markers.py new file mode 100644 index 0000000..9fc1fd8 --- /dev/null +++ b/nttt/markers.py @@ -0,0 +1,72 @@ +import re + + +LINE_KIND_BARE_MARKER = "bare" +LINE_KIND_LABELLED_MARKER = "labelled" +LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE = "paired_empty_blockquote" +LINE_KIND_REGULAR = "regular" + + +MODERN_BARE_MARKER_PATTERN = re.compile( + r'^(?P\s*(?:>\s*)+)\[!(?P[A-Z][A-Z0-9_-]*)\]\s*$' +) + +MODERN_LABELLED_MARKER_PATTERN = re.compile( + r'^(?P\s*(?:>\s*)+)\[!(?P[A-Z][A-Z0-9_-]*)\]\s+(?P