Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

Note - NTTT will work on Windows, macOS and Linux.

## Documentation

For maintainers, [doc/transformations.md](doc/transformations.md) describes what NTTT changes in `meta.yml` and Markdown files (sections, HTML, formatting, URLs, and related behaviour).

## Prerequisites

The tool requires having Python 3.7 or newer.
Expand Down Expand Up @@ -61,6 +65,14 @@ pip3 install . --upgrade

![install nttt](images/install_nttt.png)

You could also use `pipx` (instructions below for Mac using homebrew):

```bash
brew install pipx
pipx install /path/to/project/nttt
nttt --help
```

You can uninstall nttt using:

```bash
Expand Down Expand Up @@ -102,6 +114,28 @@ You can specify different directories for the input and output folder using the
nttt --input c:\path\to\project\de-DE --output c:\path\to\project\de-DE-tidy
```

### Crowdin marker stripping and restoring

NTTT has three processing modes:

- `tidy` (default): restore stripped Markdown markers for non-English locale folders, then run the existing tidy-up transforms.
- `strip`: remove non-translatable Markdown markers before uploading English source files to Crowdin.
- `restore`: reinsert stripped Markdown markers into translated files after downloading from Crowdin.

Use `strip` on the English source folder before Crowdin upload:

```bash
nttt --mode strip -i en -o en -Y on
```

Use `restore` on a translated locale folder after Crowdin download:

```bash
nttt --mode restore -i de-DE -e en -o de-DE -Y on
```

Modern bare markers such as `> [!TASK]` are removed entirely, along with their paired empty `>` line. Modern labelled markers such as `> [!ACCORDION] Where are my voice recordings stored?` keep the label available for translation by becoming `> Where are my voice recordings stored?`; restore reinserts `[!ACCORDION]` before the translated label. Legacy markers such as `--- task ---` and `--- /task ---` are also removed and restored by line alignment against `en/`.

### Help

To bring up full usage information use the `-h`/`--help` option.
Expand Down
134 changes: 134 additions & 0 deletions doc/transformations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# NTTT: transformations reference

This document describes what **Nina's Translation Tidy-up Tool (NTTT)** changes on disk, so maintainers know what to expect and where to look in code.

## Scope

- **Inputs:** Files under the chosen **input** directory. The tool collects every `meta.yml` and every `*.md` (see `find_files` in [`nttt/utilities.py`](../nttt/utilities.py)).
- **English reference:** A parallel tree (default: `INPUT/../en`) used for `meta.yml` sync and optional section-tag revert.
- **Outputs:** Corresponding paths under the **output** directory (created as needed). After processing, **missing** files/folders can be copied from input and English (`add_missing_entries`).

NTTT does **not** process standalone `.html` files. HTML-related steps run on **HTML inside Markdown**.

---

## High-level pipeline (`fix_md_step`)

For each `.md` file, [`nttt/tidyup.py`](../nttt/tidyup.py) applies, in order:

1. **`restore_tree`** — for non-English locale folders, restore Markdown markers stripped before Crowdin upload.
2. **`fix_sections`** — normalise `---` section lines (Crowdin quirks).
3. **`revert_section_translation`** — optional; restore English section tag lines when structure matches.
4. **`trim_md_tags`** — strip padding inside paired Markdown delimiters (outside ` ``` ` fences).
5. **`trim_html_tags`** — strip padding inside simple inline HTML tags (outside single `` ` `` spans).
6. **`trim_formatting_tags`** — normalise `{ … }` attribute blocks after a word (Scratch/Pico-style).
7. **URL rewrite:** replace `/en/` with `/<language>/` everywhere in the file body.

Steps 1–5 can be skipped via **`--disable`** (see [`nttt/arguments.py`](../nttt/arguments.py)).

`meta.yml` is handled separately by **`fix_meta`** (YAML round-trip, revert non-translatable keys from English). This doc focuses on Markdown/HTML-style transforms.

---

## Crowdin marker strip/restore (`nttt/strip.py`, `nttt/restore.py`)

**Modes:** `--mode strip`, `--mode restore`, and default `--mode tidy`.

| Mode | Behaviour |
|------|-----------|
| `strip` | Runs on `en/` before Crowdin upload. Removes structural-only markers and keeps labelled marker text translatable. |
| `restore` | Runs on a locale folder after Crowdin download. Rebuilds markers from the matching English file. |
| `tidy` | For non-English locale folders, runs restore first, then the existing tidy transforms. |

**Marker classification (`nttt/markers.py`):**

| Kind | Pattern | Strip output | Restore output |
|------|---------|--------------|----------------|
| Modern bare | `> [!TASK]`, `> [!SAVE]`, nested forms like `> > [!HINT]` | Dropped. A following empty blockquote line (`>`, `> >`) is also dropped. | Copied back from `en/`. |
| Modern labelled | `> [!ACCORDION] Where are my voice recordings stored?` | Rewritten to `> Where are my voice recordings stored?`. | Rewritten to `> [!ACCORDION] <translated label>`. |
| Legacy bare | `--- task ---`, `--- /task ---`, `--- print-only ---`, `--- feedback ---` | Dropped. | Copied back from `en/`. |

Restore uses line-index alignment against the stripped English file. If the translated file has a different number of lines from the stripped English reference, NTTT logs a warning and leaves that file unchanged for this step.

Fenced code blocks split by ` ``` ` are not stripped.

## 1. Section markers (`nttt/cleanup_sections.py`)

**Function:** `fix_sections`

| Behaviour | Purpose |
|-----------|---------|
| Replace `\---` with `---` | Crowdin sometimes escapes section markers. |
| Normalise `--` / `---` wrappers around section names | Fix missing dash or inconsistent spacing; target form **`--- <tag> ---`**. Tags allow word chars, digits, hyphens, and certain Unicode space characters inside the name. |
| Normalise closing sections | **`--- /tag ---`** — removes extra spaces between `/` and the tag name. |
| Split jammed section lines | Restore newline between adjacent **`--- … ---`** lines when Crowdin merges them (e.g. hints/hint); regex also tolerates some translator edits. |
| Repair broken collapse/title blocks | Restore **`--- collapse ---`** plus YAML-style **`title:`** block when Crowdin breaks the structure; colons may be ASCII or full-width (`:`). |

**Function:** `revert_section_translation` (requires English `.md`)

- Collects lines matching **`--- <anything> ---`** in translation and English.
- If **counts match**, replaces each translated section line with the **English** line at the same index (keeps English tag names, e.g. `task` vs translated word).
- If counts differ, logs a **warning** to stderr and leaves the file unchanged for this step.

---

## 2. Markdown delimiters (`nttt/cleanup_markdown.py`)

**Function:** `trim_md_tags`

- Splits content on **` ``` `** (triple backtick). **`apply_to_every_other_part`** runs trimming only on segments **outside** fenced blocks (indices 0, 2, 4, …); fence interiors are untouched.
- Per line outside fences:
- **List lines:** odd number of `*` and line starts with `*` after `lstrip` → only the substring **after the first `*`** is trimmed (preserves the bullet marker).
- Otherwise the **whole line** is trimmed.
- **Trim rule:** regex finds paired **`` ` ``**, **`_` … `___`**, or **`*` … `***`** wrapping content; inner content is **`.strip()`**; delimiters unchanged.

Logging can record each replacement (`log_replacement`).

---

## 3. Inline HTML (`nttt/cleanup_html.py`)

**Function:** `trim_html_tags`

- Splits on **single** `` ` ``. Only **even-index** segments are processed; **inline code** segments are preserved.
- Matches **paired** tags: `<tagName>…</tagName>` where `tagName` is **word characters + digits only** (no hyphenated custom elements in the pattern). Inner HTML is **`.strip()`**.
- Does **not** handle attributes on the opening tag, self-closing tags, or arbitrary XML namespaces.

---

## 4. Formatting braces (`nttt/cleanup_formatting.py`)

**Function:** `trim_formatting_tags`

- Single-pass regex over the **entire** file (no code-fence splitting).
- Targets patterns like **`word { … key = "value" … }`** with flexible Unicode spaces, colons, and quotes (see [`nttt/constants.py`](../nttt/constants.py) `RegexConstants`).
- **Lowercases** the attribute name and value.
- Normalises "blank" link targets: values matching **`_` + spaces + `blank`** → **`_blank`**.

---

## 5. Locale URLs (`nttt/tidyup.py`)

After cleanup: **replace every `/en/` with `/<language>/`** in the Markdown file (`language` from resolved CLI args, defaulting from input folder basename).

---

## Operational notes

- **Confirmation:** Unless **`-Y`**, the tool lists files and waits for **`y`** before writing.
- **Volunteer acknowledgements / missing files:** Separate from Markdown transforms; see `add_volunteer_acknowledgement` and `add_missing_entries` in [`nttt/tidyup.py`](../nttt/tidyup.py).
- **Logging:** Several modules accept a `logging` object for replacement traces (`nttt_logging`).

---

## Quick code map

| Concern | Module |
|---------|--------|
| Orchestration | `nttt/tidyup.py`, `nttt/__init__.py` |
| CLI / disable flags | `nttt/arguments.py` |
| Sections | `nttt/cleanup_sections.py` |
| Markdown emphasis / code delimiters | `nttt/cleanup_markdown.py` |
| Inline HTML | `nttt/cleanup_html.py` |
| Brace attributes | `nttt/cleanup_formatting.py` |
| Split "every other segment" | `nttt/utilities.py` → `apply_to_every_other_part` |
16 changes: 15 additions & 1 deletion nttt/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from .arguments import parse_command_line, resolve_arguments, check_arguments, show_arguments
from .constants import ArgumentKeyConstants, Modes
from .restore import restore_tree
from .strip import strip_tree
from .tidyup import tidyup_translations
from ._version import __version__

Expand All @@ -7,4 +10,15 @@ def main():
resolved_arguments = resolve_arguments(command_line_args)
show_arguments(resolved_arguments)
if (check_arguments(resolved_arguments)):
tidyup_translations(resolved_arguments)
mode = resolved_arguments[ArgumentKeyConstants.MODE]
if mode == Modes.STRIP:
strip_tree(
resolved_arguments[ArgumentKeyConstants.INPUT],
resolved_arguments[ArgumentKeyConstants.OUTPUT])
elif mode == Modes.RESTORE:
restore_tree(
resolved_arguments[ArgumentKeyConstants.INPUT],
resolved_arguments[ArgumentKeyConstants.ENGLISH],
resolved_arguments[ArgumentKeyConstants.OUTPUT])
else:
tidyup_translations(resolved_arguments)
13 changes: 12 additions & 1 deletion nttt/arguments.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from .constants import ArgumentKeyConstants
from .constants import ArgumentKeyConstants, Modes
import os
from pathlib import Path
from argparse import ArgumentParser
Expand Down Expand Up @@ -51,6 +51,11 @@ def parse_command_line(version):
parser.add_argument("-l", "--language", help="The language of the content to be tidied up, defaults to basename(INPUT).")
parser.add_argument("-v", "--volunteers", help="The list of volunteers as a comma separated list, defaults to an empty list.")
parser.add_argument("-f", "--final", help="The number of the final step file, defaults to the step file with the highest number.")
parser.add_argument("-m", "--mode", choices=[Modes.TIDY, Modes.STRIP, Modes.RESTORE],
help="The processing mode. Options are: tidy (default cleanup), "
"strip (remove non-translatable structural markers before Crowdin upload), "
"restore (restore stripped structural markers after Crowdin download). "
"Default is tidy.")
parser.add_argument("-D", "--Disable", help="The risky features to be disabled, separated by commas. "
"Options are: fix_md (fix common markdown-related issues), "
"fix_html (fix common issues in HTML-like tags (<kbd>Return</kbd>)), "
Expand Down Expand Up @@ -120,6 +125,11 @@ def resolve_arguments(command_line_args):
else:
arguments[ArgumentKeyConstants.YES] = "off"

if hasattr(command_line_args, "mode") and command_line_args.mode:
arguments[ArgumentKeyConstants.MODE] = command_line_args.mode
else:
arguments[ArgumentKeyConstants.MODE] = Modes.TIDY

return arguments


Expand All @@ -138,6 +148,7 @@ def show_arguments(arguments):
print("Disabled functions - '{}'".format(arguments[ArgumentKeyConstants.DISABLE]))
print("Logging - '{}'".format(arguments[ArgumentKeyConstants.LOGGING]))
print("Yes - '{}'".format(arguments[ArgumentKeyConstants.YES]))
print("Mode - '{}'".format(arguments[ArgumentKeyConstants.MODE]))


def check_folder(folder):
Expand Down
7 changes: 7 additions & 0 deletions nttt/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,13 @@ class ArgumentKeyConstants:
DISABLE = 'DISABLE'
LOGGING = 'LOGGING'
YES = 'YES'
MODE = 'MODE'


class Modes:
TIDY = "tidy"
STRIP = "strip"
RESTORE = "restore"


class RegexConstants:
Expand Down
72 changes: 72 additions & 0 deletions nttt/markers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
import re


LINE_KIND_BARE_MARKER = "bare"
LINE_KIND_LABELLED_MARKER = "labelled"
LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE = "paired_empty_blockquote"
LINE_KIND_REGULAR = "regular"


MODERN_BARE_MARKER_PATTERN = re.compile(
r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s*$'
)

MODERN_LABELLED_MARKER_PATTERN = re.compile(
r'^(?P<prefix>\s*(?:>\s*)+)\[!(?P<tag>[A-Z][A-Z0-9_-]*)\]\s+(?P<label>\S.*?)\s*$'
)

LEGACY_BARE_MARKER_PATTERN = re.compile(
r'^\s*---\s+/?[\w-]+\s+---\s*$'
)

EMPTY_BLOCKQUOTE_PATTERN = re.compile(r'^\s*(?:>\s*)+$')


def remove_eol(line):
return line.rstrip("\r\n")


def get_eol(line):
if line.endswith("\r\n"):
return "\r\n"
if line.endswith("\n"):
return "\n"
if line.endswith("\r"):
return "\r"
return ""


def classify_line(line):
line_without_eol = remove_eol(line)

match = MODERN_LABELLED_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_LABELLED_MARKER, match

match = MODERN_BARE_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_BARE_MARKER, match

match = LEGACY_BARE_MARKER_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_BARE_MARKER, match

match = EMPTY_BLOCKQUOTE_PATTERN.match(line_without_eol)
if match:
return LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE, match

return LINE_KIND_REGULAR, None


def is_marker_line(line):
line_kind, _ = classify_line(line)
return line_kind in (LINE_KIND_BARE_MARKER, LINE_KIND_LABELLED_MARKER)


def is_modern_bare_marker_line(line):
return MODERN_BARE_MARKER_PATTERN.match(remove_eol(line)) is not None


def is_paired_empty_blockquote(line):
line_kind, _ = classify_line(line)
return line_kind == LINE_KIND_PAIRED_EMPTY_BLOCKQUOTE
Loading
Loading