Skip to content

Latest commit

 

History

History
260 lines (209 loc) · 10.9 KB

File metadata and controls

260 lines (209 loc) · 10.9 KB
module STDJSON
tag v0.2.0
phase Phase 2
stable stable
since v0.2.0
synopsis RFC 8259 JSON parser + serialiser
labels
encode
lastError
parse
parseFile
type
valid
valueOf
writeFile
errors
U-STDJSON-ENCODE
U-STDJSON-PARSE
conformance
tests/conformance/json/
see_also
STDLOG
STDSEED
created 2026-05-05
last_modified 2026-05-08
revisions 3
doc_type
REFERENCE
# `STDJSON` — RFC 8259 JSON parser + serialiser Pure-M parser, validator, and serialiser for [RFC 8259] JavaScript Object Notation. Phase 2 module (track L11; target tag `v0.2.0`). Conformance corpus: a curated subset of [JSONTestSuite] vendored at `tests/conformance/json/{y,n,i}/`. [RFC 8259]: https://datatracker.ietf.org/doc/html/rfc8259 [JSONTestSuite]: https://github.com/nst/JSONTestSuite ## Public API | Form | Signature | Returns | |---|---|---| | Extrinsic | `$$parse^STDJSON(text, .root)` | `1` on success, `0` on failure (caller inspects `$$lastError`). | | Extrinsic | `$$encode^STDJSON(.root)` | RFC-8259 conformant JSON text. | | Extrinsic | `$$valid^STDJSON(text)` | `1` iff `text` is a conformant RFC-8259 document. | | Extrinsic | `$$lastError^STDJSON()` | `""` if last `parse`/`valid`/`parseFile` succeeded, else `"line:col: message"`. | | Extrinsic | `$$type^STDJSON(.node)` | One of `object`, `array`, `string`, `number`, `true`, `false`, `null`, or `""` if `node` is undefined. | | Extrinsic | `$$valueOf^STDJSON(.node)` | Scalar value for `s`/`n` leaves (string content or canonical numeric string); `""` for containers and literals. | | Procedure | `do parseFile^STDJSON(path, .root)` | Streams `path` through `parse`. Same success contract; on failure the partial tree is killed. | | Procedure | `do writeFile^STDJSON(path, .root)` | Serialises `.root` and writes to `path`. | `.root` (and any sub-node passed to `$$type` / `$$valueOf`) is a caller-owned local array. `parse` kills `root` before populating it. ## Storage model — sigil-prefixed M tree Each JSON value occupies one node in the caller's array. The value **at the subscript itself** holds a one-character **type sigil**, optionally followed by `:VALUE` for scalars. Children (object members or array elements) hang at the next subscript level. | Sigil | JSON type | Storage at this node | Children | |---|---|---|---| | `o` | object | sigil only | `node(key)` per member | | `a` | array | sigil only | `node(i)` for `i = 1..n` | | `s:` | string | `s:` followed by the unescaped UTF-8 string | none | | `n:` | number | `n:` followed by the **canonical numeric string** as written in the source (preserves precision) | none | | `t` | `true` | sigil only | none | | `f` | `false` | sigil only | none | | `z` | `null` | sigil only (`z` rather than `n` to avoid colliding with number) | none | The colon at position 2 of `s:` and `n:` is a fixed delimiter — the value is `$EXTRACT(node, 3, $LENGTH(node))`. Strings whose first byte is `:` round-trip cleanly because only the **first** colon is structural. ### Examples ```m ; -- {"name": "Alice", "age": 30, "kids": null} ----------------------- root = "o" root("name") = "s:Alice" root("age") = "n:30" root("kids") = "z" ; -- [1, "two", true, [false]] --------------------------------------- root = "a" root(1) = "n:1" root(2) = "s:two" root(3) = "t" root(4) = "a" root(4,1) = "f" ; -- bare scalar at root: 42 ----------------------------------------- root = "n:42" ; -- empty object / empty array -------------------------------------- root = "o" ; {} — no children root = "a" ; [] — no children ; -- empty-string key (legal per RFC 8259) --------------------------- ; {"": "value"} root = "o" root("") = "s:value" ``` ### Why this shape - **Subscript-level sigil avoids reserved-key collisions.** A scheme that stored type at `node("@type")` would forbid the JSON key `"@type"`. RFC 8259 places no constraints on key strings, so the storage layer must not either. - **Container sigil at the parent, not the child.** Lets `$$type` answer in O(1) without scanning children. - **Numbers as strings preserve full source precision.** A 19-digit integer or a high-precision decimal would lose digits if eagerly coerced to M numeric. The caller does the coercion (`+$$valueOf(...)`) when they want it. - **`z` for null.** A two-character sigil family (e.g. `nu` for null vs `nu:42` for number) would slow `$$type` and complicate parsing. One free letter (`z`) costs nothing. ## Parsing `parse` is a single-pass token-based recursive descent. The cursor tracks `(offset, line, col)` for error reporting; `line` and `col` are 1-based. Tokens: `{ } [ ] , : STRING NUMBER true false null`. Whitespace is the four bytes specified in RFC 8259 §2 (`%x20 %x09 %x0A %x0D`). ### String parsing The eight two-character escapes `\\ \" \/ \b \f \n \r \t` and the six-character escape `\uXXXX` are honoured. Surrogate pairs (`\uD800–\uDBFF` followed by `\uDC00–\uDFFF`) decode to a single codepoint and are emitted as UTF-8 bytes. Bare control characters (U+0000–U+001F) inside a string are rejected per §7. ### Number parsing The full RFC 8259 §6 grammar: ``` number = [ minus ] int [ frac ] [ exp ] int = zero / ( digit1-9 *DIGIT ) ; no leading zeros except "0" frac = decimal-point 1*DIGIT exp = e [ sign ] 1*DIGIT ``` The token is captured verbatim from the source and stored as `n:<verbatim>`. No round-trip through a numeric type happens during parse, so `{"x": 1234567890123456789}` survives intact. ### Errors Parse errors set `$ECODE = ",U-STDJSON-PARSE,"` and stash a human-readable message at `^STDLIB($job,"stdjson","err")`. Format: ``` line:col: <reason> ``` Reasons are stable strings (used by tests): | Code-style reason | When | |---|---| | `unexpected character '<c>'` | a byte that does not start any token at this position | | `unterminated string` | EOF inside a `"…"` | | `bad escape '\<c>'` | escape sequence that is not one of the documented eight or `\u` | | `bad \u escape` | `\u` not followed by four hex digits | | `lone surrogate` | high or low surrogate without its mate | | `unescaped control character` | byte 0x00–0x1F appears inside a string literal | | `bad number` | digit sequence violates RFC 8259 §6 (leading zero, lone decimal, missing exponent digits, …) | | `expected ':' after key` | object body has key but no `:` | | `expected ',' or '}'` | object members not separated by `,` or terminated by `}` | | `expected ',' or ']'` | array elements not separated by `,` or terminated by `]` | | `unexpected EOF` | document ends before the value is complete | | `trailing garbage` | non-whitespace bytes after the top-level value | `$$lastError` returns the message; `do parse(...)` callers should check `$ECODE'=""` or the `0` return value. ## Implementation-defined behaviour (`i_*` corpus) RFC 8259 leaves several edge cases to the implementation. STDJSON's choices are documented here so the `i/` conformance loop in `STDJSONTST.m` can assert them explicitly. | Vector | STDJSON behaviour | |---|---| | `i_number_int_overflow_64.json` (20-digit int) | **Accept.** Stored as canonical numeric string; no precision lost. Caller decides how to coerce. | | `i_number_huge_exp.json` / `i_number_tiny_exp.json` | **Accept.** Same string-storage rationale. | | `i_object_duplicate_key.json` (`{"a":1,"a":2}`) | **Last wins.** M arrays are intrinsically last-wins on key collision; STDJSON does not detect or warn. | | `i_string_embedded_null_escape.json` (`""`) | **Accept.** Decoded to a single `$CHAR(0)` byte. M strings can hold any byte value 0–255, but downstream consumers (notably `WRITE` to a non-binary device) may behave oddly. | | `i_string_lone_high_surrogate.json` / `i_string_lone_low_surrogate.json` | **Reject.** RFC 8259 §8.2 says the conformance is undefined; STDJSON treats lone surrogates as a parse error (`lone surrogate`). | | `i_string_invalid_utf8.json` | **Accept.** STDJSON does not validate the source byte stream for UTF-8 well-formedness; bytes pass through. (Producers who care should validate at the boundary.) | ## Encoding `encode` walks the M tree top-down and emits canonical text: - Objects: `{` `key1` `:` `val1` `,` … `}`. Member order is M's collation order on subscripts (numerics first, then strings in byte order). RFC 8259 §4 leaves member order to the implementation. - Arrays: `[` `e1` `,` … `]`. Indices `1..n` are walked in order; any gap (e.g. defined `node(1)` and `node(3)` with no `node(2)`) is a programmer error and raises `$ECODE = ",U-STDJSON-ENCODE,"` — the encoder will not silently invent a `null` to fill the gap. - Strings: re-escape the eight standard escapes; bytes 0x00–0x1F that lack a named escape become `\u00XX`. Bytes 0x7F–0xFF pass through (caller is assumed to have a UTF-8 string). - Numbers: emitted verbatim from `n:` storage (canonical form preserved from source). - `true` / `false` / `null`: emitted as the bare literal. `encode` produces no whitespace. A pretty-printer is out of scope for v0.2.0. ## Examples ```m ; --- parse → tree → encode round-trip -------------------------------- new root,text,out set text="{""name"":""Alice"",""tags"":[1,""two"",true]}" if '$$parse^STDJSON(text,.root) write "fail: ",$$lastError^STDJSON(),! quit write $$type^STDJSON(.root),! ; object write $$type^STDJSON(.root("name")),! ; string write $$valueOf^STDJSON(.root("name")),! ; Alice write $$valueOf^STDJSON(.root("tags",1)),! ; 1 set out=$$encode^STDJSON(.root) write out,! ; canonical re-emit ; --- streaming from disk -------------------------------------------- new tree do parseFile^STDJSON("/tmp/payload.json",.tree) if $ecode'="" write "parse error: ",$$lastError^STDJSON(),! quit ; --- build a tree by hand and write it ------------------------------ new t set t="o" set t("greeting")="s:hello" set t("count")="n:3" set t("ok")="t" do writeFile^STDJSON("/tmp/out.json",.t) ``` ## Storage namespace Process-scoped under `^STDLIB($job,"stdjson",...)`: | Subscript | Contents | |---|---| | `^STDLIB($job,"stdjson","err")` | Last error message (`"line:col: …"`), cleared on successful `parse`/`valid`. | No cross-process state. No persistent globals. ## Lint suppression STDJSON's parser uses mixed-case local variable names (`pos`, `line`, `col`, `outBuf`, `seenKey`, …) following modern Pythonic style. `M-XINDX-057` is downgraded to `INFO` for this file in `.m-cli.toml` (plan §6.3) — same exemption granted to STDFMT for the same reason. ## See also - [`STDLOG`](stdlog.md) — gains JSON-line output (`emitJsonLine^STDLOG`) once STDJSON is on the load path. (Follow-on track in v0.2.x.) - [`STDSEED`](stdseed.md) — `LOADJSON^STDSEED` add-on unblocks once STDJSON ships (currently raises `U-STDSEED-NOT-IMPLEMENTED`). - [JSONTestSuite README](../../tests/conformance/json/README.md) — vendored corpus contract.