Skip to content

Latest commit

 

History

History
564 lines (473 loc) · 21.1 KB

File metadata and controls

564 lines (473 loc) · 21.1 KB

MinZ Iterator Implementation Status

Updated: 2026-03-02 — v0.19.5 (post register allocator overhaul)

E2E Verified (MZX --console-io)

Full pipeline: MinZ source -> parser -> semantic -> MIR optimizer -> fusion optimizer -> Z80 codegen -> asm peephole -> MZA -> MZX emulator. All 11 tests PASS with hex-verified output.

Test Chain Expected Hex Status
iter_foreach arr.forEach(console_log) ABCDE\n 41424344450a PASS
iter_take arr.iter().take(3).forEach(console_log) ABC\n 4142430a PASS
iter_skip arr.iter().skip(2).forEach(console_log) CDE\n 4344450a PASS
iter_map_foreach arr.iter().map(double).forEach(console_log) doubled bytes 020406080a0a PASS
iter_filter_foreach arr.filter(is_big).forEach(console_log) DE\n 44450a PASS
iter_inline_filter arr.filter(|x| x > 67).forEach(console_log) DE\n 44450a PASS
iter_lambda_map arr.iter().map(|x| x+1).forEach(console_log) BCDEF\n 42434445460a PASS
iter_map_filter_foreach arr.map(double).filter(|x|x>5).forEach(log) 6,8,10,\n 06080a0a PASS
iter_filter_map_foreach arr.filter(|x|x>=67).map(add_one).forEach(log) DEF\n 4445460a PASS
iter_take_map_foreach arr.take(3).map(add_one).forEach(log) BCD\n 4243440a PASS
iter_bare_djnz arr.filter(|x| x > 3).forEach(console_log) \x04\x05\n 04050a PASS

Run: bash examples/e2e_iterators/run_e2e.sh


Performance After Register Allocator Overhaul

Measured via exact T-state counting in the Z80 emulator (E2E harness, regalloc_quality_test.go).

Pattern Before (v0.19.4) After (v0.19.5) Ideal Speedup
forEach(call) ~207T/elem 26T/elem ~43T 7.8x
map(fn).forEach(call) ~280T/elem 36T/elem ~60T ~7.6x
filter(|x|x>N).forEach(call) ~280T/elem 26T/elem ~30T ~10x

The per-element costs are now below the ideal hand-written target from the original plan (64T).

What the overhaul changed (6 phases)

  1. Hint-aware allocation — allocator respects RegHintB (counter), RegHintHL (pointer), RegHintA (element)
  2. Dynamic A/B trackingloadToA()/loadToB() skip loads when the register already has the value
  3. Memory round-trip elimination — POP restores tracking, void returns don't store, INC HL stays tracked
  4. DJNZ with BC save/restore — PUSH BC/POP BC around calls, enabling bare DJNZ
  5. Call-clobber-aware invalidation — per-function ModifiedRegisters bitmask; only clear what callee actually clobbers
  6. Cycle-count regression tests — 3 deterministic tests with thresholds

See Register Allocator Overhaul Report for full before/after assembly and pipeline walkthrough.


What the Compiler Actually Produces (v0.19.5, post-overhaul)

All listings below are real compiler output — generated by mz --dump-mir and mz -o file.a80.


1. forEach(call) — 132T total, 26T/element

MinZ source:

fun console_log(c: u8) -> void {
    asm {
        OUT ($23), A
    }
}

fun main() -> void {
    let arr: [u8; 5] = [65, 66, 67, 68, 69];
    arr.forEach(console_log);
    asm {
        DI
        HALT
    }
}

MIR (--dump-mir):

Function listing_foreach.console_log$u8(c: u8) -> void
  @smc
  Instructions:
      0: asm { OUT ($23), A }
      1: return

Function listing_foreach.main() -> void
  @smc
  Locals:
    r1 = arr: [5]u8
  Instructions:
      0: r2 = array_literal ; Array literal [[65 66 67 68 69]] -> DB directive
      1: r8 = load arr
      2: nop ; DJNZ OPTIMIZED LOOP for array[5]
      3: r9 = 5 ; DJNZ counter = 5
      4: r10 = r8 ; Pointer to array start
      5: djnz_loop_1:
      6: r11 = *r10 ; Load element via pointer
      7: push r10 ; Save array pointer before operations
      8: push r9 ; Save DJNZ counter before operations (BC preserved across CALL)
      9: r12 = call listing_foreach.console_log$u8
     10: r9 = pop ; Restore DJNZ counter after operations
     11: r10 = pop ; Restore array pointer after operations
     12: r10++ ; Advance to next element
     13: djnz r9, djnz_loop_1 ; DJNZ - decrement and loop
     14: asm { DI / HALT }
     15: return

Z80 assembly (loop body only, -o file.a80):

listing_foreach_main_djnz_loop_1_i1:
    ; Load element via pointer
    LD HL, ($F014)    ; Virtual register 10 from memory
    LD A, (HL)
    LD ($F016), A     ; Virtual register 11 to memory
    ; Save array pointer before operations
    ; Register 10 already in HL (tracked)
    PUSH HL       ; Save array pointer before operations
    ; Save DJNZ counter before operations (BC preserved across CALL)
    LD HL, ($F012)    ; Virtual register 9 from memory
    PUSH HL       ; Save DJNZ counter before operations (BC preserved across CALL)
    ; Call listing_foreach.console_log_u8
    ; Register-based parameter passing
    LD A, ($F016)     ; Virtual register 11 from memory
    ; Parameter c in A
    CALL listing_foreach_console_log_u8
    LD ($F018), HL    ; Virtual register 12 to memory
    ; Restore DJNZ counter after operations (BC preserved across CALL)
    POP HL        ; Restore DJNZ counter
    LD ($F012), HL    ; Virtual register 9 to memory
    ; Restore array pointer after operations
    POP HL        ; Restore array pointer
    LD ($F014), HL    ; Virtual register 10 to memory
    ; Advance to next element
    LD HL, ($F014)    ; Virtual register 10 from memory
    INC HL
    LD ($F014), HL    ; Virtual register 10 to memory
    ; DJNZ - decrement and loop
    LD A, ($F012)     ; Virtual register 9 from memory
    LD B, A
    DEC B
    LD A, B
    LD ($F012), A     ; Virtual register 9 to memory
    JR NZ, listing_foreach_main_djnz_loop_1_i1

; console_log — 2 instructions, 0 memory round-trips:
listing_foreach_console_log_u8:
    OUT ($23), A
    RET

2. map(fn).forEach(call) — 184T total, 36T/element

MinZ source:

fun add_one(x: u8) -> u8 {
    return x + 1;
}

fun console_log(c: u8) -> void {
    asm {
        OUT ($23), A
    }
}

fun main() -> void {
    let arr: [u8; 5] = [65, 66, 67, 68, 69];
    arr.map(add_one).forEach(console_log);
    asm {
        DI
        HALT
    }
}

MIR (--dump-mir) — note add_one is fused inline, no OpCall for it:

Function listing_map.add_one$u8(x: u8) -> u8
  @smc
  Instructions:
      0: r2 = param x
      1: r3 = 1
      2: r4 = r2 + r3
      3: return r4

Function listing_map.main() -> void
  @smc
  Locals:
    r1 = arr: [5]u8
  Instructions:
      0: r2 = array_literal ; Array literal [[65 66 67 68 69]] -> DB directive
      1: r8 = load arr
      2: nop ; DJNZ OPTIMIZED LOOP for array[5]
      3: r9 = 5 ; DJNZ counter = 5
      4: r10 = r8 ; Pointer to array start
      5: djnz_loop_1:
      6: r11 = *r10 ; Load element via pointer
      7: push r10 ; Save array pointer before operations
      8: push r9 ; Save DJNZ counter before operations
      9: r1 = r11 ; Fused: param x <- r11
     10: r2 = 1 ; Fused from listing_map.add_one$u8
     11: r3 = r1 + r2 ; Fused from listing_map.add_one$u8
     12: r12 = r3 ; Fused: return value
     13: r13 = call listing_map.console_log$u8
     14: r9 = pop ; Restore DJNZ counter
     15: r10 = pop ; Restore array pointer
     16: r10++ ; Advance to next element
     17: djnz r9, djnz_loop_1
     18: asm { DI / HALT }
     19: return

Z80 assembly (loop body only):

listing_map_main_djnz_loop_1_i1:
    ; Load element via pointer
    LD HL, ($F014)    ; Virtual register 10 from memory
    LD A, (HL)
    LD ($F016), A     ; Virtual register 11 to memory
    ; Save array pointer before operations
    ; Register 10 already in HL (tracked)
    PUSH HL       ; Save array pointer before operations
    ; Save DJNZ counter before operations (BC preserved across CALL)
    LD HL, ($F012)    ; Virtual register 9 from memory
    PUSH HL       ; Save DJNZ counter before operations
    ; Fused from listing_map.add_one_u8
    LD A, 1
    LD ($F004), A     ; Virtual register 2 to memory
    ; Fused from listing_map.add_one_u8
    LD HL, ($F016)    ; Virtual register 11 from memory
    LD D, H
    LD E, L
    LD HL, 1      ; Constant
    ADD HL, DE
    LD ($F006), HL    ; Virtual register 3 to memory
    ; Fused: return value
    LD HL, ($F006)    ; Virtual register 3 from memory
    LD ($F018), HL    ; Virtual register 12 to memory
    ; Call listing_map.console_log_u8
    ; Register-based parameter passing
    LD A, ($F018)     ; Virtual register 12 from memory
    ; Parameter c in A
    CALL listing_map_console_log_u8
    LD ($F01A), HL    ; Virtual register 13 to memory
    ; Restore DJNZ counter after operations
    POP HL
    LD ($F012), HL
    ; Restore array pointer after operations
    POP HL
    LD ($F014), HL
    ; Advance to next element
    LD HL, ($F014)    ; Virtual register 10 from memory
    INC HL
    LD ($F014), HL    ; Virtual register 10 to memory
    ; DJNZ - decrement and loop
    LD A, ($F012)     ; Virtual register 9 from memory
    LD B, A
    DEC B
    LD A, B
    LD ($F012), A     ; Virtual register 9 to memory
    JR NZ, listing_map_main_djnz_loop_1_i1

; add_one — dead code (fused inline), still emitted:
listing_map_add_one_u8:
    LD A, #00      ; Parameter x (gets patched)
    ...
    RET

Note: add_one is fused into the loop (lines 9-12 in MIR, "Fused from" comments in asm). But the original function definition is still emitted as dead code (lines 89-109 in full output). The fusion uses 16-bit ADD HL, DE for what should be INC A — a known u8-as-u16 codegen issue.


3. filter(lambda).forEach(call) — 132T total, 26T/element

MinZ source:

fun console_log(c: u8) -> void {
    asm {
        OUT ($23), A
    }
}

fun main() -> void {
    let arr: [u8; 5] = [65, 66, 67, 68, 69];
    arr.filter(|x| x > 66).forEach(console_log);
    asm {
        DI
        HALT
    }
}

MIR (--dump-mir) — note inline filter at instruction 10:

Function listing_filter.main() -> void
  @smc
  Locals:
    r1 = arr: [5]u8
  Instructions:
      0: r2 = array_literal ; Array literal [[65 66 67 68 69]] -> DB directive
      1: r8 = load arr
      2: nop ; DJNZ OPTIMIZED LOOP for array[5]
      3: r9 = 5 ; DJNZ counter = 5
      4: r10 = r8 ; Pointer to array start
      5: djnz_loop_1:
      6: r11 = *r10 ; Load element via pointer
      7: push r10 ; Save array pointer before operations
      8: push r9 ; Save DJNZ counter before operations
      9: r12 = 67 ; Inline filter constant = 67
     10: jump_if_flag CY -> filter_continue_2 ; Inline filter: CP 67 + JR CY
     11: r13 = call listing_filter.console_log$u8
     12: filter_continue_2:
     13: r9 = pop ; Restore DJNZ counter
     14: r10 = pop ; Restore array pointer
     15: r10++ ; Advance to next element
     16: djnz r9, djnz_loop_1
     17: asm { DI / HALT }
     18: return

Z80 assembly (loop body only):

listing_filter_main_djnz_loop_1_i1:
    ; Load element via pointer
    LD HL, ($F014)    ; Virtual register 10 from memory
    LD A, (HL)
    LD ($F016), A     ; Virtual register 11 to memory
    ; Save array pointer before operations
    ; Register 10 already in HL (tracked)
    PUSH HL       ; Save array pointer before operations
    ; Save DJNZ counter before operations (BC preserved across CALL)
    LD HL, ($F012)    ; Virtual register 9 from memory
    PUSH HL       ; Save DJNZ counter before operations
    ; Inline filter constant = 67
    LD A, 67
    LD ($F018), A     ; Virtual register 12 to memory
    ; Inline filter: CP 67 + JR CY
    LD A, ($F016)     ; Virtual register 11 from memory
    CP 67
    JR C, listing_filter_main_filter_continue_2_i1
    ; Call listing_filter.console_log_u8
    ; Register 11 already in A (tracked)     <-- dynamic tracking!
    ; Parameter c in A
    CALL listing_filter_console_log_u8
    LD ($F01A), HL    ; Virtual register 13 to memory
    ; filter_continue_2:
listing_filter_main_filter_continue_2_i1:
    ; Restore DJNZ counter after operations
    POP HL
    LD ($F012), HL
    ; Restore array pointer after operations
    POP HL
    LD ($F014), HL
    ; Advance to next element
    LD HL, ($F014)    ; Virtual register 10 from memory
    INC HL
    LD ($F014), HL    ; Virtual register 10 to memory
    ; DJNZ - decrement and loop
    LD A, ($F012)     ; Virtual register 9 from memory
    LD B, A
    DEC B
    LD A, B
    LD ($F012), A     ; Virtual register 9 to memory
    JR NZ, listing_filter_main_djnz_loop_1_i1

Key: |x| x > 66 becomes CP 67; JR C, continue (no function call). The CP instruction sets carry if A < 67, which is the inverse of x > 66. After CP, A still holds the element, so the dynamic tracking comment "Register 11 already in A" means no reload is needed for the console_log CALL argument.


What IS genuinely zero-cost

  • take(n) counter folding — LD B, n instead of LD B, array.length
  • skip(n) pointer offset — ADD HL, DE at init, no skip loop
  • Inline filter lambda — CP N + JR C instead of CALL + OR A + JR Z
  • Fusion inlining — eliminates CALL/RET for simple callbacks
  • ADD A, A strength reduction — x * 2 is 4T instead of 8T (SLA A)

Operation Status

Operation Parser Semantic Z80 E2E Notes
forEach(fn) pass pass correct PASS PUSH/POP HL+BC around CALL
take(n) pass pass correct PASS Counter folded to LD B, n
skip(n) pass pass correct PASS Pointer offset + counter adjust
map(fn) pass pass correct PASS Fusion inlines simple fns
filter(fn) pass pass correct PASS Named predicate + forEach
filter(|x| x>N) pass pass correct PASS Inline CP N+1 + JR C
map(|x| x+N) pass pass correct PASS Lambda inlined
map+filter+forEach pass pass correct PASS Multi-stage chain
filter+map+forEach pass pass correct PASS Multi-stage chain
take+map+forEach pass pass correct PASS Multi-stage chain
peek(fn) pass pass compiles untested Same path as forEach
inspect(fn) pass pass compiles untested Alias for peek
takeWhile(fn) pass pass compiles untested Stack hazard — early exit after PUSH
enumerate() pass pass compiles BROKEN B used for both counter and index
reduce(fn) pass pass compiles BROKEN Two LD A — second overwrites first
skipWhile(fn) pass pass N/A N/A Not in DJNZ mode

Known Bugs

CRITICAL: enumerate uses B for counter AND index

    LD B, 5           ; B = loop counter
.loop:
    LD A, B
    INC A
    LD B, A           ; B = index (now 6?!)
    ; ...
    DEC B             ; decrement... what? counter? index?
    JR NZ, .loop      ; near-infinite loop

B cannot serve two purposes. Needs separate index register (C or memory-backed).

CRITICAL: reduce overwrites A with second param

sum_u8_u8:
    LD A, 0           ; SMC: acc (will be patched)
    LD A, 0           ; SMC: x — OVERWRITES acc!
    ADD HL, DE        ; 16-bit add on unloaded registers
    RET

Two TRUE SMC loads both target A. Second wipes first.

takeWhile stack hazard

If predicate fails and loop exits early, PUSH HL has no matching POP. Stack corrupted by 2 bytes per iteration until exit. Needs POP HL before exit jump.

Dead code: inlined functions still emitted

When fusion optimizer inlines a callback (e.g., double), the function definition is still emitted in the output. Wastes bytes but doesn't affect correctness.

Orphan PUSH/POP after fusion

Semantic layer emits PUSH/POP HL based on hasCallOps at IR time. Fusion optimizer removes CALLs after PUSH/POP are already in the instruction stream. Result: PUSH/POP with no CALL between them — pure waste (21T/iteration).


Remaining Optimization Opportunities

Priority 1: Fix enumerate and reduce (Bug G)

  • Separate loop counter from enumerate index (different registers)
  • Fix reduce: two SMC params can't both target A
  • Add E2E tests for both

Priority 2: Strip orphan PUSH/POP after fusion

  • Fusion optimizer should remove PUSH/POP when it eliminates all CALLs
  • Enables true bare DJNZ path (currently unreachable in E2E tests)

Priority 3: Fix takeWhile stack hazard

  • Emit POP HL before early-exit jump

Priority 4: Trust physical allocations fully

  • The allocator assigns registers but codegen still stores to $F0xx defensively
  • When pointer IS in HL, don't store to $F014 after every use
  • When counter IS in B, don't round-trip through $F012

Priority 5: 8-bit arithmetic for u8 values

  • Fused add_one currently uses 16-bit ADD HL, DE instead of INC A
  • Type-aware codegen should use 8-bit paths for u8

Priority 6: Dead code elimination for inlined functions

  • After fusion inlines a function, mark it unused if no other callers
  • Remove from output

Architecture

Compilation Pipeline

Source -> Parser -> Semantic -> MIR -> Optimizer -> Z80 Codegen -> Asm Peephole -> MZA
           |          |          |        |              |
           |          |          |        |              +-- OpDJNZ -> DEC B + JR NZ
           |          |          |        |                  (or bare DJNZ if no CALL)
           |          |          |        +-- Fusion: inline callbacks in DJNZ loops
           |          |          |        +-- DCE, copy prop, constant fold
           |          |          +-- OpLoad, OpCall, OpDJNZ, OpJumpIfFlag
           |          +-- generateDJNZIteration(): counter + pointer + PUSH/POP
           +-- tryConvertIteratorChain(): method chain -> IteratorChainExpr

Key Files

File Role
pkg/ast/iterator.go IteratorChainExpr, IteratorOp (Function + Argument fields)
pkg/parser/participle/convert.go tryConvertIteratorChain() — CallExpr -> IteratorChainExpr
pkg/semantic/iterator.go Type checking, DJNZ/indexed dispatch, lambda extraction
pkg/semantic/iterator_enhanced.go Enhanced DJNZ with take/skip/enumerate/reduce
pkg/codegen/z80.go Z80 codegen + register tracking (~5500 lines)
pkg/codegen/register_allocator.go Hint-aware linear scan allocation (~500 lines)
pkg/codegen/register_allocator_test.go Allocation unit tests
pkg/optimizer/fusion.go Fusion optimizer — DJNZ loop detection + callback inlining
pkg/optimizer/assembly_peephole.go Asm peephole (pattern 48: DEC B + JR NZ -> DJNZ)
pkg/optimizer/dead_code_elimination.go DCE — fixed OpJumpIfFlag/OpPush marking
pkg/z80testing/regalloc_quality_test.go Cycle-count regression tests

Test Suite (90+ tests across 8 layers)

Package Tests What
examples/e2e_iterators/ 11 Full pipeline E2E: compile -> assemble -> emulate -> hex verify
tests/iterator_corpus/ 18 Full pipeline regression — all 18 compile to Z80
pkg/optimizer/ 19 Peephole, DCE, superoptimizer, normalization, fusion (7)
pkg/parser/participle/ 20 Chain conversion, argument routing
pkg/semantic/ 17 DJNZ eligibility, filter/map/lambda IR
pkg/mirvm/ 8 DJNZ execution at MIR level
pkg/codegen/ 5+ Z80 assembly patterns + register allocator tests
pkg/z80testing/ 3 Cycle-count regression tests (deterministic)

Bugs Fixed (Historical)

Version Bug Fix
v0.19.4 HL clobbered by CALL return value PUSH/POP HL around CALL blocks
v0.19.4 TRUE SMC anchor: LD A,(nn) instead of LD A,N Changed to opcode 3E (patchable immediate)
v0.19.4 MIR peephole read inst.Value not inst.Imm Fixed trackConstants() to use inst.Imm
v0.19.4 prepareCallArguments() skipped all SMC Changed to only skip TRUE SMC (UsesTrueSMC)
v0.19.4 Lambda NumParams=0 after extraction Use AddParamWithRegister()
v0.19.4 Inliner assumed params at regs 1,2,... Use fn.Params[i].Reg
v0.19.4 Copy propagation across labels Clear copy map at every OpLabel
v0.19.5 DCE removed filter constant (OpJumpIfFlag) Added OpJumpIfFlag to Phase 1 marking
v0.19.5 BareDJNZ hint + peephole pattern 48 DEC B; JR NZ -> DJNZ when no CALL
v0.19.5 findSymbol non-determinism Prefer shortest match (function entry, not sub-labels)
v0.19.5 Register allocator overhaul (6 phases) Hint-aware, dynamic tracking, call-clobber-aware

References