|
| 1 | +# 🔬 How MyLang Works — Interpreter Internals |
| 2 | + |
| 3 | +This document explains how the interpreter transforms your code into output, step by step. Even if you've never built a language before, you'll understand the full pipeline by the end. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## The Big Picture |
| 8 | + |
| 9 | +When you write `2 + 3.`, the interpreter runs through **4 stages** to produce `5`: |
| 10 | + |
| 11 | +``` |
| 12 | +Source Code → [Tokenizer] → [Parser] → [Evaluator] → Output |
| 13 | + "2 + 3." Tokens AST Tree Walks tree "5" |
| 14 | +``` |
| 15 | + |
| 16 | +Let's walk through each stage. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Stage 1: Tokenizer (Lexer) |
| 21 | + |
| 22 | +**File**: `me_doingIt.cpp` → `class Tokenizer` |
| 23 | + |
| 24 | +The tokenizer reads raw text character by character and breaks it into **tokens** — small meaningful pieces. Think of it like breaking a sentence into words. |
| 25 | + |
| 26 | +### Example |
| 27 | + |
| 28 | +Input: `var x = 10 + 3.` |
| 29 | + |
| 30 | +Tokens produced: |
| 31 | + |
| 32 | +``` |
| 33 | +[KeywordVar: "var"] [Identifier: "x"] [Equals: "="] [Number: "10"] |
| 34 | +[Operator: "+"] [Number: "3"] [Dot: "."] [Eof] |
| 35 | +``` |
| 36 | + |
| 37 | +### How It Works |
| 38 | + |
| 39 | +The tokenizer uses a `while` loop that walks through the source string one character at a time: |
| 40 | + |
| 41 | +``` |
| 42 | +Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
| 43 | +Source: v a r x = 1 0 + 3 . |
| 44 | +``` |
| 45 | + |
| 46 | +For each character it asks: |
| 47 | + |
| 48 | +1. **Is it a digit?** → Keep reading digits to form a `Number` token (`10`) |
| 49 | +2. **Is it a letter?** → Keep reading letters to form a word, then check: |
| 50 | + - Is it a keyword (`var`, `fn`, `if`, `while`, etc.)? → Keyword token |
| 51 | + - Is it a logical operator (`and`, `or`, `not`)? → Operator token |
| 52 | + - Otherwise? → `Identifier` token (a variable/function name) |
| 53 | +3. **Is it a symbol?** → Match single or double-character operators (`+`, `==`, `&&`, `<=`, etc.) |
| 54 | +4. **Is it `-->`?** → Skip everything until `<--` (comment) |
| 55 | +5. **Whitespace?** → Skip it |
| 56 | + |
| 57 | +### Important Detail: The Dot Ambiguity |
| 58 | + |
| 59 | +The character `.` serves **two purposes**: |
| 60 | + |
| 61 | +- **Statement terminator**: `x.` means "end of statement, print x" |
| 62 | +- **Decimal point**: `3.14` is a floating-point number |
| 63 | + |
| 64 | +The tokenizer resolves this by checking: _Is the next character after `.` a digit?_ If yes → it's part of a decimal number. If no → it's a terminator. |
| 65 | + |
| 66 | +``` |
| 67 | +"3.14." → [Number: "3.14"] [Dot: "."] |
| 68 | +"10." → [Number: "10"] [Dot: "."] |
| 69 | +``` |
| 70 | + |
| 71 | +--- |
| 72 | + |
| 73 | +## Stage 2: Parser |
| 74 | + |
| 75 | +**File**: `me_doingIt.cpp` → `class Parser` |
| 76 | + |
| 77 | +The parser reads the flat list of tokens and builds a **tree structure** called an **AST** (Abstract Syntax Tree). This tree represents the logical structure of your program. |
| 78 | + |
| 79 | +### What the Parser Produces |
| 80 | + |
| 81 | +For this code: |
| 82 | + |
| 83 | +``` |
| 84 | +if x > 10: |
| 85 | + x. |
| 86 | +; |
| 87 | +``` |
| 88 | + |
| 89 | +The parser creates this tree: |
| 90 | + |
| 91 | +``` |
| 92 | +IfStmt |
| 93 | +├── condition: Expression [x > 10] |
| 94 | +├── then-block: BlockStmt |
| 95 | +│ └── ExprStmt |
| 96 | +│ └── Expression [x] |
| 97 | +└── else-block: (none) |
| 98 | +``` |
| 99 | + |
| 100 | +### Statement Types |
| 101 | + |
| 102 | +The parser knows how to recognize these statement patterns: |
| 103 | + |
| 104 | +| Statement | Pattern | Produced Node | |
| 105 | +| -------------------- | ---------------------------------------- | --------------------------------- | |
| 106 | +| Variable declaration | `var NAME = EXPR.` | `AssignStmt` (isDeclaration=true) | |
| 107 | +| Assignment | `NAME = EXPR.` | `AssignStmt` | |
| 108 | +| Function definition | `fn NAME @(PARAMS): BODY ;` | `FunctionDefStmt` | |
| 109 | +| If/elif/else | `if EXPR: BODY ; [elif...] [else...]` | `IfStmt` | |
| 110 | +| While loop | `while EXPR: BODY ;` | `WhileStmt` | |
| 111 | +| For loop | `for NAME in range(from X to Y): BODY ;` | `ForStmt` | |
| 112 | +| Return | `give(EXPR).` | `ReturnStmt` | |
| 113 | +| Pass | `pass.` | `PassStmt` | |
| 114 | +| Expression | `EXPR.` | `ExprStmt` (prints the result) | |
| 115 | + |
| 116 | +### How Expression Parsing Works: The Shunting-Yard Algorithm |
| 117 | + |
| 118 | +This is the most complex part. Expressions like `2 + 3 * 4` need to respect operator precedence (`*` before `+`). The parser uses the **Shunting-Yard Algorithm** (invented by Edsger Dijkstra) to convert infix notation to **RPN** (Reverse Polish Notation). |
| 119 | + |
| 120 | +#### What is RPN? |
| 121 | + |
| 122 | +Normal math (infix): `2 + 3 * 4` |
| 123 | +RPN (postfix): `2 3 4 * +` |
| 124 | + |
| 125 | +In RPN, operators come **after** their operands. The beauty: **no parentheses needed** and evaluation is trivially simple with a stack. |
| 126 | + |
| 127 | +#### The Algorithm |
| 128 | + |
| 129 | +Uses two data structures: an **output queue** and an **operator stack**. |
| 130 | + |
| 131 | +``` |
| 132 | +Input tokens: 2 + 3 * 4 |
| 133 | +
|
| 134 | +Step 1: "2" is a number → push to output |
| 135 | + Output: [2] Stack: [] |
| 136 | +
|
| 137 | +Step 2: "+" is an operator → push to stack |
| 138 | + Output: [2] Stack: [+] |
| 139 | +
|
| 140 | +Step 3: "3" is a number → push to output |
| 141 | + Output: [2, 3] Stack: [+] |
| 142 | +
|
| 143 | +Step 4: "*" is an operator → precedence of * (6) > + (5) |
| 144 | + So * goes on top, + stays |
| 145 | + Output: [2, 3] Stack: [+, *] |
| 146 | +
|
| 147 | +Step 5: "4" is a number → push to output |
| 148 | + Output: [2, 3, 4] Stack: [+, *] |
| 149 | +
|
| 150 | +Step 6: End of input → pop all operators to output |
| 151 | + Output: [2, 3, 4, *, +] Stack: [] |
| 152 | +``` |
| 153 | + |
| 154 | +Result RPN: `2 3 4 * +` ✓ |
| 155 | + |
| 156 | +#### How Parentheses Work |
| 157 | + |
| 158 | +`(2 + 3) * 4`: |
| 159 | + |
| 160 | +- `(` → pushed to stack as marker |
| 161 | +- `2 + 3` processed normally |
| 162 | +- `)` → pop operators until `(` is found, removing the marker |
| 163 | +- `*` → normal processing |
| 164 | + |
| 165 | +Result: `2 3 + 4 *` ✓ (addition happens first) |
| 166 | + |
| 167 | +#### Unary Operators |
| 168 | + |
| 169 | +`-5` is tricky because `-` could be subtraction or negation. The parser checks: was the **previous token** an operator, opening paren, or nothing? If so, it's unary. |
| 170 | + |
| 171 | +Unary `-` is renamed to `~` internally so the evaluator can distinguish: |
| 172 | + |
| 173 | +- `-` with two operands = subtraction |
| 174 | +- `~` with one operand = negation |
| 175 | + |
| 176 | +Unary `!` stays as `!`. |
| 177 | + |
| 178 | +### Short-Circuit Evaluation (the Tricky Part) |
| 179 | + |
| 180 | +`&&` and `||` need **lazy evaluation** — the right side shouldn't run if the left side already determines the result. But RPN evaluates everything eagerly! |
| 181 | + |
| 182 | +**Solution**: The parser has **three layers**: |
| 183 | + |
| 184 | +``` |
| 185 | +parseExpression() → calls parseLogicalOr() |
| 186 | +parseLogicalOr() → calls parseLogicalAnd(), handles || |
| 187 | +parseLogicalAnd() → calls parsePrimaryExpr(), handles && |
| 188 | +parsePrimaryExpr() → Shunting-Yard for everything else |
| 189 | +``` |
| 190 | + |
| 191 | +When `||` or `&&` appears **at the top level** (not inside parentheses), the parser **doesn't** put them in the RPN. Instead, it creates a tree node: |
| 192 | + |
| 193 | +``` |
| 194 | +Expression |
| 195 | +├── logicalOp: "||" |
| 196 | +├── lhs: Expression [left side - RPN] |
| 197 | +└── rhs: Expression [right side - RPN] |
| 198 | +``` |
| 199 | + |
| 200 | +The evaluator then checks the LHS first, and **only evaluates RHS if needed**: |
| 201 | + |
| 202 | +```cpp |
| 203 | +if (logicalOp == "&&") { |
| 204 | + double leftVal = lhs->evaluate(scope); |
| 205 | + if (leftVal == 0) return 0.0; // Short-circuit: skip RHS! |
| 206 | + return rhs->evaluate(scope); // Only evaluate if LHS was true |
| 207 | +} |
| 208 | +``` |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## Stage 3: Evaluator |
| 213 | + |
| 214 | +**File**: `me_doingIt.cpp` → `Expression::evaluate()` and `*.execute()` methods |
| 215 | + |
| 216 | +### Expression Evaluation (RPN Stack Machine) |
| 217 | + |
| 218 | +Evaluating RPN is beautifully simple. Use a **stack**: |
| 219 | + |
| 220 | +``` |
| 221 | +RPN: 2 3 4 * + |
| 222 | +
|
| 223 | +Step 1: "2" → push Stack: [2] |
| 224 | +Step 2: "3" → push Stack: [2, 3] |
| 225 | +Step 3: "4" → push Stack: [2, 3, 4] |
| 226 | +Step 4: "*" → pop 4 and 3, |
| 227 | + push 3*4=12 Stack: [2, 12] |
| 228 | +Step 5: "+" → pop 12 and 2, |
| 229 | + push 2+12=14 Stack: [14] |
| 230 | +
|
| 231 | +Result: 14 ✓ |
| 232 | +``` |
| 233 | + |
| 234 | +### Statement Execution |
| 235 | + |
| 236 | +Each AST node has an `execute()` method: |
| 237 | + |
| 238 | +- **ExprStmt**: Evaluates the expression and **prints** the result |
| 239 | +- **AssignStmt**: Evaluates the expression, stores the result in the scope |
| 240 | +- **IfStmt**: Evaluates condition → if non-zero, executes the matching branch's block |
| 241 | +- **WhileStmt**: Evaluates condition → while non-zero, executes body, re-evaluates condition |
| 242 | +- **ForStmt**: Determines range → iterates, setting loop variable in scope for each iteration |
| 243 | +- **FunctionDefStmt**: Stores the function definition in the scope (does not run it yet) |
| 244 | +- **ReturnStmt**: Evaluates expression, throws `ReturnException` with the value |
| 245 | + |
| 246 | +### How `give` (Return) Works |
| 247 | + |
| 248 | +`give(value)` throws a C++ exception (`ReturnException`). This exception **unwinds** through any nested loops, if-blocks, etc., until it's caught by the function call code in the evaluator. This is why `give` correctly exits from inside while loops: |
| 249 | + |
| 250 | +``` |
| 251 | +fn find @(): |
| 252 | + var i = 0. |
| 253 | + while i < 100: ← loop running |
| 254 | + if i == 42: |
| 255 | + give(i). ← throws ReturnException(42) |
| 256 | + ; ← exception flies through if-block |
| 257 | + i = i + 1. |
| 258 | + ; ← exception flies through while-loop |
| 259 | +; ← caught here by function call handler |
| 260 | +``` |
| 261 | + |
| 262 | +### Function Calls |
| 263 | + |
| 264 | +When the evaluator encounters a function call in an expression: |
| 265 | + |
| 266 | +1. **Pop arguments** from the stack |
| 267 | +2. **Create a new scope** (child of caller's scope, with barrier) |
| 268 | +3. **Define parameters** as local variables in the new scope |
| 269 | +4. **Execute** the function body |
| 270 | +5. **Catch** any `ReturnException` → push the return value onto the stack |
| 271 | +6. If no `give` was used → push `0` (implicit return) |
| 272 | + |
| 273 | +--- |
| 274 | + |
| 275 | +## Stage 4: Scope System |
| 276 | + |
| 277 | +**File**: `me_doingIt.cpp` → `struct Scope` |
| 278 | + |
| 279 | +The scope system controls **which variables are visible** and **which can be modified**. It's implemented as a **linked list** of scope frames. |
| 280 | + |
| 281 | +### Scope Chain |
| 282 | + |
| 283 | +``` |
| 284 | +Global Scope ← defines: x=10, PI=3.14 |
| 285 | + │ |
| 286 | + ├── Function Scope (barrier=true) ← defines: a=5 (parameter) |
| 287 | + │ │ |
| 288 | + │ └── If-Block Scope (barrier=false) ← defines: temp=1 |
| 289 | + │ |
| 290 | + └── For-Loop Scope (barrier=false) ← defines: i=3 (loop var) |
| 291 | +``` |
| 292 | + |
| 293 | +### The Barrier Mechanism |
| 294 | + |
| 295 | +Each scope has a `barrier` flag: |
| 296 | + |
| 297 | +- **`barrier = false`** (if/else, for, while blocks): The `set()` method **propagates** writes to the parent scope. So `x = 99` inside an if-block modifies the outer `x`. |
| 298 | + |
| 299 | +- **`barrier = true`** (function scopes): The `set()` method **stops** at the barrier. So `x = 99` inside a function throws an error — it can't reach the outer `x`. |
| 300 | + |
| 301 | +### Variable Lookup (`get`) |
| 302 | + |
| 303 | +When reading variable `x`, the scope walks **up** the chain: |
| 304 | + |
| 305 | +``` |
| 306 | +Current scope → has x? → Yes → return it |
| 307 | + → No → check parent → has x? → Yes → return it |
| 308 | + → No → check parent → ... |
| 309 | + → Error! |
| 310 | +``` |
| 311 | + |
| 312 | +There's **no barrier for reading** — functions can always read outer variables. Only writing is blocked. |
| 313 | + |
| 314 | +### Variable Assignment (`set`) |
| 315 | + |
| 316 | +When writing `x = value`: |
| 317 | + |
| 318 | +``` |
| 319 | +Current scope → has x? → Yes → update it |
| 320 | + → No → barrier? → Yes → ERROR ("cannot mutate outer scope") |
| 321 | + → No → try parent.set(x, value) |
| 322 | +``` |
| 323 | + |
| 324 | +--- |
| 325 | + |
| 326 | +## Putting It All Together |
| 327 | + |
| 328 | +Here's the full journey of this program: |
| 329 | + |
| 330 | +``` |
| 331 | +var x = 5. |
| 332 | +fn double @(n): give(n * 2). ; |
| 333 | +double(x). |
| 334 | +``` |
| 335 | + |
| 336 | +### 1. Tokenizer |
| 337 | + |
| 338 | +``` |
| 339 | +[var] [x] [=] [5] [.] [fn] [double] [@] [(] [n] [)] [:] [give] [(] [n] [*] [2] [)] [.] [;] [double] [(] [x] [)] [.] |
| 340 | +``` |
| 341 | + |
| 342 | +### 2. Parser |
| 343 | + |
| 344 | +``` |
| 345 | +Program (BlockStmt) |
| 346 | +├── AssignStmt { name="x", expr=RPN[5], isDeclaration=true } |
| 347 | +├── FunctionDefStmt { name="double", params=["n"], |
| 348 | +│ body=BlockStmt [ |
| 349 | +│ ReturnStmt { expr=RPN[n, 2, *] } |
| 350 | +│ ] |
| 351 | +│ } |
| 352 | +└── ExprStmt { expr=RPN[x, double CALL(1)] } |
| 353 | +``` |
| 354 | + |
| 355 | +### 3. Evaluator |
| 356 | + |
| 357 | +``` |
| 358 | +1. AssignStmt: evaluate RPN[5] → 5, store x=5 in global scope |
| 359 | +2. FunctionDefStmt: store "double" function definition in scope |
| 360 | +3. ExprStmt: evaluate RPN[x, double CALL(1)] |
| 361 | + a. Push x → stack: [5] |
| 362 | + b. CALL double with 1 arg |
| 363 | + - Pop 5 from stack |
| 364 | + - Create new scope with n=5 |
| 365 | + - Execute body: evaluate RPN[n, 2, *] |
| 366 | + - Push n=5, push 2 → stack: [5, 2] |
| 367 | + - Pop 2, pop 5, push 10 → stack: [10] |
| 368 | + - ReturnStmt throws ReturnException(10) |
| 369 | + - Catch → push 10 to stack |
| 370 | + c. Stack: [10] |
| 371 | + d. Print: 10 |
| 372 | +``` |
| 373 | + |
| 374 | +**Output**: `10` |
| 375 | + |
| 376 | +--- |
| 377 | + |
| 378 | +## Summary of Key Design Decisions |
| 379 | + |
| 380 | +| Decision | Choice | Why | |
| 381 | +| ------------------------- | --------------------------------------- | ---------------------------------------------------------------------------------------------- | |
| 382 | +| Expression representation | RPN (Reverse Polish Notation) | Simple stack-based evaluation, no recursion needed | |
| 383 | +| Short-circuit `&&`/`\|\|` | Tree nodes wrapping RPN sub-expressions | Can't lazily evaluate inside flat RPN, so logical ops are lifted to tree layer | |
| 384 | +| Scope model | Dynamic scope with barriers | Simple, satisfies "inner functions can read outer vars" while preventing mutation | |
| 385 | +| Return mechanism | C++ exceptions (`ReturnException`) | Cleanly unwinds through nested loops and blocks without adding return-checking code everywhere | |
| 386 | +| Statement terminator | `.` (dot) | Chosen by language designer as a visual alternative to `;` | |
| 387 | +| Function syntax | `fn NAME @(PARAMS): BODY ;` | `@` is a visual separator, `:` and `;` delimit the body | |
0 commit comments