notes: TPU deep dive (bilingual, content complete, ready for review) by Random-Liu · Pull Request #2 · Random-Liu/Random-Liu.github.io

Random-Liu · 2026-05-05T14:13:43Z

Summary

Bilingual long-form note notes/tpu-deep-dive.{cn.md, md} distilled from your Gemini conversation in notes/tpu_draft.md. Organized by abstraction layer (hardware → XLA → inference → cluster → system comparison) plus three reference appendices. Each chapter ends with a ↔ GPU subsection.

Chinese: 1395 lines, 19 chapters + 3 appendices + writing log
English: 1398 lines, 1:1 structural mirror per GEMINI.md §1
Site nav updated in mkdocs.yml (both 中文版 and English sections)

Status

✅ Skeleton
✅ Part I — Hardware (single chip, ICI/Torus, OCS+96 fibers, 3D Ring All-Reduce + NUCA, host/PCIe/NUMA, packaging)
✅ Part II — XLA (compilation model, JIT/AOT/bucketing, topology-aware mapping)
✅ Part III — Inference adaptation (vLLM/JetStream/Saxml split, PagedAttention via Pallas, Prefill/Decode + chunked + 1D static flatten, KV/memory hierarchy, Gemini's MoE/spec-decode compromises)
✅ Part IV — Cluster (K8s slice CRD + Kueue + TPU Provisioner → OCS, multi-host slice with LWS/JobSet)
✅ Part V — System comparison (programming model chains, cost/MFU/TCO, hardware weaknesses)
✅ Appendices A (trade-off cheat sheet), B (numbers list), C (TPU↔GPU terminology)
✅ Writing log with 10 intentional cuts + 5 external [补充 — Claude 加] callouts for your review
✅ English mirror complete and committed
✅ mkdocs.yml nav updated

Things specifically asking for your decision

notes/tpu_draft.md is the source you committed; it'll be copied into the build by CI but isn't in nav, so it won't appear on the site. Decide whether to delete after merge or keep as archive.
5 external additions I tagged [补充 — Claude 加] (ICI 4.5 TB/s number, Microsoft Lumen OCS aside, v4 single-digit MFU sourcing, Mooncake aside, Midjourney/Character.AI/Waymark numbers without dates) — please verify or strip.
10 intentional cuts listed at bottom of each file's writing log — let me know which to add back.
The user-prompt path docs/tpu-deep-dive.md was switched to notes/tpu-deep-dive.{cn.md, md} to match repo convention. OK?

Test plan

Skim chapter structure on GitHub mobile — does the layering match how you'd want to look up things in 6 months?
Sanity-check tone on one or two chapters (target: softer than the Claude default, with analogies, not assuming reader is expert)
Decide on [补充] callouts and intentional cuts
Approve via lgtm / merge comment, or merge directly per CLAUDE.md SOP

CI doesn't run on PRs (only on push to main per .github/workflows/ci.yml), so the published rendering with math/Mermaid will only be visible after merge. Iterate post-merge if needed.

Set up the chapter structure for notes/tpu-deep-dive.cn.md based on the agreed 22-topic outline, organized by abstraction layer: hardware (Part I), XLA compiler/runtime (Part II), inference adaptation (Part III), cluster orchestration (Part IV), and system-level GPU comparison (Part V), plus three appendices for trade-off lookup, parameter list, and GPU-equivalence glossary. Each chapter has a single-line topic statement and TODO bullets that map directly to points in the source conversation; no prose has been written yet, so this commit contains structure only. Co-authored-by: Claude <noreply@anthropic.com>

Replace skeleton TODOs with full prose for chapters 1-6: - Single chip: MXU/VPU/SPU systolic array, 2x2 walk-through, VLIW packs - Inter-chip: ICI and 3D Torus topology - OCS: MEMS mirror physics, 96-fiber geometry, rack-level slicing - Collective communication: Ring All-Reduce, dim-partitioned 3D, NUCA - Host integration: PCIe, multi-host slicing (LWS/JobSet), NUMA - Advanced packaging: 2.5D/3D, silicon interposer, TSV Each chapter ends with a GPU-comparison subsection. Three "[补充 — Claude 加]" callouts mark facts added beyond the source for user review. Parts II-V and appendices to follow in subsequent commits on this branch. Co-authored-by: Claude <noreply@anthropic.com>

Part II (XLA compiler & runtime, chapters 7-9): - Compilation model: operator fusion, static padding, software pipelining, VLIW 5-slot packs (single-chip and multi-chip variants) - Compilation timing: JIT timeline, bucketing + AOT + persistent cache pipeline, why no universal precompiled library - Topology-aware mapping: TP on short copper rings, DP on long optical rings Part III (inference adaptation, chapters 10-14): - Software stack split: vLLM (lift-and-shift), JetStream (TPU-native), Saxml (JAX legacy) - PagedAttention adaptation: control-plane/data-plane split with XLA pool + vLLM block tables + Pallas custom kernel - Prefill/Decode coordination: arithmetic intensity gap, static-bus continuous batching, chunked prefill, 1D static flatten for mixed steps - KV / memory hierarchy: ICI as native RDMA-bypass, no GDS equivalent, KV offload via PCIe to host DDR - Gemini practical compromises: MoE Capacity Factor for static routing, Tree Attention for tensorized speculative decoding One additional "[补充 — Claude 加]" callout in Chapter 13 about Mooncake-style separated KV pools (not in source). Part IV, V, appendices, and English mirror still pending. Co-authored-by: Claude <noreply@anthropic.com>

Part IV (cluster orchestration, chapters 15-16): - K8s TPU abstraction: device plugin, topology labels, slice CRD, Kueue gang scheduling, TPU Provisioner calling OCS API, OCS-based failure self-healing - Multi-host slice orchestration: LWS/JobSet, SPMD startup mode, failure-radius analysis Part V (system comparison, chapters 17-19): - Programming model chains: GPU's three-segment (CUDA -> NCCL -> IB/RDMA) vs TPU's single-segment (SPMD + ICI as VLIW slot 5), with concrete pseudo-instruction streams and the intersection/highway/factory analogies - Cost and efficiency: NVIDIA Tax breakdown, MFU comparison, Performance/Watt, real-world TCO cases (Midjourney, Character.AI, Waymark) - TPU hardware weaknesses: MXU granularity penalty, weak SPU and speculative-decoding pain, MoE All-to-All congestion on Torus, Decode big-batch dead-ends (KV OOM + token dispersion), HBM bandwidth/compute imbalance, what each weakness buys Appendices: - A: Trade-off cheat sheet indexed by design dimension - B: Numbers and parameters list (all annotated as "from source") - C: TPU-to-GPU terminology mapping Writing log: - 10-item intentional-cuts list for user review - 5 external "[补充 — Claude 加]" callouts collected with chapter references Chinese version is now content-complete. English mirror to follow. Co-authored-by: Claude <noreply@anthropic.com>

1:1 structural translation of notes/tpu-deep-dive.cn.md Part I per GEMINI.md section 1: idiomatic + concise English, every chapter, section, bullet, and bolded term mirrored from the Chinese source. Six chapters covering single chip (MXU/VPU/SPU systolic array, walk-through, VLIW packs), inter-chip ICI + 3D Torus, OCS optical switching with 96-fiber geometry, 3D-decomposed Ring All-Reduce with NUCA, host integration including PCIe/NUMA/multi-host slice, and advanced packaging from Wire Bonding to TSV. External-knowledge callouts ported verbatim from Chinese for review consistency. Parts II-V and appendices to follow. Co-authored-by: Claude <noreply@anthropic.com>

1:1 structural mirror of Chinese Parts II (XLA: chapters 7-9) and III (inference adaptation: chapters 10-14). Covers compilation model and VLIW packs, JIT/AOT/bucketing/persistent-cache pipeline, topology-aware mapping, vLLM/JetStream/Saxml stack split, PagedAttention adaptation via control-plane/data-plane split, Prefill/Decode coordination including chunked prefill and 1D static flatten, KV/memory hierarchy with ICI/GDS/KV-offload comparison, and Gemini's MoE Capacity Factor + Tree Attention. External-knowledge callouts mirror the Chinese version verbatim for review consistency. Parts IV, V, and appendices still to come. Co-authored-by: Claude <noreply@anthropic.com>

Final part of the 1:1 English mirror. Part IV covers K8s TPU abstractions (device plugin, slice CRD, Kueue gang scheduling, TPU Provisioner -> OCS, failure self-healing) and multi-host slice orchestration with LWS/JobSet and SPMD. Part V covers programming model chains (GPU three-segment vs TPU one-segment), cost and efficiency (NVIDIA Tax, MFU, Performance/Watt, real-world TCO), and TPU hardware weaknesses with paired trade-offs. Appendices A-C mirror the Chinese versions exactly: trade-off cheat sheet, numbers list, and TPU-to-GPU terminology mapping. The writing log is also mirrored, listing 10 intentional cuts and 5 external additions for author review. This completes the bilingual content. Next step is to update the mkdocs.yml navigation to surface the new note and prepare for review/merge. Co-authored-by: Claude <noreply@anthropic.com>

Add the new bilingual TPU note to both the 中文版 -> 技术笔记 and English -> Tech Notes sections of mkdocs.yml so it appears in the published site sidebar after merge to main. Co-authored-by: Claude <noreply@anthropic.com>

claude added 8 commits May 5, 2026 14:13

nav: Surface TPU deep dive in mkdocs nav

a2ffce4

Add the new bilingual TPU note to both the 中文版 -> 技术笔记 and English -> Tech Notes sections of mkdocs.yml so it appears in the published site sidebar after merge to main. Co-authored-by: Claude <noreply@anthropic.com>

Random-Liu changed the title ~~notes: TPU deep dive (skeleton + drafting in progress)~~ notes: TPU deep dive (bilingual, content complete, ready for review) May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes: TPU deep dive (bilingual, content complete, ready for review)#2

notes: TPU deep dive (bilingual, content complete, ready for review)#2
Random-Liu wants to merge 8 commits intomainfrom
claude/notes-tpu-deep-dive

Random-Liu commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Random-Liu commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status

Things specifically asking for your decision

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Random-Liu commented May 5, 2026 •

edited

Loading