notes: TPU deep dive (bilingual, content complete, ready for review)#2
Open
Random-Liu wants to merge 8 commits intomainfrom
Open
notes: TPU deep dive (bilingual, content complete, ready for review)#2Random-Liu wants to merge 8 commits intomainfrom
Random-Liu wants to merge 8 commits intomainfrom
Conversation
Set up the chapter structure for notes/tpu-deep-dive.cn.md based on the agreed 22-topic outline, organized by abstraction layer: hardware (Part I), XLA compiler/runtime (Part II), inference adaptation (Part III), cluster orchestration (Part IV), and system-level GPU comparison (Part V), plus three appendices for trade-off lookup, parameter list, and GPU-equivalence glossary. Each chapter has a single-line topic statement and TODO bullets that map directly to points in the source conversation; no prose has been written yet, so this commit contains structure only. Co-authored-by: Claude <noreply@anthropic.com>
Replace skeleton TODOs with full prose for chapters 1-6: - Single chip: MXU/VPU/SPU systolic array, 2x2 walk-through, VLIW packs - Inter-chip: ICI and 3D Torus topology - OCS: MEMS mirror physics, 96-fiber geometry, rack-level slicing - Collective communication: Ring All-Reduce, dim-partitioned 3D, NUCA - Host integration: PCIe, multi-host slicing (LWS/JobSet), NUMA - Advanced packaging: 2.5D/3D, silicon interposer, TSV Each chapter ends with a GPU-comparison subsection. Three "[补充 — Claude 加]" callouts mark facts added beyond the source for user review. Parts II-V and appendices to follow in subsequent commits on this branch. Co-authored-by: Claude <noreply@anthropic.com>
Part II (XLA compiler & runtime, chapters 7-9): - Compilation model: operator fusion, static padding, software pipelining, VLIW 5-slot packs (single-chip and multi-chip variants) - Compilation timing: JIT timeline, bucketing + AOT + persistent cache pipeline, why no universal precompiled library - Topology-aware mapping: TP on short copper rings, DP on long optical rings Part III (inference adaptation, chapters 10-14): - Software stack split: vLLM (lift-and-shift), JetStream (TPU-native), Saxml (JAX legacy) - PagedAttention adaptation: control-plane/data-plane split with XLA pool + vLLM block tables + Pallas custom kernel - Prefill/Decode coordination: arithmetic intensity gap, static-bus continuous batching, chunked prefill, 1D static flatten for mixed steps - KV / memory hierarchy: ICI as native RDMA-bypass, no GDS equivalent, KV offload via PCIe to host DDR - Gemini practical compromises: MoE Capacity Factor for static routing, Tree Attention for tensorized speculative decoding One additional "[补充 — Claude 加]" callout in Chapter 13 about Mooncake-style separated KV pools (not in source). Part IV, V, appendices, and English mirror still pending. Co-authored-by: Claude <noreply@anthropic.com>
Part IV (cluster orchestration, chapters 15-16): - K8s TPU abstraction: device plugin, topology labels, slice CRD, Kueue gang scheduling, TPU Provisioner calling OCS API, OCS-based failure self-healing - Multi-host slice orchestration: LWS/JobSet, SPMD startup mode, failure-radius analysis Part V (system comparison, chapters 17-19): - Programming model chains: GPU's three-segment (CUDA -> NCCL -> IB/RDMA) vs TPU's single-segment (SPMD + ICI as VLIW slot 5), with concrete pseudo-instruction streams and the intersection/highway/factory analogies - Cost and efficiency: NVIDIA Tax breakdown, MFU comparison, Performance/Watt, real-world TCO cases (Midjourney, Character.AI, Waymark) - TPU hardware weaknesses: MXU granularity penalty, weak SPU and speculative-decoding pain, MoE All-to-All congestion on Torus, Decode big-batch dead-ends (KV OOM + token dispersion), HBM bandwidth/compute imbalance, what each weakness buys Appendices: - A: Trade-off cheat sheet indexed by design dimension - B: Numbers and parameters list (all annotated as "from source") - C: TPU-to-GPU terminology mapping Writing log: - 10-item intentional-cuts list for user review - 5 external "[补充 — Claude 加]" callouts collected with chapter references Chinese version is now content-complete. English mirror to follow. Co-authored-by: Claude <noreply@anthropic.com>
1:1 structural translation of notes/tpu-deep-dive.cn.md Part I per GEMINI.md section 1: idiomatic + concise English, every chapter, section, bullet, and bolded term mirrored from the Chinese source. Six chapters covering single chip (MXU/VPU/SPU systolic array, walk-through, VLIW packs), inter-chip ICI + 3D Torus, OCS optical switching with 96-fiber geometry, 3D-decomposed Ring All-Reduce with NUCA, host integration including PCIe/NUMA/multi-host slice, and advanced packaging from Wire Bonding to TSV. External-knowledge callouts ported verbatim from Chinese for review consistency. Parts II-V and appendices to follow. Co-authored-by: Claude <noreply@anthropic.com>
1:1 structural mirror of Chinese Parts II (XLA: chapters 7-9) and III (inference adaptation: chapters 10-14). Covers compilation model and VLIW packs, JIT/AOT/bucketing/persistent-cache pipeline, topology-aware mapping, vLLM/JetStream/Saxml stack split, PagedAttention adaptation via control-plane/data-plane split, Prefill/Decode coordination including chunked prefill and 1D static flatten, KV/memory hierarchy with ICI/GDS/KV-offload comparison, and Gemini's MoE Capacity Factor + Tree Attention. External-knowledge callouts mirror the Chinese version verbatim for review consistency. Parts IV, V, and appendices still to come. Co-authored-by: Claude <noreply@anthropic.com>
Final part of the 1:1 English mirror. Part IV covers K8s TPU abstractions (device plugin, slice CRD, Kueue gang scheduling, TPU Provisioner -> OCS, failure self-healing) and multi-host slice orchestration with LWS/JobSet and SPMD. Part V covers programming model chains (GPU three-segment vs TPU one-segment), cost and efficiency (NVIDIA Tax, MFU, Performance/Watt, real-world TCO), and TPU hardware weaknesses with paired trade-offs. Appendices A-C mirror the Chinese versions exactly: trade-off cheat sheet, numbers list, and TPU-to-GPU terminology mapping. The writing log is also mirrored, listing 10 intentional cuts and 5 external additions for author review. This completes the bilingual content. Next step is to update the mkdocs.yml navigation to surface the new note and prepare for review/merge. Co-authored-by: Claude <noreply@anthropic.com>
Add the new bilingual TPU note to both the 中文版 -> 技术笔记 and English -> Tech Notes sections of mkdocs.yml so it appears in the published site sidebar after merge to main. Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bilingual long-form note
notes/tpu-deep-dive.{cn.md, md}distilled from your Gemini conversation innotes/tpu_draft.md. Organized by abstraction layer (hardware → XLA → inference → cluster → system comparison) plus three reference appendices. Each chapter ends with a↔ GPUsubsection.GEMINI.md§1mkdocs.yml(both 中文版 and English sections)Status
[补充 — Claude 加]callouts for your reviewmkdocs.ymlnav updatedThings specifically asking for your decision
notes/tpu_draft.mdis the source you committed; it'll be copied into the build by CI but isn't in nav, so it won't appear on the site. Decide whether to delete after merge or keep as archive.[补充 — Claude 加](ICI 4.5 TB/s number, Microsoft Lumen OCS aside, v4 single-digit MFU sourcing, Mooncake aside, Midjourney/Character.AI/Waymark numbers without dates) — please verify or strip.docs/tpu-deep-dive.mdwas switched tonotes/tpu-deep-dive.{cn.md, md}to match repo convention. OK?Test plan
[补充]callouts and intentional cutslgtm/mergecomment, or merge directly perCLAUDE.mdSOPCI doesn't run on PRs (only on push to main per
.github/workflows/ci.yml), so the published rendering with math/Mermaid will only be visible after merge. Iterate post-merge if needed.