Skip to content

notes: TPU deep dive (bilingual, content complete, ready for review)#2

Open
Random-Liu wants to merge 8 commits intomainfrom
claude/notes-tpu-deep-dive
Open

notes: TPU deep dive (bilingual, content complete, ready for review)#2
Random-Liu wants to merge 8 commits intomainfrom
claude/notes-tpu-deep-dive

Conversation

@Random-Liu
Copy link
Copy Markdown
Owner

@Random-Liu Random-Liu commented May 5, 2026

Summary

Bilingual long-form note notes/tpu-deep-dive.{cn.md, md} distilled from your Gemini conversation in notes/tpu_draft.md. Organized by abstraction layer (hardware → XLA → inference → cluster → system comparison) plus three reference appendices. Each chapter ends with a ↔ GPU subsection.

  • Chinese: 1395 lines, 19 chapters + 3 appendices + writing log
  • English: 1398 lines, 1:1 structural mirror per GEMINI.md §1
  • Site nav updated in mkdocs.yml (both 中文版 and English sections)

Status

  • ✅ Skeleton
  • ✅ Part I — Hardware (single chip, ICI/Torus, OCS+96 fibers, 3D Ring All-Reduce + NUCA, host/PCIe/NUMA, packaging)
  • ✅ Part II — XLA (compilation model, JIT/AOT/bucketing, topology-aware mapping)
  • ✅ Part III — Inference adaptation (vLLM/JetStream/Saxml split, PagedAttention via Pallas, Prefill/Decode + chunked + 1D static flatten, KV/memory hierarchy, Gemini's MoE/spec-decode compromises)
  • ✅ Part IV — Cluster (K8s slice CRD + Kueue + TPU Provisioner → OCS, multi-host slice with LWS/JobSet)
  • ✅ Part V — System comparison (programming model chains, cost/MFU/TCO, hardware weaknesses)
  • ✅ Appendices A (trade-off cheat sheet), B (numbers list), C (TPU↔GPU terminology)
  • ✅ Writing log with 10 intentional cuts + 5 external [补充 — Claude 加] callouts for your review
  • ✅ English mirror complete and committed
  • mkdocs.yml nav updated

Things specifically asking for your decision

  1. notes/tpu_draft.md is the source you committed; it'll be copied into the build by CI but isn't in nav, so it won't appear on the site. Decide whether to delete after merge or keep as archive.
  2. 5 external additions I tagged [补充 — Claude 加] (ICI 4.5 TB/s number, Microsoft Lumen OCS aside, v4 single-digit MFU sourcing, Mooncake aside, Midjourney/Character.AI/Waymark numbers without dates) — please verify or strip.
  3. 10 intentional cuts listed at bottom of each file's writing log — let me know which to add back.
  4. The user-prompt path docs/tpu-deep-dive.md was switched to notes/tpu-deep-dive.{cn.md, md} to match repo convention. OK?

Test plan

  • Skim chapter structure on GitHub mobile — does the layering match how you'd want to look up things in 6 months?
  • Sanity-check tone on one or two chapters (target: softer than the Claude default, with analogies, not assuming reader is expert)
  • Decide on [补充] callouts and intentional cuts
  • Approve via lgtm / merge comment, or merge directly per CLAUDE.md SOP

CI doesn't run on PRs (only on push to main per .github/workflows/ci.yml), so the published rendering with math/Mermaid will only be visible after merge. Iterate post-merge if needed.

claude added 8 commits May 5, 2026 14:13
Set up the chapter structure for notes/tpu-deep-dive.cn.md based on
the agreed 22-topic outline, organized by abstraction layer:
hardware (Part I), XLA compiler/runtime (Part II), inference adaptation
(Part III), cluster orchestration (Part IV), and system-level GPU
comparison (Part V), plus three appendices for trade-off lookup,
parameter list, and GPU-equivalence glossary. Each chapter has a
single-line topic statement and TODO bullets that map directly to
points in the source conversation; no prose has been written yet, so
this commit contains structure only.

Co-authored-by: Claude <noreply@anthropic.com>
Replace skeleton TODOs with full prose for chapters 1-6:
- Single chip: MXU/VPU/SPU systolic array, 2x2 walk-through, VLIW packs
- Inter-chip: ICI and 3D Torus topology
- OCS: MEMS mirror physics, 96-fiber geometry, rack-level slicing
- Collective communication: Ring All-Reduce, dim-partitioned 3D, NUCA
- Host integration: PCIe, multi-host slicing (LWS/JobSet), NUMA
- Advanced packaging: 2.5D/3D, silicon interposer, TSV

Each chapter ends with a GPU-comparison subsection. Three
"[补充 — Claude 加]" callouts mark facts added beyond the source for
user review.

Parts II-V and appendices to follow in subsequent commits on this
branch.

Co-authored-by: Claude <noreply@anthropic.com>
Part II (XLA compiler & runtime, chapters 7-9):
- Compilation model: operator fusion, static padding, software
  pipelining, VLIW 5-slot packs (single-chip and multi-chip variants)
- Compilation timing: JIT timeline, bucketing + AOT + persistent
  cache pipeline, why no universal precompiled library
- Topology-aware mapping: TP on short copper rings, DP on long
  optical rings

Part III (inference adaptation, chapters 10-14):
- Software stack split: vLLM (lift-and-shift), JetStream
  (TPU-native), Saxml (JAX legacy)
- PagedAttention adaptation: control-plane/data-plane split with
  XLA pool + vLLM block tables + Pallas custom kernel
- Prefill/Decode coordination: arithmetic intensity gap, static-bus
  continuous batching, chunked prefill, 1D static flatten for mixed
  steps
- KV / memory hierarchy: ICI as native RDMA-bypass, no GDS
  equivalent, KV offload via PCIe to host DDR
- Gemini practical compromises: MoE Capacity Factor for static
  routing, Tree Attention for tensorized speculative decoding

One additional "[补充 — Claude 加]" callout in Chapter 13 about
Mooncake-style separated KV pools (not in source).

Part IV, V, appendices, and English mirror still pending.

Co-authored-by: Claude <noreply@anthropic.com>
Part IV (cluster orchestration, chapters 15-16):
- K8s TPU abstraction: device plugin, topology labels, slice CRD,
  Kueue gang scheduling, TPU Provisioner calling OCS API, OCS-based
  failure self-healing
- Multi-host slice orchestration: LWS/JobSet, SPMD startup mode,
  failure-radius analysis

Part V (system comparison, chapters 17-19):
- Programming model chains: GPU's three-segment (CUDA -> NCCL ->
  IB/RDMA) vs TPU's single-segment (SPMD + ICI as VLIW slot 5),
  with concrete pseudo-instruction streams and the
  intersection/highway/factory analogies
- Cost and efficiency: NVIDIA Tax breakdown, MFU comparison,
  Performance/Watt, real-world TCO cases (Midjourney, Character.AI,
  Waymark)
- TPU hardware weaknesses: MXU granularity penalty, weak SPU and
  speculative-decoding pain, MoE All-to-All congestion on Torus,
  Decode big-batch dead-ends (KV OOM + token dispersion), HBM
  bandwidth/compute imbalance, what each weakness buys

Appendices:
- A: Trade-off cheat sheet indexed by design dimension
- B: Numbers and parameters list (all annotated as "from source")
- C: TPU-to-GPU terminology mapping

Writing log:
- 10-item intentional-cuts list for user review
- 5 external "[补充 — Claude 加]" callouts collected with chapter
  references

Chinese version is now content-complete. English mirror to follow.

Co-authored-by: Claude <noreply@anthropic.com>
1:1 structural translation of notes/tpu-deep-dive.cn.md Part I per
GEMINI.md section 1: idiomatic + concise English, every chapter,
section, bullet, and bolded term mirrored from the Chinese source.
Six chapters covering single chip (MXU/VPU/SPU systolic array,
walk-through, VLIW packs), inter-chip ICI + 3D Torus, OCS optical
switching with 96-fiber geometry, 3D-decomposed Ring All-Reduce
with NUCA, host integration including PCIe/NUMA/multi-host slice,
and advanced packaging from Wire Bonding to TSV.

External-knowledge callouts ported verbatim from Chinese for
review consistency.

Parts II-V and appendices to follow.

Co-authored-by: Claude <noreply@anthropic.com>
1:1 structural mirror of Chinese Parts II (XLA: chapters 7-9) and
III (inference adaptation: chapters 10-14). Covers compilation
model and VLIW packs, JIT/AOT/bucketing/persistent-cache pipeline,
topology-aware mapping, vLLM/JetStream/Saxml stack split,
PagedAttention adaptation via control-plane/data-plane split,
Prefill/Decode coordination including chunked prefill and 1D
static flatten, KV/memory hierarchy with ICI/GDS/KV-offload
comparison, and Gemini's MoE Capacity Factor + Tree Attention.

External-knowledge callouts mirror the Chinese version verbatim
for review consistency.

Parts IV, V, and appendices still to come.

Co-authored-by: Claude <noreply@anthropic.com>
Final part of the 1:1 English mirror. Part IV covers K8s TPU
abstractions (device plugin, slice CRD, Kueue gang scheduling,
TPU Provisioner -> OCS, failure self-healing) and multi-host slice
orchestration with LWS/JobSet and SPMD. Part V covers programming
model chains (GPU three-segment vs TPU one-segment), cost and
efficiency (NVIDIA Tax, MFU, Performance/Watt, real-world TCO),
and TPU hardware weaknesses with paired trade-offs. Appendices A-C
mirror the Chinese versions exactly: trade-off cheat sheet,
numbers list, and TPU-to-GPU terminology mapping. The writing log
is also mirrored, listing 10 intentional cuts and 5 external
additions for author review.

This completes the bilingual content. Next step is to update the
mkdocs.yml navigation to surface the new note and prepare for
review/merge.

Co-authored-by: Claude <noreply@anthropic.com>
Add the new bilingual TPU note to both the 中文版 -> 技术笔记 and
English -> Tech Notes sections of mkdocs.yml so it appears in the
published site sidebar after merge to main.

Co-authored-by: Claude <noreply@anthropic.com>
@Random-Liu Random-Liu changed the title notes: TPU deep dive (skeleton + drafting in progress) notes: TPU deep dive (bilingual, content complete, ready for review) May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants