Skip to content

perf: optimize OneTrans BF16 training on TF2.15#265

Open
weloMThreads wants to merge 4 commits into
MooreThreads:mainfrom
weloMThreads:perf/onetrans-bf16-effective-clean
Open

perf: optimize OneTrans BF16 training on TF2.15#265
weloMThreads wants to merge 4 commits into
MooreThreads:mainfrom
weloMThreads:perf/onetrans-bf16-effective-clean

Conversation

@weloMThreads
Copy link
Copy Markdown
Collaborator

Summary

  • Add OneTrans BF16 fast paths and graph rewrites for attention, RMSNorm, softmax, dropout, gather/cast, and SGD update patterns.
  • Restore the minimal TF2.15 MUSA device/runtime registration needed by these kernels.
  • Keep the branch scoped to the performance path; no training script changes are included.

Verification

  • Built build/libmusa_plugin.so on CI from perf/onetrans-bf16-effective-clean.
  • Verified TensorFlow 2.15 plugin load exposes /device:MUSA:0 with MUSA_VISIBLE_DEVICES=6.
  • Ran OneTrans BF16 perf test on card 6:
    • Average ms/full step: 41.852
    • Measured steps: 40
    • Last measured train_loss: 0.8828

@weloMThreads weloMThreads force-pushed the perf/onetrans-bf16-effective-clean branch from 8d1fb60 to eba02cb Compare May 20, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant