perf: optimize OneTrans BF16 training on TF2.15 by weloMThreads · Pull Request #265 · MooreThreads/tensorflow_musa_extension

weloMThreads · 2026-05-20T09:20:34Z

Summary

Add OneTrans BF16 fast paths and graph rewrites for attention, RMSNorm, softmax, dropout, gather/cast, and SGD update patterns.
Restore the minimal TF2.15 MUSA device/runtime registration needed by these kernels.
Keep the branch scoped to the performance path; no training script changes are included.

Built build/libmusa_plugin.so on CI from perf/onetrans-bf16-effective-clean.
Verified TensorFlow 2.15 plugin load exposes /device:MUSA:0 with MUSA_VISIBLE_DEVICES=6.
Ran OneTrans BF16 perf test on card 6:
- Average ms/full step: 41.852
- Measured steps: 40
- Last measured train_loss: 0.8828

welo516 added 3 commits May 20, 2026 18:36

perf: add OneTrans BF16 fast paths

58fc902

fix: keep MUSA device context available on TF2.15

abe2897

fix: register MUSA devices on TensorFlow 2.15

eba02cb

weloMThreads force-pushed the perf/onetrans-bf16-effective-clean branch from 8d1fb60 to eba02cb Compare May 20, 2026 10:36

ci: rerun against latest main

ccf5e9d