Skip to content

feat!: multilingual text-to-speech#1134

Open
IgorSwat wants to merge 28 commits into
mainfrom
@is/multilingual-tts
Open

feat!: multilingual text-to-speech#1134
IgorSwat wants to merge 28 commits into
mainfrom
@is/multilingual-tts

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 8, 2026

Description

Introduces major changes to the text-to-speech module based on Kokoro model, including:

  • Multilingual text-to-speech - a set of complete pipelines & voices for different languages. A complete list of (currently) supported languages can be found below.
  • Improved phonemization & speech quality - utilizing neural phonemization model as a fallback for the old lexicon-base phonemization significantly improves speech quality, particularly for non-standard, out of dictionary words.
  • Timestamp-based audio cutting - an improve postprocessing algorithm, eliminates artifacts introduced by .pte model, resulting in cleaner, more natural speech.
  • API changes: prepared for voice-cloning & custom, fine-tuned versions of Kokoro model.

Supported language current status:

  • 🇺🇸 American English: ✅
  • 🇬🇧 British English: ✅
  • 🇫🇷 French: ✅
  • 🇪🇸 Spanish: ✅
  • 🇵🇹/🇧🇷 Portugese: ✅
  • 🇮🇹 Italian: ✅
  • 🇵🇱 Polish: ✅
  • 🇩🇪 German: ✅
  • 🇮🇳 Hindi: ✅
  • 🇯🇵 Japanese: ❌ (coming soon)
  • 🇨🇳 Mandarin Chinese: ❌ (coming soon)

Introduces a breaking change?

  • Yes
  • No

There are 2 major breaking changes introduced by this PR:

  • Changed "synthezation from phonemes" API.

    Old API:

     const audioData = await tts.forwardFromPhonemes({
       phonemes:
         'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
     });
    

    New API:

    const audioData = await tts.forward({
      text:
        'ɐ mˈæn hˌu dˈʌzᵊnt tɹˈʌst hɪmsˈɛlf, kæn nˈɛvəɹ ɹˈiᵊli tɹˈʌst ˈɛniwˌʌn ˈɛls.',
       phonemize: false,  # Disables phonemization and treats text as phonemes
    });
    
  • Changed predefined model - voice setups. Now both model files & voice/phonemization files are bundled together, due to languages like Polish or German having fine-tuned model weights.

    Old API:

    const model = useTextToSpeech({
      model: KOKORO_MEDIUM,
      voice: KOKORO_VOICE_AF_HEART,
    });
    

    New API:

    const model = useTextToSpeech(KOKORO_AMERICAN_ENGLISH_FEMALE_HEART);
    

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Play around demo speech apps.

Unit tests for RNE-specific code will be added later on.
Phonemis package has it's own, wide range of unit tests implemented (see Phonemis repo)

Screenshots

Related issues

#712

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from chmjkb and msluszniak May 8, 2026 14:24
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 8380a2a to eb999a7 Compare May 8, 2026 14:26
@IgorSwat IgorSwat self-assigned this May 8, 2026
@IgorSwat IgorSwat added feature PRs that implement a new feature improvement PRs or issues focused on improvements in the current codebase labels May 8, 2026
@IgorSwat IgorSwat changed the title feat: multilingual text-to-speech feat!: multilingual text-to-speech May 8, 2026
Copy link
Copy Markdown
Member

@msluszniak msluszniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also update the code in documentation and documentation in general. Also address lint warnings, there are plenty of them that you need to add to cspell ignore.

Comment thread packages/react-native-executorch/react-native-executorch.podspec Outdated
@msluszniak
Copy link
Copy Markdown
Member

Also if this PR adds breaking change, please describe it directly below Introduces a breaking change? section in PR body.

Comment on lines +66 to +67
? pageY + height + 2
: pageY - Math.min(DROPDOWN_MAX_HEIGHT, models.length * 42) - 2;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bunch of magic numbers, please explain

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come on, this is just a frontend of a demo app. Those values are indeed random numbers which make it look good 😅

{ label: '🇮🇹 IM Nicola', value: KOKORO_ITALIAN_MALE_NICOLA },
{ label: '🇵🇹 PF Dora', value: KOKORO_PORTUGUESE_FEMALE_DORA },
{ label: '🇵🇹 PM Santa', value: KOKORO_PORTUGUESE_MALE_SANTA },
{ label: '🇵🇱 PM Mateusz', value: KOKORO_POLISH_MALE_MATEUSZ },
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xD

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder who the fella is

Comment thread apps/speech/screens/TextToSpeechScreen.tsx Outdated
Comment thread packages/react-native-executorch/android/src/main/cpp/CMakeLists.txt Outdated
Comment on lines +24 to +39
phonemizer_(phonemis::Config{
.lang = lang,
.tagger = taggerDataSource.empty()
? std::optional<phonemis::tagger::Config>{}
: std::make_optional(phonemis::tagger::Config{
.data_filepath = taggerDataSource}),
.phonemizer =
phonemis::phonemizer::Config{
.lang = lang,
.lexicon_filepath = lexiconSource.empty()
? std::nullopt
: std::make_optional(lexiconSource),
.nn_model_filepath =
neuralModelSource.empty()
? std::nullopt
: std::make_optional(neuralModelSource)}}),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very hard to read

Copy link
Copy Markdown
Contributor Author

@IgorSwat IgorSwat May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but there are 2 things which prevent me from making it more readable:

  • Each time I try to improve it by just re-formatting indentation, the linter reverts it during committing phase
  • It's hard to simplify the code logic itself, since we always need these "empty path" checks to check what comes from the JS side, and these checks generate most of the complexity of this piece of code.

Copy link
Copy Markdown
Member

@msluszniak msluszniak May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the linter reverts it during committing phase

This is because we have clang-format call in pre-commit hook with LLVM default format enforced.

Comment thread packages/react-native-executorch/src/constants/tts/voices.ts
Comment on lines +87 to +89
taggerIdx >= 0 ? (paths[taggerIdx] as string) : '',
lexiconIdx >= 0 ? (paths[lexiconIdx] as string) : '',
neuralModelIdx >= 0 ? (paths[neuralModelIdx] as string) : '',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need those assertions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. Tagger, lexicon and neural model are all theoretically optional, so we need some conditional to decide whether we should pass en empty string or an existing value.

@msluszniak msluszniak linked an issue May 18, 2026 that may be closed by this pull request
5 tasks
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from a1837c6 to 995a70d Compare May 18, 2026 15:38
@msluszniak msluszniak requested review from chmjkb and msluszniak May 18, 2026 15:47
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 7ed3558 to a3c38d3 Compare May 19, 2026 08:50
msluszniak and others added 3 commits May 19, 2026 11:31
…e type aliases

TypeDoc emits `export type` declarations under `06-api-reference/type-aliases/`,
not `06-api-reference/interfaces/`. The links in useTextToSpeech.md pointed at
the interfaces/ paths, which never get generated for these names, breaking the
Docusaurus build (`onBrokenLinks: 'throw'`).
@IgorSwat IgorSwat force-pushed the @is/multilingual-tts branch from 10e8e1c to 38340f6 Compare May 19, 2026 11:32
- tests/CMakeLists.txt: build phonemis from source (add_subdirectory)
  and propagate its include dir to rntests_core. The previous IMPORTED
  STATIC pointed at a libphonemis.a that nothing builds.
- FrameTransformTest, ObjectDetectionTest, InstanceSegmentationTest:
  update bbox member access for #1130's BBox refactor
  (.x1/.y1/.x2/.y2 → .p1.x/.p1.y/.p2.x/.p2.y).
- PoseEstimationTest: keypoint type became float in #1130; update the
  static_assert from int32_t to float.
- FrameTransformTest: make the three Right_* tests platform-aware.
  Production inverseRotateBbox/inverseRotatePoints are a no-op on
  Android for Right (front-cam upright portrait); rotateFrameForModel
  rotates CW on Android vs CCW on iOS. Tests now have #if defined(__APPLE__)
  branches matching production.
- SpeechToTextTest: GTEST_SKIP TranscribeReturnsValidChars with a TODO —
  known-failing on this branch, needs separate investigation.
- run_tests.sh: fix two stale Hugging Face URLs (fsmn-vad and
  yolo26n-pose filenames had changed upstream, causing wget to 404 and
  silently abort the script).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature improvement PRs or issues focused on improvements in the current codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Text to Speech - add new languages support

4 participants