Text

Text & language processing for Elixir.

A toolkit for tokenization, language identification, sentiment analysis, named-entity recognition, word clouds, phonetic encoding, search ranking, and the supporting plumbing — all in pure BEAM, with optional ML backends behind feature flags.

Capabilities

Detection and analysis

Language identification (Text.Language.Classifier.Fasttext) — pure-Elixir port of fastText's lid.176, 176 languages, validated bit-for-bit against the reference. ~100 µs per prediction with EXLA.
Sentiment analysis (Text.Sentiment) — multilingual AFINN lexicons (104 languages + emoji) or XLM-RoBERTa via Bumblebee.
Part-of-speech tagging (Text.POS) — via Bumblebee, English by default.
Named-entity recognition (Text.NER) — via Bumblebee, multilingual (10 high-resource languages).
Readability (Text.Readability) — Flesch, Flesch-Kincaid, Gunning-Fog, SMOG, ARI, Coleman-Liau, LIX, Dale-Chall, Spache.
PII detection and redaction (Text.PII) — emails, phones, IBANs, Luhn-validated credit cards, SSNs, IPv4/IPv6, URLs.

Cleaning and normalization

Pipeline cleaning (Text.Clean) — HTML/entity strip, mojibake repair, NFC/NFKC normalization, diacritic folding.
Truecasing (Text.Truecase) — sentence-aware case restoration for ALL-CAPS or lowercased text.
Emoji (Text.Emoji) — detect, count, strip, convert to/from :short_name: form, and per-emoji sentiment scoring (Kralj Novak et al. 2015).

Strings

Edit distance (Text.Distance) — Levenshtein, Damerau-Levenshtein, Hamming, Jaro, Jaro-Winkler.
N-gram set similarity (Text.Distance) — Jaccard, Sørensen-Dice, Tanimoto, cosine over character n-grams.
Set similarity (Text.Similarity) — Jaccard, Dice, overlap, cosine over arbitrary token sets.
Phonetic encoding — Soundex, Metaphone, NYSIIS, Cologne (German), DoubleMetaphone, each with match?/2 for direct equality.
Slugification (Text.Slug) — locale-aware Unicode folding with cross-script transliteration.
Segmentation (Text.Segment) — UAX #29 word/sentence boundaries with CLDR abbreviation suppressions.
Syllable counting (Text.Syllable) — vowel-group heuristic with an exceptions table.
Hyphenation (Text.Hyphenation) — Knuth-Liang patterns; bundled en-us, de-1996, fr, es, it, nl, pt, with on-demand fetching for ~80 more from hyph-utf8.

Spelling and morphology

Spell correction (Text.Spell) — Norvig-style edit-distance suggestions backed by Text.WordFreq.
Lemmatization (Text.Lemma) — dictionary-driven, ~41k bundled English pairs; ~20 more languages auto-downloadable from michmech via mix text.download_lemma_data.
Word frequencies (Text.WordFreq) — bundled top-30,000 frequencies for en, de, fr, es, it, nl, pt with frequency, zipf, rank, top.
English inflection (Text.Inflect.En) — pluralize/2 and singularize/2, modern and classical modes.

Statistics and search

N-grams and word counts (Text.Ngram, Text.Word).
TF-IDF and BM25 (Text.IR) — indexed corpus with scoring and top-K search.
Collocation extraction (Text.Collocation) — bigrams ranked by frequency, PMI, or log-likelihood.
Concordance (Text.KWIC) — keyword-in-context lookup.
Word embeddings (Text.Embedding) — load fastText .vec files, then cosine similarity, nearest neighbours, and analogies.
Word clouds (Text.WordCloud) — multilingual keyword extraction (six scoring backends) plus spiral layout and SVG rendering.
Extractive summarization (Text.Summarize) — TextRank and LexRank over Jaccard sentence graphs.
Stopwords (Text.Stopwords) — bundled lists for ~60 languages from stopwords-iso.

Installation

def deps do
  [
    {:text, "~> 0.5.0"}
  ]
end

For the language identifier, fetch the lid.176.bin model once after install:

mix text.download_lid176

For production environments using the optional Bumblebee-backed modules, mix text.download_models (plural) pre-fetches every external artefact — lid.176.bin plus the default Hugging Face checkpoints — so the first call to each module never hits the network.

A taste

# Sentiment — multilingual, no model download by default.
Text.Sentiment.analyze("J'adore ce livre.", language: :fr).label
#=> :positive

# Language identification — load the fastText model once.
{:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
  Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
)

{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)

# Word cloud → SVG file in four piped steps.
text
|> Text.WordCloud.terms(language: :en)
|> Text.WordCloud.Layout.layout(width: 800, height: 600, rotations: :radial)
|> Text.WordCloud.SVG.render(palette: Color.Palette.tonal("#3b82f6"))
|> then(&File.write!("cloud.svg", &1))

Guides

In-depth walkthroughs with worked examples:

Text classification (language identification) — setup, detect/3, CLDR locale resolution, Hans/Hant disambiguation, performance tuning.
Sentiment analysis — lexicon vs Bumblebee backends, multilingual lexicons, custom lexicons, threshold tuning, production wiring.
Part-of-speech tagging and NER — Bumblebee setup, tag sets, model pre-download, named Nx.Servings for high-QPS workloads.
Keyword-in-context concordance — Text.KWIC.concordance/3, formatting, collocate scans, sense disambiguation patterns.
Word clouds — six scoring backends (YAKE!, frequency, RAKE, TextRank, TF-IDF, KeyBERT), Wordle-style layouts (:radial/:spiral), SVG rendering with Color.Palette integration.

Optional dependencies

The package works without any optional deps. Adding them enables progressively heavier capabilities:

Dep	Enables
`:exla`	Order-of-magnitude faster inference for Fasttext and the Bumblebee-backed modules. Strongly recommended in production.
`:bumblebee`	Neural sentiment, POS, NER, and the KeyBERT word-cloud backend.
`:localize`	CLDR-canonical locale resolution (`fr-Latn-CA`, `zh-Hans-CN`) and `Localize.LanguageTag` input shapes.
`:color`	`Color.Palette.Tonal` and `Theme` palettes for SVG word-cloud rendering.
`:text_stemmer`	Snowball stemming (~30 languages) for word-cloud morphological-variant consolidation.

Calls that need a missing optional dep raise with installation instructions; the rest of the package keeps working.

Every public function that takes a :language (or :locale) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct (when :localize is loaded). See Text.Language for the normalisation helpers.

Roadmap

Quantized model support for lid.176.ftz (917 KB variant).
Nx.Serving for batched inference for throughput-bound workloads.
CLDR-tailored segmentation once the unicode_set regex engine matures.

License

Apache 2.0 — see LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
bench		bench
data		data
docs		docs
guides		guides
lib		lib
plans		plans
priv		priv
test		test
tmp/Text.Language.Classifier.Fasttext.ModelLoaderTest/test-load-2-loads-a-model-from-disk-d41e7d83		tmp/Text.Language.Classifier.Fasttext.ModelLoaderTest/test-load-2-loads-a-model-from-disk-d41e7d83
.dialyzer_ignore_warnings		.dialyzer_ignore_warnings
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
TODO.md		TODO.md
logo.png		logo.png
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text

Capabilities

Detection and analysis

Cleaning and normalization

Strings

Spelling and morphology

Statistics and search

Installation

A taste

Guides

Optional dependencies

Roadmap

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text

Capabilities

Detection and analysis

Cleaning and normalization

Strings

Spelling and morphology

Statistics and search

Installation

A taste

Guides

Optional dependencies

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages