Text & language processing for Elixir.
A toolkit for tokenization, language identification, sentiment analysis, named-entity recognition, word clouds, phonetic encoding, search ranking, and the supporting plumbing — all in pure BEAM, with optional ML backends behind feature flags.
- Language identification (
Text.Language.Classifier.Fasttext) — pure-Elixir port of fastText'slid.176, 176 languages, validated bit-for-bit against the reference. ~100 µs per prediction with EXLA. - Sentiment analysis (
Text.Sentiment) — multilingual AFINN lexicons (104 languages + emoji) or XLM-RoBERTa via Bumblebee. - Part-of-speech tagging (
Text.POS) — via Bumblebee, English by default. - Named-entity recognition (
Text.NER) — via Bumblebee, multilingual (10 high-resource languages). - Readability (
Text.Readability) — Flesch, Flesch-Kincaid, Gunning-Fog, SMOG, ARI, Coleman-Liau, LIX, Dale-Chall, Spache. - PII detection and redaction (
Text.PII) — emails, phones, IBANs, Luhn-validated credit cards, SSNs, IPv4/IPv6, URLs.
- Pipeline cleaning (
Text.Clean) — HTML/entity strip, mojibake repair, NFC/NFKC normalization, diacritic folding. - Truecasing (
Text.Truecase) — sentence-aware case restoration for ALL-CAPS or lowercased text. - Emoji (
Text.Emoji) — detect, count, strip, convert to/from:short_name:form, and per-emoji sentiment scoring (Kralj Novak et al. 2015).
- Edit distance (
Text.Distance) — Levenshtein, Damerau-Levenshtein, Hamming, Jaro, Jaro-Winkler. - N-gram set similarity (
Text.Distance) — Jaccard, Sørensen-Dice, Tanimoto, cosine over character n-grams. - Set similarity (
Text.Similarity) — Jaccard, Dice, overlap, cosine over arbitrary token sets. - Phonetic encoding —
Soundex,Metaphone,NYSIIS,Cologne(German),DoubleMetaphone, each withmatch?/2for direct equality. - Slugification (
Text.Slug) — locale-aware Unicode folding with cross-script transliteration. - Segmentation (
Text.Segment) — UAX #29 word/sentence boundaries with CLDR abbreviation suppressions. - Syllable counting (
Text.Syllable) — vowel-group heuristic with an exceptions table. - Hyphenation (
Text.Hyphenation) — Knuth-Liang patterns; bundleden-us,de-1996,fr,es,it,nl,pt, with on-demand fetching for ~80 more from hyph-utf8.
- Spell correction (
Text.Spell) — Norvig-style edit-distance suggestions backed byText.WordFreq. - Lemmatization (
Text.Lemma) — dictionary-driven, ~41k bundled English pairs; ~20 more languages auto-downloadable from michmech viamix text.download_lemma_data. - Word frequencies (
Text.WordFreq) — bundled top-30,000 frequencies foren,de,fr,es,it,nl,ptwithfrequency,zipf,rank,top. - English inflection (
Text.Inflect.En) —pluralize/2andsingularize/2, modern and classical modes.
- N-grams and word counts (
Text.Ngram,Text.Word). - TF-IDF and BM25 (
Text.IR) — indexed corpus with scoring and top-K search. - Collocation extraction (
Text.Collocation) — bigrams ranked by frequency, PMI, or log-likelihood. - Concordance (
Text.KWIC) — keyword-in-context lookup. - Word embeddings (
Text.Embedding) — load fastText.vecfiles, then cosine similarity, nearest neighbours, and analogies. - Word clouds (
Text.WordCloud) — multilingual keyword extraction (six scoring backends) plus spiral layout and SVG rendering. - Extractive summarization (
Text.Summarize) — TextRank and LexRank over Jaccard sentence graphs. - Stopwords (
Text.Stopwords) — bundled lists for ~60 languages from stopwords-iso.
def deps do
[
{:text, "~> 0.5.0"}
]
endFor the language identifier, fetch the lid.176.bin model once after install:
mix text.download_lid176For production environments using the optional Bumblebee-backed modules, mix text.download_models (plural) pre-fetches every external artefact — lid.176.bin plus the default Hugging Face checkpoints — so the first call to each module never hits the network.
# Sentiment — multilingual, no model download by default.
Text.Sentiment.analyze("J'adore ce livre.", language: :fr).label
#=> :positive
# Language identification — load the fastText model once.
{:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
)
{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)
# Word cloud → SVG file in four piped steps.
text
|> Text.WordCloud.terms(language: :en)
|> Text.WordCloud.Layout.layout(width: 800, height: 600, rotations: :radial)
|> Text.WordCloud.SVG.render(palette: Color.Palette.tonal("#3b82f6"))
|> then(&File.write!("cloud.svg", &1))In-depth walkthroughs with worked examples:
-
Text classification (language identification) — setup,
detect/3, CLDR locale resolution, Hans/Hant disambiguation, performance tuning. -
Sentiment analysis — lexicon vs Bumblebee backends, multilingual lexicons, custom lexicons, threshold tuning, production wiring.
-
Part-of-speech tagging and NER — Bumblebee setup, tag sets, model pre-download, named
Nx.Servings for high-QPS workloads. -
Keyword-in-context concordance —
Text.KWIC.concordance/3, formatting, collocate scans, sense disambiguation patterns. -
Word clouds — six scoring backends (YAKE!, frequency, RAKE, TextRank, TF-IDF, KeyBERT), Wordle-style layouts (
:radial/:spiral), SVG rendering withColor.Paletteintegration.
The package works without any optional deps. Adding them enables progressively heavier capabilities:
| Dep | Enables |
|---|---|
:exla |
Order-of-magnitude faster inference for Fasttext and the Bumblebee-backed modules. Strongly recommended in production. |
:bumblebee |
Neural sentiment, POS, NER, and the KeyBERT word-cloud backend. |
:localize |
CLDR-canonical locale resolution (fr-Latn-CA, zh-Hans-CN) and Localize.LanguageTag input shapes. |
:color |
Color.Palette.Tonal and Theme palettes for SVG word-cloud rendering. |
:text_stemmer |
Snowball stemming (~30 languages) for word-cloud morphological-variant consolidation. |
Calls that need a missing optional dep raise with installation instructions; the rest of the package keeps working.
Every public function that takes a :language (or :locale) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct (when :localize is loaded). See Text.Language for the normalisation helpers.
- Quantized model support for
lid.176.ftz(917 KB variant). Nx.Servingfor batched inference for throughput-bound workloads.- CLDR-tailored segmentation once the
unicode_setregex engine matures.
Apache 2.0 — see LICENSE.md.