Skip to content

kipcole9/text

Repository files navigation

Text

Text & language processing for Elixir.

A toolkit for tokenization, language identification, sentiment analysis, named-entity recognition, word clouds, phonetic encoding, search ranking, and the supporting plumbing — all in pure BEAM, with optional ML backends behind feature flags.

Capabilities

Detection and analysis

  • Language identification (Text.Language.Classifier.Fasttext) — pure-Elixir port of fastText's lid.176, 176 languages, validated bit-for-bit against the reference. ~100 µs per prediction with EXLA.
  • Sentiment analysis (Text.Sentiment) — multilingual AFINN lexicons (104 languages + emoji) or XLM-RoBERTa via Bumblebee.
  • Part-of-speech tagging (Text.POS) — via Bumblebee, English by default.
  • Named-entity recognition (Text.NER) — via Bumblebee, multilingual (10 high-resource languages).
  • Readability (Text.Readability) — Flesch, Flesch-Kincaid, Gunning-Fog, SMOG, ARI, Coleman-Liau, LIX, Dale-Chall, Spache.
  • PII detection and redaction (Text.PII) — emails, phones, IBANs, Luhn-validated credit cards, SSNs, IPv4/IPv6, URLs.

Cleaning and normalization

  • Pipeline cleaning (Text.Clean) — HTML/entity strip, mojibake repair, NFC/NFKC normalization, diacritic folding.
  • Truecasing (Text.Truecase) — sentence-aware case restoration for ALL-CAPS or lowercased text.
  • Emoji (Text.Emoji) — detect, count, strip, convert to/from :short_name: form, and per-emoji sentiment scoring (Kralj Novak et al. 2015).

Strings

  • Edit distance (Text.Distance) — Levenshtein, Damerau-Levenshtein, Hamming, Jaro, Jaro-Winkler.
  • N-gram set similarity (Text.Distance) — Jaccard, Sørensen-Dice, Tanimoto, cosine over character n-grams.
  • Set similarity (Text.Similarity) — Jaccard, Dice, overlap, cosine over arbitrary token sets.
  • Phonetic encodingSoundex, Metaphone, NYSIIS, Cologne (German), DoubleMetaphone, each with match?/2 for direct equality.
  • Slugification (Text.Slug) — locale-aware Unicode folding with cross-script transliteration.
  • Segmentation (Text.Segment) — UAX #29 word/sentence boundaries with CLDR abbreviation suppressions.
  • Syllable counting (Text.Syllable) — vowel-group heuristic with an exceptions table.
  • Hyphenation (Text.Hyphenation) — Knuth-Liang patterns; bundled en-us, de-1996, fr, es, it, nl, pt, with on-demand fetching for ~80 more from hyph-utf8.

Spelling and morphology

  • Spell correction (Text.Spell) — Norvig-style edit-distance suggestions backed by Text.WordFreq.
  • Lemmatization (Text.Lemma) — dictionary-driven, ~41k bundled English pairs; ~20 more languages auto-downloadable from michmech via mix text.download_lemma_data.
  • Word frequencies (Text.WordFreq) — bundled top-30,000 frequencies for en, de, fr, es, it, nl, pt with frequency, zipf, rank, top.
  • English inflection (Text.Inflect.En) — pluralize/2 and singularize/2, modern and classical modes.

Statistics and search

  • N-grams and word counts (Text.Ngram, Text.Word).
  • TF-IDF and BM25 (Text.IR) — indexed corpus with scoring and top-K search.
  • Collocation extraction (Text.Collocation) — bigrams ranked by frequency, PMI, or log-likelihood.
  • Concordance (Text.KWIC) — keyword-in-context lookup.
  • Word embeddings (Text.Embedding) — load fastText .vec files, then cosine similarity, nearest neighbours, and analogies.
  • Word clouds (Text.WordCloud) — multilingual keyword extraction (six scoring backends) plus spiral layout and SVG rendering.
  • Extractive summarization (Text.Summarize) — TextRank and LexRank over Jaccard sentence graphs.
  • Stopwords (Text.Stopwords) — bundled lists for ~60 languages from stopwords-iso.

Installation

def deps do
  [
    {:text, "~> 0.5.0"}
  ]
end

For the language identifier, fetch the lid.176.bin model once after install:

mix text.download_lid176

For production environments using the optional Bumblebee-backed modules, mix text.download_models (plural) pre-fetches every external artefact — lid.176.bin plus the default Hugging Face checkpoints — so the first call to each module never hits the network.

A taste

# Sentiment — multilingual, no model download by default.
Text.Sentiment.analyze("J'adore ce livre.", language: :fr).label
#=> :positive

# Language identification — load the fastText model once.
{:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
  Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
)

{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)

# Word cloud → SVG file in four piped steps.
text
|> Text.WordCloud.terms(language: :en)
|> Text.WordCloud.Layout.layout(width: 800, height: 600, rotations: :radial)
|> Text.WordCloud.SVG.render(palette: Color.Palette.tonal("#3b82f6"))
|> then(&File.write!("cloud.svg", &1))

Guides

In-depth walkthroughs with worked examples:

  • Text classification (language identification) — setup, detect/3, CLDR locale resolution, Hans/Hant disambiguation, performance tuning.

  • Sentiment analysis — lexicon vs Bumblebee backends, multilingual lexicons, custom lexicons, threshold tuning, production wiring.

  • Part-of-speech tagging and NER — Bumblebee setup, tag sets, model pre-download, named Nx.Servings for high-QPS workloads.

  • Keyword-in-context concordanceText.KWIC.concordance/3, formatting, collocate scans, sense disambiguation patterns.

  • Word clouds — six scoring backends (YAKE!, frequency, RAKE, TextRank, TF-IDF, KeyBERT), Wordle-style layouts (:radial/:spiral), SVG rendering with Color.Palette integration.

Optional dependencies

The package works without any optional deps. Adding them enables progressively heavier capabilities:

Dep Enables
:exla Order-of-magnitude faster inference for Fasttext and the Bumblebee-backed modules. Strongly recommended in production.
:bumblebee Neural sentiment, POS, NER, and the KeyBERT word-cloud backend.
:localize CLDR-canonical locale resolution (fr-Latn-CA, zh-Hans-CN) and Localize.LanguageTag input shapes.
:color Color.Palette.Tonal and Theme palettes for SVG word-cloud rendering.
:text_stemmer Snowball stemming (~30 languages) for word-cloud morphological-variant consolidation.

Calls that need a missing optional dep raise with installation instructions; the rest of the package keeps working.

Every public function that takes a :language (or :locale) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct (when :localize is loaded). See Text.Language for the normalisation helpers.

Roadmap

  • Quantized model support for lid.176.ftz (917 KB variant).
  • Nx.Serving for batched inference for throughput-bound workloads.
  • CLDR-tailored segmentation once the unicode_set regex engine matures.

License

Apache 2.0 — see LICENSE.md.

About

Text detection and processing for Elixir

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors