WIP: Added fighting-words term importance to clustering models by x-tabdeveloping · Pull Request #75 · x-tabdeveloping/turftopic

x-tabdeveloping · 2024-12-10T10:39:12Z

I'm working on adding term importance from this paper.
It's honestly way smarter than tf-idf based approaches, and doesn't suffer from the smoothing issues of the bayes method I developed earlier.
It also doesn't have the theoretical weaknesses of Top2Vec, and it produces similar or better quality topics.
I'm considering making it the default in the library.

x-tabdeveloping · 2024-12-10T10:39:28Z

Thanks for sharing the post @KennethEnevoldsen

x-tabdeveloping · 2024-12-10T10:41:02Z

Also, the component values are way more interpretable, since they're basically z-scores.
You can essentially assign significance to descriptive words, which is awesome.

KennethEnevoldsen · 2024-12-10T19:41:31Z

Oh this looks great! Glad to see that you are already tackling it

What are your thoughts on adapting it to an embedding use case?

x-tabdeveloping · 2025-04-01T11:26:40Z

import numpy as np
import pandas as pd
import plotly.express as px
# from scipy.stats import multivariate_normal
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

from turftopic.feature_importance import fighting_words
from turftopic.supervised.semantic_lexical import SemanticLexicalAnalysis

ds = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
    #   categories=["alt.atheism", "sci.space"],
)
embeddings = np.load("_emb/20news_all-MiniLM.npy")
corpus = ds.data
labels = np.array(ds.target)

trf = SentenceTransformer("all-MiniLM-L6-v2")
# embeddings = trf.encode(corpus, show_progress_bar=True)

model = SemanticLexicalAnalysis(encoder=trf).fit(
    corpus, y=labels, embeddings=embeddings
)

model.plot_semantic_lexical_square(19)

model.plot_residuals(1)

Added fighting-words term importance to clustering models

e4f8fc0

x-tabdeveloping added 2 commits December 18, 2024 14:19

Added semantic-difference based feature importance

d8662f0

Added a first draft of SemanticLexicalAnalysis

3d8d1fd

x-tabdeveloping mentioned this pull request Jun 3, 2025

Added feature importance methods based on cluster differences #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Added fighting-words term importance to clustering models#75

WIP: Added fighting-words term importance to clustering models#75
x-tabdeveloping wants to merge 3 commits intomainfrom
fighting_words

x-tabdeveloping commented Dec 10, 2024

Uh oh!

x-tabdeveloping commented Dec 10, 2024

Uh oh!

x-tabdeveloping commented Dec 10, 2024

Uh oh!

KennethEnevoldsen commented Dec 10, 2024 •

edited

Loading

Uh oh!

x-tabdeveloping commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

x-tabdeveloping commented Dec 10, 2024

Uh oh!

x-tabdeveloping commented Dec 10, 2024

Uh oh!

x-tabdeveloping commented Dec 10, 2024

Uh oh!

KennethEnevoldsen commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x-tabdeveloping commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KennethEnevoldsen commented Dec 10, 2024 •

edited

Loading