Skip to content

Latest commit

 

History

History
80 lines (58 loc) · 1.66 KB

File metadata and controls

80 lines (58 loc) · 1.66 KB

Jigsaw Unintended Bias in Toxicity Classification

Kaggle Competition

Getting started (these is included in simple_lstm_baseline.py)

# Download the dataset
kaggle competitions download -c jigsaw-unintended-bias-in-toxicity-classification

# unzip data
mkdir data
unzip test.csv.zip -d data
unzip train.csv.zip -d data

# not sure why it don't have read permission
chmod +r data/*

# clean up
rm *.zip

Dataset

Submission

For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic).

Models do not need to predict the additional attributes for the competition

id,prediction
7000000,0.0
7000001,0.0
etc.

Evaluation

Submetric

  • Overall AUC: the ROC-AUC for the full evaluation set
  • Bias AUCs:
    • Subgroup AUC
    • BPSN (Background Positive, Subgroup Negative) AUC
    • BNSP (Background Negative, Subgroup Positive) AUC

Generalized Mean of Bias AUCs

$$ M_p(m_s) = \left(\frac{1}{N} \sum_{s=1}^{N} m_s^p\right)^\frac{1}{p} $$

Final Metric

$$ score = w_0 AUC_{overall} + \sum_{a=1}^{A} w_a M_p(m_{s,a}) $$

Usage

# simple LSTM baseline
python3 simple_lstm_baseline.py

Resources

Popular Kernel

Preprocessing

Model