This document catalogs the datasets used to train our essay scoring models, specifically focusing on dimensional/multi-trait scoring (Task Achievement, Coherence & Cohesion, Vocabulary, and Grammar).
Important
Dataset licensing affects how trained models can be used. We document licensing status for all datasets used in training.
| Dataset | License Type | Commercial Use | Notes |
|---|---|---|---|
| IELTS-WT2-LLaMa3-1k | Unknown (HuggingFace) | No explicit license specified on HuggingFace | |
| DREsS | Consent Form Required | Requires signed consent form for access | |
| Write & Improve | Non-commercial only | ❌ No | Used only for calibration/validation, not primary training |
| ASAP++ | Research use | Standard academic research license | |
| ELLIPSE | Kaggle Competition | Competition rules apply |
The W&I Corpus has restrictive licensing:
- Non-commercial use only - prohibits commercial products/services
- No derived items without approval - models trained primarily on this data require CUP&A approval
- No redistribution - data cannot be shared publicly
Our Approach: AES-DEBERTA uses W&I only for Stage 2 CEFR calibration and validation. The primary model training uses other datasets (IELTS-WT2, DREsS). This may be defensible under research/educational use terms, but we do not redistribute the corpus or models primarily derived from it.
These datasets are actively used in the training pipeline for AES-DEBERTA.
Purpose: Dimensional Scoring Initialization (Primary Training Data)
| Attribute | Value |
|---|---|
| Size | ~1,000 essays |
| Labels | 4 Dimensions (TA, CC, Vocab, Grammar) on 0-9 scale |
| Source | HuggingFace (123Harr/IELTS-WT2-LLaMa3-1k) |
| License | Unknown (no explicit license on HuggingFace) |
| Usage | Used by AES-DEBERTA (Stage 1) to learn distinct traits. |
Pros:
- Exact match for our IELTS-style dimensions.
- High-quality synthetic/augmented data.
Note: License is listed as "unknown" on HuggingFace. Usage should be considered for research/educational purposes until clarified.
Purpose: Dimensional Scoring Augmentation (Primary Training Data)
| Attribute | Value |
|---|---|
| Size | ~48.9K samples (2.3K human-scored + 6.5K standardized + 40K synthetic) |
| Labels | 3 Dimensions (Content, Organization, Language) on 1-5 scale |
| Mapping | Content → TA, Organization → CC, Language → Vocab/Grammar |
| Source | Official Website |
| License | Consent form required for access (ACL 2025 paper) |
| Usage | Used by AES-DEBERTA (Stage 1) |
Pros:
- Large-scale rubric-based dataset.
- Expert-scored EFL essays.
- Published in ACL 2025.
Note: Access requires submitting a consent form. License terms are in the consent agreement.
Purpose: CEFR Calibration & Validation Only
| Attribute | Value |
|---|---|
| Size | ~3,800 training essays |
| Labels | Holistic CEFR (A1-C2) |
| Source | Cambridge University Press & Assessment |
| License | Non-commercial, research use only, no redistribution |
| Usage | Used by AES-DEBERTA (Stage 2 calibration only) |
Warning
W&I license prohibits commercial use and requires approval for derived models. We use it only for calibration (aligning scores to CEFR scale) and validation (testing accuracy), not as primary training data.
These datasets are high-quality candidates for improving model robustness in the future.
Best choice for US K-12 style scoring
| Attribute | Value |
|---|---|
| Size | ~13,000 essays (8 prompts) |
| Dimensions | Content, Organization, Style, Conventions |
| Score Range | 0-6 per dimension |
| Paper | ASAP++ Paper (ACL Anthology) |
| Essays | Kaggle ASAP-AES |
| License | Research use |
Pros: True human rubric scores, widely used in research. Cons: US K-12 prompts differ significantly from IELTS/EFL tasks.
Newest, largest argumentative dataset
| Attribute | Value |
|---|---|
| Size | ~24,000 essays |
| Focus | Argumentative writing (more relevant to IELTS) |
| Download | Kaggle ASAP 2.0 |
Pros: Large size, argumentative focus.
Best for EFL learner specificity
| Attribute | Value |
|---|---|
| Size | ~6,500 essays |
| Dimensions | Cohesion, Syntax, Vocabulary, Phraseology, Grammar, Conventions |
| Source | English Language Learners |
| Download | Kaggle ELLIPSE |
Pros: 6 dimensional scores, EFL student authors.
Our current AES-DEBERTA training strategy (scripts/training/train-deberta-aes.py) implements a 3-Stage Process:
-
Stage 1: Dimensional Pre-Training
- Uses IELTS-WT2 & DREsS (Primary training data)
- Objective: Learn to distinguish between Content, Organization, and Grammar.
- Loss: MSE on dimensional scores (0-9 scale).
-
Stage 2: CEFR Calibration
- Uses Write & Improve (W&I) (Calibration only)
- Objective: Calibrate the overall score to the specific CEFR standards (A2-C2).
- Loss: Ordinal Regression (CORN) on CEFR levels.
- Note: W&I is used only for calibration, not primary training.
-
Stage 3: End-to-End Fine-Tuning
- Uses IELTS-WT2 & DREsS (Primary training data)
- Objective: Balance dimensional accuracy with correct CEFR alignment.
- Loss: Combined Weighted Loss.