Danni Yang, Sitao Chen, Changyao Tian
If you find our work helpful, please give us a β or cite our paper. See the InternVL-U technical report appendix for more details.
- [2026/03/06] TextEdit benchmark released.
- [2026/03/06] Evaluation code released.
- [2026/03/06] Leaderboard updated with latest models.
- Precise spatial alignment
- Font and style consistency
- Background preservation
- Layout-constrained reasoning
We introduce TextEdit, a high-quality, multi-scenario benchmark designed to evaluate fine-grained text editing capabilities in image generation models.
TextEdit covers a diverse set of real-world and virtual scenarios, spanning 18 subcategories with a total of 2,148 high-quality source images and manually annotated edited ground-truth images.
To comprehensively assess model performance, we combine classic OCR, image-fidelity metrics and modern multimodal LLM-based evaluation across target accuracy, text preservation, scene integrity, local realism and visual coherence. This dual-track protocol enables comprehensive assessment.
Our goal is to provide a standardized, realistic, and scalable benchmark for text editing research.
π Full Benchmark Results
| Models | # Params | Real | Virtual | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OA | OP | OR | F1 | NED | CLIP | AES | OA | OP | OR | F1 | NED | CLIP | AES | ||
| Generation Models | |||||||||||||||
| Qwen-Image-Edit | 20B | 0.75 | 0.68 | 0.66 | 0.67 | 0.71 | 0.75 | 5.72 | 0.78 | 0.75 | 0.73 | 0.74 | 0.75 | 0.81 | 5.21 |
| GPT-Image-1.5 | - | 0.74 | 0.69 | 0.67 | 0.68 | 0.68 | 0.75 | 5.78 | 0.73 | 0.72 | 0.71 | 0.71 | 0.70 | 0.80 | 5.28 |
| Nano Banana Pro | - | 0.77 | 0.72 | 0.70 | 0.71 | 0.72 | 0.75 | 5.79 | 0.80 | 0.78 | 0.77 | 0.78 | 0.78 | 0.81 | 5.28 |
| Unified Models | |||||||||||||||
| Lumina-DiMOO | 8B | 0.22 | 0.23 | 0.19 | 0.20 | 0.19 | 0.69 | 5.53 | 0.22 | 0.25 | 0.21 | 0.22 | 0.20 | 0.72 | 4.76 |
| Ovis-U1 | 2.4B+1.2B | 0.40 | 0.37 | 0.34 | 0.35 | 0.35 | 0.72 | 5.32 | 0.37 | 0.40 | 0.38 | 0.39 | 0.33 | 0.75 | 4.66 |
| BAGEL | 7B+7B | 0.60 | 0.59 | 0.53 | 0.55 | 0.55 | 0.74 | 5.71 | 0.57 | 0.60 | 0.56 | 0.57 | 0.54 | 0.78 | 5.19 |
| InternVL-U (Ours) | 2B+1.7B | 0.77 | 0.73 | 0.70 | 0.71 | 0.72 | 0.75 | 5.70 | 0.79 | 0.77 | 0.75 | 0.75 | 0.77 | 0.80 | 5.12 |
| Models | # Params | Real | Virtual | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TA | TP | SI | LR | VC | Avg | TA | TP | SI | LR | VC | Avg | ||
| Generation Models | |||||||||||||
| Qwen-Image-Edit | 20B | 0.92 | 0.82 | 0.75 | 0.57 | 0.80 | 0.77 | 0.57 | 0.79 | 0.92 | 0.80 | 0.77 | 0.77 |
| GPT-Image-1.5 | - | 0.96 | 0.94 | 0.86 | 0.80 | 0.93 | 0.90 | 0.82 | 0.93 | 0.96 | 0.91 | 0.87 | 0.90 |
| Nano Banana Pro | - | 0.96 | 0.95 | 0.85 | 0.88 | 0.93 | 0.91 | 0.87 | 0.92 | 0.96 | 0.94 | 0.89 | 0.92 |
| Unified Models | |||||||||||||
| Lumina-DiMOO | 8B | 0.17 | 0.06 | 0.04 | 0.02 | 0.05 | 0.09 | 0.02 | 0.06 | 0.16 | 0.05 | 0.03 | 0.08 |
| Ovis-U1 | 2.4B+1.2B | 0.31 | 0.12 | 0.12 | 0.07 | 0.18 | 0.18 | 0.06 | 0.16 | 0.31 | 0.14 | 0.13 | 0.19 |
| BAGEL | 7B+7B | 0.68 | 0.60 | 0.38 | 0.35 | 0.56 | 0.53 | 0.38 | 0.51 | 0.68 | 0.62 | 0.42 | 0.54 |
| InternVL-U (Ours) | 2B+1.7B | 0.94 | 0.90 | 0.71 | 0.80 | 0.80 | 0.88 | 0.87 | 0.86 | 0.91 | 0.82 | 0.62 | 0.83 |
π Mini-set Benchmark Results(500 samples)
| Models | # Params | Real | Virtual | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OA | OP | OR | F1 | NED | CLIP | AES | OA | OP | OR | F1 | NED | CLIP | AES | ||
| Generation Models | |||||||||||||||
| Qwen-Image-Edit | 20B | 0.76 | 0.69 | 0.67 | 0.67 | 0.70 | 0.75 | 5.81 | 0.74 | 0.71 | 0.70 | 0.70 | 0.70 | 0.80 | 5.27 |
| GPT-Image-1.5 | - | 0.72 | 0.68 | 0.66 | 0.67 | 0.67 | 0.75 | 5.85 | 0.68 | 0.69 | 0.68 | 0.68 | 0.65 | 0.80 | 5.32 |
| Nano Banana Pro | - | 0.76 | 0.71 | 0.69 | 0.70 | 0.70 | 0.75 | 5.86 | 0.77 | 0.76 | 0.75 | 0.75 | 0.76 | 0.81 | 5.32 |
| Unified Models | |||||||||||||||
| Lumina-DiMOO | 8B | 0.20 | 0.22 | 0.18 | 0.19 | 0.19 | 0.70 | 5.58 | 0.22 | 0.25 | 0.21 | 0.22 | 0.19 | 0.73 | 4.87 |
| Ovis-U1 | 2.4B+1.2B | 0.37 | 0.34 | 0.32 | 0.32 | 0.33 | 0.72 | 5.39 | 0.39 | 0.41 | 0.38 | 0.39 | 0.33 | 0.74 | 4.75 |
| BAGEL | 7B+7B | 0.61 | 0.59 | 0.52 | 0.54 | 0.54 | 0.74 | 5.79 | 0.53 | 0.58 | 0.53 | 0.55 | 0.51 | 0.78 | 5.25 |
| InternVL-U (Ours) | 2B+1.7B | 0.77 | 0.74 | 0.70 | 0.71 | 0.71 | 0.76 | 5.79 | 0.74 | 0.72 | 0.69 | 0.70 | 0.72 | 0.79 | 5.14 |
| Models | # Params | Real | Virtual | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TA | TP | SI | LR | VC | Avg | TA | TP | SI | LR | VC | Avg | ||
| Generation Models | |||||||||||||
| Qwen-Image-Edit | 20B | 0.93 | 0.85 | 0.77 | 0.55 | 0.78 | 0.80 | 0.60 | 0.82 | 0.91 | 0.81 | 0.74 | 0.76 |
| GPT-Image-1.5 | - | 0.97 | 0.94 | 0.86 | 0.79 | 0.92 | 0.91 | 0.85 | 0.93 | 0.95 | 0.92 | 0.83 | 0.88 |
| Nano Banana Pro | - | 0.96 | 0.95 | 0.85 | 0.86 | 0.92 | 0.91 | 0.87 | 0.92 | 0.96 | 0.93 | 0.87 | 0.92 |
| Unified Models | |||||||||||||
| Lumina-DiMOO | 8B | 0.16 | 0.04 | 0.04 | 0.02 | 0.06 | 0.08 | 0.02 | 0.05 | 0.19 | 0.07 | 0.03 | 0.10 |
| Ovis-U1 | 2.4B+1.2B | 0.29 | 0.11 | 0.11 | 0.08 | 0.20 | 0.17 | 0.04 | 0.16 | 0.35 | 0.18 | 0.15 | 0.22 |
| BAGEL | 7B+7B | 0.68 | 0.61 | 0.38 | 0.34 | 0.59 | 0.53 | 0.36 | 0.52 | 0.69 | 0.64 | 0.40 | 0.54 |
| InternVL-U (Ours) | 2B+1.7B | 0.94 | 0.91 | 0.72 | 0.73 | 0.75 | 0.89 | 0.88 | 0.87 | 0.90 | 0.78 | 0.57 | 0.79 |
You can download images from this page. The TextEdit benchmark data is organized under data/ by and category:
- Virtual (categories
1.x.x): Synthetic/virtual scene images - Real (categories
2.x): Real-world scene images
Evaluation prompts are provided under eval_prompts/ in two subsets:
| Subset | Directory | Description |
|---|---|---|
| Fullset | eval_prompts/fullset/ |
Complete benchmark with all samples |
| Miniset (500) | eval_prompts/miniset/ |
500-sample subset uniformly sampled from the fullset |
Each .jsonl file contains per-sample fields: id, prompt, original_image, gt_image, source_text, target_text, gt_caption.
You need to use your model to perform image editing inference process. Please organize the outputs in the folder structure shown below to facilitate evaluation.
output/
βββ internvl-u/ # Your Model Name
β βββ 1.1.1 # Category Name
β βββ 1007088003726.0.jpg # Model Output Images
β βββ 1013932004096.0.jpg
β βββ ...
β βββ 1.1.2
β βββ 1.1.3
β βββ ...
β βββ 2.7
Classic metrics evaluate text editing quality using OCR-based text accuracy, image-text alignment, and aesthetic quality. All metrics are reported separately for Virtual and Real splits.
| Abbreviation | Metric | Description |
|---|---|---|
| OA | OCR Accuracy | Whether the target text is correctly rendered in the editing region |
| OP | OCR Precision | Precision of text content (target + background) in the generated image |
| OR | OCR Recall | Recall of text content (target + background) in the generated image |
| F1 | OCR F1 | Harmonic mean of OCR Precision and Recall |
| NED | Normalized Edit Distance | ROI-aware normalized edit distance between target and generated text |
| CLIP | CLIPScore | CLIP-based image-text alignment score |
| AES | Aesthetic Score | Predicted aesthetic quality score of the generated image |
Evaluation scripts are provided separately for fullset and miniset:
eval_scripts/classic_metrics_eval_full.shβ evaluate on the full benchmarkeval_scripts/classic_metrics_eval_mini.shβ evaluate on the 500-sample miniset
Step 1. Modify the contents of the configure script according to your project directory. (e.g., eval_scripts/classic_metrics_eval_full.sh):
MODELS="model-a,model-b,model-c" # Comma-separated list of model names to be evaluated
path="your_project_path_here"
CACHE_DIR="$path/TextEdit/checkpoint" # Directory for all model checkpoints (OCR, CLIP, etc.)
BENCHMARK_DIR="$path/TextEdit/eval_prompts/fullset"
GT_ROOT_DIR="$path/TextEdit/data" # Root path for original & GT images
MODEL_OUTPUT_ROOT="$path/TextEdit/output" # Root path for model infer outputs
OUTPUT_DIR="$path/TextEdit/result/classic_fullset" # Evaluation result root path for classic metricNote: All required model checkpoints (PaddleOCR, CLIP, aesthetic model, etc.) should be placed under the
CACHE_DIRdirectory.
Step 2.Run evaluation shell script to evaluate your model output.
# Fullset evaluation
bash eval_scripts/classic_metrics_eval_full.sh
# Miniset evaluation
bash eval_scripts/classic_metrics_eval_mini.shResults are saved as {model_name}.json under the output directory, containing per-sample scores and aggregated metrics for both Virtual and Real splits.
Our VLM-based evaluation uses Gemini-3-Pro-Preview as an expert judge to score text editing quality across five fine-grained dimensions. The evaluation is a two-step pipeline.
| Abbreviation | Metric | Description |
|---|---|---|
| TA | Text Accuracy | Spelling correctness and completeness of the target text (1β5) |
| TP | Text Preservation | Preservation of non-target background text (1β5) |
| SI | Scene Integrity | Geometric stability of non-edited background areas (1β5) |
| LR | Local Realism | Inpainting quality, edge cleanness, and seamlessness (1β5) |
| VC | Visual Coherence | Style matching (font, lighting, shadow, texture harmony) (1β5) |
| Avg | Weighted Average | Weighted average of all five dimensions (default weights: 0.4 / 0.3 / 0.1 / 0.1 / 0.1) |
All raw scores (1β5) are normalized to 0β1 for reporting. A cutoff mechanism is available: if TA (Q1) < 4, the remaining dimensions are set to 0, reflecting that a failed text edit invalidates other quality dimensions.
Send (Original Image, GT Image, Edited Image) triplets to the Gemini API for scoring.
Configure and run eval_scripts/vlm_metrics_eval_step1.sh:
API_KEY="your_gemini_api_key_here"
BASE_URL="your_gemini_api_base_url_here"
python eval_pipeline/vlm_metrics_eval_step1.py \
--input_data_dir <your_path>/TextEdit/eval_prompts/fullset \
--model_output_root <your_path>/TextEdit/output \
--gt_data_root <your_path>/TextEdit/data \
--output_base_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
--model_name "gemini-3-pro-preview" \
--models "model-a,model-b,model-c" \
--api_key "$API_KEY" \
--base_url "$BASE_URL" \
--num_workers 64Per-model .jsonl answer files are saved under the output_base_dir.
Aggregate the per-sample Gemini responses into a final report.
Configure and run eval_scripts/vlm_metrics_eval_step2.sh:
# Fullset report
python eval_pipeline/vlm_metrics_eval_step2.py \
--answer_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
--output_file <your_path>/TextEdit/result/gemini_report_fullset.json \
--weights 0.4 0.3 0.1 0.1 0.1 \
--enable_cutoff
# Miniset report
python eval_pipeline/vlm_metrics_eval_step2.py \
--answer_dir <your_path>/TextEdit/result/vlm_gemini_mini_answers \
--output_file <your_path>/TextEdit/result/gemini_report_miniset.json \
--weights 0.4 0.3 0.1 0.1 0.1 \
--enable_cutoffKey parameters:
--weights: Weights for Q1βQ5 (default:0.4 0.3 0.1 0.1 0.1).--enable_cutoff: Enable cutoff mechanism β if Q1 < 4, set Q2βQ5 to 0.
The output includes a JSON report, a CSV table, and a Markdown-formatted leaderboard printed to the console.
If you find our TextEdit Bench useful, please cite our InternVL-U technical report using this BibTeX.
@article{tian2026internvlu,
title={InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing},
author={Tian, Changyao and Yang, Danni and Chen, Guanzhou and Cui, Erfei and Wang, Zhaokai and Duan, Yuchen and Yin, Penghao and Chen, Sitao and Yang, Ganlin and Liu, Mingxin and Zhu, Zirun and Fan, Ziqian and Gu, Leyao and Wang, Haomin and Wei, Qi and Yin, Jinhui and Yang, Xue and Zhong, Zhihang and Qin, Qi and Xin, Yi and Fu, Bin and Liu, Yihao and Ge, Jiaye and Guo, Qipeng and Luo, Gen and Li, Hongsheng and Qiao, Yu and Chen, Kai and Zhang, Hongjie},
year={2026},
eprint={2603.09877},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.09877}
}
