Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions docs/en/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: MMDDatasetEvaluator
createTime: 2025/04/04 19:46
permalink: /en/api/operators/text_sft/eval/mmddatasetevaluator/
---

## 📘 Overview

The `MMDDatasetEvaluator` is an operator that evaluates the distribution discrepancy between two datasets using the Maximum Mean Discrepancy (MMD) method. It embeds text into a high-dimensional space and computes the kernel-based distance to quantify the distribution shift between the evaluation dataset and a reference dataset. A smaller MMD score indicates that the two distributions are closer.

## `__init__`

```python
def __init__(
self,
ref_frame: DataFlowStorage,
*,
ref_max_sample_num: int = 5000,
ref_shuffle_seed: int = 42,
ref_instruction_key: str = "input",
ref_output_key: str = "output",
kernel_type: Literal["RBF"] = "RBF",
bias: bool = True,
rbf_sigma: float = 1.0,
embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers",
embedding_model_name: str | None = None,
st_device: str = "cuda",
st_batch_size: int = 32,
st_normalize_embeddings: bool = True,
vllm_max_num_seqs: int = 128,
vllm_gpu_memory_utilization: float = 0.9,
vllm_tensor_parallel_size: int = 1,
vllm_pipeline_parallel_size: int = 1,
vllm_truncate_max_length: int = 40960,
cache_type: Literal["redis", "none"] = "none",
redis_url: str = "redis://127.0.0.1:6379",
max_concurrent_requests: int = 50,
redis_db: int = 0,
cache_model_id: str | None = None,
)
```

| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| **ref_frame** | DataFlowStorage | Required | The reference dataset used as the distribution baseline. |
| **ref_max_sample_num** | int | `5000` | Maximum number of samples to draw from the reference dataset. |
| **ref_shuffle_seed** | int | `42` | Random seed for sampling the reference dataset. |
| **ref_instruction_key** | str | `'input'` | Column name for the instruction field in the reference dataset. |
| **ref_output_key** | str | `'output'` | Column name for the output field in the reference dataset. |
| **kernel_type** | str | `'RBF'` | Kernel function type; currently only `'RBF'` is supported. |
| **bias** | bool | `True` | Whether to use bias in the MMD computation. |
| **rbf_sigma** | float | `1.0` | Bandwidth parameter for the RBF kernel. |
| **embedding_type** | str | `'sentence_transformers'` | Embedding backend to use; either `'sentence_transformers'` or `'vllm'`. **Note:** when using `'vllm'`, you need to install `distflow[vllm]` first. |
| **embedding_model_name** | str | Required | Name of the embedding model (required). |
| **st_device** | str | `'cuda'` | Device for SentenceTransformers (e.g., `'cuda'`, `'cpu'`). |
| **st_batch_size** | int | `32` | Batch size for SentenceTransformers inference. |
| **st_normalize_embeddings** | bool | `True` | Whether to normalize embeddings from SentenceTransformers. |
| **vllm_max_num_seqs** | int | `128` | Maximum number of sequences for vLLM. |
| **vllm_gpu_memory_utilization** | float | `0.9` | GPU memory utilization ratio for vLLM. |
| **vllm_tensor_parallel_size** | int | `1` | Tensor parallel size for vLLM. |
| **vllm_pipeline_parallel_size** | int | `1` | Pipeline parallel size for vLLM. |
| **vllm_truncate_max_length** | int | `40960` | Maximum truncation length for vLLM inputs. |
| **cache_type** | str | `'none'` | Cache type for embeddings; either `'redis'` or `'none'`. |
| **redis_url** | str | `'redis://127.0.0.1:6379'` | Redis connection URL when `cache_type='redis'`. |
| **max_concurrent_requests** | int | `50` | Maximum concurrent requests to Redis. |
| **redis_db** | int | `0` | Redis database index. |
| **cache_model_id** | str | `None` | Model identifier used for the Redis cache key. |

## `run`

```python
def run(
self,
storage: DataFlowStorage,
input_instruction_key: str,
input_output_key: str,
max_sample_num: int | None = None,
shuffle_seed: int | None = None,
) -> tuple[float, dict[str, Any]]
```

| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| **storage** | DataFlowStorage | Required | The DataFlowStorage instance containing the evaluation dataset. |
| **input_instruction_key** | str | Required | Column name for the instruction field in the evaluation dataset. |
| **input_output_key** | str | Required | Column name for the output field in the evaluation dataset. |
| **max_sample_num** | int | `None` | Maximum samples from the evaluation dataset; falls back to `ref_max_sample_num` if not set. |
| **shuffle_seed** | int | `None` | Random seed for sampling the evaluation dataset; falls back to `ref_shuffle_seed` if not set. |

## 🧠 Example Usage

```python
from dataflow.operators.text_sft.eval import MMDDatasetEvaluator
from dataflow.utils.storage import FileStorage

# Prepare reference and evaluation storages
ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl")
eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl")

# Initialize the evaluator
evaluator = MMDDatasetEvaluator(
ref_frame=ref_storage.step(),
ref_instruction_key="instruction",
ref_output_key="output",
embedding_type="sentence_transformers",
embedding_model_name="BAAI/bge-large-zh",
st_device="cuda",
st_batch_size=32,
)

# Run evaluation
mmd_score, mmd_meta = evaluator.run(
eval_storage.step(),
input_instruction_key="instruction",
input_output_key="output",
)
print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}")
```

#### 🧾 Default Output Format

| Field | Type | Description |
| :--- | :--- | :--- |
| **MMDScore** | float | The computed MMD distance (smaller is closer). |
| **MMDMeta** | dict | Metadata dictionary containing computation details. |

**Example Output:**
```json
{
"MMDScore": 0.00342,
"MMDMeta": {
"num_src_samples": 5000,
"num_tgt_samples": 5000
}
}
```
142 changes: 142 additions & 0 deletions docs/zh/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: MMDDatasetEvaluator
createTime: 2025/04/04 19:46
permalink: /zh/api/operators/text_sft/eval/mmddatasetevaluator/
---

## 📘 概述

[MMDDatasetEvaluator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/text_sft/eval/mmd_dataset_evaluator.py) 是一个基于最大均值差异(Maximum Mean Discrepancy, MMD)的数据集评估算子。它通过将文本嵌入到高维空间,并计算核函数距离,来量化评估数据集与参考数据集之间的分布偏移程度。MMD 值越小,表示两个数据集的分布越接近。

## `__init__`

```python
def __init__(
self,
ref_frame: DataFlowStorage,
*,
ref_max_sample_num: int = 5000,
ref_shuffle_seed: int = 42,
ref_instruction_key: str = "input",
ref_output_key: str = "output",
kernel_type: Literal["RBF"] = "RBF",
bias: bool = True,
rbf_sigma: float = 1.0,
embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers",
embedding_model_name: str | None = None,
st_device: str = "cuda",
st_batch_size: int = 32,
st_normalize_embeddings: bool = True,
vllm_max_num_seqs: int = 128,
vllm_gpu_memory_utilization: float = 0.9,
vllm_tensor_parallel_size: int = 1,
vllm_pipeline_parallel_size: int = 1,
vllm_truncate_max_length: int = 40960,
cache_type: Literal["redis", "none"] = "none",
redis_url: str = "redis://127.0.0.1:6379",
max_concurrent_requests: int = 50,
redis_db: int = 0,
cache_model_id: str | None = None,
)
```

### init 参数说明

| 参数名 | 类型 | 默认值 | 说明 |
| :--- | :--- | :--- | :--- |
| **ref_frame** | DataFlowStorage | 必需 | 参考数据集(DataFlowStorage),作为分布比较的基准。 |
| **ref_max_sample_num** | int | `5000` | 从参考数据集中采样的最大样本数。 |
| **ref_shuffle_seed** | int | `42` | 参考数据集采样的随机种子。 |
| **ref_instruction_key** | str | `'input'` | 参考数据集中指令字段的列名。 |
| **ref_output_key** | str | `'output'` | 参考数据集中输出字段的列名。 |
| **kernel_type** | str | `'RBF'` | 核函数类型,当前仅支持 `'RBF'`。 |
| **bias** | bool | `True` | 是否在 MMD 计算中使用偏置项。 |
| **rbf_sigma** | float | `1.0` | RBF 核函数的带宽参数。 |
| **embedding_type** | str | `'sentence_transformers'` | 嵌入模型后端,可选 `'sentence_transformers'` 或 `'vllm'`。**注意:** 使用 `'vllm'` 时需先安装 `distflow[vllm]`。 |
| **embedding_model_name** | str | 必需 | 嵌入模型名称(必填)。 |
| **st_device** | str | `'cuda'` | SentenceTransformers 的运行设备(如 `'cuda'`、`'cpu'`)。 |
| **st_batch_size** | int | `32` | SentenceTransformers 的推理批次大小。 |
| **st_normalize_embeddings** | bool | `True` | 是否对 SentenceTransformers 生成的嵌入向量进行归一化。 |
| **vllm_max_num_seqs** | int | `128` | vLLM 的最大序列数。 |
| **vllm_gpu_memory_utilization** | float | `0.9` | vLLM 的 GPU 显存利用率。 |
| **vllm_tensor_parallel_size** | int | `1` | vLLM 的张量并行大小。 |
| **vllm_pipeline_parallel_size** | int | `1` | vLLM 的流水线并行大小。 |
| **vllm_truncate_max_length** | int | `40960` | vLLM 输入的最大截断长度。 |
| **cache_type** | str | `'none'` | 嵌入缓存类型,可选 `'redis'` 或 `'none'`。 |
| **redis_url** | str | `'redis://127.0.0.1:6379'` | 使用 Redis 缓存时的连接地址。 |
| **max_concurrent_requests** | int | `50` | 对 Redis 缓存的最大并发请求数。 |
| **redis_db** | int | `0` | Redis 数据库索引。 |
| **cache_model_id** | str | `None` | Redis 缓存键中使用的模型标识符。 |

## `run`

```python
def run(
self,
storage: DataFlowStorage,
input_instruction_key: str,
input_output_key: str,
max_sample_num: int | None = None,
shuffle_seed: int | None = None,
) -> tuple[float, dict[str, Any]]
```

执行算子主逻辑,从评估数据集中读取样本,计算其与参考数据集之间的 MMD 距离,并返回距离值和元数据。

#### 参数

| 名称 | 类型 | 默认值 | 说明 |
| :--- | :--- | :--- | :--- |
| **storage** | DataFlowStorage | 必需 | 包含评估数据集的数据流存储实例。 |
| **input_instruction_key** | str | 必需 | 评估数据集中指令字段的列名。 |
| **input_output_key** | str | 必需 | 评估数据集中输出字段的列名。 |
| **max_sample_num** | int | `None` | 评估数据集的最大采样数;未设置时默认使用 `ref_max_sample_num`。 |
| **shuffle_seed** | int | `None` | 评估数据集采样的随机种子;未设置时默认使用 `ref_shuffle_seed`。 |

## 🧠 示例用法

```python
from dataflow.operators.text_sft.eval import MMDDatasetEvaluator
from dataflow.utils.storage import FileStorage

# 准备参考数据集和评估数据集
ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl")
eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl")

# 初始化评估器
evaluator = MMDDatasetEvaluator(
ref_frame=ref_storage.step(),
ref_instruction_key="instruction",
ref_output_key="output",
embedding_type="sentence_transformers",
embedding_model_name="BAAI/bge-large-zh",
st_device="cuda",
st_batch_size=32,
)

# 执行评估
mmd_score, mmd_meta = evaluator.run(
eval_storage.step(),
input_instruction_key="instruction",
input_output_key="output",
)
print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}")
```

#### 🧾 默认输出格式

| 字段 | 类型 | 说明 |
| :--- | :--- | :--- |
| **MMDScore** | float | 计算得到的 MMD 距离值(越小表示分布越接近)。 |
| **MMDMeta** | dict | 包含计算细节信息的元数据字典。 |

**示例输出:**
```json
{
"MMDScore": 0.00342,
"MMDMeta": {
"num_src_samples": 5000,
"num_tgt_samples": 5000
}
}
```
Loading