diff --git a/docs/en/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md b/docs/en/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md new file mode 100644 index 0000000000..7e4073158a --- /dev/null +++ b/docs/en/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md @@ -0,0 +1,136 @@ +--- +title: MMDDatasetEvaluator +createTime: 2025/04/04 19:46 +permalink: /en/api/operators/text_sft/eval/mmddatasetevaluator/ +--- + +## 📘 Overview + +The `MMDDatasetEvaluator` is an operator that evaluates the distribution discrepancy between two datasets using the Maximum Mean Discrepancy (MMD) method. It embeds text into a high-dimensional space and computes the kernel-based distance to quantify the distribution shift between the evaluation dataset and a reference dataset. A smaller MMD score indicates that the two distributions are closer. + +## `__init__` + +```python +def __init__( + self, + ref_frame: DataFlowStorage, + *, + ref_max_sample_num: int = 5000, + ref_shuffle_seed: int = 42, + ref_instruction_key: str = "input", + ref_output_key: str = "output", + kernel_type: Literal["RBF"] = "RBF", + bias: bool = True, + rbf_sigma: float = 1.0, + embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers", + embedding_model_name: str | None = None, + st_device: str = "cuda", + st_batch_size: int = 32, + st_normalize_embeddings: bool = True, + vllm_max_num_seqs: int = 128, + vllm_gpu_memory_utilization: float = 0.9, + vllm_tensor_parallel_size: int = 1, + vllm_pipeline_parallel_size: int = 1, + vllm_truncate_max_length: int = 40960, + cache_type: Literal["redis", "none"] = "none", + redis_url: str = "redis://127.0.0.1:6379", + max_concurrent_requests: int = 50, + redis_db: int = 0, + cache_model_id: str | None = None, +) +``` + +| Parameter | Type | Default | Description | +| :--- | :--- | :--- | :--- | +| **ref_frame** | DataFlowStorage | Required | The reference dataset used as the distribution baseline. | +| **ref_max_sample_num** | int | `5000` | Maximum number of samples to draw from the reference dataset. | +| **ref_shuffle_seed** | int | `42` | Random seed for sampling the reference dataset. | +| **ref_instruction_key** | str | `'input'` | Column name for the instruction field in the reference dataset. | +| **ref_output_key** | str | `'output'` | Column name for the output field in the reference dataset. | +| **kernel_type** | str | `'RBF'` | Kernel function type; currently only `'RBF'` is supported. | +| **bias** | bool | `True` | Whether to use bias in the MMD computation. | +| **rbf_sigma** | float | `1.0` | Bandwidth parameter for the RBF kernel. | +| **embedding_type** | str | `'sentence_transformers'` | Embedding backend to use; either `'sentence_transformers'` or `'vllm'`. **Note:** when using `'vllm'`, you need to install `distflow[vllm]` first. | +| **embedding_model_name** | str | Required | Name of the embedding model (required). | +| **st_device** | str | `'cuda'` | Device for SentenceTransformers (e.g., `'cuda'`, `'cpu'`). | +| **st_batch_size** | int | `32` | Batch size for SentenceTransformers inference. | +| **st_normalize_embeddings** | bool | `True` | Whether to normalize embeddings from SentenceTransformers. | +| **vllm_max_num_seqs** | int | `128` | Maximum number of sequences for vLLM. | +| **vllm_gpu_memory_utilization** | float | `0.9` | GPU memory utilization ratio for vLLM. | +| **vllm_tensor_parallel_size** | int | `1` | Tensor parallel size for vLLM. | +| **vllm_pipeline_parallel_size** | int | `1` | Pipeline parallel size for vLLM. | +| **vllm_truncate_max_length** | int | `40960` | Maximum truncation length for vLLM inputs. | +| **cache_type** | str | `'none'` | Cache type for embeddings; either `'redis'` or `'none'`. | +| **redis_url** | str | `'redis://127.0.0.1:6379'` | Redis connection URL when `cache_type='redis'`. | +| **max_concurrent_requests** | int | `50` | Maximum concurrent requests to Redis. | +| **redis_db** | int | `0` | Redis database index. | +| **cache_model_id** | str | `None` | Model identifier used for the Redis cache key. | + +## `run` + +```python +def run( + self, + storage: DataFlowStorage, + input_instruction_key: str, + input_output_key: str, + max_sample_num: int | None = None, + shuffle_seed: int | None = None, +) -> tuple[float, dict[str, Any]] +``` + +| Parameter | Type | Default | Description | +| :--- | :--- | :--- | :--- | +| **storage** | DataFlowStorage | Required | The DataFlowStorage instance containing the evaluation dataset. | +| **input_instruction_key** | str | Required | Column name for the instruction field in the evaluation dataset. | +| **input_output_key** | str | Required | Column name for the output field in the evaluation dataset. | +| **max_sample_num** | int | `None` | Maximum samples from the evaluation dataset; falls back to `ref_max_sample_num` if not set. | +| **shuffle_seed** | int | `None` | Random seed for sampling the evaluation dataset; falls back to `ref_shuffle_seed` if not set. | + +## 🧠 Example Usage + +```python +from dataflow.operators.text_sft.eval import MMDDatasetEvaluator +from dataflow.utils.storage import FileStorage + +# Prepare reference and evaluation storages +ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl") +eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl") + +# Initialize the evaluator +evaluator = MMDDatasetEvaluator( + ref_frame=ref_storage.step(), + ref_instruction_key="instruction", + ref_output_key="output", + embedding_type="sentence_transformers", + embedding_model_name="BAAI/bge-large-zh", + st_device="cuda", + st_batch_size=32, +) + +# Run evaluation +mmd_score, mmd_meta = evaluator.run( + eval_storage.step(), + input_instruction_key="instruction", + input_output_key="output", +) +print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}") +``` + +#### 🧾 Default Output Format + +| Field | Type | Description | +| :--- | :--- | :--- | +| **MMDScore** | float | The computed MMD distance (smaller is closer). | +| **MMDMeta** | dict | Metadata dictionary containing computation details. | + +**Example Output:** +```json +{ + "MMDScore": 0.00342, + "MMDMeta": { + "num_src_samples": 5000, + "num_tgt_samples": 5000 + } +} +``` diff --git a/docs/zh/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md b/docs/zh/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md new file mode 100644 index 0000000000..4446190afa --- /dev/null +++ b/docs/zh/notes/api/operators/text_sft/eval/MMDDatasetEvaluator.md @@ -0,0 +1,142 @@ +--- +title: MMDDatasetEvaluator +createTime: 2025/04/04 19:46 +permalink: /zh/api/operators/text_sft/eval/mmddatasetevaluator/ +--- + +## 📘 概述 + +[MMDDatasetEvaluator](https://github.com/OpenDCAI/DataFlow/blob/main/dataflow/operators/text_sft/eval/mmd_dataset_evaluator.py) 是一个基于最大均值差异(Maximum Mean Discrepancy, MMD)的数据集评估算子。它通过将文本嵌入到高维空间,并计算核函数距离,来量化评估数据集与参考数据集之间的分布偏移程度。MMD 值越小,表示两个数据集的分布越接近。 + +## `__init__` + +```python +def __init__( + self, + ref_frame: DataFlowStorage, + *, + ref_max_sample_num: int = 5000, + ref_shuffle_seed: int = 42, + ref_instruction_key: str = "input", + ref_output_key: str = "output", + kernel_type: Literal["RBF"] = "RBF", + bias: bool = True, + rbf_sigma: float = 1.0, + embedding_type: Literal["vllm", "sentence_transformers"] = "sentence_transformers", + embedding_model_name: str | None = None, + st_device: str = "cuda", + st_batch_size: int = 32, + st_normalize_embeddings: bool = True, + vllm_max_num_seqs: int = 128, + vllm_gpu_memory_utilization: float = 0.9, + vllm_tensor_parallel_size: int = 1, + vllm_pipeline_parallel_size: int = 1, + vllm_truncate_max_length: int = 40960, + cache_type: Literal["redis", "none"] = "none", + redis_url: str = "redis://127.0.0.1:6379", + max_concurrent_requests: int = 50, + redis_db: int = 0, + cache_model_id: str | None = None, +) +``` + +### init 参数说明 + +| 参数名 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **ref_frame** | DataFlowStorage | 必需 | 参考数据集(DataFlowStorage),作为分布比较的基准。 | +| **ref_max_sample_num** | int | `5000` | 从参考数据集中采样的最大样本数。 | +| **ref_shuffle_seed** | int | `42` | 参考数据集采样的随机种子。 | +| **ref_instruction_key** | str | `'input'` | 参考数据集中指令字段的列名。 | +| **ref_output_key** | str | `'output'` | 参考数据集中输出字段的列名。 | +| **kernel_type** | str | `'RBF'` | 核函数类型,当前仅支持 `'RBF'`。 | +| **bias** | bool | `True` | 是否在 MMD 计算中使用偏置项。 | +| **rbf_sigma** | float | `1.0` | RBF 核函数的带宽参数。 | +| **embedding_type** | str | `'sentence_transformers'` | 嵌入模型后端,可选 `'sentence_transformers'` 或 `'vllm'`。**注意:** 使用 `'vllm'` 时需先安装 `distflow[vllm]`。 | +| **embedding_model_name** | str | 必需 | 嵌入模型名称(必填)。 | +| **st_device** | str | `'cuda'` | SentenceTransformers 的运行设备(如 `'cuda'`、`'cpu'`)。 | +| **st_batch_size** | int | `32` | SentenceTransformers 的推理批次大小。 | +| **st_normalize_embeddings** | bool | `True` | 是否对 SentenceTransformers 生成的嵌入向量进行归一化。 | +| **vllm_max_num_seqs** | int | `128` | vLLM 的最大序列数。 | +| **vllm_gpu_memory_utilization** | float | `0.9` | vLLM 的 GPU 显存利用率。 | +| **vllm_tensor_parallel_size** | int | `1` | vLLM 的张量并行大小。 | +| **vllm_pipeline_parallel_size** | int | `1` | vLLM 的流水线并行大小。 | +| **vllm_truncate_max_length** | int | `40960` | vLLM 输入的最大截断长度。 | +| **cache_type** | str | `'none'` | 嵌入缓存类型,可选 `'redis'` 或 `'none'`。 | +| **redis_url** | str | `'redis://127.0.0.1:6379'` | 使用 Redis 缓存时的连接地址。 | +| **max_concurrent_requests** | int | `50` | 对 Redis 缓存的最大并发请求数。 | +| **redis_db** | int | `0` | Redis 数据库索引。 | +| **cache_model_id** | str | `None` | Redis 缓存键中使用的模型标识符。 | + +## `run` + +```python +def run( + self, + storage: DataFlowStorage, + input_instruction_key: str, + input_output_key: str, + max_sample_num: int | None = None, + shuffle_seed: int | None = None, +) -> tuple[float, dict[str, Any]] +``` + +执行算子主逻辑,从评估数据集中读取样本,计算其与参考数据集之间的 MMD 距离,并返回距离值和元数据。 + +#### 参数 + +| 名称 | 类型 | 默认值 | 说明 | +| :--- | :--- | :--- | :--- | +| **storage** | DataFlowStorage | 必需 | 包含评估数据集的数据流存储实例。 | +| **input_instruction_key** | str | 必需 | 评估数据集中指令字段的列名。 | +| **input_output_key** | str | 必需 | 评估数据集中输出字段的列名。 | +| **max_sample_num** | int | `None` | 评估数据集的最大采样数;未设置时默认使用 `ref_max_sample_num`。 | +| **shuffle_seed** | int | `None` | 评估数据集采样的随机种子;未设置时默认使用 `ref_shuffle_seed`。 | + +## 🧠 示例用法 + +```python +from dataflow.operators.text_sft.eval import MMDDatasetEvaluator +from dataflow.utils.storage import FileStorage + +# 准备参考数据集和评估数据集 +ref_storage = FileStorage(first_entry_file_name="reference_data.jsonl") +eval_storage = FileStorage(first_entry_file_name="eval_data.jsonl") + +# 初始化评估器 +evaluator = MMDDatasetEvaluator( + ref_frame=ref_storage.step(), + ref_instruction_key="instruction", + ref_output_key="output", + embedding_type="sentence_transformers", + embedding_model_name="BAAI/bge-large-zh", + st_device="cuda", + st_batch_size=32, +) + +# 执行评估 +mmd_score, mmd_meta = evaluator.run( + eval_storage.step(), + input_instruction_key="instruction", + input_output_key="output", +) +print(f"MMD Score: {mmd_score}, Meta: {mmd_meta}") +``` + +#### 🧾 默认输出格式 + +| 字段 | 类型 | 说明 | +| :--- | :--- | :--- | +| **MMDScore** | float | 计算得到的 MMD 距离值(越小表示分布越接近)。 | +| **MMDMeta** | dict | 包含计算细节信息的元数据字典。 | + +**示例输出:** +```json +{ + "MMDScore": 0.00342, + "MMDMeta": { + "num_src_samples": 5000, + "num_tgt_samples": 5000 + } +} +```