Skip to content

fix: Enhance synthesis task management and export functionality#465

Merged
Dallas98 merged 10 commits intomainfrom
synth
Apr 1, 2026
Merged

fix: Enhance synthesis task management and export functionality#465
Dallas98 merged 10 commits intomainfrom
synth

Conversation

@Dallas98
Copy link
Copy Markdown
Collaborator

@Dallas98 Dallas98 commented Apr 1, 2026

No description provided.

raise SynthesisExportError("No file instances to export")

output_dir = self._output_path or self._get_default_output_dir()
os.makedirs(output_dir, exist_ok=True)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 3 days ago

In general terms, we need to ensure that any user-controlled path is validated and constrained before being used with filesystem APIs. In this case, we should ensure that output_path (if provided) resolves to a subdirectory of a server-controlled base directory, after normalizing the path. If it does not, we should reject it with a clear error. We should also avoid letting the user force absolute paths.

The best fix here is to add a helper method in SynthesisDatasetExporter that computes a safe output directory: it should decide on a base export root (for example, a fixed directory under the application’s working directory), normalize the combination of that root and any user-supplied output_path, and then verify that the resulting path stays under the root. Then export_data should call this helper instead of using self._output_path directly. This maintains the existing behavior (allowing per-request subdirectories) but prevents directory traversal and arbitrary absolute paths.

Concretely, in runtime/datamate-python/app/module/generation/service/export_service.py:

  1. Add a private method, e.g. _get_safe_output_dir, to SynthesisDatasetExporter:

    • If self._output_path is falsy: return _get_default_output_dir() (existing behavior).
    • Otherwise:
      • Choose a base root, for example base_root = os.path.abspath(self._get_default_output_dir()). This keeps everything under whatever default export root the app already uses.
      • Normalize the user-specified relative path: safe_rel = os.path.normpath(self._output_path).
      • Reject absolute paths or paths that start with os.pardir (..) after normalization, by raising SynthesisExportError.
      • Compose the final path: candidate = os.path.abspath(os.path.join(base_root, safe_rel)).
      • Verify that candidate starts with base_root + os.sep or equals base_root. If not, raise SynthesisExportError.
      • Return candidate.
  2. Update export_data so that line 151 uses output_dir = self._get_safe_output_dir() instead of self._output_path or self._get_default_output_dir().

No new imports are needed beyond os, which is already imported. All changes are confined to SynthesisDatasetExporter in export_service.py.


Suggested changeset 1
runtime/datamate-python/app/module/generation/service/export_service.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/runtime/datamate-python/app/module/generation/service/export_service.py b/runtime/datamate-python/app/module/generation/service/export_service.py
--- a/runtime/datamate-python/app/module/generation/service/export_service.py
+++ b/runtime/datamate-python/app/module/generation/service/export_service.py
@@ -46,6 +46,37 @@
         self._format = format if format in self.SUPPORTED_FORMATS else self.DEFAULT_FORMAT
         self._output_path = output_path
 
+    def _get_safe_output_dir(self) -> str:
+        """
+        获取安全的导出目录。
+
+        如果未显式指定 output_path,则退回到默认导出目录。
+        如果指定了 output_path,则将其视为默认导出目录下的相对路径,
+        并进行规范化和越权访问检查,防止目录遍历或写入任意位置。
+        """
+        # 默认根目录(应用已有的默认导出目录)
+        base_root = os.path.abspath(self._get_default_output_dir())
+
+        # 如果没有用户指定的输出路径,直接使用默认导出目录
+        if not self._output_path:
+            return base_root
+
+        # 规范化用户提供的路径,防止 ".." 等路径片段
+        user_path = os.path.normpath(self._output_path)
+
+        # 禁止绝对路径,以及以父目录开头的相对路径
+        if os.path.isabs(user_path) or user_path.startswith(os.pardir + os.sep) or user_path == os.pardir:
+            raise SynthesisExportError("Invalid output path")
+
+        # 将用户路径视为 base_root 下的子目录
+        candidate = os.path.abspath(os.path.join(base_root, user_path))
+
+        # 再次确认结果路径仍在 base_root 下
+        if not (candidate == base_root or candidate.startswith(base_root + os.sep)):
+            raise SynthesisExportError("Invalid output path")
+
+        return candidate
+
     async def export_task_to_dataset(
         self,
         task_id: str,
@@ -148,7 +179,7 @@
         if not file_instances:
             raise SynthesisExportError("No file instances to export")
 
-        output_dir = self._output_path or self._get_default_output_dir()
+        output_dir = self._get_safe_output_dir()
         os.makedirs(output_dir, exist_ok=True)
 
         file_paths: List[str] = []
EOF
@@ -46,6 +46,37 @@
self._format = format if format in self.SUPPORTED_FORMATS else self.DEFAULT_FORMAT
self._output_path = output_path

def _get_safe_output_dir(self) -> str:
"""
获取安全的导出目录

如果未显式指定 output_path则退回到默认导出目录
如果指定了 output_path则将其视为默认导出目录下的相对路径
并进行规范化和越权访问检查防止目录遍历或写入任意位置
"""
# 默认根目录(应用已有的默认导出目录)
base_root = os.path.abspath(self._get_default_output_dir())

# 如果没有用户指定的输出路径,直接使用默认导出目录
if not self._output_path:
return base_root

# 规范化用户提供的路径,防止 ".." 等路径片段
user_path = os.path.normpath(self._output_path)

# 禁止绝对路径,以及以父目录开头的相对路径
if os.path.isabs(user_path) or user_path.startswith(os.pardir + os.sep) or user_path == os.pardir:
raise SynthesisExportError("Invalid output path")

# 将用户路径视为 base_root 下的子目录
candidate = os.path.abspath(os.path.join(base_root, user_path))

# 再次确认结果路径仍在 base_root 下
if not (candidate == base_root or candidate.startswith(base_root + os.sep)):
raise SynthesisExportError("Invalid output path")

return candidate

async def export_task_to_dataset(
self,
task_id: str,
@@ -148,7 +179,7 @@
if not file_instances:
raise SynthesisExportError("No file instances to export")

output_dir = self._output_path or self._get_default_output_dir()
output_dir = self._get_safe_output_dir()
os.makedirs(output_dir, exist_ok=True)

file_paths: List[str] = []
Copilot is powered by AI and may make mistakes. Always verify output.
def _write_jsonl(self, path: str, records: Iterable[dict], format: Optional[str] = None) -> None:
"""写入 JSONL 文件"""
fmt = format or self._format
os.makedirs(os.path.dirname(path), exist_ok=True)

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix

AI 3 days ago

In general, to fix uncontrolled path usage you should (a) restrict writes to a known safe base directory and/or (b) normalize and validate any user-provided path segments before using them with filesystem APIs. For paths that may include subdirectories, the standard pattern is: choose a fixed base_dir, compute full_path = os.path.normpath(os.path.join(base_dir, user_path)), then ensure full_path is still under base_dir before creating directories or opening files.

For this codebase, the minimal, non‑breaking fix is:

  1. Introduce a helper on SynthesisDatasetExporter that turns self._output_path (which may be user-controlled) into a validated, absolute directory under a server-controlled root. We can:

    • Pick a safe root directory, e.g. the system temp directory returned by tempfile.gettempdir() (consistent with _get_default_output_dir), and create a fixed subfolder such as "synthesis_exports".
    • If self._output_path is provided, treat it as a subdirectory or file name under that root, not as an absolute path. Normalize with os.path.normpath and check that the resulting resolved directory starts with the chosen root (prefix check on absolute paths).
    • If the check fails, raise SynthesisExportError.
  2. Use this helper both where the directory is created (export_data) and where _write_jsonl constructs parent directories:

    • In export_data, replace output_dir = self._output_path or self._get_default_output_dir() with logic that computes a safe output_dir using the helper.
    • In _write_jsonl, derive the parent directory with os.path.dirname(path) and still call os.makedirs(..., exist_ok=True) but only after path has been constructed with the validated output_dir.
  3. Keep the external behaviour similar: callers can still pass output_path, but it will now be interpreted safely (as a subpath under the export root) instead of an arbitrary filesystem location.

Concretely, all changes are limited to runtime/datamate-python/app/module/generation/service/export_service.py:

  • Add a new private method _get_safe_output_dir near _get_default_output_dir that:

    • Imports tempfile locally (like _get_default_output_dir).
    • Defines base_root = os.path.join(tempfile.gettempdir(), "synthesis_exports").
    • If self._output_path is falsy, returns base_root.
    • Otherwise, builds candidate = os.path.join(base_root, self._output_path), normalizes to normalized = os.path.normpath(candidate), and ensures os.path.commonpath([normalized, base_root]) == base_root. If not, raise SynthesisExportError.
    • Returns normalized.
  • Modify export_data so output_dir is assigned by calling self._get_safe_output_dir() and then os.makedirs(output_dir, exist_ok=True) remains.

No changes are needed in generation_api.py or to imports beyond using existing os and a local tempfile import inside _get_safe_output_dir (similar to _get_default_output_dir).


Suggested changeset 1
runtime/datamate-python/app/module/generation/service/export_service.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/runtime/datamate-python/app/module/generation/service/export_service.py b/runtime/datamate-python/app/module/generation/service/export_service.py
--- a/runtime/datamate-python/app/module/generation/service/export_service.py
+++ b/runtime/datamate-python/app/module/generation/service/export_service.py
@@ -148,7 +148,8 @@
         if not file_instances:
             raise SynthesisExportError("No file instances to export")
 
-        output_dir = self._output_path or self._get_default_output_dir()
+        # 使用受控且经过校验的输出目录,防止目录穿越或任意路径写入
+        output_dir = self._get_safe_output_dir()
         os.makedirs(output_dir, exist_ok=True)
 
         file_paths: List[str] = []
@@ -265,6 +266,38 @@
         import tempfile
         return tempfile.gettempdir()
 
+    def _get_safe_output_dir(self) -> str:
+        """
+        获取并校验导出输出目录,确保位于受控根目录下
+
+        - 如果未指定 output_path,则使用系统临时目录下的固定子目录。
+        - 如果指定了 output_path,则将其视为该根目录下的相对子路径,并进行归一化和越界检查。
+        """
+        import tempfile
+
+        # 受控的导出根目录,例如: /tmp/synthesis_exports
+        base_root = os.path.join(tempfile.gettempdir(), "synthesis_exports")
+
+        # 未指定 output_path 时,直接使用受控根目录
+        if not self._output_path:
+            return base_root
+
+        # 将用户提供的 output_path 视为 base_root 下的相对路径
+        candidate = os.path.join(base_root, self._output_path)
+        normalized = os.path.normpath(candidate)
+
+        # 防止目录穿越,确保归一化后的路径仍位于 base_root 下
+        try:
+            common = os.path.commonpath([normalized, base_root])
+        except ValueError:
+            # 不同驱动器等情况视为非法路径
+            raise SynthesisExportError("Invalid output path")
+
+        if common != base_root:
+            raise SynthesisExportError("Output path is outside of allowed export directory")
+
+        return normalized
+
     @staticmethod
     def _ensure_dataset_path(dataset: Dataset) -> str:
         """确保数据集路径存在"""
EOF
@@ -148,7 +148,8 @@
if not file_instances:
raise SynthesisExportError("No file instances to export")

output_dir = self._output_path or self._get_default_output_dir()
# 使用受控且经过校验的输出目录,防止目录穿越或任意路径写入
output_dir = self._get_safe_output_dir()
os.makedirs(output_dir, exist_ok=True)

file_paths: List[str] = []
@@ -265,6 +266,38 @@
import tempfile
return tempfile.gettempdir()

def _get_safe_output_dir(self) -> str:
"""
获取并校验导出输出目录确保位于受控根目录下

- 如果未指定 output_path则使用系统临时目录下的固定子目录
- 如果指定了 output_path则将其视为该根目录下的相对子路径并进行归一化和越界检查
"""
import tempfile

# 受控的导出根目录,例如: /tmp/synthesis_exports
base_root = os.path.join(tempfile.gettempdir(), "synthesis_exports")

# 未指定 output_path 时,直接使用受控根目录
if not self._output_path:
return base_root

# 将用户提供的 output_path 视为 base_root 下的相对路径
candidate = os.path.join(base_root, self._output_path)
normalized = os.path.normpath(candidate)

# 防止目录穿越,确保归一化后的路径仍位于 base_root 下
try:
common = os.path.commonpath([normalized, base_root])
except ValueError:
# 不同驱动器等情况视为非法路径
raise SynthesisExportError("Invalid output path")

if common != base_root:
raise SynthesisExportError("Output path is outside of allowed export directory")

return normalized

@staticmethod
def _ensure_dataset_path(dataset: Dataset) -> str:
"""确保数据集路径存在"""
Copilot is powered by AI and may make mistakes. Always verify output.
@Dallas98 Dallas98 merged commit 9dd8d69 into main Apr 1, 2026
10 of 11 checks passed
@Dallas98 Dallas98 deleted the synth branch April 2, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants