|
| 1 | +# Rule Identity |
| 2 | +id: PYTHON-LANG-SEC-110 |
| 3 | +name: Unsafe Tarfile Extraction Detected |
| 4 | +short_description: tarfile.extractall() and tarfile.extract() are vulnerable to path traversal attacks (Zip Slip). Malicious tar archives can write files outside the intended directory using entries with ../ in their names. |
| 5 | + |
| 6 | +# Classification |
| 7 | +severity: HIGH |
| 8 | +category: lang |
| 9 | +language: python |
| 10 | +ruleset: python/lang/PYTHON-LANG-SEC-110 |
| 11 | + |
| 12 | +# Vulnerability Details |
| 13 | +cwe: |
| 14 | + - id: CWE-22 |
| 15 | + name: Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') |
| 16 | + url: https://cwe.mitre.org/data/definitions/22.html |
| 17 | + |
| 18 | +owasp: |
| 19 | + - id: A01:2021 |
| 20 | + name: Broken Access Control |
| 21 | + url: https://owasp.org/Top10/A01_2021-Broken_Access_Control/ |
| 22 | + |
| 23 | +tags: |
| 24 | + - python |
| 25 | + - tarfile |
| 26 | + - path-traversal |
| 27 | + - zip-slip |
| 28 | + - file-extraction |
| 29 | + - CWE-22 |
| 30 | + - OWASP-A01 |
| 31 | + |
| 32 | +# Author |
| 33 | +author: |
| 34 | + name: Shivasurya |
| 35 | + url: https://x.com/sshivasurya |
| 36 | + |
| 37 | +# Content |
| 38 | +description: | |
| 39 | + Python's tarfile module does not validate member paths during extraction by default. A crafted |
| 40 | + tar archive can contain entries with names like "../../etc/crontab" or absolute paths like |
| 41 | + "/etc/passwd", causing extractall() or extract() to write files outside the intended destination |
| 42 | + directory. This is known as the "Zip Slip" vulnerability (despite the name, it affects tar |
| 43 | + archives too). |
| 44 | +
|
| 45 | + CVE-2007-4559 is the canonical vulnerability in Python's tarfile module — it has remained |
| 46 | + unfixed for over 15 years because fixing it would break backward compatibility. Python 3.12 |
| 47 | + introduced the filter parameter (PEP 706) to provide safe extraction modes. |
| 48 | +
|
| 49 | + This vulnerability has been exploited in real-world supply chain attacks. GHSA-fhff-qmm8-h2fp |
| 50 | + (mlflow) demonstrates how tar path traversal in ML pipeline tools can lead to arbitrary file |
| 51 | + write on the server, enabling remote code execution. |
| 52 | +
|
| 53 | +message: "tarfile.extractall() or tarfile.extract() detected. Tar archives can contain path traversal entries. Use filter='data' or validate members before extraction." |
| 54 | + |
| 55 | +security_implications: |
| 56 | + - title: Arbitrary File Write via Path Traversal |
| 57 | + description: | |
| 58 | + A malicious tar archive can contain members with relative paths containing "../" |
| 59 | + sequences. When extracted with extractall(), these members are written outside the |
| 60 | + intended destination directory. An attacker can overwrite any file writable by the |
| 61 | + application process, including configuration files, crontabs, SSH authorized_keys, |
| 62 | + or application code. |
| 63 | + - title: Remote Code Execution via File Overwrite |
| 64 | + description: | |
| 65 | + By overwriting executable files, Python modules, or cron jobs, an attacker who can |
| 66 | + supply a malicious tar archive to an application can achieve remote code execution. |
| 67 | + In CI/CD and ML pipeline contexts, this is especially dangerous as archives are |
| 68 | + often processed automatically. |
| 69 | + - title: Symlink Following in Archives |
| 70 | + description: | |
| 71 | + Tar archives can contain symbolic links. A crafted archive can include a symlink |
| 72 | + pointing outside the destination directory, followed by a regular file entry that |
| 73 | + writes through the symlink. This two-step attack bypasses simple path validation |
| 74 | + that only checks regular file paths. |
| 75 | + - title: Supply Chain Attacks via Package Archives |
| 76 | + description: | |
| 77 | + Libraries that download and extract tar archives (ML model registries, package |
| 78 | + managers, data pipelines) are common targets. The mlflow vulnerability |
| 79 | + (GHSA-fhff-qmm8-h2fp) allowed arbitrary file write through crafted model artifacts, |
| 80 | + demonstrating the supply chain risk. |
| 81 | +
|
| 82 | +secure_example: | |
| 83 | + import tarfile |
| 84 | + import os |
| 85 | +
|
| 86 | + # INSECURE: extractall without validation |
| 87 | + # tar.extractall(path=dest) |
| 88 | +
|
| 89 | + # SECURE (Python 3.12+): Use filter='data' for safe extraction |
| 90 | + def safe_extract_filtered(archive_path: str, dest: str) -> None: |
| 91 | + with tarfile.open(archive_path) as tar: |
| 92 | + tar.extractall(path=dest, filter="data") |
| 93 | +
|
| 94 | + # SECURE (all versions): Validate each member before extraction |
| 95 | + def safe_extract_validated(archive_path: str, dest: str) -> None: |
| 96 | + dest = os.path.realpath(dest) |
| 97 | + with tarfile.open(archive_path) as tar: |
| 98 | + for member in tar.getmembers(): |
| 99 | + member_path = os.path.realpath(os.path.join(dest, member.name)) |
| 100 | + if not member_path.startswith(dest + os.sep): |
| 101 | + raise ValueError(f"Path traversal detected: {member.name}") |
| 102 | + if member.issym() or member.islnk(): |
| 103 | + raise ValueError(f"Symlink in archive: {member.name}") |
| 104 | + tar.extractall(path=dest, members=tar.getmembers()) |
| 105 | +
|
| 106 | + # SECURE: Read file contents without extracting to disk |
| 107 | + def read_archive_member(archive_path: str, member_name: str) -> bytes: |
| 108 | + with tarfile.open(archive_path) as tar: |
| 109 | + f = tar.extractfile(member_name) |
| 110 | + if f is None: |
| 111 | + raise ValueError("Not a regular file") |
| 112 | + return f.read() |
| 113 | +
|
| 114 | +recommendations: |
| 115 | + - Use filter='data' parameter on extractall() (Python 3.12+) to automatically reject path traversal entries, absolute paths, and symlinks. |
| 116 | + - For Python versions before 3.12, validate each member path with os.path.realpath() and ensure it stays within the destination directory. |
| 117 | + - Reject tar members that are symbolic links or hard links pointing outside the extraction directory. |
| 118 | + - Use extractfile() instead of extract() when you only need to read file contents without writing to disk. |
| 119 | + - Consider using the 'tarfile.data_filter' or a custom filter function to enforce extraction policies. |
| 120 | + |
| 121 | +detection_scope: | |
| 122 | + This rule detects calls to tarfile.extractall() and tarfile.extract() on tarfile objects. |
| 123 | + All call sites are flagged because the safety depends on whether the archive source is |
| 124 | + trusted and whether proper member validation is performed. The filter='data' parameter |
| 125 | + (Python 3.12+) is the recommended mitigation but cannot be statically verified in all cases. |
| 126 | +
|
| 127 | +# Compliance |
| 128 | +compliance: |
| 129 | + - standard: CWE Top 25 |
| 130 | + requirement: "CWE-22 - Improper Limitation of a Pathname to a Restricted Directory in the MITRE CWE Top 25" |
| 131 | + - standard: OWASP Top 10 |
| 132 | + requirement: "A01:2021 - Broken Access Control" |
| 133 | + - standard: NIST SP 800-53 |
| 134 | + requirement: "SI-10: Information Input Validation" |
| 135 | + - standard: PCI DSS v4.0 |
| 136 | + requirement: "Requirement 6.2.4 - Protect against injection attacks including path traversal" |
| 137 | + |
| 138 | +# References |
| 139 | +references: |
| 140 | + - title: "CVE-2007-4559: Python tarfile directory traversal" |
| 141 | + url: https://nvd.nist.gov/vuln/detail/CVE-2007-4559 |
| 142 | + - title: "GHSA-fhff-qmm8-h2fp: mlflow tar path traversal" |
| 143 | + url: https://github.com/advisories/GHSA-fhff-qmm8-h2fp |
| 144 | + - title: "PEP 706 - Filter for tarfile.extractall" |
| 145 | + url: https://peps.python.org/pep-0706/ |
| 146 | + - title: "CWE-22: Path Traversal" |
| 147 | + url: https://cwe.mitre.org/data/definitions/22.html |
| 148 | + - title: "Zip Slip Vulnerability (Snyk Research)" |
| 149 | + url: https://security.snyk.io/research/zip-slip-vulnerability |
| 150 | + |
| 151 | +# FAQ |
| 152 | +faq: |
| 153 | + - question: Why wasn't CVE-2007-4559 fixed in Python's tarfile module? |
| 154 | + answer: | |
| 155 | + The Python core developers considered fixing this a backward-compatibility break. |
| 156 | + Many legitimate use cases depend on extracting archives with relative paths or |
| 157 | + symlinks. Instead, Python 3.12 introduced PEP 706 which adds the filter parameter |
| 158 | + to extractall(), allowing users to opt into safe extraction behavior. Future Python |
| 159 | + versions will make the safe filter the default. |
| 160 | +
|
| 161 | + - question: Is extractall(members=validated_list) sufficient? |
| 162 | + answer: | |
| 163 | + Passing a validated members list helps, but you must also check for symlink-based |
| 164 | + attacks where a symlink is followed by a file that writes through it. Both the |
| 165 | + member paths and the link targets need validation. The filter='data' approach in |
| 166 | + Python 3.12+ handles these cases automatically. |
| 167 | +
|
| 168 | + - question: How does filter='data' protect against path traversal? |
| 169 | + answer: | |
| 170 | + The 'data' filter rejects archive members with absolute paths, paths containing |
| 171 | + '..' components, symlinks, hard links, device nodes, and other special entries. |
| 172 | + It only allows regular files and directories with safe, relative paths. This is |
| 173 | + the recommended approach for Python 3.12+. |
| 174 | +
|
| 175 | + - question: Is tarfile.extractfile() safe? |
| 176 | + answer: | |
| 177 | + Yes, extractfile() returns a file-like object for reading the member contents |
| 178 | + without writing anything to disk. It is safe to use with untrusted archives when |
| 179 | + you need to read file contents. However, it only works with regular file members, |
| 180 | + not directories or special file types. |
| 181 | +
|
| 182 | + - question: What about shutil.unpack_archive() with tar files? |
| 183 | + answer: | |
| 184 | + shutil.unpack_archive() calls tarfile.extractall() internally and is equally |
| 185 | + vulnerable to path traversal. The same mitigations apply. In Python 3.12+, |
| 186 | + shutil.unpack_archive() also accepts the filter parameter. |
| 187 | +
|
| 188 | +# Similar Rules |
| 189 | +similar_rules: |
| 190 | + - PYTHON-LANG-SEC-060 |
| 191 | + |
| 192 | +# Test Files |
| 193 | +tests: |
| 194 | + positive: tests/positive/ |
| 195 | + negative: tests/negative/ |
0 commit comments