Skip to content

Commit 5a3bc5b

Browse files
authored
feat: implement nested zips (#12)
Implements reproducibility for nested zip files. Additionally, we add tests for nested zip files. Finally, the test harness is overhauled to make writing verification tests easier.
1 parent 819613d commit 5a3bc5b

File tree

7 files changed

+468
-97
lines changed

7 files changed

+468
-97
lines changed

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,47 @@
11
# package-python-function
2-
Python command-line (CLI) tool to package a Python function for deploying to AWS Lambda, and possibly other
3-
cloud platforms.
2+
Python command-line (CLI) tool to package a Python function for deploying to AWS Lambda, and possibly other cloud platforms.
43

5-
This tool builds a ZIP file from a virtual environment with all depedencies installed that are to be included in the final deployment asset. If the content is larger than AWS Lambda's maximum unzipped package size of 250 MiB,
6-
then this tool will employ the ZIP-inside-ZIP (nested-ZIP) workaround. This allows deploying Lambdas with large
7-
dependency packages, especially those with native code compiled extensions like Pandas, PyArrow, etc.
4+
This tool builds a ZIP file from a virtual environment with all dependencies installed that are to be included in the final deployment asset. If the content is larger than AWS Lambda's maximum unzipped package size of 250 MiB, This tool will then employ the ZIP-inside-ZIP (nested-ZIP) workaround. This allows deploying Lambdas with large dependency packages, especially those with native code compiled extensions like Pandas, PyArrow, etc. The ZIP files are generated [reproducibly](#a-note-on-reproducability), ensuring that the same source will always generate a ZIP file with the same hash.
85

9-
This technique was originally pioneered by [serverless-python-requirements](https://github.com/serverless/serverless-python-requirements), which is a NodeJS (JavaScript) plugin for the [Serverless Framework](https://github.com/serverless/serverless). The technique has been improved here to not require any special imports in your entrypoint source file. That is, no changes are needed to your source code to leverage the nested ZIP deployment.
6+
This technique was originally pioneered by [serverless-python-requirements](https://github.com/serverless/serverless-python-requirements), which is a NodeJS (JavaScript) plugin for the [Serverless Framework](https://github.com/serverless/serverless). The technique has been improved here to not require any special imports in your entrypoint source file. That is, no changes are needed to your source code to leverage the nested ZIP deployment.
107

11-
The motivation for this Python tool is to achieve the same results as serverless-python-requirements but with a
12-
purely Python tool. This can simplify and speed up developer and CI/CD workflows.
8+
The motivation for this Python tool is to achieve the same results as [serverless-python-requirements](https://www.serverless.com/plugins/serverless-python-requirements) but with a purely Python tool. This can simplify and speed up developer and CI/CD workflows.
139

14-
One important thing that this tool does not do is build the target virtual environment and install all of the
15-
dependencies. You must first generate that with a tool like [Poetry](https://github.com/python-poetry/poetry) and the [poetry-plugin-bundle](https://github.com/python-poetry/poetry-plugin-bundle).
10+
One important thing that this tool does not do is build the target virtual environment and install all of the dependencies. You must first generate that with a tool like [Poetry](https://github.com/python-poetry/poetry) and the [poetry-plugin-bundle](https://github.com/python-poetry/poetry-plugin-bundle).
1611

1712
## Example command sequence
18-
```
13+
```shell
1914
poetry bundle venv .build/.venv --without dev
2015
package-python-function .build/.venv --output-dir .build/lambda
2116
```
2217

23-
The output will be a .zip file with the same name as your project from your pyproject.toml file (with dashes replaced
18+
The output will be a .zip file with the same name as your project from your `pyproject.toml` file (with dashes replaced
2419
with underscores).
2520

2621
## Installation
2722
Use [pipx](https://github.com/pypa/pipx) to install:
2823

29-
```
24+
```shell
3025
pipx install package-python-function
3126
```
3227

3328
## Usage / Arguments
34-
`package-python-function venv_dir [--project PROJECT] [--output-dir OUTPUT_DIR] [--output OUTPUT]`
35-
36-
- `venv_dir` [Required]: The path to the virtual environment to package.
29+
```shell
30+
package-python-function venv_dir [--project PROJECT] [--output-dir OUTPUT_DIR] [--output OUTPUT]
31+
```
3732

38-
- `--project` [Optional]: Path to the pyproject.toml file. Omit to use the pyproject.toml file in the current working directory.
33+
- `venv_dir` [Required]: The path to the virtual environment to package.
34+
- `--project` [Optional]: Path to the `pyproject.toml` file. Omit to use the `pyproject.toml` file in the current working directory.
3935

4036
One of the following must be specified:
4137
- `--output`: The full output path of the final zip file.
38+
- `--output-dir`: The output directory for the final zip file. The name of the zip file will be based on the project's
39+
name in the `pyproject.toml` file (with dashes replaced with underscores).
4240

43-
- `--output-dir`: The output directory for the final zip file. The name of the zip file will be based on the project's
44-
name in the pyproject.toml file (with dashes replaced with underscores).
41+
## A Note on Reproducibility
4542

43+
The ZIP files generated adhere with [reproducible builds](https://reproducible-builds.org/docs/archives/). This means that file permissions and timestamps are modified inside the ZIP, such that the ZIP will have a deterministic hash. By default, the date is set to `1980-01-01`.
4644

45+
Additionally, the tool respects the standardized `$SOURCE_DATE_EPOCH` [environment variable](https://reproducible-builds.org/docs/source-date-epoch/), which will allow you to set that date as needed.
4746

47+
One important caveat is that ZIP files do not support files with timestamps earlier than `1980-01-01` inside them, due to MS-DOS compatibility. Therefore, the tool will throw a `SourceDateEpochError` is `$SOURCE_DATE_EPOCH` is below `315532800`.
Lines changed: 10 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,18 @@
11
from __future__ import annotations
22

33
import logging
4-
import os
54
import shutil
6-
import time
7-
import zipfile
85
from pathlib import Path
96
from tempfile import NamedTemporaryFile
10-
from typing import TYPE_CHECKING
7+
from zipfile import ZIP_DEFLATED, ZIP_STORED
118

129
from .python_project import PythonProject
13-
14-
if TYPE_CHECKING:
15-
from typing import Tuple
10+
from .reproducible_zipfile import ZipFile
1611

1712
logger = logging.getLogger(__name__)
1813

1914
class Packager:
20-
AWS_LAMBDA_MAX_UNZIP_SIZE = 262144000
15+
AWS_LAMBDA_MAX_UNZIP_SIZE = 262_144_000
2116

2217
def __init__(self, venv_path: Path, project_path: Path, output_dir: Path, output_file: Path | None):
2318
self.project = PythonProject(project_path)
@@ -46,35 +41,14 @@ def package(self) -> None:
4641
def zip_all_dependencies(self, target_path: Path) -> None:
4742
logger.info(f"Zipping to {target_path}...")
4843

49-
def date_time() -> Tuple[int, int, int, int, int, int]:
50-
"""Returns date_time value used to force overwrite on all ZipInfo objects. Defaults to
51-
1980-01-01 00:00:00. You can set this with the environment variable SOURCE_DATE_EPOCH as an
52-
integer value representing seconds since Epoch.
53-
"""
54-
source_date_epoch = os.environ.get("SOURCE_DATE_EPOCH", None)
55-
if source_date_epoch is not None:
56-
return time.gmtime(int(source_date_epoch))[:6]
57-
return (1980, 1, 1, 0, 0, 0)
58-
59-
with zipfile.ZipFile(target_path, "w", zipfile.ZIP_DEFLATED) as zip_file:
60-
44+
with ZipFile(target_path, "w", ZIP_DEFLATED) as zip_file:
6145
def zip_dir(path: Path) -> None:
6246
for item in path.iterdir():
6347
if item.is_dir():
6448
zip_dir(item)
6549
else:
66-
zinfo = zipfile.ZipInfo.from_file(
67-
item, item.relative_to(self.input_path)
68-
)
69-
zinfo.date_time = date_time()
70-
zinfo.external_attr = 0o644 << 16
71-
zinfo.compress_type = zipfile.ZIP_DEFLATED
7250
self._uncompressed_bytes += item.stat().st_size
73-
with (
74-
open(item, "rb") as src,
75-
zip_file.open(zinfo, "w") as dest,
76-
):
77-
shutil.copyfileobj(src, dest, 1024 * 8)
51+
zip_file.write_reproducibly(item, item.relative_to(self.input_path))
7852

7953
zip_dir(self.input_path)
8054

@@ -96,15 +70,15 @@ def zip_dir(path: Path) -> None:
9670
def generate_nested_zip(self, inner_zip_path: Path) -> None:
9771
logger.info(f"Generating nested-zip and __init__.py loader using entrypoint package '{self.project.entrypoint_package_name}'...")
9872

99-
with zipfile.ZipFile(self.output_file, 'w') as outer_zip_file:
73+
with ZipFile(self.output_file, 'w') as outer_zip_file:
10074
entrypoint_dir = Path(self.project.entrypoint_package_name)
101-
outer_zip_file.write(
75+
outer_zip_file.write_reproducibly(
10276
inner_zip_path,
10377
arcname=str(entrypoint_dir / ".dependencies.zip"),
104-
compresslevel=zipfile.ZIP_STORED
78+
compresslevel=ZIP_STORED
10579
)
106-
outer_zip_file.writestr(
80+
outer_zip_file.writestr_reproducibly(
10781
str(entrypoint_dir / "__init__.py"),
10882
Path(__file__).parent.joinpath("nested_zip_loader.py").read_text(),
109-
compresslevel=zipfile.ZIP_DEFLATED
83+
compresslevel=ZIP_DEFLATED
11084
)
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
from __future__ import annotations
2+
3+
import os
4+
import shutil
5+
import time
6+
import zipfile
7+
from typing import TYPE_CHECKING
8+
9+
if TYPE_CHECKING:
10+
from os import PathLike
11+
from pathlib import Path
12+
from typing import Optional, Tuple, Union
13+
14+
DEFAULT_DATE_TIME = (1980, 1, 1, 0, 0, 0)
15+
DEFAULT_DIR_MODE = 0o755
16+
DEFAULT_FILE_MODE = 0o644
17+
18+
class SourceDateEpochError(Exception):
19+
"""Raise when there are issues with $SOURCE_DATE_EPOCH"""
20+
21+
def date_time() -> Tuple[int, int, int, int, int, int]:
22+
"""Returns date_time value used to force overwrite on all ZipInfo objects. Defaults to
23+
1980-01-01 00:00:00. You can set this with the environment variable SOURCE_DATE_EPOCH as an
24+
integer value representing seconds since Epoch.
25+
"""
26+
source_date_epoch = os.environ.get("SOURCE_DATE_EPOCH", None)
27+
if source_date_epoch is not None:
28+
dt = time.gmtime(int(source_date_epoch))[:6]
29+
if dt[0] < 1980:
30+
raise SourceDateEpochError(
31+
"$SOURCE_DATE_EPOCH must be >= 315532800, since ZIP files need MS-DOS date/time format, which can be 1/1/1980, at minimum."
32+
)
33+
return dt
34+
return DEFAULT_DATE_TIME
35+
36+
class ZipFile(zipfile.ZipFile):
37+
def write_reproducibly(
38+
self,
39+
filename: PathLike,
40+
arcname: Optional[Union[Path, str]] = None,
41+
compress_type: Optional[int] = None,
42+
compresslevel: Optional[int] = None,
43+
):
44+
if not self.fp:
45+
raise ValueError("Attempt to write to ZIP archive that was already closed")
46+
if self._writing:
47+
raise ValueError("Can't write to ZIP archive while an open writing handle exists")
48+
49+
zinfo = zipfile.ZipInfo.from_file(filename, arcname, strict_timestamps=self._strict_timestamps)
50+
zinfo.date_time = date_time()
51+
if zinfo.is_dir():
52+
zinfo.external_attr = (0o40000 | DEFAULT_DIR_MODE) << 16
53+
zinfo.external_attr |= 0x10 # MS-DOS directory flag
54+
else:
55+
zinfo.external_attr = DEFAULT_FILE_MODE << 16
56+
57+
if zinfo.is_dir():
58+
zinfo.compress_size = 0
59+
zinfo.CRC = 0
60+
self.mkdir(zinfo)
61+
else:
62+
if compress_type is not None:
63+
zinfo.compress_type = compress_type
64+
else:
65+
zinfo.compress_type = self.compression
66+
67+
if compresslevel is not None:
68+
zinfo._compresslevel = compresslevel
69+
else:
70+
zinfo._compresslevel = self.compresslevel
71+
72+
with open(filename, "rb") as src, self.open(zinfo, "w") as dest:
73+
shutil.copyfileobj(src, dest, 1024 * 8)
74+
75+
def writestr_reproducibly(
76+
self,
77+
zinfo_or_arcname: Union[str, zipfile.ZipInfo],
78+
data: Union[str, bytes],
79+
compress_type: Optional[int] = None,
80+
compresslevel: Optional[int] = None,
81+
):
82+
if isinstance(data, str):
83+
data = data.encode("utf-8")
84+
85+
if not isinstance(zinfo_or_arcname, zipfile.ZipInfo):
86+
zinfo = zipfile.ZipInfo(filename=zinfo_or_arcname, date_time=date_time())
87+
zinfo.compress_type = self.compression
88+
zinfo._compresslevel = self.compresslevel
89+
if zinfo.is_dir():
90+
zinfo.external_attr = (0o40000 | DEFAULT_DIR_MODE) << 16
91+
zinfo.external_attr |= 0x10 # MS-DOS directory flag
92+
else:
93+
zinfo.external_attr = DEFAULT_FILE_MODE << 16
94+
else:
95+
zinfo = zinfo_or_arcname
96+
97+
zinfo.file_size = len(data)
98+
if compress_type is not None:
99+
zinfo.compress_type = compress_type
100+
101+
if compresslevel is not None:
102+
zinfo._compresslevel = compresslevel
103+
104+
if not self.fp:
105+
raise ValueError("Attempt to write to ZIP archive that was already closed")
106+
if self._writing:
107+
raise ValueError("Can't write to ZIP archive while an open writing handle exists.")
108+
109+
with self._lock:
110+
with self.open(zinfo, mode="w") as dest:
111+
dest.write(data)

poetry.lock

Lines changed: 86 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)