Skip to content

Commit 5c030a4

Browse files
committed
Added Apache Arrow formats (fixes #1)
1 parent f863738 commit 5c030a4

10 files changed

Lines changed: 314 additions & 207 deletions

File tree

.github/workflows/python-package.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
fail-fast: false
2020
matrix:
2121
os: [ubuntu-latest]
22-
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]
22+
python-version: ["3.10", "3.11", "3.12", "3.13"]
2323
steps:
2424
- uses: actions/checkout@v3
2525
- name: Set up Python ${{ matrix.python-version }}

docs/pages/index.md

Lines changed: 38 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,32 +1,38 @@
1-
# Introduction
2-
3-
DSFF (DataSet File Format) is a tiny library relying on [`openpyxl`](https://pypi.org/project/openpyxl) that allows to store a dataset with its features for use with machine learning in an XSLX file whose structure is enforced. It is intended to make easy to store, edit and exchange a dataset.
4-
5-
It is used with the [Packing Box](https://github.com/packing-box/docker-packing-box) to export datasets in a convenient format.
6-
7-
-----
8-
9-
## Setup
10-
11-
This library is available on [PyPi](https://pypi.python.org/pypi/dsff/) and can be simply installed using Pip:
12-
13-
```sh
14-
pip install --user dsff
15-
```
16-
17-
-----
18-
19-
## Format
20-
21-
DSFF is straightforward and contains only the minimum for storing a dataset.
22-
23-
The following document properties of the XSLX format are used:
24-
25-
- `title`: this holds the name of the dataset
26-
- `description`: this holds a serialized dictionary of the metadata from the dataset
27-
28-
An XSLX workbook format as a DSFF has two and only two worksheets:
29-
30-
1. `data`: the matrix of the whole dataset (including headers), eventually containing samples' metadata but mostly the feature values
31-
2. `features`: the name-description pairs of each feature used in `data` (including two headers: `name` and `description`)
32-
1+
# Introduction
2+
3+
DSFF (DataSet File Format) is a tiny library relying on [`openpyxl`](https://pypi.org/project/openpyxl) that allows to store a dataset with its features for use with machine learning in an XSLX file whose structure is enforced. It is intended to make easy to store, edit and exchange a dataset.
4+
5+
It is used with the [Packing Box](https://github.com/packing-box/docker-packing-box) to export datasets in a convenient format.
6+
7+
-----
8+
9+
## Setup
10+
11+
This library is available on [PyPi](https://pypi.python.org/pypi/dsff/) and can be simply installed using Pip:
12+
13+
```sh
14+
pip install --user dsff
15+
```
16+
17+
If you want to use additional [Apache Arrow](https://arrow.apache.org/docs/index.html) formats, you can install [`pyarrow`](https://arrow.apache.org/docs/python/index.html) with the following command:
18+
19+
```sh
20+
pip install --user dsff[extra]
21+
```
22+
23+
-----
24+
25+
## Format
26+
27+
DSFF is straightforward and contains only the minimum for storing a dataset.
28+
29+
The following document properties of the XSLX format are used:
30+
31+
- `title`: this holds the name of the dataset
32+
- `description`: this holds a serialized dictionary of the metadata from the dataset
33+
34+
An XSLX workbook format as a DSFF has two and only two worksheets:
35+
36+
1. `data`: the matrix of the whole dataset (including headers), eventually containing samples' metadata but mostly the feature values
37+
2. `features`: the name-description pairs of each feature used in `data` (including two headers: `name` and `description`)
38+

docs/pages/usage.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,3 +82,27 @@ Converting from other formats to DSFF | Converting from DSFF to other formats
8282
f.to_dataset() # creates ./[dsff-title] with data.csv, features.json and metadata.json
8383
```
8484

85+
**Creating a Feather dataset from a DSFF**
86+
87+
```python
88+
>>> import dsff
89+
>>> with dsff.DSFF("/path/to/my-dataset.feather") as f:
90+
f.to_feather() # creates ./my-dataset.feather
91+
```
92+
93+
**Creating an ORC dataset from a DSFF**
94+
95+
```python
96+
>>> import dsff
97+
>>> with dsff.DSFF("/path/to/my-dataset.orc") as f:
98+
f.to_orc() # creates ./my-dataset.orc
99+
```
100+
101+
**Creating a Parquet dataset from a DSFF**
102+
103+
```python
104+
>>> import dsff
105+
>>> with dsff.DSFF("/path/to/my-dataset.parquet") as f:
106+
f.to_parquet() # creates ./my-dataset.parquet
107+
```
108+

pyproject.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ authors = [
1616
description = "DataSet File Format (DSFF)"
1717
license = {file = "LICENSE"}
1818
keywords = ["python", "programming", "dataset-file-format", "dsff"]
19-
requires-python = ">=3.8,<4"
19+
requires-python = ">=3.10,<4"
2020
classifiers = [
2121
"Development Status :: 5 - Production/Stable",
2222
"Environment :: Console",
@@ -33,6 +33,11 @@ dependencies = [
3333
]
3434
dynamic = ["version"]
3535

36+
[project.optional-dependencies]
37+
extra = [
38+
"pyarrow",
39+
]
40+
3641
[project.readme]
3742
file = "README.md"
3843
content-type = "text/markdown"

pytest.ini

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[pytest]
2+
pythonpath = src

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
openpyxl
2+
pyarrow

src/dsff/VERSION.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.1.0
1+
1.2.0

src/dsff/formats/pa.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# -*- coding: UTF-8 -*-
2+
from .__common__ import *
3+
4+
5+
__all__ = []
6+
7+
8+
def _nowrite(m):
9+
raise NotImplementedError(f"none of {m}.write_table and {m}.write_{m} is implemented")
10+
11+
12+
for module in ["feather", "orc", "parquet"]:
13+
__all__ += [f"from_{module}", f"to_{module}"]
14+
def gen_func(m):
15+
def from_(dsff, path=None, exclude=DEFAULT_EXCL):
16+
dataset = globals()[m].read_table(path)
17+
dsff.write(data=[dataset.schema.names] + [list(r.values()) for r in dataset.to_pylist()],
18+
metadata=literal_eval(dataset.schema.metadata.pop(b'__metadata__', b"{}").decode()),
19+
features={k.decode(): v.decode() for k, v in dataset.schema.metadata.items()})
20+
from_.__name__ = f"from_{m}"
21+
def to_(dsff, path=None, text=False):
22+
with (BytesIO() if text else open(path, 'wb+')) as f:
23+
getattr(globals()[m], "write_table", getattr(globals()[m], f"write_{m}", _nowrite))(dsff._to_table(), f)
24+
if text:
25+
return f.getvalue()
26+
to_.__name__ = f"to_{m}"
27+
return from_, to_
28+
globals()[f'from_{module}'], globals()[f'to_{module}'] = gen_func(module)
29+

0 commit comments

Comments
 (0)