|
1 | | -# Introduction |
2 | | - |
3 | | -DSFF (DataSet File Format) is a tiny library relying on [`openpyxl`](https://pypi.org/project/openpyxl) that allows to store a dataset with its features for use with machine learning in an XSLX file whose structure is enforced. It is intended to make easy to store, edit and exchange a dataset. |
4 | | - |
5 | | -It is used with the [Packing Box](https://github.com/packing-box/docker-packing-box) to export datasets in a convenient format. |
6 | | - |
7 | | ------ |
8 | | - |
9 | | -## Setup |
10 | | - |
11 | | -This library is available on [PyPi](https://pypi.python.org/pypi/dsff/) and can be simply installed using Pip: |
12 | | - |
13 | | -```sh |
14 | | -pip install --user dsff |
15 | | -``` |
16 | | - |
17 | | ------ |
18 | | - |
19 | | -## Format |
20 | | - |
21 | | -DSFF is straightforward and contains only the minimum for storing a dataset. |
22 | | - |
23 | | -The following document properties of the XSLX format are used: |
24 | | - |
25 | | -- `title`: this holds the name of the dataset |
26 | | -- `description`: this holds a serialized dictionary of the metadata from the dataset |
27 | | - |
28 | | -An XSLX workbook format as a DSFF has two and only two worksheets: |
29 | | - |
30 | | -1. `data`: the matrix of the whole dataset (including headers), eventually containing samples' metadata but mostly the feature values |
31 | | -2. `features`: the name-description pairs of each feature used in `data` (including two headers: `name` and `description`) |
32 | | - |
| 1 | +# Introduction |
| 2 | + |
| 3 | +DSFF (DataSet File Format) is a tiny library relying on [`openpyxl`](https://pypi.org/project/openpyxl) that allows to store a dataset with its features for use with machine learning in an XSLX file whose structure is enforced. It is intended to make easy to store, edit and exchange a dataset. |
| 4 | + |
| 5 | +It is used with the [Packing Box](https://github.com/packing-box/docker-packing-box) to export datasets in a convenient format. |
| 6 | + |
| 7 | +----- |
| 8 | + |
| 9 | +## Setup |
| 10 | + |
| 11 | +This library is available on [PyPi](https://pypi.python.org/pypi/dsff/) and can be simply installed using Pip: |
| 12 | + |
| 13 | +```sh |
| 14 | +pip install --user dsff |
| 15 | +``` |
| 16 | + |
| 17 | +If you want to use additional [Apache Arrow](https://arrow.apache.org/docs/index.html) formats, you can install [`pyarrow`](https://arrow.apache.org/docs/python/index.html) with the following command: |
| 18 | + |
| 19 | +```sh |
| 20 | +pip install --user dsff[extra] |
| 21 | +``` |
| 22 | + |
| 23 | +----- |
| 24 | + |
| 25 | +## Format |
| 26 | + |
| 27 | +DSFF is straightforward and contains only the minimum for storing a dataset. |
| 28 | + |
| 29 | +The following document properties of the XSLX format are used: |
| 30 | + |
| 31 | +- `title`: this holds the name of the dataset |
| 32 | +- `description`: this holds a serialized dictionary of the metadata from the dataset |
| 33 | + |
| 34 | +An XSLX workbook format as a DSFF has two and only two worksheets: |
| 35 | + |
| 36 | +1. `data`: the matrix of the whole dataset (including headers), eventually containing samples' metadata but mostly the feature values |
| 37 | +2. `features`: the name-description pairs of each feature used in `data` (including two headers: `name` and `description`) |
| 38 | + |
0 commit comments