ATM, deserializing a osekit.public_api.dataset.Dataset will completely deserialize all analysis datasets, meaning that every audio file will be touched with reading its metadata and all.
On large datasets, this can lead to a significant loss of time in case the audio doesn't need to be actually used.
Here's what I was thinking of to avoid such behaviour:
- Add a parameter to
Dataset.from_json() to avoid deserializing the analysis datasets (which could be done later on request)
- Avoid systematically instantiating the
AudioFile in the AudioData._make_file() method (which implies opening the file to read the full metadata): rather only storing the path and begin timestamp and actually instantiating the AudioFile when needed.
@ElodieENSTA : IIRC, you only need to access the metadata of the Dataset (e.g. name of the analyses etc.) in the import section, right?
At which stage of the process do you need the full analysis dataset to be deserialized? Only once on import? Or each time the campaign is opened by an user?
ATM, deserializing a
osekit.public_api.dataset.Datasetwill completely deserialize all analysis datasets, meaning that every audio file will be touched with reading its metadata and all.On large datasets, this can lead to a significant loss of time in case the audio doesn't need to be actually used.
Here's what I was thinking of to avoid such behaviour:
Dataset.from_json()to avoid deserializing the analysis datasets (which could be done later on request)AudioFilein theAudioData._make_file()method (which implies opening the file to read the full metadata): rather only storing the path and begin timestamp and actually instantiating theAudioFilewhen needed.@ElodieENSTA : IIRC, you only need to access the metadata of the
Dataset(e.g. name of the analyses etc.) in the import section, right?At which stage of the process do you need the full analysis dataset to be deserialized? Only once on import? Or each time the campaign is opened by an user?