-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Missing data handling, workflow flexibility, and data validation
The problem
- There is currently no standardised approach for how functions
should handle missing data. What should a function return if it's
input DataFrame is None, empty, or has no rows? - This is both an internal problem when Dataframes are parsed between
functions and a problem when incoming data from the users might be missing
tables. - The problem also extends to what happens when only some columns are
missing from a DataFrame, or when some values within a column are nan.
A proposed solution
General idea
- One solution is a schema based approach.
- Under the schema based approach missing data would always be represented
by a DataFrame with no rows but a full set of expected columns which conform
to the expected schema for the Dataframe. - This could also be thought of as an all columns no rows approach.
- Any function would always be expected to return DataFrames with all the
expected columns defined in the schema. - A key benefit of the schema approach is that many functions should not
require modification as most DataFrame operations gracefully handle no rows
as long as all the expected columns are present. Never allowing for None values
of DataFrame with missing columns should allow for less checks likeif df.empty
orif df is Nonewhich add clutter and cognitive load when reading code.
Implementation details
Schema
- A schema for a workflow step (e.g. translator), would define all input tables.
- The schema for a table would define all expected columns, and the allowed
data types within each schema. As an option we could also defined allowed values.
For example, only fuel types in a predefined set, this could help protect users
from typos etc or poorly defined input sets. - Some tables and columns would be optional and others compulsory.
- The ISPyPSA inputs and Translator inputs are clear candidates for schemas, it
is unclear if a schema for the parsed IASR workbook tables makes sense.
Schema enforcement
- Inputs datasets with missing compulsory tables or columns, or which break
datatype or allowed value rules would raise an error, which informs the users
of the specific data set validation problem. - Where datasets are missing optional tables or columns these would be added
at schema enforcement time. This is a critical step of the schema approach of
ensuring that missing input data is always represented as a DataFrame with no
rows or column of nans, while allowing users to provide simplified datasets
with only tables or columns of interest to their use case.
Testing
- All functions should be tested to ensure they handle DataFrames with
no rows gracefully and in a manner consistent with the modelling method. - All functions should be tested to ensure they handle optional columns
containing nan values gracefully and in a manner consistent with the modelling
method. - Testing of functions with multiple input tables should include testing the
various combinations of missing data, i.e. table A empty but table B not. The
test then defines the expected behaviour, if this behaviour is not obvious
then it should be documented in the public facing docs.
Outcome
This refactor while significant achieves several desired outcome:
- Introduces data validation
- Provides a framework to allow us to safely introduce method
flexibility, and handling of simplified input data sets. Although,
we could just implement the low hanging fruit flexibility to begin
with. - Provides assurance that ISPyPSA will gracefully handle missing data,
providing a more robust user experience. For example, if a user deletes
all ecaa generators to do a green field optimisation then we would be
confident that ISPyPSA will handle this. - Schemas and definition of expected behaviour would form the basis
for compressive table documentation.
Implementation plan
A staged implementation should be quiet achievable. For instance,
- Implementation on a single ISPyPSA table.
- Review and merge (to get early feedback and easy review of small PR)
- Implementation of say 4 tables
- Review and merge (review that approach is generalising well)
- Complete implementation on just ISPyPSA tables
- Review and merge
- Extend implementation to PyPSA tables
- Review and merge
Metadata
Metadata
Assignees
Labels
No labels