Skip to content

Draft data validation, missing values, and input flexibility approach #5

@nick-gorman

Description

@nick-gorman

Missing data handling, workflow flexibility, and data validation

The problem

  • There is currently no standardised approach for how functions
    should handle missing data. What should a function return if it's
    input DataFrame is None, empty, or has no rows?
  • This is both an internal problem when Dataframes are parsed between
    functions and a problem when incoming data from the users might be missing
    tables.
  • The problem also extends to what happens when only some columns are
    missing from a DataFrame, or when some values within a column are nan.

A proposed solution

General idea

  • One solution is a schema based approach.
  • Under the schema based approach missing data would always be represented
    by a DataFrame with no rows but a full set of expected columns which conform
    to the expected schema for the Dataframe.
  • This could also be thought of as an all columns no rows approach.
  • Any function would always be expected to return DataFrames with all the
    expected columns defined in the schema.
  • A key benefit of the schema approach is that many functions should not
    require modification as most DataFrame operations gracefully handle no rows
    as long as all the expected columns are present. Never allowing for None values
    of DataFrame with missing columns should allow for less checks like if df.empty
    or if df is None which add clutter and cognitive load when reading code.

Implementation details

Schema

  • A schema for a workflow step (e.g. translator), would define all input tables.
  • The schema for a table would define all expected columns, and the allowed
    data types within each schema. As an option we could also defined allowed values.
    For example, only fuel types in a predefined set, this could help protect users
    from typos etc or poorly defined input sets.
  • Some tables and columns would be optional and others compulsory.
  • The ISPyPSA inputs and Translator inputs are clear candidates for schemas, it
    is unclear if a schema for the parsed IASR workbook tables makes sense.

Schema enforcement

  • Inputs datasets with missing compulsory tables or columns, or which break
    datatype or allowed value rules would raise an error, which informs the users
    of the specific data set validation problem.
  • Where datasets are missing optional tables or columns these would be added
    at schema enforcement time. This is a critical step of the schema approach of
    ensuring that missing input data is always represented as a DataFrame with no
    rows or column of nans, while allowing users to provide simplified datasets
    with only tables or columns of interest to their use case.

Testing

  • All functions should be tested to ensure they handle DataFrames with
    no rows gracefully and in a manner consistent with the modelling method.
  • All functions should be tested to ensure they handle optional columns
    containing nan values gracefully and in a manner consistent with the modelling
    method.
  • Testing of functions with multiple input tables should include testing the
    various combinations of missing data, i.e. table A empty but table B not. The
    test then defines the expected behaviour, if this behaviour is not obvious
    then it should be documented in the public facing docs.

Outcome

This refactor while significant achieves several desired outcome:

  • Introduces data validation
  • Provides a framework to allow us to safely introduce method
    flexibility, and handling of simplified input data sets. Although,
    we could just implement the low hanging fruit flexibility to begin
    with.
  • Provides assurance that ISPyPSA will gracefully handle missing data,
    providing a more robust user experience. For example, if a user deletes
    all ecaa generators to do a green field optimisation then we would be
    confident that ISPyPSA will handle this.
  • Schemas and definition of expected behaviour would form the basis
    for compressive table documentation.

Implementation plan

A staged implementation should be quiet achievable. For instance,

  1. Implementation on a single ISPyPSA table.
  2. Review and merge (to get early feedback and easy review of small PR)
  3. Implementation of say 4 tables
  4. Review and merge (review that approach is generalising well)
  5. Complete implementation on just ISPyPSA tables
  6. Review and merge
  7. Extend implementation to PyPSA tables
  8. Review and merge

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions