This repository creates a sample solution / benchmarks for students who need to build a regression model to predict housing prices in the UE "Creating Machine Learning Models in Practice". The code was created in the context of an educational TU Wien project. The original dataset used is the well known Ames Housing Prices Dataset from Dean De Cock. The version used in this code was converted from txt to csv and is stored in DBRepo, pre-split in training, validation and testing datasets.
Original dataset and documentation: -https://jse.amstat.org/v19n3/decock/AmesHousing.txt -https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
DBRepo dataset splits: -https://handle.test.datacite.org/10.82556/1jqa-zp46 -https://handle.test.datacite.org/10.82556/7cm6-bt62 -https://handle.test.datacite.org/10.82556/20e7-a615
This project implements a linear regression model to predict housing prices using the Ames Housing dataset. The script performs the following:
- Data Retrieval: Downloads datasets via the
dbrepoPython package - Preprocessing: Numeric + categorical imputation, one-hot encoding
- Training: Trains a
LinearRegressionmodel on the training set - Evaluation: Validates and tests the model; computes RMSE
- Visualization: Plots absolute error and root squared error
- Upload: Saves model and plots, and uploads them to TUWRD via the Invenio API
The code comes as a python script 'main.py' or as a notebook 'housing_prices_dataset_model.ipynb'.
For both you first need to fill in some variables.
DBRepo credentials go here in line 47 client = RestClient(endpoint="https://test.dbrepo.tuwien.ac.at", username="username", password="password").
The Invenio auth token in line 140 auth_token = "Invenio Token".
Additionally, if you do not wish to downlaod the data from DBRepo, you can use the local datasets in the data folder, by
commenting and uncommenting at line 50.
Install all dependencies and start the scripts with:
pip install -r requirements.txt
python main.pypip install jupyter
jupyter notebookThe MIT license attached only pertains to the source code itself, not the produced model ames_housing_prices_model.pkl
or the two graphs scatter-absolute_error and scatter_root_squared_error or the attached dataset splits in the data
folder.
These are all derivate works from the original ames housing dataset and do not have a license. They can however be reused in
an educational context, see here where the dataset was originally published: https://jse.amstat.org/jse_users.htm
Therefore, the same restriction applies to the datasets, the graphs and the model.