Math Autocorrector: Accompanying Code and Data for the Paper "Automated Feedback Generation for Undergraduate Mathematics: Development and Evaluation of an AI Teaching Assistant"
The present repository contains code and supplementary data for the paper "Automated Feedback Generation for Undergraduate Mathematics: Development and Evaluation of an AI Teaching Assistant".
-
Clone the Repository: Clone this repository to your local machine.
-
Python Environment: Create a Python virtualenv with Python3.10 or higher, and use
pip install -r requirements.txt
to install the requirements.
-
Activate the Environment:
Use the following command to activate the environment:
pyenv activate <environment-name>
if the environment has been set up using pip/pyenv.
-
Environment Variables: Use a dotenv file to set
OPENAI_API_KEYin your environment. You have to bring your own API key.
-
Running the Program: Start the program by running
math_tutor.py. -
User Interface: By default, the program launches a Gradio interface in your browser for submitting homework assignments. This interface can be used to test the system on freely chosen individual questions.
-
Providing Inputs:
- Enter homework text in the textbox.
- Upload an image of the homework.
-
Receiving Feedback: The program processes inputs and displays feedback in the output section.
By default, the program will launch a Gradio interface.
-
--temperature(float): Sets the temperature for response generation (default: 0.0). This parameter is mainly important if a non-reasoning language model is used, which may be a reasonable choice for tasks which are mathematically relatively easy. -
--config(str): Path to the configuration file relative to theconfigsdirectory. If not specified,configs/systematic5.0.jsonis used, as in the testing reported in the paper, this workflow performed best on the advanced test questions.Example Usage:
python math_tutor.py --temperature 0.5 --config path/to/config.json
To use the program in batch processing mode, use the following argument:
--problem_dir(str): Path to the directory containing a .json file of questions, relative to thetestdirectory.
This will make requests to the OpenAI API asynchronously by creating one thread per test case.
Batch processing mode, or more precisely batch processing mode run by specific shell scripts (see below), is used for the bulk of the evaluation work in our paper.
Adjust the program's behavior via the config file, setting parameters like instructions for the various processing steps, and task directives. Different configurations can be used for different tasks and can be specified via the --config command line parameter at startup as indicated in the previous section. This can drastically change behaviour, e.g. it would be possible to change the program from a evaluator to a problem solver just by changing the configuration file. Various pre-set configuration files are provided in the configs directory.
Multiple scripts are provided for running comprehensive evaluations across different test datasets, models, and configurations.
Note that running a grid evaluation on a range of models and configurations can be expensive, and that cost tracking is much less useful in that setting than cost prediction. When running a large grid search or using our evaluation methodology on a large dataset, we recommend running a small-scale test first and retrieving the actual costs from the OpenAI dashboard. Thereafter, predict your costs to see if they fit your budget.
Runs evaluations on the main test dataset with regrading across multiple models and configurations.
# Run
./run_regrading_comparison.sh
# Show help
./run_regrading_comparison.sh --helpRuns evaluations on the advanced test dataset with multiple models and configurations in parallel.
# Run
./run_advanced_test_comparison.sh
# Show help
./run_advanced_test_comparison.sh --helpRuns evaluations on the advanced test dataset with hints across multiple models and configurations. Supports parallel execution.
# Run with default settings
python run_advanced_test_with_hints_comparison.py
# Run with custom number of parallel workers
python run_advanced_test_with_hints_comparison.py --max-workers 8
# Show help
python run_advanced_test_with_hints_comparison.py --helpRuns evaluations on the ghosts_9jan dataset (needs to be obtained separately from the GitHub of Frieder et al and placed into the correct directory as indicated in the code) across multiple models and configurations. Supports parallel execution.
# Run with default settings
python run_ghosts_comparison.py
# Run with custom number of parallel workers
python run_ghosts_comparison.py --max-workers 8
# Show help
python run_ghosts_comparison.py --helpCore evaluation script that computes statistical comparisons between AI-generated grades and ground truth grades. Calculates Pearson correlation, Kendall tau, Spearman rank correlation, percent agreement, and other metrics.
# Evaluate results from a directory
python evaluator.py --input_directory tests/test/output/gpt-5 --output_file results/gpt-5_eval.json
# Skip bootstrap confidence intervals (faster)
python evaluator.py --input_directory <dir> --output_file <file> --no_bootstrapThis script is called automatically by the evaluation comparison scripts, but can also be used independently.
Helper script to aggregate evaluation statistics from per-file results. Computes weighted averages of correlation metrics across all files for a model.
# Extract a specific statistic from evaluation results
python aggregate_eval_stats.py <json_file> <stat_name>
# Example: Get Pearson correlation
python aggregate_eval_stats.py results/evaluations/gpt-5_comparison.json correlation
# Available statistics: correlation, kendall_tau, spearman, percent_agreement, percent_close_matchTracks token usage locally when TOKEN_USAGE_USERNAME is set in .env. Automatically logs API usage to CSV files in the usage/ directory during program execution.
# View cumulative token usage statistics
python token_usage.pyThe script automatically tracks:
- Prompt tokens, completion tokens, and reasoning tokens (for reasoning models)
- Cost per API call based on model pricing
- Session-level cost tracking
The program creates temporary files under the cache and usage subdirectories, which may be freely deleted.