This runbook provides step-by-step operational procedures for managing the data pipeline, including routine operations, troubleshooting, and recovery procedures.
- Python 3.7 or higher
- Read access to data directories
- Write access to logs directory
# Check recent logs
tail -n 50 logs/migration/migration.log
# Check for errors
grep ERROR logs/migration/migration.log | tail -n 20
# Check processing state
cat logs/migration/.migration_state.json | python -m json.tool# Copy new files from raw to validated
python scripts/migration/migrate.py \
--config scripts/migration/config.json
# Check results
echo "Check logs/migration/ for detailed results"# Run validation on a directory
python scripts/validation/validate_funding.py \
--directory data/transformed/validated \
--schema config/schemas/funding_data_v1.jsonScenario: New funding data has been added to raw zone
Steps:
-
Verify data location:
ls -lh data/raw/funding_sources/[source_name]/
-
Run migration:
python scripts/migration/migrate.py \ --source data/raw/funding_sources/[source_name] \ --dest data/transformed/validated/[source_name]
-
Check logs:
tail -f logs/migration/migration.log
-
Verify output:
ls -lh data/transformed/validated/[source_name]/
Scenario: Validation rules changed, need to reprocess
Steps:
-
Clear processed state:
# Backup current state cp logs/migration/.migration_state.json \ logs/migration/.migration_state.json.bak # Clear state for specific files python -c " import json with open('logs/migration/.migration_state.json', 'r') as f: state = json.load(f) # Remove specific files from state state['processed_files'] = {} with open('logs/migration/.migration_state.json', 'w') as f: json.dump(state, f, indent=2) "
-
Run migration again:
python scripts/migration/migrate.py
-
Verify reprocessing:
grep "Processing file" logs/migration/migration.log | tail -n 20
Scenario: Need to archive superseded data
Steps:
-
Identify files to archive:
find data/raw/funding_sources -name "*2023*" -type f -
Move to archive:
mkdir -p data/raw/_archive/2023 mv data/raw/funding_sources/*/2023-* data/raw/_archive/2023/
-
Document archival:
echo "Archived 2023 data on $(date)" >> data/raw/_archive/ARCHIVE_LOG.txt
Scenario: Log files growing too large
Steps:
-
Check log sizes:
du -sh logs/* -
Archive old logs:
tar -czf logs_archive_$(date +%Y%m%d).tar.gz logs/ mv logs_archive_*.tar.gz ~/backups/
-
Logs will auto-rotate, but can manually clear if needed:
# Keep only last 100 lines of each log for log in logs/*/*.log; do tail -n 100 "$log" > "$log.tmp" mv "$log.tmp" "$log" done
Symptoms: Migration exits with validation errors
Diagnosis:
# Check validation logs
grep "Validation failed" logs/migration/migration.log
# Get details on failed file
grep -A 10 "ERROR.*validation" logs/migration/migration.logResolution:
-
Examine the failing file:
cat [failing_file] | python -m json.tool -
Check schema requirements:
cat config/schemas/funding_data_v1.json
-
Options:
- Fix source data in raw zone (create new version)
- Update schema if requirements changed
- Add data cleaning step
Symptoms: Migration completes but files not in destination
Diagnosis:
# Check if already processed
cat logs/migration/.migration_state.json | grep [filename]
# Check for skip messages
grep "already processed" logs/migration/migration.logResolution:
- Files may already be processed (idempotent behavior)
- If need to reprocess, clear from state:
# Edit .migration_state.json to remove file entry
Symptoms: Migration fails with "Insufficient disk space"
Diagnosis:
df -h
du -sh data/*Resolution:
-
Clean up old backups:
rm -rf data/_backups/[old_date]
-
Archive and compress old logs:
tar -czf old_logs.tar.gz logs/ rm -rf logs/*.log.*
-
Move serveable data to external storage if needed
Symptoms: "Checksum mismatch" errors in logs
Diagnosis:
grep "Checksum mismatch" logs/migration/migration.logResolution:
-
This indicates file corruption during copy
-
Check disk health:
# On Linux sudo smartctl -a /dev/sda -
Retry the operation:
# Clear state and rerun python scripts/migration/migrate.py -
If persists, investigate storage system
Symptoms: "Permission denied" errors
Diagnosis:
ls -la data/raw/
ls -la data/transformed/Resolution:
# Fix permissions
chmod -R u+rw data/
chmod -R u+rw logs/
# If using shared system, check group permissionsIf transformed data is corrupted:
-
Stop all processing
-
Remove corrupted data:
rm -rf data/transformed/validated/* rm -rf data/transformed/cleaned/*
-
Clear processing state:
rm logs/migration/.migration_state.json
-
Reprocess from raw:
python scripts/migration/migrate.py
-
Verify checksums in logs
If raw data is corrupted:
- Raw data should never be corrupted (immutable)
- If it is, restore from source system
- Reingest from original source
- Document incident
Steps:
-
Check for backups:
ls -lh data/_backups/
-
Restore from backup:
cp -r data/_backups/[timestamp]/* data/transformed/validated/ -
Restore processing state:
cp logs/migration/.migration_state.json.bak \ logs/migration/.migration_state.json
-
Investigate failure before retrying
Daily Checks:
- Number of files processed
- Validation failure rate
- Processing time per file
- Disk space utilization
Weekly Checks:
- Log file sizes
- Backup integrity
- Schema version currency
- Processing performance trends
Create a monitoring script (example):
#!/bin/bash
# monitor_pipeline.sh
# Check for recent errors
ERROR_COUNT=$(grep -c ERROR logs/migration/migration.log)
if [ $ERROR_COUNT -gt 10 ]; then
echo "ALERT: $ERROR_COUNT errors in migration log"
fi
# Check disk space
DISK_USAGE=$(df -h /home | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $DISK_USAGE -gt 80 ]; then
echo "ALERT: Disk usage at ${DISK_USAGE}%"
fi
# Check if migration ran recently
LAST_RUN=$(stat -c %Y logs/migration/migration.log)
NOW=$(date +%s)
HOURS_SINCE=$(( ($NOW - $LAST_RUN) / 3600 ))
if [ $HOURS_SINCE -gt 24 ]; then
echo "ALERT: No migration in $HOURS_SINCE hours"
fiBefore running migrations on production data:
- Backups are current
- Sufficient disk space (>20% free)
- No other migrations running
- Schema files are current
- Test with sample data first
- Review recent error logs
After completing migration:
- Check log file for errors
- Verify file counts match expectations
- Spot-check sample files
- Verify checksums logged correctly
- Confirm disk space still adequate
- Update documentation if needed
For issues not covered in this runbook:
- Check logs in
logs/directory - Review documentation in
docs/ - Examine source code in
scripts/ - Open issue on GitHub repository
- v1.0 (2025-10): Initial runbook with migration procedures