This project uses descriptive file names so anyone can understand what each file does without opening it.
Format: action_what_purpose.py
Examples:
collect_google_places_api_data.py(notgoogle_places_collector.py)knn_missing_data_imputation.py(notcomprehensive_imputation.py)deduplicate_standardize_data.py(notdata_cleaner.py)
The project follows a streamlined pipeline with only essential files.
| File Name | What It Does |
|---|---|
run_automated_data_collection_pipeline.py |
Master orchestration script - Runs complete data collection, cleaning, and imputation pipeline |
| File Name | What It Does |
|---|---|
collect_google_places_api_data.py |
Collects clinic data from Google Places API → writes to Neon database |
collect_yelp_fusion_api_data.py |
Collects clinic data from Yelp Fusion API → writes to Neon database |
collect_google_trends_search_data.py |
Collects search demand trends from Google Trends API → identifies which services are trending |
| File Name | What It Does |
|---|---|
calculate_combined_metrics.py |
Calculates combined ratings (Google + Yelp), data sources, quality scores |
deduplicate_standardize_data.py |
Finds duplicate clinics, merges them, standardizes formats |
duplicate_clinic_detector_merger.py |
Core fuzzy matching algorithm for duplicate detection (used by deduplicate script) |
knn_missing_data_imputation.py |
Fills ALL missing data: ZIP codes, clinic types, ratings using K-NN and multi-strategy imputation |
| File Name | What It Does |
|---|---|
sqlalchemy_database_models.py |
SQLAlchemy ORM models defining database schema (Clinic, Review, SearchTrend, etc.) |
initialize_create_database_tables.py |
Creates database tables, manages connections to Neon PostgreSQL |
| File Name | Purpose |
|---|---|
README.md |
Complete project overview with business context, technical details, pipeline flow |
CLAUDE.md |
Detailed technical documentation with architecture decisions and development log |
POWERBI_NEON_CONNECTION.md |
Step-by-step guide for connecting Power BI to Neon PostgreSQL |
FILE_NAMING_GUIDE.md |
This file - explains naming convention and file structure |
┌─────────────────────────────────────────────────────────────────┐
│ run_automated_data_collection_pipeline.py (Orchestrator) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 1-3: Data Collection │
│ • collect_google_places_api_data.py → Neon Database │
│ • collect_yelp_fusion_api_data.py → Neon Database │
│ • collect_google_trends_search_data.py → Neon Database │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 4: Data Enrichment │
│ • calculate_combined_metrics.py │
│ - Combined ratings, data sources, quality scores │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 5: Data Cleaning & Deduplication │
│ • deduplicate_standardize_data.py │
│ - Uses: duplicate_clinic_detector_merger.py │
│ - Merges duplicates, standardizes formats │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ STEP 6: Missing Data Imputation │
│ • knn_missing_data_imputation.py │
│ - ZIP codes (K-NN geographic) │
│ - Clinic types (name/category inference) │
│ - Ratings (cross-platform proxy) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Neon PostgreSQL Database (100% Complete Data) │
│ • Clinics, Reviews, SearchTrends tables populated │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Power BI Dashboards (Direct Connection) │
│ • Market demand analysis using search trend data │
└─────────────────────────────────────────────────────────────────┘
Before: Files like view_data.py, quickstart.py, models.py required opening files to understand their purpose.
After: Files like collect_google_places_api_data.py, knn_missing_data_imputation.py, sqlalchemy_database_models.py are self-documenting.
✅ Clarity: File names describe exactly what each script does ✅ Onboarding: New developers understand structure immediately ✅ Focused: Only essential pipeline files, no clutter ✅ Maintenance: Easy to find and modify specific functionality ✅ Searchability: Descriptive names make finding files trivial
collect_- Fetches data from external APIscalculate_- Computes derived metricsdeduplicate_- Removes duplicate recordsknn_- Uses K-Nearest Neighbors algorithmrun_- Orchestrates multiple stepsinitialize_- Set up initial state
_google_places_api_data- Google Places API source_yelp_fusion_api_data- Yelp Fusion API source_database_- Database operations_combined_metrics- Aggregated calculations_missing_data_- Null/empty values_clinic_- Clinic records
_knn_- K-Nearest Neighbors algorithm_sqlalchemy_- SQLAlchemy ORM_neon_or_postgresql_- Neon PostgreSQL
Need to...
- Run the complete pipeline? →
python3 run_automated_data_collection_pipeline.py --full - Collect only Google Places data? →
python3 run_automated_data_collection_pipeline.py --google - Collect only Yelp data? →
python3 run_automated_data_collection_pipeline.py --yelp - Collect only Google Trends data? →
python3 run_automated_data_collection_pipeline.py --trends - Clean existing data? →
python3 run_automated_data_collection_pipeline.py --clean-only - Modify database schema? → Edit
src/database/sqlalchemy_database_models.py - Change imputation logic? → Edit
src/utils/knn_missing_data_imputation.py - Adjust duplicate detection? → Edit
src/utils/duplicate_clinic_detector_merger.py
chicago-clinic-intelligence-system/
├── run_automated_data_collection_pipeline.py ← START HERE
│
├── src/
│ ├── collectors/ ← Data collection from APIs
│ │ ├── collect_google_places_api_data.py
│ │ ├── collect_yelp_fusion_api_data.py
│ │ └── collect_google_trends_search_data.py
│ │
│ ├── utils/ ← Data processing & cleaning
│ │ ├── calculate_combined_metrics.py
│ │ ├── deduplicate_standardize_data.py
│ │ ├── duplicate_clinic_detector_merger.py
│ │ └── knn_missing_data_imputation.py
│ │
│ └── database/ ← Database models & connections
│ ├── sqlalchemy_database_models.py
│ └── initialize_create_database_tables.py
│
├── data/ ← Data storage (SQLite backup)
├── config/ ← Configuration settings
├── docs/ ← Additional documentation
│
└── Documentation Files
├── README.md
├── CLAUDE.md
├── POWERBI_NEON_CONNECTION.md
├── FILE_NAMING_GUIDE.md (this file)
└── PROJECT_WALKTHROUGH_GUIDE.md
The following files were deleted as they were not essential to the core pipeline flow:
One-Time/Testing Scripts:
migrate_database_sqlite_to_postgresql.py- One-time migration (already completed)test_database_connection.py- Testing utilitydemo_quick_data_collection.py- Demo script
Viewing/Export Tools:
inspect_database_records.py- Database viewing (not part of pipeline)export_clean_data_for_powerbi.py- CSV export (direct Neon connection used instead)
Legacy/Replaced Scripts:
zipcode_knn_geographic_imputation.py- Replaced by comprehensive imputationscheduler.py- Not used in current pipeline
Obsolete Connectors:
powerbi_direct_database_connector_script.py- SQLite connector (now using Neon)
Redundant Documentation:
PROJECT_README.md- Superseded by README.mdIMPLEMENTATION_SUMMARY.md- Info consolidated into README.md
Note: collect_google_trends_search_data.py was temporarily removed but has been restored as it provides critical search demand data for service opportunity analysis.
Last Updated: 2026-01-26 Maintained By: Sourabh Rodagi