This project analyzes consumer buying behaviour across Mumbai's socioeconomic spectrum - examining whether people buy branded products for quality (smart spending) or for social status. The dataset covers 500 respondents across 5 income classes and 5 Mumbai regions, built from a Google Forms survey and synthetically augmented to ensure class-level representation.
The full pipeline covers data collection, preprocessing in Python, PostgreSQL schema design, and 22 structured SQL queries across 4 analysis areas.
| Tool | Purpose |
|---|---|
| Python (pandas) | Data cleaning, encoding, dummy variable creation |
| PostgreSQL 17 | Relational schema, data storage, analysis queries |
| pgAdmin 4 | Query execution and data import |
| Power BI | Dashboard and visualization (in progress) |
| Attribute | Detail |
|---|---|
| Total Respondents | 500 |
| Total Columns | 34 |
| Survey Tool | Google Forms |
| Geography | Mumbai — 5 regions |
| Income Classes | Poor | Lower Middle | Upper Middle | Rich |
| Source | Google Forms survey — original responses supplemented with simulated entries to ensure class-level representation across all 5 income segments |
Regions covered: South Mumbai, Western Suburbs, Central Suburbs, Navi Mumbai, Thane
Privacy Notice: The
Preprocessing steps:
- Encoded all categorical columns to numeric
- Fixed encoding bug in
branded_is_social_statuscolumn (0–3 → 1–4) - Derived
class_labelfrommonthly_income - Exploded two multi-select columns (
Branded products preference,Influential factors) into 11 binary dummy columns usingpandas.get_dummies() - Final dataset: 500 rows × 34 columns, fully numeric and query-ready
spend-smart-sql/
│
├── README.md
├── schema.sql -- CREATE TABLE script
├── analysis.sql -- All 22 analysis queries with findings
└── Dataset_for_Spend_Smart_final.csv -- Final cleaned dataset
- Dataset is nearly 50-50 gender split (Male 49.8%, Female 46.2%)
- Central Suburbs has the most respondents (26%) - dominated by Poor class
- Rich class is concentrated in South Mumbai (64 out of 91 South Mumbai respondents)
- Brand affinity rises with income: Rich (4.07) → Upper Middle (3.68) → Lower Middle (3.13) → Poor (3.08)
- South Mumbai scores highest brand affinity (4.05), Navi Mumbai lowest (3.26)
- Income and brand affinity have a 0.53 Pearson correlation - moderate and statistically meaningful
- Biggest insight: Poor class scores higher on brand influence awareness (3.63) than Lower Middle (3.49) — they see brand culture around them but cannot participate in it
- Rich class: quality belief (4.18) vs VFM belief (2.36) → gap of +1.83 - strongly brand-biased
- Poor class: quality belief (2.93) vs VFM belief (3.83) → gap of -0.91 - trust unbranded for value
- The gap flips sign exactly at the class boundary between Upper Middle and Lower Middle
- Overall: Smart Spender 36.2% | Combination 25.8% | Status-oriented 21.8% | Depends 16.2%
- 42.4% of Rich respondents are status-oriented vs only 5.7% of Lower Middle
- Status-oriented buyers have highest brand affinity (4.01) and lowest price consciousness (2.77)
- Smart Spenders are the exact opposite — brand affinity 3.18, price consciousness 3.84
- 56.9% of all status-oriented buyers live in South Mumbai or Western Suburbs
- Quality (356) and Price (328) dominate overall — Mumbai consumers are fundamentally rational
- Top factor by class: Rich → Brand name (169) | Poor → Price (164) | LMC → Price (83) | UMC → Quality (41)
- Rich class cites both Price AND Brand name together most (42 respondents, 36.5%) — they evaluate cost even while being brand-driven
- Price consciousness vs status belief gap nearly disappears for Rich (+0.03) — they are equally price and status aware
- Electronics is the most universally branded category - even Poor class at 65.9%
- Groceries is least brand-loyal across all classes - Rich only 26.1%, Poor just 7.3%
- Clothing separates classes most sharply - Rich 82.1% vs Poor 28.5%
- Personal care is surprisingly high for Poor (63.7%) - trust in branded health/hygiene products
GROUP BYwithCOUNT,SUM,AVG,ROUND- Window functions -
RANK() OVER (PARTITION BY ...),SUM() OVER () CASE WHENfor conditional aggregation and label decoding- CTEs (
WITH ... AS) for multi-step analysis CORR()for Pearson correlation coefficientUNION ALLfor unpivoting factor columns- Percentage calculations within and across groups
| # | Area | Queries |
|---|---|---|
| 1 | Demographics & Distribution | Q1 – Q4 |
| 2 | Brand Affinity Analysis | Q5 – Q8 |
| 3 | Buying Behaviour Analysis | Q9 – Q12 |
| 4 | Purchase Factor Analysis | Q13 – Q17 |
| 5 | Segmentation & Correlation Analysis | Q18 – Q22 |
- Data collection and preprocessing
- PostgreSQL schema design and data import
- 22 SQL analysis queries completed
- Power BI dashboard (in progress)
Rishi Agrawal B.Tech CSE (Data Science) — Shri Ramdeobaba College of Engineering and Management, Nagpur GitHub | LinkedIn