A hybrid recommendation engine combining Content-Based Filtering and Collaborative Filtering techniques to deliver personalized movie suggestions using the IMDB Movies Dataset.
- Overview
- Dataset
- Project Structure
- Methodology
- EDA Highlights
- Recommendation Engine
- Tech Stack
- Getting Started
- Results
CineMatch is a hybrid movie recommender system built on the IMDB Movies Dataset. It combines:
- Content-Based Filtering — recommends movies similar to a given title based on genres, directors, and cast using TF-IDF vectorization and cosine similarity.
- User-Based Collaborative Filtering — leverages user rating patterns to surface movies liked by similar users.
The hybrid approach addresses the limitations of each method individually — content-only systems are too narrow, while collaborative systems suffer from cold-start problems.
| Property | Detail |
|---|---|
| Source | Kaggle — IMDB Movie Ratings Dataset |
| File | movie_data.csv |
| Feature | Description |
|---|---|
director_name |
Director of the movie |
genres |
Pipe-separated genre tags (e.g. Action|Drama) |
actor_1_name |
Primary actor |
actor_2_name |
Secondary actor |
actor_3_name |
Tertiary actor |
movie_title |
Official title |
imdb_score |
Rating on a 1–10 scale |
num_voted_users |
Number of IMDB user votes |
title_year |
Year of release |
duration |
Runtime in minutes |
language |
Primary language |
country |
Country of production |
├── Hybrid-Recommender-System.ipynb # Main notebook
├── movie_data.csv # Dataset
└── README.md
- Separated numeric and categorical features
- Imputed missing numeric values with median
- Imputed missing categorical values with
"Unknown" - Stripped invisible characters and whitespace from movie titles
- Validated data ranges and categorical consistency
- Distribution of IMDB scores
- Average score and duration trends over time
- Genre popularity and average ratings by genre
- Most active actors by movie count
- Top 10 most-voted movies
Combined text-based features into a single features column:
imdb['features'] = (
imdb['genres'] + " " +
imdb['director_name'] + " " +
imdb['actor_1_name'] + " " +
imdb['actor_2_name'] + " " +
imdb['actor_3_name']
)- Applied TF-IDF Vectorization (
max_features=3000, bigrams,min_df=2) - Computed Cosine Similarity matrix across all movies
- Retrieved top-N most similar movies for any given title
- 🎭 Drama is the most common genre, followed by Comedy and Thriller
- ⭐ IMDB scores are right-skewed — most movies cluster between 6.0 – 7.5
- 🎬 Robert De Niro and Nicolas Cage lead in movie appearances
- 📈 Average movie ratings slightly declined post-2000
- 🔗 Engagement metrics (votes, reviews) are strongly correlated with each other but only moderately with IMDB score — popularity ≠ quality
get_recommendations("Avatar")Output:
Recommended movies similar to Avatar:
1. Aliens
2. Titanic
3. Guardians of the Galaxy
4. Star Trek Into Darkness
...
The function:
- Looks up the movie index
- Retrieves its cosine similarity scores against all other movies
- Returns the top 10 most similar titles (excluding itself)
| Library | Purpose |
|---|---|
pandas |
Data loading, cleaning, transformation |
scikit-learn |
TF-IDF vectorization, cosine similarity |
matplotlib |
Plotting |
seaborn |
Statistical visualizations |
| Method | Strength |
|---|---|
| Content-Based | Strong at finding stylistically similar films (same director, cast, genre) |
| Collaborative | Captures user taste patterns beyond surface metadata |
| Hybrid | Balances both — broader coverage, better cold-start handling |
Have suggestions or want to collaborate? Feel free to open an issue or reach out!
Built with ❤️ using Python & IMDB data