Repository for final project allocation and submission for MM62201 : Introduction to Programming offered in Autumn 2025 at IIT Kharagpur taught by Prof Subhamoy Mandal. This repository might be updated with new projects and/or changes to existing projects. Please check back regularly.
Final projects are approved by Prof Subhamoy Mandal.
- Final Projects for MM62201 : Introduction to Programming
-
- Project 1 : Medical Transcription Analysis
- Project 2 : Agriculture Crop Production Analysis
- Project 3 : Medical Image Visualization and Analysis
- Project 4 : Impact of Soil Quality on Crop Growth Analysis
- Project 5 : Plant Disease Analysis Using Leaf Image Data
- Project 6 : PCOS Data Analysis and Visualization
- Project 7 : Mushroom Edibility Classification Using Data Analysis and Visualization
- Project 8 : Predicting Depression Risk and Recovery Using Clinical and Mindfulness Data
- Project 9 : Monitoring Glucose, Heart Rate, and Activity Patterns in Personalized Nutrition
- The project is to be done in groups of 2 students. The students are expected to work together collaboratively.
- The choice of programming language is Python.
- Each group will be assigned a mentor TA who will be responsible for guiding the group throughout the project.
- Meetings with the mentor TA will be scheduled at the beginning of the project and at regular intervals.
- Each student will be evaluated based on the contribution towards the project. Make sure you are contributing equally to the project.
- Code plagiarism will not be tolerated. Any submission found to be plagiarized will be awarded a zero grade.
- Late submissions will not be accepted.
- Pre-requisite: For all of the below mentioned projects, go through the dataset throughly before attempting to solve them.
- The final project evaluation is based on the following criteria:
Continuous Evaluation (CE) : 40%Code Quality and Documentation : 20%Final Submission and Report : 40%
Continuous Evaluation (CE): 40%- The CE will be based on the following criteria:
- Your participation in the weekly meetings with your mentor TA.
- Your weekly progress and updates on the project.
- The CE will be based on the following criteria:
Code Quality and Documentation: 20%- This will be based on the following criteria:
- Code Quality : 10% (based on the code quality and readability)
- Documentation : 10% (based on the documentation of the code and the project)
- This will be based on the following criteria:
Final Submission and Report: 40%- This will be based on the following criteria:
- Final Submission : 20% (based on the final submission of the project)
- Final Report : 20% (based on the final report of the project)
- This will be based on the following criteria:
- CE will be evaluated if you have attended
at least 75%of the weekly meetings with your mentor TA.
Forkthisgithub.com/manalir66/Final-Projects-for-MM62201repository.Clonethe forked repository to your local machine using the following command:git clone github.com/{your_username}/Final-Projects-for-MM62201- Your projects are in the
submissionsdirectory. You can find the project description in the README.md file of the respective project directory. - Work on the project and make
regular commitsto your local repository andpushthem to your forked repository. - Your mentor TA will review your code and provide feedback.
- You have to submit the following:
Final Code: The final code of your project in the respective project directory.- Code should be highly readable and well documented.
- Try to write efficient and clean code and avoid unnecessary code.
Final Report: The final report of your project in the respective project directory. The report should be in the form of amarkdownfile with the namereport.md. The report should contain the following:Introduction: A brief introduction of the project.Data: A brief description of the data used in the project.Questions & Answers: The questions and their respective answers. Also include the code snippets used to answer the questions andwho solvedthe question.References: The references used in the project.
- Submission of the final project will be done via
GitHub Pull Requests. - Once you are done with the project, you can create a
Pull Requestto themainbranch of thegithub.com/manalir66/Final-Projects-for-MM62201repository. - We will review your merge request and provide feedback. You can make changes to your code and update the merge request. If accepted, your project will be merged to the
mainbranch of thegithub.com/manalir66/Final-Projects-for-MM62201repository. - That's it!
Congratulations!!You have successfully submitted your final project.
The deadline for the final project submission is 28th November 2025, 23:59 IST**.
| Students | Project | Mentor TA |
|---|
| | Project 1 : Medical Transcription Analysis | | | | Project 2 : Agriculture Crop Production Analysis | | | | Project 3 : Medical Image Visualization and Analysis | | | | Project 4 : Impact of Soil Quality on Crop Growth Analysis | | | | Project 5 : Plant Disease Analysis Using Leaf Image Data | | | | Project 6 : PCOS Data Analysis and Visualization | | | | Project 7 : Mushroom Edibility Classification Using Data Analysis and Visualization | | | | Project 8 : Predicting Depression Risk and Recovery Using Clinical and Mindfulness Data | | | | Project 9 : Exploring Glucose, Heart Rate, and Activity Patterns in Personalized Nutrition | |
-
The project aims to analyse the medical transcription dataset. The dataset is located in the
data/medical_transcriptions/mtsamples.csvdirectory. -
The dataset is a
csvfile.CSVstands forCommaSeparatedValues. It is a simple file format used to store tabular data, such as a spreadsheet or database. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. -
The dataset contains following fields :
description: Short brief of the interaction between the patient and the doctor.medical_specialty: Medical specialty of the issue discussed in the transcription.sample_name: Medical Samples used for the diagnosis.transcription: Full transcription of the interaction between the patient and the doctor.keywords: Keywords of the transcription
-
The project can be divided into sub-areas as follows :
Data Preprocessing- Write functions to read the csv file. Suggestion : Use the
pandaslibrary. - This dataset needs bit of pre-processing. The
medical_specialtyfield contains multiple values. You need to split the values and create a list of values. For example, if themedical_specialtyfield containsOrthopedics, Neurology, then you need to split it into['Orthopedics', 'Neurology']. - The keywords field contains multiple values. You need to split the values and transform it into a list of values. For example, if the
keywordsfield contains'pain, headache, migraine', then you need to split it into['pain', 'headache', 'migraine']. - Look into the dataset and find out if there are any other fields that need to be pre-processed.
- Write functions to read the csv file. Suggestion : Use the
Data Analysis- In this part you can prepare a set of questions at least 10 and answer them using the dataset.
- Some examples questions to get you started:
- What is the
most commonmedical specialty? - What is the
most commonmedical sample? - What is the
most commonkeyword? - What is the
averagelength of the transcription? - What is the
averagelength of the description? - What is the
averagelength of the keywords? - And so on... Get creative and come up with your own questions.
- What is the
Data Visualization- In this part you can make use of the
matplotlibandseabornlibraries to visualize the answers to the questions you asked in the previous part. - Everyone likes to see the results in the form of
graphsandcharts. So, make sure you visualize the answers to the questions you asked in the previous part.
- In this part you can make use of the
-
This project aims to analyse the crop production data from 2006 to 2011 from all the states of India. The dataset is located in the
data/crop_production/directory. -
The data directory contains 5 csv files. Go through the data files and understand the data.
-
Different data files contain different types of data. For example
datafile_1.csvcontains the following fields:Crop: Name of the cropState: Name of the stateCost of Cultivation (/Hectare) A2+FL: Cost of cultivation per hectareCost of Cultivation (/Hectare) C2: Cost of cultivation per hectareCost of Production (/Quintal) C2: Cost of production per quintalYield (Quintal/ Hectare): Yield per hectare
-
The
datafile_2.csvcontains the following fields:Crop: Name of the cropProduction (YYYY - YY): Production of the crop between two consecutive yearsArea (YYYY - YY): Area of the crop between two consecutive yearsYield (YYYY - YY): Yield of the crop between two consecutive years
-
Go through the data files and understand the data. You can use the
pandaslibrary to read the csv files and perform analysis on the data. -
The data files are not clean. You need to clean the data before you start analysing it.
-
The project can be divided into the following parts:
Data Processing- Writing the functions for reading the data files.
- Once you have read the data files, you need to clean the data. You can use the
pandaslibrary to clean the data. - Only keep the data which is relevant to the analysis and drop the rest of the data.
Data Analysis- In this part, you need to prepare a set of questions and answer them using the data provided.
- Answer
at least 15 questionsusing the data provided. - A few examples questions to get you started are as follows:
- Which
crophas thehighest productionin the country? - What are the major
stateswherericeis grown? - What is the
average cost of cultivationofricein the country? - What are seasons where
Sunfloweris grown? (data availabe indatafile_5.csv) - What is average crop duration for
Paddy,WheatandMaize?
- Which
- You can come up with your own questions and answer them using the data provided.
Data Visualization- Visualize the data using
matplotliborseabornlibrary. - Visualizing the data will help you understand the data better and answer the questions.
- Visualize the data using
- The project aims to read, visualize and analyze the medical images. The dataset is located in the
data/medical_images/directory. - The dataset contains medical images of
MRIandCTscans for different anatomical parts of the body. It also contains thesegmentation masksfor the images. - The dataset has
Hippocampus MRIimages and segmentation masksHeart MRIimages and segmentation masksProstate MRIimages and segmentation masksAbdomen CTimages and segmentation masks
- These scans are used to diagnose the diseases of the body. The segmentation masks are used to identify the different parts of the body in the images.
- Scans are in
NIFTIformat.NIFTIis a standard format for storing medical images. - All the scans are
3D Volumes. Each 3D volume is a stack of2D images. Each 2D image is called aslice. - Your first task is to read the images and visualize them. You can use the
nibabellibrary to read the images. - Visualizing the images is important to understand the data. You can use the
matplotliblibrary to visualize the images. Visualization can be done in multiple ways. You can visualize the images in the following ways:- Visualize the
slicesof the images and the segmentation masks. - Visualize the
3D volumesof the images and the segmentation masks.
- Visualize the
- The next task is to analyze the images. You can use the
numpylibrary to analyze the images. The analysis part is open ended. - You can perform simple statistical analysis on the images. You can also perform more complex analysis like
image segmentationandimage classification. - Statistical analysis may include the following:
- Calculate the
mean,median,standard deviation,minimumandmaxmumfor the whole image, segmented image. - Now compare the statistics of the segmented image with the whole image. What do you observe?
- Calculate the
- Complex analysis may include the following:
- Perform
image segmentationon the images. You can use thescikit-imagelibrary to perform image segmentation. - Perform
image classificationon the images. You can use thescikit-learnlibrary to perform image classification. - You can also perform
image registrationon the images. You can use theSimpleITKlibrary to perform image registration.
- Perform
- Try with statistical analysis first and then move on to more complex analysis. Although, we do not expect you to perform complex analysis, you can try it if you want to.
- Remember, the analysis part is open ended. You can come up with your own analysis ideas and implement them.
- This project aims to explore how soil quality (nutrients, pH) and weather conditions (temperature, humidity, rainfall) influence different crops. The dataset is located in the
data/crop_growth/directory. - By analyzing this data, you will identify patterns and insights that can help to classify the type of crop grown in different regions based on these soil and weather factors.
- Go through the data files and understand the data. You can use the
pandaslibrary to read the csv files and perform analysis on the data. - The project can be divided into the following parts:
Data PreprocessingHandling Missing Values: a) Check for missing values in soil nutrient data (nitrogen, phosphorus, potassium), pH, temperature, humidity, and rainfall. b) Use imputation techniques like mean or median imputation for continuous variables such as nutrient levels, pH, and weather data.Feature Scaling: a) Normalize the soil nutrient levels, pH, temperature, humidity, and rainfall to bring them onto a comparable scale. This is important for analysis and visualization purposes.Encoding Categorical Data: a) Convert the categorical feature crop label into numerical form using one-hot encoding or label encoding, so that it can be included in the analysis.
Data AnalysisDescriptive Statistics: a) Compute summary statistics (mean, median, min, max) for all continuous features (soil nutrients, pH, temperature, humidity, rainfall) to understand the distribution of these variables for different crop types.Correlation Analysis: a) Examine the correlation between features (soil nutrients, pH, temperature, humidity, rainfall) to understand how soil and weather
conditions relate to each other. Use correlation matrices to visualize these relationships.
Data VisualizationFeature Distributions: a) Use histograms or box plots to visualize the distribution of key features (nitrogen, phosphorus, potassium, pH, temperature, humidity, rainfall) across different crop types. This will help in understanding which factors are more prevalent for certain crops.Pair Plots: a) Generate pair plots to visualize relationships between different features for each crop type. Pair plots can show how soil nutrients and weather conditions are distributed across crops and how they might interact with each other.Heatmap for Correlation: a) Create a heatmap to show the correlation between features (soil nutrients, pH, temperature, humidity, and rainfall). This will highlight strong correlations between variables that may influence crop type classification.Bar Charts for Crop Type Distribution: a) Use bar charts to visualize the distribution of the crop type across different ranges of soil and weather features. For example, you can create bar charts showing the count of each crop type at various pH levels or nitrogen concentrations.Scatter Plots: a) Plot scatter plots with soil nutrients (e.g., nitrogen vs. phosphorus) on the axes, color-coded by crop type, to visually assess which crops prefer specific soil nutrient combinations.
- You can use the
pandaslibrary for handling missing data, imputation, encoding, and descriptive statistics. - For scaling features and encoding categorical data, you can use
scikit-learnlibrary. - The
seabornandmatplotliblibrary is used for creating visualizations like pair plots, heatmaps, bar charts, scatter plots, and histograms. - You can try to develop a classification model to predict crop types based on soil and weather features and evaluate their performance. Although, we do not expect you to build the classification model, you can try it if you want to. -->
- The aim of this project is to identify plant diseases from images of leaves using image processing and machine learning techniques. The dataset is located in the
data/plant_disease/directory. - By analyzing the key visual differences between healthy and diseased plant leaves, the project aims to automate disease detection, which can help farmers and agricultural experts identify problems early and take action to improve crop health.
- This project can be divided into the following parts:
Image ProcessingImage Resizing: Resize all images to a uniform dimension (e.g., 128x128 or 256x256 pixels) to ensure consistent input size across the dataset. This step ensures that the images are compatible with deep learning models, which require fixed input dimensions. You can useOpenCVorPIL(Python Imaging Library) to resize the images.Normalization: Scale pixel values from a range of 0-255 to 0-1. This ensures that the features (pixel values) are on a comparable scale, which aids in faster and better model convergence during training.
Data AnalysisStatistical Analysis: Perform descriptive statistical analysis of the dataset to explore the distribution of images across different disease categories. This includes computing the number of samples per disease class and assessing whether there are any class imbalances (i.e., some diseases having significantly more samples than others).Pixel Intensity Analysis: Examine the pixel intensity values of the images to identify key differences between healthy and diseased leaves. This analysis involves comparing pixel distributions (e.g., histograms of pixel values) for each class to see if certain patterns (such as darker or lighter regions) are characteristic of diseases.- You can use
pandasto analyze class distributions and check for imbalances.
Data VisualizationImage Grid Visualization: Create grids of sample images for each disease class to visually assess the variations within and across the classes. This helps in identifying visual patterns, such as texture or color differences, that may be indicative of disease. To create image grids, you can usematplotliblibrary.Dimensionality Reduction (PCA or t-SNE): Apply dimensionality reduction techniques such as PCA (Principal Component Analysis) or t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce the high-dimensional image data to 2D or 3D space. This allows for visualization of the relationships between images and how different diseases cluster together. You can usescikit-learnlibrary to perform PCA or t-SNE for feature reduction.
- Remember, the analysis part is open-ended. You can come up with your own analysis ideas and implement them.
- You can try to develop a deep learning model for plant disease detection and evaluate their performance. Although, we do not expect you to build the detection model, you can try it if you want to.
- The aim of this project is to analyze data related to Polycystic Ovary Syndrome (PCOS), a hormonal disorder affecting women of reproductive age.
- The analysis will explore the relationship between various clinical, demographic, and lifestyle factors with the occurrence of PCOS.
- This project also aims to identify patterns that may help in early detection and understanding of contributing factors for PCOS.
- The dataset is located in the
data/pcos_data/directory. The dataset may include clinical and demographic features such as:- Clinical Features: BMI, blood pressure, FSH/LH ratio, menstrual cycle length, etc.
- Demographic Features: Age, marital status.
- Lifestyle Factors: Fast food(Yes/No), Regular exercise habits.
- Labels: PCOS (Yes/No).
- This project is divided into the following parts:
Data PreprocessingHandling Missing Values: Identify and handle any missing or NaN values in the dataset using imputation or removal methods. You can usePandasandScikit-learnlibrary.Data Balancing: If there is a class imbalance (e.g., significantly more non-PCOS cases), apply techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance the classes usingimblearn.over_samplinglibrary.
Data AnalysisDescriptive Analysis: Compute summary statistics (mean, median, standard deviation) for clinical features such as BMI, and blood glucose.Correlation Analysis: Assess relationships between features and their influence on PCOS. Use correlation matrices to visualize these relationships. You can useseabornlibrary for visualization.Statistical Testing: Use hypothesis testing (e.g., Chi-Square Test, ANOVA) to examine the association between categorical variables (e.g., Pregnant) and the presence of PCOS. Thescipy.statscan be used for this testing.
Data VisualizationFeature Distributions: Visualize the distribution of key features like BMI, and blood pressure across PCOS and non-PCOS groups using histograms or box plots.Clustering Analysis: Use clustering techniques like KMeans or hierarchical clustering to identify patterns or groups within the dataset. You can usesklearn.clusterfor kmeans clustering.Feature Importance Visualization: Use a machine learning model like a Decision Tree or Random Forest to identify the most important features contributing to PCOS and visualize them using bar plots by usingscikit-learnlibrary.
- Remember, the analysis part is open-ended. You can come up with your own analysis ideas and implement them.
- The goal of this project is to analyze a dataset containing descriptions of various mushroom species, focusing on the identification of features that distinguish between edible and poisonous mushrooms. Find the dataset at - https://archive.ics.uci.edu/dataset/73/mushroom
- The project will involve handling missing values, performing descriptive and correlation analyses, and generating insightful visualizations to understand which characteristics are most associated with edibility or toxicity.
- This project is divided into the following parts:
Data Preprocessing:
Handling Missing Values: Identify and treat missing values by using imputation techniques such as mode imputation, given the categorical nature of the dataset.Encoding Categorical Data: Convert all categorical variables to numerical format (e.g., one-hot encoding or label encoding) to enable analysis using theScikit-learnlibrary.Balancing Classes: If the dataset shows an imbalance in edible vs. poisonous mushrooms, use resampling techniques like SMOTE (Synthetic Minority Over-sampling Technique) to balance classes using 'imblearn' library.
Data Analysis:Descriptive Statistics: Calculate summary statistics to identify common features associated with edible and poisonous mushrooms.Correlation Analysis: Compute correlations among categorical features to identify potential relationships or patterns with mushroom edibility.Chi-Square Tests: Perform chi-square tests on categorical variables to determine which features are statistically significant in distinguishing between poisonous and edible mushrooms usingscipy.statslibrary.
Data Visualization:Feature Distributions: Create bar plots and box plots to illustrate the distribution of individual features (e.g., cap shape, odor, gill size) across edible and poisonous mushrooms utilizingmatplotliblibrary.Correlation Matrix Heatmap: Display a heatmap of correlations between features to visually identify relationships among attributes and potential indicators of edibility.Pair Plot: Use pair plots to see how combinations of features vary between edible and poisonous mushrooms, which can reveal unique patterns using python libraries.Decision Tree Visualization: If a decision tree model is used, visualize the tree to illustrate decision-making criteria and the most relevant features associated with edibility.
- Remember, the analysis part is open-ended. You can come up with your own analysis ideas and implement them.
-
Data Loading and Initial Exploration: Load the Excel file containing the mental health dataset and conduct an initial exploration to understand its structure and contents.
-
The dataset is located in the
data/mental_health/directory. -
The dataset includes following features such as:-
Demographics:age, sexClinical Factors:condition (specific disease), condition type (disease group), baseline BDI-II depression score, identifier of the hospital.Mindfulness Therapy:number of sessions started, number of sessions completedHealth Outcomes:BDI-II depression score at 12 weeks, BDI-II depression score at 24 weeks
-
Understand the dataset structure, data types, and basic statistics for each features and give a brief summary report with initial findings on data types.
-
The project is divided into following parts:-
-
Data Preprocessing and Cleaning:Write functions to read the csv file. Suggestion : Use thepandaslibrary. The dataset may contain missing values, NaN values, or invalid entries. Use suitable imputation techniques for data cleaning and filling missing entries. -
Data Analysis:It includes visualization (e.g., histograms, scatter plots, box plots) and calculating correlations between features. You may find out basic statistical measures such as (mean, median, etc.) for each features. If any field contains multiple values. You need to split the values and transform it into a list of values.- In this part, you need to prepare a set of questions and answer them using the data provided.
- Answer
at least 15 questionsusing the data provided. - A few examples questions to get you started are as follows:
- How do baseline BDI-II scores vary across disease groups (
condition_type)? - Are there noticeable differences in baseline depression levels between hospitals(
hospital_id)? - What is the average reduction in depression score for participants who completed all sessions versus those who did not?
- Are there differences in 12-week outcomes by sex, age group, or disease group?
- Which factors (demographic, clinical, or therapy-related) most strongly predict short-term depression improvement?
- What proportion of patients experience relapse or worsening after the 12-week period? d. You can come up with your own questions and answer them using the data provided.
- How do baseline BDI-II scores vary across disease groups (
-
Data Visualization:Make sure you potray all your data analysis using some plots, pie charts, bar charts, heatmaps as per the suitability.- Use histograms (age), Bar chart or Pie chart (sex distribution)
- Visualize how many patients started vs. completed mindfulness sessions (e.g., stacked bar or pie chart).
- Use paired line plots or boxplots to compare baseline vs. 12-week BDI-II scores.
- Use Scatterplot (sessions completed vs. BDI change)
- Use Line plot (BDI over time) or Paired Dot plot (12 vs. 24 weeks)
-
-
The analysis part is open-ended. You can come up with your own analysis ideas and implement them.
-
The goal of this project is to explore how physiological and lifestyle factors relate to glucose regulation in 10 participants.
-
The dataset is located in
data\nutrition\directory. -
You will work with:
- CGMacros-00XX.csv: Time-series data per participant including glucose (Libre GL), heart rate (HR), calories burned, and METs.
- bio.csv: Baseline characteristics and lab measurements including BMI, A1c, fasting glucose, insulin, cholesterol, and fingerstick glucose readings.
-
Data Preprocessing and Cleaning:Write functions to read the csv file. Suggestion : Use thepandaslibrary. The dataset may contain missing values, NaN values, or invalid entries. Use suitable imputation techniques for data cleaning and filling missing entries. Understand the association between the datasets. Convert timestamps to datetime objects. -
Data Analysis:It includes visualization (e.g., histograms, scatter plots, box plots) and calculating correlations between features. You may find out basic statistical measures such as (mean, median, etc.) for each features.-
In this part, you need to prepare a set of questions and answer them using the data provided.
-
Answer
at least 15 questionsusing the data provided.- How does Libre GL glucose vary over time for each participant?
- How do heart rate, calories, and METs vary during the day for each participant?
- Compare average glucose (Libre GL) between participants.
- Does higher heart rate associate with lower or higher glucose spikes?
- Identify the participant with highest glucose variability.
- Visualize daily activity vs. daily glucose averages for each participant.
- Compute average daily metrics (mean HR, mean glucose, calories burned) and plot across participants.
- What time of day tends to show the highest average glucose across participants?
- Compare male vs. female participants on fasting glucose, A1c, and average Libre GL.
-
-
Data Visualization:- Use line plots to analyze time-series data(glucose value from Libre GL varying over time). You can also plot for single metric or with multiple metrics such as heart rate, calorie values.
- Bar charts can be used for comparing average glucose between participants.
- Scatterplots can be used to showcase peak glucose levels with higher heart rate.
-
The analysis is open-ended and you can frame your own questions. Use suitable plots to visualize your data.
Main Goal: The overall goal for these projects is to give you an opportunity to explore, understand and study with domain-specific (i,e., agriculture/medical) data.
- Python Documentation
- [Class Code Materials]: Mailed after each lab class
- Introduction to Computation and Programming Using Python
- Elements of Programming Interviews in Python
- Python Libraries: Please explore these Python libraries in depth to identify their unique features, capabilities, and functionalities (mostly to identify the right kind of visualization tools).