🤖 ArXiv Paper Classification using CNN and Glove Embeddings

👉 CSE 587: Deep Learning for NLP - Midterm Project (Spring 2025)

👉 Contributors: Sinjoy Saha, Xin Dong

👉 Read the full report here.

🔍Table of Contents

🔍Table of Contents
Dataset Curation
- Dataset Distribution
  - OOD Test Data
Models
Results
Discussion
- Impact of Embedding Size on Model Performance
Conclusion

Dataset Curation

We curate the top-5 classes from arXiv CS papers.

Category	Papers	Percentage
cs.CV (Computer Vision)	112,317	34.76%
cs.CL (Computation and Language)	59,447	18.4%
cs.LG (Machine Learning)	99,416	30.77%
cs.RO (Robotics)	27,966	8.65%
cs.CR (Cryptography and Security)	24,961	7.72%

Dataset Distribution

OOD Test Data

Models

Model Architecture

Layer Type	Output Shape	Glove50CNN Parameters	Glove100CNN Parameters	Glove200CNN Parameters
Embedding	[512, 256, 50]	20,000,000	40,000,000	80,000,000
Conv1d (In → Out, kernel=3)	[512, 128, 256]	19,328	38,528	76,928
MaxPool1d (kernel=2)	[512, 128, 128]	0	0	0
Conv1d (128 → 128, kernel=3)	[512, 128, 128]	49,280	49,280	49,280
MaxPool1d (kernel=2)	[512, 128, 64]	0	0	0
Conv1d (128 → 128, kernel=3)	[512, 128, 64]	49,280	49,280	49,280
MaxPool1d (kernel=2)	[512, 128, 32]	0	0	0
Fully Connected (Linear)	[512, 5]	20,485	20,485	20,485

Model Parameters

Parameter Type	Glove50CNN	Glove100CNN	Glove200CNN
Total Parameters	20,138,373	40,157,573	80,195,973
Trainable Parameters	138,373	157,573	195,973
Non-trainable Parameters	20,000,000	40,000,000	80,000,000

Resource Size (MB)

Resource	Glove50CNN	Glove100CNN	Glove200CNN
Input Size for one batch	0.50	0.50	0.50
Forward/Backward Pass	386.02	436.02	536.02
Parameters Size	76.82	153.19	305.92
Estimated Total Size	463.34	589.71	842.44

Results

Training and validation loss curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Training and validation accuracy curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Confusion Matrix for Glove50CNN for 2015-2020 and 2023

Performance of Glove50CNN (2015-2020)

Class	Precision	Recall	F1-score	Count
cs.CL	0.91	0.97	0.94	4396
cs.CR	0.98	0.86	0.91	1450
cs.CV	0.93	0.96	0.94	10316
cs.LG	0.67	0.61	0.63	4038
cs.RO	0.95	0.80	0.87	2495
Accuracy			0.92
M Avg	0.89	0.84	0.86
W Avg	0.92	0.92	0.92

Performance of Glove50CNN (2023)

Class	Precision	Recall	F1-score	Count
cs.CL	0.80	0.96	0.87	4396
cs.CR	0.92	0.79	0.85	1450
cs.CV	0.82	0.95	0.88	10316
cs.LG	0.89	0.49	0.63	4038
cs.RO	0.95	0.78	0.86	2495
Accuracy			0.84
M Avg	0.88	0.79	0.82
W Avg	0.85	0.84	0.83

Discussion

Impact of Embedding Size on Model Performance

Conclusion

Objective: Evaluated GloVe-CNN models with embedding sizes of 50, 100, and 200 for research paper classification.

Findings:

Larger embeddings generally improve accuracy and F1-scores.
GloVe200CNN performed best, particularly in macro and weighted-average F1 scores.
Diminishing returns observed from 100 to 200 dimensions.

Performance on 2023 Data:

All models showed performance degradation.
cs.LG had the biggest decline due to evolving research trends.
GloVe200CNN was most stable, while others struggled with underrepresented categories.

Key Insights:

Well-represented categories (cs.CL, cs.CR, cs.CV) were more robust.
Class distribution impacts model performance over time.

Future Work:

Train domain-specific word embeddings.
Explore hybrid CNN-RNN architectures.
Integrate large language models (LLMs) for better accuracy and granularity.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
images		images
src		src
CSE_587_Midterm_Project_Report.pdf		CSE_587_Midterm_Project_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 ArXiv Paper Classification using CNN and Glove Embeddings

🔍Table of Contents

Dataset Curation

Dataset Distribution

OOD Test Data

Models

Model Architecture

Model Parameters

Resource Size (MB)

Results

Training and validation loss curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Training and validation accuracy curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Confusion Matrix for Glove50CNN for 2015-2020 and 2023

Performance of Glove50CNN (2015-2020)

Performance of Glove50CNN (2023)

Discussion

Impact of Embedding Size on Model Performance

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 ArXiv Paper Classification using CNN and Glove Embeddings

🔍Table of Contents

Dataset Curation

Dataset Distribution

OOD Test Data

Models

Model Architecture

Model Parameters

Resource Size (MB)

Results

Training and validation loss curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Training and validation accuracy curves for GloveCNN models with embedding dimensions 50, 100 and 200.

Confusion Matrix for Glove50CNN for 2015-2020 and 2023

Performance of Glove50CNN (2015-2020)

Performance of Glove50CNN (2023)

Discussion

Impact of Embedding Size on Model Performance

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages