A technically credible LLM fine-tuning project focused on efficient fine-tuning, evaluation rigor, and inference optimization. This project demonstrates how to fine-tune a Mistral-7B-Instruct model for a domain-specific support/FAQ assistant using QLoRA on a single 16GB GPU.
The goal of this project is to create a specialized assistant that can handle domain-specific troubleshooting and operational queries with higher accuracy than a general-purpose zero-shot model, while maintaining a low memory footprint.
- Efficient Fine-Tuning: Uses QLoRA (4-bit quantization + LoRA adapters) to train on limited hardware.
- Memory Optimization: Implements gradient checkpointing, paged AdamW, and gradient accumulation.
- Evaluation Rigor: Comparative analysis using ROUGE-L and BLEU metrics against a zero-shot baseline.
- Production-Ready API: FastAPI wrapper with structured logging and latency benchmarking.
- Containerized Deployment: Docker-ready for consistent environment reproduction.
graph TD
A[Raw Support Data] --> B[Preprocessing Script]
B --> C[Cleaned & Formatted Dataset]
C --> D[SFTTrainer]
E[Mistral-7B-Instruct] --> F[4-bit Quantization]
F --> G[PEFT Adapter Setup]
G --> D
D --> H[Fine-tuned Adapters]
H --> I[Evaluation & Benchmarking]
graph LR
User[User Request] --> API[FastAPI]
API --> Engine[Inference Engine]
Engine --> Base[Quantized Base Model]
Engine --> Adapter[LoRA Adapters]
Base & Adapter --> Result[Generated Response]
Result --> API
API --> User
- Python 3.10+
- NVIDIA GPU with 16GB+ VRAM
- CUDA drivers installed
-
Clone the repository:
git clone https://github.com/whitebeard10/Mistral-Domain-Support-Assistant.git cd Mistral-Domain-Support-Assistant -
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
cp .env.example .env # Edit .env with your HF token and WandB key
| Parameter | Value |
|---|---|
| Base Model | Mistral-7B-Instruct-v0.2 |
| Quantization | 4-bit NF4 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| Target Modules | q_proj, v_proj, k_proj, o_proj |
| Learning Rate | 2e-4 |
| Optimizer | Paged AdamW 32-bit |
| Batch Size | 4 (Effective: 16 via Grad Accum) |
| Max Seq Length | 1024 |
Early stopping is applied after identifies overfitting onset (usually around epoch 3). This ensures the model retains general reasoning capabilities while specializing in the support domain.
| Metric | Zero-Shot Baseline | Fine-Tuned (QLoRA) | Improvement |
|---|---|---|---|
| ROUGE-L | 0.3250 | 0.3965 | +22.0% |
| BLEU | 0.1240 | 0.1580 | +27.4% |
| Avg Latency | 2.8s | 2.4s | -14.2% |
Note: Benchmarked on a single NVIDIA T4/RTX 3090.
Start the API:
python -m app.mainRequest:
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"instruction": "How do I reset my password?"}'Response:
{
"response": "To reset your password, click on the 'Forgot Password' link on the login page...",
"latency": 2.35,
"tokens_generated": 45,
"model_name": "mistralai/Mistral-7B-Instruct-v0.2"
}- Hallucination Persistence: While domain accuracy improved, the model may still hallucinate complex troubleshooting steps not present in the training data.
- Multi-turn Memory: This implementation focuses on single-turn instruction following. Multi-turn context handling is limited.
- Over-specialization: The model might prioritize domain-specific answers even for general queries (catastrophic forgetting mitigation via low rank LoRA).