Author: Markus van Kempen (markus.van.kempen@gmail.com) Last Updated: January 24, 2026 Test Queries: 100+ (comprehensive test) Database: SQLite with customers, products, orders tables + sales VIEW
| Rank | Combination | Accuracy | Speed | Recommendation |
|---|---|---|---|---|
| 🥇 | Mistral Small + BeeAI | 100% | 512ms | Best overall |
| 🥇 | Mistral Small + LangChain | 100% | 525ms | Fast + reliable |
| 🥇 | Granite 4 Small + LangChain | 100% | 604ms | Good balance |
| 🥇 | Granite 4 Small + BeeAI | 100% | ~650ms | After prompt fix |
| 🥈 | Direct SQL + Granite 4 | 86% | ~500ms | Fast but less accurate |
Mode Accuracy Notes
─────────────────────────────────────────────
LangChain 100% Retry logic + schema injection
BeeAI 100% After prompt optimization
Direct SQL 86% Fast, some complex query issues
Prompt engineering significantly impacts accuracy. After fixing prompts to include:
- Simple query patterns ("show all products" →
SELECT * FROM products) - Correct column names (products.category vs sales.product_category)
- Top N patterns
BeeAI accuracy improved from 83% → 100%.
We tested 4 additional complex scenarios to validate the "Enriched Schema" (descriptions + sample values).
| Query | Key Challenge | Result |
|---|---|---|
| "Show me the top 3 customers in Europe" | Sample Value 'Europe' | ✅ Passed |
| "Products low in stock (<10)" | Semantic Logic | ✅ Passed |
| "Accessories sales in Germany" | Country + Category | ✅ Passed |
| "Revenue from Laptop Pro" | Product Matching | ✅ Passed |
Verdict: Enriched schema provides critical context (valid values, column meanings) that allows the LLM to handle ambiguity without explicit few-shot examples.
Comparison of different LLMs for SQL generation (12 test queries):
| Model | Gen Rate | Exec Rate | Avg Time | Pattern Match | Verdict |
|---|---|---|---|---|---|
| 🥇 Granite 4 Small | 100% | 91.7% | 599ms | 83.3% | Best overall |
| 🥈 Mistral Small 24B | 100% | 83.3% | 397ms | 91.7% | Fastest |
| 🥉 Llama 3.3 70B | 100% | 75.0% | 1265ms | 83.3% | Slowest |
| 4th Granite 3.3 8B | 100% | 50.0% | 411ms | 75.0% | Struggles with complex |
| Model | Simple | Filter | Aggregation | Complex |
|---|---|---|---|---|
| Granite 4 Small | 100% | 100% | 100% | 75% |
| Mistral Small 24B | 100% | 100% | 75% | 75% |
| Llama 3.3 70B | 100% | 100% | 75% | 50% |
| Granite 3.3 8B | 100% | 50% | 75% | 0% |
- Granite 4 Small is the best all-rounder - good speed and highest accuracy
- Mistral Small 24B is fastest but slightly lower accuracy on aggregations
- Llama 3.3 70B is slower (3x) with no accuracy advantage
- Granite 3.3 8B (deprecated) struggles with complex queries - generates incomplete SQL
Use Granite 4 Small as default for balanced performance. Switch to Mistral Small 24B for high-throughput scenarios where speed matters more than complex query handling.
Four different approaches to natural language → SQL generation were tested, all now featuring self-improving capabilities with embedding-based learning:
| Rank | Mode | Success Rate | Avg Time | Learning | Recommendation |
|---|---|---|---|---|---|
| 🥇 | Self-Improving | 81% | ~700ms | Full | Best for production - continuous improvement |
| 🥈 | BeeAI Agent | 100%* | 567ms | Enabled | Highest single-query reliability |
| 🥉 | Direct SQL | 81% | 451ms | Enabled | Fastest, good balance |
| 4th | LangChain Agent | 90%* | 3,056ms | Enabled | Most flexible, but slowest |
*Results from initial 10-query test; 100-query test shows 81% average across modes.
All modes now include:
- Embedding-based similarity search using IBM Slate model
- Few-shot learning from successful patterns
- Error avoidance from logged mistakes
- User feedback integration (👍/👎)
System now uses embedding-based classification with schema validation:
- Semantic Classification - Uses IBM Slate embeddings to understand query intent
- 79 pre-embedded example queries across 4 categories
- Understands "revenue per country" = "revenue by country"
- No keyword rules needed for variations
- Schema-Aware Validation - Checks if query terms match actual database content
- Empty Result Analysis - Explains why queries return no results
User Query: "show total revenue per country"
↓
┌─────────────────────────────────────────────────────────┐
│ SEMANTIC CLASSIFICATION (Embeddings) │
│ - Embed query using IBM Slate │
│ - Compare to 79 pre-embedded examples │
│ - Similar to: "total revenue by country" (90%) │
│ - Category: database ✅ │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ SCHEMA VALIDATION (only for database queries) │
│ - Check if mentioned entities exist in DB │
│ - "country" exists in schema ✅ │
└─────────────────────────────────────────────────────────┘
↓
Generate SQL
| Category | Examples | Count |
|---|---|---|
| database | "show all products", "revenue by country" | 33 |
| general | "what time is it", "hello" | 18 |
| non_database | "show me the document", "sales for BMW" | 20 |
| help | "what can you do", "show examples" | 8 |
| Query Type | Example | Handled |
|---|---|---|
| Documents | "show me the box document CS" | ✅ Blocked |
| Unknown country | "orders from France" | ✅ Blocked |
| Unknown product | "sales for BMW" | ✅ Blocked |
| Valid query | "show laptop sales" | ✅ SQL generated |
| Valid country | "orders from Germany" | ✅ SQL generated |
When SQL returns no results, the system explains why:
Query: "show me iPhone sales"
SQL: SELECT * FROM sales WHERE product_name LIKE '%iphone%'
→ ⚠️ No results found.
**No product matches:** `iphone` doesn't match any product names
**Did you mean:** smartphone pro, smartphone x, smartphone basic
Warns when queries match multiple products (e.g., "Wireless Mouse" matches both "Wireless Mouse" and "Wireless Mouse Pro").
┌────────────────────────────────────────────────────────────────────┐
│ SELF-IMPROVING TEST (100 QUERIES) │
├────────────────────────────────────────────────────────────────────┤
│ Total Queries: 100 │
│ Success Rate: 81% (81/100) │
│ Patterns Learned: 81 │
│ Errors Logged: 20 │
│ Embedding Utilization: 99% (99/100 queries used patterns) │
└────────────────────────────────────────────────────────────────────┘
| Batch | Queries | Success Rate | Patterns Learned | Patterns Used |
|---|---|---|---|---|
| 1 | 25 | 92% | 23 | 24/25 |
| 2 | 25 | 68% | 40 | 25/25 |
| 3 | 25 | 88% | 62 | 25/25 |
| 4 | 25 | 76% | 81 | 25/25 |
| Category | Success Rate | Notes |
|---|---|---|
| products_filter | 100% (15/15) | Perfect performance |
| sales_basic | 100% (15/15) | Perfect performance |
| products_basic | 95% (19/20) | Very high |
| aggregation | 90% (9/10) | Excellent |
| complex | 80% (8/10) | Good |
| sales_geography | 73% (11/15) | Room for improvement |
| sales_product | 27% (4/15) | Needs prompt tuning |
- Architecture: Prompt engineering + SQLGlot validation + self-improving
- Learning: Embedding-based pattern matching + error avoidance
- Strengths: Fastest, simple architecture, good learning
- Weaknesses: Can struggle with complex joins
- Architecture: LangChain SQL chain + schema injection + self-improving
- Learning: Few-shot examples injected into chain
- Strengths: Good SQL quality, retry logic, flexible
- Weaknesses: Slowest (5-6x slower than Direct)
- Architecture: IBM BeeAI Framework + native WatsonX + self-improving
- Learning: Full embedding integration
- Strengths: Best single-query reliability, clean async API
- Weaknesses: Newer framework, smaller community
- Architecture: Full learning system with pattern storage
- Learning: Primary focus on continuous improvement
- Strengths: Gets better over time, learns from feedback
- Weaknesses: Initial queries may be slower (cold start)
User Query
↓
🔍 Embedding Search (IBM Slate)
│
↓
┌─────────────────────────────────────┐
│ Find Similar Patterns │
│ - Semantic similarity (cosine) │
│ - Top 3 matches with similarity > 0.7│
│ - Fallback to keyword matching │
└─────────────────────────────────────┘
│
↓
┌─────────────────────────────────────┐
│ Get Common Errors to Avoid │
│ - Top 3 error patterns │
│ - Count of occurrences │
└─────────────────────────────────────┘
│
↓
📝 Build Enhanced Prompt:
- ✅ SUCCESSFUL EXAMPLES: 3 similar patterns
- ⚠️ MISTAKES TO AVOID: 3 common errors
- 📊 Schema information
│
↓
🤖 Generate SQL → Execute
│
├── Success → 💾 Store Pattern (with embedding)
└── Failure → ⚠️ Store Error Pattern
│
↓
👍/👎 User Feedback → Update pattern ratings
| Type | Confidence Threshold | Action |
|---|---|---|
| database | Any | Generate SQL |
| general | ≥50% | Direct response |
| help | ≥50% | Show guidance |
| Query | Type | Needs SQL |
|---|---|---|
| "show all products" | database | YES |
| "what time is it" | general | NO |
| "hello" | general | NO |
| "what can you do" | help | NO |
| "total revenue" | database | YES |
| "thanks" | general | NO |
When a LIKE pattern matches multiple products:
Query: "how many wireless mouse did we sell"
↓
🔍 Search: product_name LIKE '%wireless mouse%'
↓
Found: ["Wireless Mouse", "Wireless Mouse Pro"]
↓
⚠️ Show disambiguation warning
| Search Term | Matches |
|---|---|
| "laptop" | Laptop Pro, Laptop Professional, Laptop Basic |
| "wireless mouse" | Wireless Mouse, Wireless Mouse Pro |
| "mechanical keyboard" | Mechanical Keyboard, Mechanical Keyboard RGB |
| "smartphone" | Smartphone X, Smartphone Pro, Smartphone Basic |
Recommended: Self-Improving Mode
- Gets better over time
- 81% accuracy with continuous improvement
- User feedback drives optimization
Recommended: Direct SQL + Learning
- Fastest average response (451ms)
- 81% success rate with learning enabled
- Falls back gracefully
Recommended: BeeAI Agent
- 100% success in controlled tests
- Native WatsonX integration
- Learning now enabled
- Added Simple Query Patterns:
"show all products" → SELECT * FROM products;
"list customers" → SELECT * FROM customers;
DO NOT add WHERE unless user specifies a filter!
- Clarified Column Names:
products table: category (NOT product_category!)
sales VIEW: product_category
- Added Top N Pattern:
"top 5 products by sales" → SELECT product_name, SUM(total_amount) AS total
FROM sales GROUP BY product_name
ORDER BY total DESC LIMIT 5;
| Mode | Before | After | Improvement |
|---|---|---|---|
| BeeAI | 83% | 100% | +17% |
| Direct SQL | 83% | 86% | +3% |
| LangChain | 100% | 100% | Maintained |
┌─────────────────────────────────────────────────────────────────────────┐
│ USER QUERY │
│ "show revenue per country" │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAYER 1: SEMANTIC CLASSIFICATION (Embeddings) │
│ ─────────────────────────────────────────────────────────────────── │
│ ✅ IMPLEMENTED │
│ • IBM Slate embeddings (ibm/slate-125m-english-rtrvr-v2) │
│ • 79 pre-embedded example queries across 4 categories │
│ • Cosine similarity matching (threshold: 0.75 high, 0.6 medium) │
│ • Categories: database, general, non_database, help │
│ │
│ UNDERSTANDS: "revenue per country" = "revenue by country" = "sales │
│ grouped by country" (no keyword rules needed!) │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┼─────────────┐
│ │ │
High (>0.75) Medium (0.6-0.75) Low (<0.6)
│ │ │
▼ ▼ ▼
Direct Route Continue Keyword Fallback
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAYER 2: KEYWORD-BASED CLASSIFICATION │
│ ─────────────────────────────────────────────────────────────────── │
│ ✅ IMPLEMENTED │
│ • Database keywords: products, sales, revenue, orders, customers... │
│ • General keywords: time, weather, hello, thanks, goodbye... │
│ • Help keywords: help, example, how to use, what can you do... │
│ • Non-database keywords: document, file, PDF, image, photo... │
│ • Regex patterns for common phrases │
│ │
│ STOP WORDS: per, each, grouped, spending, breakdown, summary... │
│ (prevents false positives) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAYER 3: SCHEMA-AWARE VALIDATION │
│ ─────────────────────────────────────────────────────────────────── │
│ ✅ IMPLEMENTED │
│ • Loads actual database schema (products, customers, orders, sales) │
│ • Checks if query terms match real data: │
│ - Products: laptop, tablet, smartphone, wireless mouse... │
│ - Countries: USA, UK, Germany, Japan, Canada, Australia... │
│ - Categories: electronics, accessories, books │
│ • Detects unknown entities: "BMW sales" → BMW not in database │
│ • Detects unknown filters: "orders from France" → France not in DB │
│ │
│ ⚠️ PARTIALLY IMPLEMENTED │
│ • City/Region validation (basic) │
│ • Date range validation (not implemented) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAYER 4: EMPTY RESULT ANALYSIS │
│ ─────────────────────────────────────────────────────────────────── │
│ ✅ IMPLEMENTED │
│ • Analyzes SQL that returns 0 rows │
│ • Suggests similar products/entities │
│ • Explains why no results (e.g., "iphone" not in product names) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────┴─────────────┐
│ │
┌─────┴─────┐ ┌─────┴─────┐
│ DATABASE │ │ GENERAL │
│ → SQL Gen │ │ → Direct │
└───────────┘ │ Response │
└───────────┘
| Component | Status | Details |
|---|---|---|
| Semantic Classifier | ✅ Complete | 79 examples, 4 categories, IBM Slate embeddings |
| Keyword Classification | ✅ Complete | Database, general, help, non-database keywords |
| Schema Loading | ✅ Complete | Products, countries, categories from DB |
| Unknown Entity Detection | ✅ Complete | "BMW sales" → helpful message |
| Unknown Filter Detection | ✅ Complete | "orders from France" → France not in DB |
| Stop Words | ✅ Complete | Prevents "per", "breakdown", "grouped" false positives |
| Empty Result Analysis | ✅ Complete | Explains why queries return 0 rows |
| Product Disambiguation | ✅ Complete | Multiple product matches warning |
| General Question Handling | ✅ Complete | "what time is it", "hello" |
| Help Question Handling | ✅ Complete | "what can you do", "show examples" |
| Component | Status | Priority | Description |
|---|---|---|---|
| Semantic Caching | ✅ Implemented | High | Cache embedding vectors in memory + disk |
| Context Awareness | ✅ Implemented | High | Remember previous queries, resolve follow-ups |
| Date Range Validation | ❌ Not Started | Medium | Validate date filters against actual order dates |
| City Validation | Low | Full city list from DB (currently partial) | |
| Confidence Calibration | Medium | Better threshold tuning based on test data | |
| Multi-Language Support | ❌ Not Started | Low | Support for non-English queries |
| Ambiguous Query Handling | Medium | "Show sales" → which sales? (clarification) | |
| Synonym Expansion | ❌ Not Started | Medium | "revenue" = "income" = "earnings" |
| Negative Examples | Medium | More examples of what NOT to generate SQL for | |
| Auto-Learning Categories | ❌ Not Started | Low | Learn new categories from user corrections |
Reduces API calls by caching embedding vectors:
┌─────────────────────────────────────────────────────────────┐
│ EMBEDDING CACHE │
│ ───────────────────────────────────────────────────────── │
│ • Memory cache: LRU with 1000 entries │
│ • Disk cache: ./data/cache/embedding_cache.json │
│ • MD5 hash keys for fast lookup │
│ • Auto-persist every 10 new entries │
│ │
│ First query: "show sales by country" → API call (450ms) │
│ Second query: same → Cache hit (<1ms) │
└─────────────────────────────────────────────────────────────┘
Cache Statistics Available:
size: Number of cached embeddingshits: Cache hit countmisses: Cache miss counthit_rate: Percentage of cache hits
Understands follow-up queries and conversation flow:
┌─────────────────────────────────────────────────────────────┐
│ CONVERSATION CONTEXT │
│ ───────────────────────────────────────────────────────── │
│ Tracks last 10 queries with: │
│ • Query text and type (database/general/help) │
│ • Generated SQL │
│ • Result count │
│ • Extracted filters (product, country, category) │
└─────────────────────────────────────────────────────────────┘
Follow-up patterns detected:
• "show more" / "more details" → expand previous results
• "same for USA" / "what about UK" → similar with new filter
• "filter by electronics" → add filter to previous
• "in Germany" / "from Japan" → country filter
• "those products" / "that data" → resolve references
• "top 5" / "sort by price" → modify previous query
Example Conversation:
User: "show laptop sales by country"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' GROUP BY country
User: "same for USA"
→ Follow-up detected (type: similar)
→ Resolved: "show laptop sales by country in USA"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' AND country = 'USA' GROUP BY country
User: "top 5"
→ Follow-up detected (type: limit)
→ Resolved: "top 5 from: show laptop sales by country in USA"
→ SQL: ... ORDER BY total DESC LIMIT 5
| Query Type | Example | Expected | Actual | Status |
|---|---|---|---|---|
| Document query | "show me the box document CS" | Blocked | Blocked | ✅ |
| Unknown country | "orders from France" | Blocked | Blocked | ✅ |
| Unknown product | "sales for BMW" | Blocked | Blocked | ✅ |
| Valid product | "show laptop sales" | SQL | SQL | ✅ |
| Valid country | "orders from Germany" | SQL | SQL | ✅ |
| Greeting | "hello" | Direct | Direct | ✅ |
| Time query | "what time is it" | Direct | Direct | ✅ |
| Help query | "what can you do" | Help | Help | ✅ |
| Breakdown query | "breakdown of sales by region" | SQL | SQL | ✅ |
| Revenue per | "revenue per country" | SQL | SQL | ✅ |
# Semantic classification thresholds
HIGH_CONFIDENCE_THRESHOLD = 0.75 # Direct classification
MEDIUM_CONFIDENCE_THRESHOLD = 0.60 # Continue to keyword check
LOW_CONFIDENCE_THRESHOLD = 0.30 # Fall back entirely
# Example counts by category
SEMANTIC_EXAMPLES = {
'database': 33, # "show all products", "revenue by country"
'general': 18, # "what time is it", "hello"
'non_database': 20, # "show me the document", "sales for BMW"
'help': 8 # "what can you do", "show examples"
}| File | Purpose |
|---|---|
query_classifier.py |
Main hybrid classification logic |
semantic_classifier.py |
Embedding-based classification |
product_mapper.py |
Product disambiguation |
schema_loader.py |
Database schema utilities |
- Multi-Shot Generation - Generate 2-3 SQL candidates and pick best
- LLM-as-Judge - Use LLM to validate results match user intent
- Hybrid Mode - Fast mode first, fallback to thorough mode on failure
- LLM Request Router - Replace brittle keyword/regex classification with a robust LLM-based router to handle natural language nuance (e.g., "biggest sales" implies sorting, not an entity named "biggest").
- Query Caching - Cache common query patterns
- Fine-tuning - Fine-tune smaller model on SQL generation task
- Active Learning - Identify queries that need human review
- Semantic Caching - Cache embedding vectors to reduce API calls
- Context Awareness - Remember conversation context for follow-up queries
- LLM: IBM Granite 3 8B Instruct (
ibm/granite-3-8b-instruct) - Embeddings: IBM Slate (
ibm/slate-125m-english-rtrvr-v2)
- Application DB: SQLite (
./data/database.db) - Learning Store: SQLite (
./data/learning.db)
-- Successful patterns
CREATE TABLE query_patterns (
id INTEGER PRIMARY KEY,
user_query TEXT,
generated_sql TEXT,
embedding TEXT, -- JSON array of floats
result_count INTEGER,
execution_time_ms REAL,
model_id TEXT,
mode TEXT,
thumbs_up INTEGER DEFAULT 0,
thumbs_down INTEGER DEFAULT 0,
created_at TIMESTAMP
);
-- Error patterns
CREATE TABLE error_patterns (
id INTEGER PRIMARY KEY,
user_query TEXT,
failed_sql TEXT,
error_message TEXT,
error_type TEXT,
model_id TEXT,
mode TEXT,
created_at TIMESTAMP
);- Python 3.13
- Streamlit 1.x
- WatsonX API (IBM Cloud)
- BeeAI Framework (IBM)
- LangChain + LangChain-IBM