SQL Generation Mode Comparison Results

Author: Markus van Kempen (markus.van.kempen@gmail.com) Last Updated: January 24, 2026 Test Queries: 100+ (comprehensive test) Database: SQLite with customers, products, orders tables + sales VIEW

Model + Mode Comparison (Latest)

Best Combinations After Prompt Optimization

Rank	Combination	Accuracy	Speed	Recommendation
🥇	Mistral Small + BeeAI	100%	512ms	Best overall
🥇	Mistral Small + LangChain	100%	525ms	Fast + reliable
🥇	Granite 4 Small + LangChain	100%	604ms	Good balance
🥇	Granite 4 Small + BeeAI	100%	~650ms	After prompt fix
🥈	Direct SQL + Granite 4	86%	~500ms	Fast but less accurate

Accuracy by Mode (After Prompt Fixes)

Mode         Accuracy   Notes
─────────────────────────────────────────────
LangChain    100%       Retry logic + schema injection
BeeAI        100%       After prompt optimization
Direct SQL    86%       Fast, some complex query issues

Key Finding

Prompt engineering significantly impacts accuracy. After fixing prompts to include:

Simple query patterns ("show all products" → SELECT * FROM products)
Correct column names (products.category vs sales.product_category)
Top N patterns

BeeAI accuracy improved from 83% → 100%.

Schema Enrichment & Validation Impact (January 2026)

Benchmark Results (Complex Queries)

We tested 4 additional complex scenarios to validate the "Enriched Schema" (descriptions + sample values).

Query	Key Challenge	Result
"Show me the top 3 customers in Europe"	Sample Value 'Europe'	✅ Passed
"Products low in stock (<10)"	Semantic Logic	✅ Passed
"Accessories sales in Germany"	Country + Category	✅ Passed
"Revenue from Laptop Pro"	Product Matching	✅ Passed

Verdict: Enriched schema provides critical context (valid values, column meanings) that allows the LLM to handle ambiguity without explicit few-shot examples.

LLM Model Comparison

Comparison of different LLMs for SQL generation (12 test queries):

Overall Results

Model	Gen Rate	Exec Rate	Avg Time	Pattern Match	Verdict
🥇 Granite 4 Small	100%	91.7%	599ms	83.3%	Best overall
🥈 Mistral Small 24B	100%	83.3%	397ms	91.7%	Fastest
🥉 Llama 3.3 70B	100%	75.0%	1265ms	83.3%	Slowest
4th Granite 3.3 8B	100%	50.0%	411ms	75.0%	Struggles with complex

Success by Query Category

Model	Simple	Filter	Aggregation	Complex
Granite 4 Small	100%	100%	100%	75%
Mistral Small 24B	100%	100%	75%	75%
Llama 3.3 70B	100%	100%	75%	50%
Granite 3.3 8B	100%	50%	75%	0%

Key Findings

Granite 4 Small is the best all-rounder - good speed and highest accuracy
Mistral Small 24B is fastest but slightly lower accuracy on aggregations
Llama 3.3 70B is slower (3x) with no accuracy advantage
Granite 3.3 8B (deprecated) struggles with complex queries - generates incomplete SQL

Recommendation

Use Granite 4 Small as default for balanced performance. Switch to Mistral Small 24B for high-throughput scenarios where speed matters more than complex query handling.

Executive Summary

Four different approaches to natural language → SQL generation were tested, all now featuring self-improving capabilities with embedding-based learning:

Rank	Mode	Success Rate	Avg Time	Learning	Recommendation
🥇	Self-Improving	81%	~700ms	Full	Best for production - continuous improvement
🥈	BeeAI Agent	100%*	567ms	Enabled	Highest single-query reliability
🥉	Direct SQL	81%	451ms	Enabled	Fastest, good balance
4th	LangChain Agent	90%*	3,056ms	Enabled	Most flexible, but slowest

*Results from initial 10-query test; 100-query test shows 81% average across modes.

New Features (January 2026)

🧠 Self-Improving System

All modes now include:

Embedding-based similarity search using IBM Slate model
Few-shot learning from successful patterns
Error avoidance from logged mistakes
User feedback integration (👍/👎)

💬 Semantic Query Classification (NEW)

System now uses embedding-based classification with schema validation:

Semantic Classification - Uses IBM Slate embeddings to understand query intent
- 79 pre-embedded example queries across 4 categories
- Understands "revenue per country" = "revenue by country"
- No keyword rules needed for variations
Schema-Aware Validation - Checks if query terms match actual database content
Empty Result Analysis - Explains why queries return no results

Semantic Classifier Architecture

User Query: "show total revenue per country"
                    ↓
┌─────────────────────────────────────────────────────────┐
│  SEMANTIC CLASSIFICATION (Embeddings)                   │
│  - Embed query using IBM Slate                          │
│  - Compare to 79 pre-embedded examples                  │
│  - Similar to: "total revenue by country" (90%)         │
│  - Category: database ✅                                 │
└─────────────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────────┐
│  SCHEMA VALIDATION (only for database queries)          │
│  - Check if mentioned entities exist in DB              │
│  - "country" exists in schema ✅                         │
└─────────────────────────────────────────────────────────┘
                    ↓
             Generate SQL

Example Categories

Category	Examples	Count
database	"show all products", "revenue by country"	33
general	"what time is it", "hello"	18
non_database	"show me the document", "sales for BMW"	20
help	"what can you do", "show examples"	8

Classification Test Results (17 queries)

Query Type	Example	Handled
Documents	"show me the box document CS"	✅ Blocked
Unknown country	"orders from France"	✅ Blocked
Unknown product	"sales for BMW"	✅ Blocked
Valid query	"show laptop sales"	✅ SQL generated
Valid country	"orders from Germany"	✅ SQL generated

Empty Result Feedback

When SQL returns no results, the system explains why:

Query: "show me iPhone sales"
SQL: SELECT * FROM sales WHERE product_name LIKE '%iphone%'
→ ⚠️ No results found.
   **No product matches:** `iphone` doesn't match any product names
   **Did you mean:** smartphone pro, smartphone x, smartphone basic

🔍 Product Disambiguation

Warns when queries match multiple products (e.g., "Wireless Mouse" matches both "Wireless Mouse" and "Wireless Mouse Pro").

100-Query Test Results

Overall Performance

┌────────────────────────────────────────────────────────────────────┐
│              SELF-IMPROVING TEST (100 QUERIES)                     │
├────────────────────────────────────────────────────────────────────┤
│  Total Queries:           100                                      │
│  Success Rate:            81% (81/100)                             │
│  Patterns Learned:        81                                       │
│  Errors Logged:           20                                       │
│  Embedding Utilization:   99% (99/100 queries used patterns)       │
└────────────────────────────────────────────────────────────────────┘

Learning Progression by Batch

Batch	Queries	Success Rate	Patterns Learned	Patterns Used
1	25	92%	23	24/25
2	25	68%	40	25/25
3	25	88%	62	25/25
4	25	76%	81	25/25

Success Rate by Category

Category	Success Rate	Notes
products_filter	100% (15/15)	Perfect performance
sales_basic	100% (15/15)	Perfect performance
products_basic	95% (19/20)	Very high
aggregation	90% (9/10)	Excellent
complex	80% (8/10)	Good
sales_geography	73% (11/15)	Room for improvement
sales_product	27% (4/15)	Needs prompt tuning

Mode-Specific Details

Direct SQL Mode

Architecture: Prompt engineering + SQLGlot validation + self-improving
Learning: Embedding-based pattern matching + error avoidance
Strengths: Fastest, simple architecture, good learning
Weaknesses: Can struggle with complex joins

LangChain Agent Mode

Architecture: LangChain SQL chain + schema injection + self-improving
Learning: Few-shot examples injected into chain
Strengths: Good SQL quality, retry logic, flexible
Weaknesses: Slowest (5-6x slower than Direct)

BeeAI Agent Mode

Architecture: IBM BeeAI Framework + native WatsonX + self-improving
Learning: Full embedding integration
Strengths: Best single-query reliability, clean async API
Weaknesses: Newer framework, smaller community

Self-Improving Agent Mode

Architecture: Full learning system with pattern storage
Learning: Primary focus on continuous improvement
Strengths: Gets better over time, learns from feedback
Weaknesses: Initial queries may be slower (cold start)

Self-Improving Architecture

User Query
    ↓
🔍 Embedding Search (IBM Slate)
    │
    ↓
┌─────────────────────────────────────┐
│ Find Similar Patterns               │
│ - Semantic similarity (cosine)      │
│ - Top 3 matches with similarity > 0.7│
│ - Fallback to keyword matching      │
└─────────────────────────────────────┘
    │
    ↓
┌─────────────────────────────────────┐
│ Get Common Errors to Avoid          │
│ - Top 3 error patterns              │
│ - Count of occurrences              │
└─────────────────────────────────────┘
    │
    ↓
📝 Build Enhanced Prompt:
   - ✅ SUCCESSFUL EXAMPLES: 3 similar patterns
   - ⚠️ MISTAKES TO AVOID: 3 common errors
   - 📊 Schema information
    │
    ↓
🤖 Generate SQL → Execute
    │
    ├── Success → 💾 Store Pattern (with embedding)
    └── Failure → ⚠️ Store Error Pattern
    │
    ↓
👍/👎 User Feedback → Update pattern ratings

Query Classification

Types Detected

Type	Confidence Threshold	Action
database	Any	Generate SQL
general	≥50%	Direct response
help	≥50%	Show guidance

Examples

Query	Type	Needs SQL
"show all products"	database	YES
"what time is it"	general	NO
"hello"	general	NO
"what can you do"	help	NO
"total revenue"	database	YES
"thanks"	general	NO

Product Disambiguation

How It Works

When a LIKE pattern matches multiple products:

Query: "how many wireless mouse did we sell"
          ↓
🔍 Search: product_name LIKE '%wireless mouse%'
          ↓
Found: ["Wireless Mouse", "Wireless Mouse Pro"]
          ↓
⚠️ Show disambiguation warning

Products That Trigger Disambiguation

Search Term	Matches
"laptop"	Laptop Pro, Laptop Professional, Laptop Basic
"wireless mouse"	Wireless Mouse, Wireless Mouse Pro
"mechanical keyboard"	Mechanical Keyboard, Mechanical Keyboard RGB
"smartphone"	Smartphone X, Smartphone Pro, Smartphone Basic

Recommendations

For Production Use

Recommended: Self-Improving Mode

Gets better over time
81% accuracy with continuous improvement
User feedback drives optimization

For Speed-Critical Applications

Recommended: Direct SQL + Learning

Fastest average response (451ms)
81% success rate with learning enabled
Falls back gracefully

For Highest Single-Query Reliability

Recommended: BeeAI Agent

100% success in controlled tests
Native WatsonX integration
Learning now enabled

Prompt Optimization (January 24, 2026)

Changes Applied to All Modes

Added Simple Query Patterns:

"show all products" → SELECT * FROM products;
"list customers" → SELECT * FROM customers;
DO NOT add WHERE unless user specifies a filter!

Clarified Column Names:

products table: category (NOT product_category!)
sales VIEW: product_category

Added Top N Pattern:

"top 5 products by sales" → SELECT product_name, SUM(total_amount) AS total
                            FROM sales GROUP BY product_name
                            ORDER BY total DESC LIMIT 5;

Impact of Prompt Fixes

Mode	Before	After	Improvement
BeeAI	83%	100%	+17%
Direct SQL	83%	86%	+3%
LangChain	100%	100%	Maintained

Hybrid Query Classification - Implementation Status

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         USER QUERY                                       │
│                    "show revenue per country"                            │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 1: SEMANTIC CLASSIFICATION (Embeddings)                          │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • IBM Slate embeddings (ibm/slate-125m-english-rtrvr-v2)               │
│  • 79 pre-embedded example queries across 4 categories                  │
│  • Cosine similarity matching (threshold: 0.75 high, 0.6 medium)        │
│  • Categories: database, general, non_database, help                    │
│                                                                          │
│  UNDERSTANDS: "revenue per country" = "revenue by country" = "sales    │
│               grouped by country" (no keyword rules needed!)             │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    │             │             │
              High (>0.75)   Medium (0.6-0.75)  Low (<0.6)
                    │             │             │
                    ▼             ▼             ▼
              Direct Route    Continue      Keyword Fallback
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 2: KEYWORD-BASED CLASSIFICATION                                   │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Database keywords: products, sales, revenue, orders, customers...    │
│  • General keywords: time, weather, hello, thanks, goodbye...           │
│  • Help keywords: help, example, how to use, what can you do...         │
│  • Non-database keywords: document, file, PDF, image, photo...          │
│  • Regex patterns for common phrases                                     │
│                                                                          │
│  STOP WORDS: per, each, grouped, spending, breakdown, summary...        │
│  (prevents false positives)                                              │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 3: SCHEMA-AWARE VALIDATION                                        │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Loads actual database schema (products, customers, orders, sales)    │
│  • Checks if query terms match real data:                               │
│    - Products: laptop, tablet, smartphone, wireless mouse...            │
│    - Countries: USA, UK, Germany, Japan, Canada, Australia...           │
│    - Categories: electronics, accessories, books                        │
│  • Detects unknown entities: "BMW sales" → BMW not in database          │
│  • Detects unknown filters: "orders from France" → France not in DB     │
│                                                                          │
│  ⚠️ PARTIALLY IMPLEMENTED                                                │
│  • City/Region validation (basic)                                        │
│  • Date range validation (not implemented)                               │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 4: EMPTY RESULT ANALYSIS                                          │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Analyzes SQL that returns 0 rows                                      │
│  • Suggests similar products/entities                                    │
│  • Explains why no results (e.g., "iphone" not in product names)        │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────┴─────────────┐
                    │                           │
              ┌─────┴─────┐               ┌─────┴─────┐
              │ DATABASE  │               │  GENERAL  │
              │ → SQL Gen │               │ → Direct  │
              └───────────┘               │  Response │
                                          └───────────┘

What's Implemented ✅

Component	Status	Details
Semantic Classifier	✅ Complete	79 examples, 4 categories, IBM Slate embeddings
Keyword Classification	✅ Complete	Database, general, help, non-database keywords
Schema Loading	✅ Complete	Products, countries, categories from DB
Unknown Entity Detection	✅ Complete	"BMW sales" → helpful message
Unknown Filter Detection	✅ Complete	"orders from France" → France not in DB
Stop Words	✅ Complete	Prevents "per", "breakdown", "grouped" false positives
Empty Result Analysis	✅ Complete	Explains why queries return 0 rows
Product Disambiguation	✅ Complete	Multiple product matches warning
General Question Handling	✅ Complete	"what time is it", "hello"
Help Question Handling	✅ Complete	"what can you do", "show examples"

What's Missing / Future Improvements 🔮

Component	Status	Priority	Description
Semantic Caching	✅ Implemented	High	Cache embedding vectors in memory + disk
Context Awareness	✅ Implemented	High	Remember previous queries, resolve follow-ups
Date Range Validation	❌ Not Started	Medium	Validate date filters against actual order dates
City Validation	⚠️ Basic	Low	Full city list from DB (currently partial)
Confidence Calibration	⚠️ Basic	Medium	Better threshold tuning based on test data
Multi-Language Support	❌ Not Started	Low	Support for non-English queries
Ambiguous Query Handling	⚠️ Basic	Medium	"Show sales" → which sales? (clarification)
Synonym Expansion	❌ Not Started	Medium	"revenue" = "income" = "earnings"
Negative Examples	⚠️ Basic	Medium	More examples of what NOT to generate SQL for
Auto-Learning Categories	❌ Not Started	Low	Learn new categories from user corrections

NEW: Semantic Caching ✅

Reduces API calls by caching embedding vectors:

┌─────────────────────────────────────────────────────────────┐
│  EMBEDDING CACHE                                             │
│  ─────────────────────────────────────────────────────────  │
│  • Memory cache: LRU with 1000 entries                      │
│  • Disk cache: ./data/cache/embedding_cache.json            │
│  • MD5 hash keys for fast lookup                            │
│  • Auto-persist every 10 new entries                        │
│                                                              │
│  First query: "show sales by country" → API call (450ms)    │
│  Second query: same → Cache hit (<1ms)                      │
└─────────────────────────────────────────────────────────────┘

Cache Statistics Available:

size: Number of cached embeddings
hits: Cache hit count
misses: Cache miss count
hit_rate: Percentage of cache hits

NEW: Context Awareness ✅

Understands follow-up queries and conversation flow:

┌─────────────────────────────────────────────────────────────┐
│  CONVERSATION CONTEXT                                        │
│  ─────────────────────────────────────────────────────────  │
│  Tracks last 10 queries with:                                │
│  • Query text and type (database/general/help)              │
│  • Generated SQL                                             │
│  • Result count                                              │
│  • Extracted filters (product, country, category)           │
└─────────────────────────────────────────────────────────────┘

Follow-up patterns detected:
• "show more" / "more details"     → expand previous results
• "same for USA" / "what about UK" → similar with new filter
• "filter by electronics"          → add filter to previous
• "in Germany" / "from Japan"      → country filter
• "those products" / "that data"   → resolve references
• "top 5" / "sort by price"        → modify previous query

Example Conversation:

User: "show laptop sales by country"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' GROUP BY country

User: "same for USA"
→ Follow-up detected (type: similar)
→ Resolved: "show laptop sales by country in USA"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' AND country = 'USA' GROUP BY country

User: "top 5"
→ Follow-up detected (type: limit)
→ Resolved: "top 5 from: show laptop sales by country in USA"
→ SQL: ... ORDER BY total DESC LIMIT 5

Test Results (17 queries)

Query Type	Example	Expected	Actual	Status
Document query	"show me the box document CS"	Blocked	Blocked	✅
Unknown country	"orders from France"	Blocked	Blocked	✅
Unknown product	"sales for BMW"	Blocked	Blocked	✅
Valid product	"show laptop sales"	SQL	SQL	✅
Valid country	"orders from Germany"	SQL	SQL	✅
Greeting	"hello"	Direct	Direct	✅
Time query	"what time is it"	Direct	Direct	✅
Help query	"what can you do"	Help	Help	✅
Breakdown query	"breakdown of sales by region"	SQL	SQL	✅
Revenue per	"revenue per country"	SQL	SQL	✅

Configuration

# Semantic classification thresholds
HIGH_CONFIDENCE_THRESHOLD = 0.75   # Direct classification
MEDIUM_CONFIDENCE_THRESHOLD = 0.60  # Continue to keyword check
LOW_CONFIDENCE_THRESHOLD = 0.30     # Fall back entirely

# Example counts by category
SEMANTIC_EXAMPLES = {
    'database': 33,      # "show all products", "revenue by country"
    'general': 18,       # "what time is it", "hello"
    'non_database': 20,  # "show me the document", "sales for BMW"
    'help': 8            # "what can you do", "show examples"
}

Files Involved

File	Purpose
`query_classifier.py`	Main hybrid classification logic
`semantic_classifier.py`	Embedding-based classification
`product_mapper.py`	Product disambiguation
`schema_loader.py`	Database schema utilities

Future Improvements

Multi-Shot Generation - Generate 2-3 SQL candidates and pick best
LLM-as-Judge - Use LLM to validate results match user intent
Hybrid Mode - Fast mode first, fallback to thorough mode on failure
LLM Request Router - Replace brittle keyword/regex classification with a robust LLM-based router to handle natural language nuance (e.g., "biggest sales" implies sorting, not an entity named "biggest").
Query Caching - Cache common query patterns
Fine-tuning - Fine-tune smaller model on SQL generation task
Active Learning - Identify queries that need human review
Semantic Caching - Cache embedding vectors to reduce API calls
Context Awareness - Remember conversation context for follow-up queries

Technical Details

Models Used

LLM: IBM Granite 3 8B Instruct (ibm/granite-3-8b-instruct)
Embeddings: IBM Slate (ibm/slate-125m-english-rtrvr-v2)

Databases

Application DB: SQLite (./data/database.db)
Learning Store: SQLite (./data/learning.db)

Learning Store Schema

-- Successful patterns
CREATE TABLE query_patterns (
    id INTEGER PRIMARY KEY,
    user_query TEXT,
    generated_sql TEXT,
    embedding TEXT,  -- JSON array of floats
    result_count INTEGER,
    execution_time_ms REAL,
    model_id TEXT,
    mode TEXT,
    thumbs_up INTEGER DEFAULT 0,
    thumbs_down INTEGER DEFAULT 0,
    created_at TIMESTAMP
);

-- Error patterns
CREATE TABLE error_patterns (
    id INTEGER PRIMARY KEY,
    user_query TEXT,
    failed_sql TEXT,
    error_message TEXT,
    error_type TEXT,
    model_id TEXT,
    mode TEXT,
    created_at TIMESTAMP
);

Environment

Python 3.13
Streamlit 1.x
WatsonX API (IBM Cloud)
BeeAI Framework (IBM)
LangChain + LangChain-IBM

FilesExpand file tree

COMPARISON_RESULTS.md

Latest commit

History