Skip to content

Latest commit

 

History

History
674 lines (554 loc) · 30.7 KB

File metadata and controls

674 lines (554 loc) · 30.7 KB

SQL Generation Mode Comparison Results

Author: Markus van Kempen (markus.van.kempen@gmail.com) Last Updated: January 24, 2026 Test Queries: 100+ (comprehensive test) Database: SQLite with customers, products, orders tables + sales VIEW


Model + Mode Comparison (Latest)

Best Combinations After Prompt Optimization

Rank Combination Accuracy Speed Recommendation
🥇 Mistral Small + BeeAI 100% 512ms Best overall
🥇 Mistral Small + LangChain 100% 525ms Fast + reliable
🥇 Granite 4 Small + LangChain 100% 604ms Good balance
🥇 Granite 4 Small + BeeAI 100% ~650ms After prompt fix
🥈 Direct SQL + Granite 4 86% ~500ms Fast but less accurate

Accuracy by Mode (After Prompt Fixes)

Mode         Accuracy   Notes
─────────────────────────────────────────────
LangChain    100%       Retry logic + schema injection
BeeAI        100%       After prompt optimization
Direct SQL    86%       Fast, some complex query issues

Key Finding

Prompt engineering significantly impacts accuracy. After fixing prompts to include:

  • Simple query patterns ("show all products" → SELECT * FROM products)
  • Correct column names (products.category vs sales.product_category)
  • Top N patterns

BeeAI accuracy improved from 83% → 100%.


Schema Enrichment & Validation Impact (January 2026)

Benchmark Results (Complex Queries)

We tested 4 additional complex scenarios to validate the "Enriched Schema" (descriptions + sample values).

Query Key Challenge Result
"Show me the top 3 customers in Europe" Sample Value 'Europe' ✅ Passed
"Products low in stock (<10)" Semantic Logic ✅ Passed
"Accessories sales in Germany" Country + Category ✅ Passed
"Revenue from Laptop Pro" Product Matching ✅ Passed

Verdict: Enriched schema provides critical context (valid values, column meanings) that allows the LLM to handle ambiguity without explicit few-shot examples.


LLM Model Comparison

Comparison of different LLMs for SQL generation (12 test queries):

Overall Results

Model Gen Rate Exec Rate Avg Time Pattern Match Verdict
🥇 Granite 4 Small 100% 91.7% 599ms 83.3% Best overall
🥈 Mistral Small 24B 100% 83.3% 397ms 91.7% Fastest
🥉 Llama 3.3 70B 100% 75.0% 1265ms 83.3% Slowest
4th Granite 3.3 8B 100% 50.0% 411ms 75.0% Struggles with complex

Success by Query Category

Model Simple Filter Aggregation Complex
Granite 4 Small 100% 100% 100% 75%
Mistral Small 24B 100% 100% 75% 75%
Llama 3.3 70B 100% 100% 75% 50%
Granite 3.3 8B 100% 50% 75% 0%

Key Findings

  1. Granite 4 Small is the best all-rounder - good speed and highest accuracy
  2. Mistral Small 24B is fastest but slightly lower accuracy on aggregations
  3. Llama 3.3 70B is slower (3x) with no accuracy advantage
  4. Granite 3.3 8B (deprecated) struggles with complex queries - generates incomplete SQL

Recommendation

Use Granite 4 Small as default for balanced performance. Switch to Mistral Small 24B for high-throughput scenarios where speed matters more than complex query handling.


Executive Summary

Four different approaches to natural language → SQL generation were tested, all now featuring self-improving capabilities with embedding-based learning:

Rank Mode Success Rate Avg Time Learning Recommendation
🥇 Self-Improving 81% ~700ms Full Best for production - continuous improvement
🥈 BeeAI Agent 100%* 567ms Enabled Highest single-query reliability
🥉 Direct SQL 81% 451ms Enabled Fastest, good balance
4th LangChain Agent 90%* 3,056ms Enabled Most flexible, but slowest

*Results from initial 10-query test; 100-query test shows 81% average across modes.


New Features (January 2026)

🧠 Self-Improving System

All modes now include:

  • Embedding-based similarity search using IBM Slate model
  • Few-shot learning from successful patterns
  • Error avoidance from logged mistakes
  • User feedback integration (👍/👎)

💬 Semantic Query Classification (NEW)

System now uses embedding-based classification with schema validation:

  1. Semantic Classification - Uses IBM Slate embeddings to understand query intent
    • 79 pre-embedded example queries across 4 categories
    • Understands "revenue per country" = "revenue by country"
    • No keyword rules needed for variations
  2. Schema-Aware Validation - Checks if query terms match actual database content
  3. Empty Result Analysis - Explains why queries return no results

Semantic Classifier Architecture

User Query: "show total revenue per country"
                    ↓
┌─────────────────────────────────────────────────────────┐
│  SEMANTIC CLASSIFICATION (Embeddings)                   │
│  - Embed query using IBM Slate                          │
│  - Compare to 79 pre-embedded examples                  │
│  - Similar to: "total revenue by country" (90%)         │
│  - Category: database ✅                                 │
└─────────────────────────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────────┐
│  SCHEMA VALIDATION (only for database queries)          │
│  - Check if mentioned entities exist in DB              │
│  - "country" exists in schema ✅                         │
└─────────────────────────────────────────────────────────┘
                    ↓
             Generate SQL

Example Categories

Category Examples Count
database "show all products", "revenue by country" 33
general "what time is it", "hello" 18
non_database "show me the document", "sales for BMW" 20
help "what can you do", "show examples" 8

Classification Test Results (17 queries)

Query Type Example Handled
Documents "show me the box document CS" ✅ Blocked
Unknown country "orders from France" ✅ Blocked
Unknown product "sales for BMW" ✅ Blocked
Valid query "show laptop sales" ✅ SQL generated
Valid country "orders from Germany" ✅ SQL generated

Empty Result Feedback

When SQL returns no results, the system explains why:

Query: "show me iPhone sales"
SQL: SELECT * FROM sales WHERE product_name LIKE '%iphone%'
→ ⚠️ No results found.
   **No product matches:** `iphone` doesn't match any product names
   **Did you mean:** smartphone pro, smartphone x, smartphone basic

🔍 Product Disambiguation

Warns when queries match multiple products (e.g., "Wireless Mouse" matches both "Wireless Mouse" and "Wireless Mouse Pro").


100-Query Test Results

Overall Performance

┌────────────────────────────────────────────────────────────────────┐
│              SELF-IMPROVING TEST (100 QUERIES)                     │
├────────────────────────────────────────────────────────────────────┤
│  Total Queries:           100                                      │
│  Success Rate:            81% (81/100)                             │
│  Patterns Learned:        81                                       │
│  Errors Logged:           20                                       │
│  Embedding Utilization:   99% (99/100 queries used patterns)       │
└────────────────────────────────────────────────────────────────────┘

Learning Progression by Batch

Batch Queries Success Rate Patterns Learned Patterns Used
1 25 92% 23 24/25
2 25 68% 40 25/25
3 25 88% 62 25/25
4 25 76% 81 25/25

Success Rate by Category

Category Success Rate Notes
products_filter 100% (15/15) Perfect performance
sales_basic 100% (15/15) Perfect performance
products_basic 95% (19/20) Very high
aggregation 90% (9/10) Excellent
complex 80% (8/10) Good
sales_geography 73% (11/15) Room for improvement
sales_product 27% (4/15) Needs prompt tuning

Mode-Specific Details

Direct SQL Mode

  • Architecture: Prompt engineering + SQLGlot validation + self-improving
  • Learning: Embedding-based pattern matching + error avoidance
  • Strengths: Fastest, simple architecture, good learning
  • Weaknesses: Can struggle with complex joins

LangChain Agent Mode

  • Architecture: LangChain SQL chain + schema injection + self-improving
  • Learning: Few-shot examples injected into chain
  • Strengths: Good SQL quality, retry logic, flexible
  • Weaknesses: Slowest (5-6x slower than Direct)

BeeAI Agent Mode

  • Architecture: IBM BeeAI Framework + native WatsonX + self-improving
  • Learning: Full embedding integration
  • Strengths: Best single-query reliability, clean async API
  • Weaknesses: Newer framework, smaller community

Self-Improving Agent Mode

  • Architecture: Full learning system with pattern storage
  • Learning: Primary focus on continuous improvement
  • Strengths: Gets better over time, learns from feedback
  • Weaknesses: Initial queries may be slower (cold start)

Self-Improving Architecture

User Query
    ↓
🔍 Embedding Search (IBM Slate)
    │
    ↓
┌─────────────────────────────────────┐
│ Find Similar Patterns               │
│ - Semantic similarity (cosine)      │
│ - Top 3 matches with similarity > 0.7│
│ - Fallback to keyword matching      │
└─────────────────────────────────────┘
    │
    ↓
┌─────────────────────────────────────┐
│ Get Common Errors to Avoid          │
│ - Top 3 error patterns              │
│ - Count of occurrences              │
└─────────────────────────────────────┘
    │
    ↓
📝 Build Enhanced Prompt:
   - ✅ SUCCESSFUL EXAMPLES: 3 similar patterns
   - ⚠️ MISTAKES TO AVOID: 3 common errors
   - 📊 Schema information
    │
    ↓
🤖 Generate SQL → Execute
    │
    ├── Success → 💾 Store Pattern (with embedding)
    └── Failure → ⚠️ Store Error Pattern
    │
    ↓
👍/👎 User Feedback → Update pattern ratings

Query Classification

Types Detected

Type Confidence Threshold Action
database Any Generate SQL
general ≥50% Direct response
help ≥50% Show guidance

Examples

Query Type Needs SQL
"show all products" database YES
"what time is it" general NO
"hello" general NO
"what can you do" help NO
"total revenue" database YES
"thanks" general NO

Product Disambiguation

How It Works

When a LIKE pattern matches multiple products:

Query: "how many wireless mouse did we sell"
          ↓
🔍 Search: product_name LIKE '%wireless mouse%'
          ↓
Found: ["Wireless Mouse", "Wireless Mouse Pro"]
          ↓
⚠️ Show disambiguation warning

Products That Trigger Disambiguation

Search Term Matches
"laptop" Laptop Pro, Laptop Professional, Laptop Basic
"wireless mouse" Wireless Mouse, Wireless Mouse Pro
"mechanical keyboard" Mechanical Keyboard, Mechanical Keyboard RGB
"smartphone" Smartphone X, Smartphone Pro, Smartphone Basic

Recommendations

For Production Use

Recommended: Self-Improving Mode

  • Gets better over time
  • 81% accuracy with continuous improvement
  • User feedback drives optimization

For Speed-Critical Applications

Recommended: Direct SQL + Learning

  • Fastest average response (451ms)
  • 81% success rate with learning enabled
  • Falls back gracefully

For Highest Single-Query Reliability

Recommended: BeeAI Agent

  • 100% success in controlled tests
  • Native WatsonX integration
  • Learning now enabled

Prompt Optimization (January 24, 2026)

Changes Applied to All Modes

  1. Added Simple Query Patterns:
"show all products" → SELECT * FROM products;
"list customers" → SELECT * FROM customers;
DO NOT add WHERE unless user specifies a filter!
  1. Clarified Column Names:
products table: category (NOT product_category!)
sales VIEW: product_category
  1. Added Top N Pattern:
"top 5 products by sales" → SELECT product_name, SUM(total_amount) AS total
                            FROM sales GROUP BY product_name
                            ORDER BY total DESC LIMIT 5;

Impact of Prompt Fixes

Mode Before After Improvement
BeeAI 83% 100% +17%
Direct SQL 83% 86% +3%
LangChain 100% 100% Maintained

Hybrid Query Classification - Implementation Status

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         USER QUERY                                       │
│                    "show revenue per country"                            │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 1: SEMANTIC CLASSIFICATION (Embeddings)                          │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • IBM Slate embeddings (ibm/slate-125m-english-rtrvr-v2)               │
│  • 79 pre-embedded example queries across 4 categories                  │
│  • Cosine similarity matching (threshold: 0.75 high, 0.6 medium)        │
│  • Categories: database, general, non_database, help                    │
│                                                                          │
│  UNDERSTANDS: "revenue per country" = "revenue by country" = "sales    │
│               grouped by country" (no keyword rules needed!)             │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    │             │             │
              High (>0.75)   Medium (0.6-0.75)  Low (<0.6)
                    │             │             │
                    ▼             ▼             ▼
              Direct Route    Continue      Keyword Fallback
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 2: KEYWORD-BASED CLASSIFICATION                                   │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Database keywords: products, sales, revenue, orders, customers...    │
│  • General keywords: time, weather, hello, thanks, goodbye...           │
│  • Help keywords: help, example, how to use, what can you do...         │
│  • Non-database keywords: document, file, PDF, image, photo...          │
│  • Regex patterns for common phrases                                     │
│                                                                          │
│  STOP WORDS: per, each, grouped, spending, breakdown, summary...        │
│  (prevents false positives)                                              │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 3: SCHEMA-AWARE VALIDATION                                        │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Loads actual database schema (products, customers, orders, sales)    │
│  • Checks if query terms match real data:                               │
│    - Products: laptop, tablet, smartphone, wireless mouse...            │
│    - Countries: USA, UK, Germany, Japan, Canada, Australia...           │
│    - Categories: electronics, accessories, books                        │
│  • Detects unknown entities: "BMW sales" → BMW not in database          │
│  • Detects unknown filters: "orders from France" → France not in DB     │
│                                                                          │
│  ⚠️ PARTIALLY IMPLEMENTED                                                │
│  • City/Region validation (basic)                                        │
│  • Date range validation (not implemented)                               │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  LAYER 4: EMPTY RESULT ANALYSIS                                          │
│  ───────────────────────────────────────────────────────────────────    │
│  ✅ IMPLEMENTED                                                          │
│  • Analyzes SQL that returns 0 rows                                      │
│  • Suggests similar products/entities                                    │
│  • Explains why no results (e.g., "iphone" not in product names)        │
└─────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
                    ┌─────────────┴─────────────┐
                    │                           │
              ┌─────┴─────┐               ┌─────┴─────┐
              │ DATABASE  │               │  GENERAL  │
              │ → SQL Gen │               │ → Direct  │
              └───────────┘               │  Response │
                                          └───────────┘

What's Implemented ✅

Component Status Details
Semantic Classifier ✅ Complete 79 examples, 4 categories, IBM Slate embeddings
Keyword Classification ✅ Complete Database, general, help, non-database keywords
Schema Loading ✅ Complete Products, countries, categories from DB
Unknown Entity Detection ✅ Complete "BMW sales" → helpful message
Unknown Filter Detection ✅ Complete "orders from France" → France not in DB
Stop Words ✅ Complete Prevents "per", "breakdown", "grouped" false positives
Empty Result Analysis ✅ Complete Explains why queries return 0 rows
Product Disambiguation ✅ Complete Multiple product matches warning
General Question Handling ✅ Complete "what time is it", "hello"
Help Question Handling ✅ Complete "what can you do", "show examples"

What's Missing / Future Improvements 🔮

Component Status Priority Description
Semantic Caching ✅ Implemented High Cache embedding vectors in memory + disk
Context Awareness ✅ Implemented High Remember previous queries, resolve follow-ups
Date Range Validation ❌ Not Started Medium Validate date filters against actual order dates
City Validation ⚠️ Basic Low Full city list from DB (currently partial)
Confidence Calibration ⚠️ Basic Medium Better threshold tuning based on test data
Multi-Language Support ❌ Not Started Low Support for non-English queries
Ambiguous Query Handling ⚠️ Basic Medium "Show sales" → which sales? (clarification)
Synonym Expansion ❌ Not Started Medium "revenue" = "income" = "earnings"
Negative Examples ⚠️ Basic Medium More examples of what NOT to generate SQL for
Auto-Learning Categories ❌ Not Started Low Learn new categories from user corrections

NEW: Semantic Caching ✅

Reduces API calls by caching embedding vectors:

┌─────────────────────────────────────────────────────────────┐
│  EMBEDDING CACHE                                             │
│  ─────────────────────────────────────────────────────────  │
│  • Memory cache: LRU with 1000 entries                      │
│  • Disk cache: ./data/cache/embedding_cache.json            │
│  • MD5 hash keys for fast lookup                            │
│  • Auto-persist every 10 new entries                        │
│                                                              │
│  First query: "show sales by country" → API call (450ms)    │
│  Second query: same → Cache hit (<1ms)                      │
└─────────────────────────────────────────────────────────────┘

Cache Statistics Available:

  • size: Number of cached embeddings
  • hits: Cache hit count
  • misses: Cache miss count
  • hit_rate: Percentage of cache hits

NEW: Context Awareness ✅

Understands follow-up queries and conversation flow:

┌─────────────────────────────────────────────────────────────┐
│  CONVERSATION CONTEXT                                        │
│  ─────────────────────────────────────────────────────────  │
│  Tracks last 10 queries with:                                │
│  • Query text and type (database/general/help)              │
│  • Generated SQL                                             │
│  • Result count                                              │
│  • Extracted filters (product, country, category)           │
└─────────────────────────────────────────────────────────────┘

Follow-up patterns detected:
• "show more" / "more details"     → expand previous results
• "same for USA" / "what about UK" → similar with new filter
• "filter by electronics"          → add filter to previous
• "in Germany" / "from Japan"      → country filter
• "those products" / "that data"   → resolve references
• "top 5" / "sort by price"        → modify previous query

Example Conversation:

User: "show laptop sales by country"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' GROUP BY country

User: "same for USA"
→ Follow-up detected (type: similar)
→ Resolved: "show laptop sales by country in USA"
→ SQL: SELECT country, SUM(total_amount) FROM sales WHERE product_name LIKE '%laptop%' AND country = 'USA' GROUP BY country

User: "top 5"
→ Follow-up detected (type: limit)
→ Resolved: "top 5 from: show laptop sales by country in USA"
→ SQL: ... ORDER BY total DESC LIMIT 5

Test Results (17 queries)

Query Type Example Expected Actual Status
Document query "show me the box document CS" Blocked Blocked
Unknown country "orders from France" Blocked Blocked
Unknown product "sales for BMW" Blocked Blocked
Valid product "show laptop sales" SQL SQL
Valid country "orders from Germany" SQL SQL
Greeting "hello" Direct Direct
Time query "what time is it" Direct Direct
Help query "what can you do" Help Help
Breakdown query "breakdown of sales by region" SQL SQL
Revenue per "revenue per country" SQL SQL

Configuration

# Semantic classification thresholds
HIGH_CONFIDENCE_THRESHOLD = 0.75   # Direct classification
MEDIUM_CONFIDENCE_THRESHOLD = 0.60  # Continue to keyword check
LOW_CONFIDENCE_THRESHOLD = 0.30     # Fall back entirely

# Example counts by category
SEMANTIC_EXAMPLES = {
    'database': 33,      # "show all products", "revenue by country"
    'general': 18,       # "what time is it", "hello"
    'non_database': 20,  # "show me the document", "sales for BMW"
    'help': 8            # "what can you do", "show examples"
}

Files Involved

File Purpose
query_classifier.py Main hybrid classification logic
semantic_classifier.py Embedding-based classification
product_mapper.py Product disambiguation
schema_loader.py Database schema utilities

Future Improvements

  1. Multi-Shot Generation - Generate 2-3 SQL candidates and pick best
  2. LLM-as-Judge - Use LLM to validate results match user intent
  3. Hybrid Mode - Fast mode first, fallback to thorough mode on failure
  4. LLM Request Router - Replace brittle keyword/regex classification with a robust LLM-based router to handle natural language nuance (e.g., "biggest sales" implies sorting, not an entity named "biggest").
  5. Query Caching - Cache common query patterns
  6. Fine-tuning - Fine-tune smaller model on SQL generation task
  7. Active Learning - Identify queries that need human review
  8. Semantic Caching - Cache embedding vectors to reduce API calls
  9. Context Awareness - Remember conversation context for follow-up queries

Technical Details

Models Used

  • LLM: IBM Granite 3 8B Instruct (ibm/granite-3-8b-instruct)
  • Embeddings: IBM Slate (ibm/slate-125m-english-rtrvr-v2)

Databases

  • Application DB: SQLite (./data/database.db)
  • Learning Store: SQLite (./data/learning.db)

Learning Store Schema

-- Successful patterns
CREATE TABLE query_patterns (
    id INTEGER PRIMARY KEY,
    user_query TEXT,
    generated_sql TEXT,
    embedding TEXT,  -- JSON array of floats
    result_count INTEGER,
    execution_time_ms REAL,
    model_id TEXT,
    mode TEXT,
    thumbs_up INTEGER DEFAULT 0,
    thumbs_down INTEGER DEFAULT 0,
    created_at TIMESTAMP
);

-- Error patterns
CREATE TABLE error_patterns (
    id INTEGER PRIMARY KEY,
    user_query TEXT,
    failed_sql TEXT,
    error_message TEXT,
    error_type TEXT,
    model_id TEXT,
    mode TEXT,
    created_at TIMESTAMP
);

Environment

  • Python 3.13
  • Streamlit 1.x
  • WatsonX API (IBM Cloud)
  • BeeAI Framework (IBM)
  • LangChain + LangChain-IBM