Skip to content

Commit 6715468

Browse files
Feature/conceptual search (#123)
* boilerplate api * refactored project base * updated gitignore to include intugle project base folder * column graph built, ruff linting * Integrated retriever * settings networkx configs added * Added table graph kg initialization * Agent integration testing * needs testing * added notebook compatitibility for create graphs. implemeted retriever for data product builder * added langgraph for dev * added force recreate for qdrant * Added project id * removed domain, removed cost tracking * corrected tool outputs * dataproduct planner APIs * storing plan in memory * add etl model mapping * added query cleaning * add conceptual search api to data product * removed unused import in dp * conceptual search test cases and quickstart * added tavily key conditional handling. added set project base function. * conditional fix in prompt * added docs and notebook * upgraded version to 1.1.0 , updated notebook docs
1 parent 29f963e commit 6715468

52 files changed

Lines changed: 3086 additions & 73 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ Intugle’s GenAI-powered open-source Python library builds a semantic data mode
4242
* **Semantic Data Model -** Transform raw, fragmented datasets into an intelligent semantic graph that captures entities, relationships, and context — the foundation for connected intelligence.
4343
* **Business Glossary & Semantic Search:** Auto-generate a business glossary and enable search that understands meaning, not just keywords — making data more accessible across technical and business users.
4444
* **Data Products -** Instantly generate SQL and reusable data products enriched with context, eliminating manual pipelines and accelerating data-to-insight.
45+
* **Conceptual Search -** Generate data product plans from natural language queries, bridging the gap between business questions and executable data product definitions. Learn more in the [documentation](https://intugle.github.io/data-tools/docs/core-concepts/data-product/conceptual-search).
46+
4547

4648
## Getting Started
4749

@@ -107,7 +109,7 @@ For a detailed, hands-on introduction to the project, please see our quickstart
107109
| **Native Snowflake with Cortex Analyst [ Tech Manufacturing ]** | [`quickstart_native_snowflake.ipynb`](notebooks/quickstart_native_snowflake.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_snowflake.ipynb) |
108110
| **Native Databricks with AI/BI Genie [ Tech Manufacturing ]** | [`quickstart_native_databricks.ipynb`](notebooks/quickstart_native_databricks.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_databricks.ipynb) |
109111
| **Streamlit App** | [`quickstart_streamlit.ipynb`](notebooks/quickstart_streamlit.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_streamlit.ipynb) |
110-
112+
| **Conceptual Search** | [`quickstart_conceptual_search.ipynb`](notebooks/quickstart_conceptual_search.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_conceptual_search.ipynb) |
111113
These datasets will take you through the following steps:
112114

113115
* **Generate Semantic Model** → The unified layer that transforms fragmented datasets, creating the foundation for connected intelligence.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
---
2+
sidebar_position: 8
3+
title: Conceptual Search
4+
---
5+
6+
# Conceptual Search
7+
8+
:::info Experimental Feature
9+
Conceptual Search is an experimental feature. The API and functionality may change in future releases.
10+
:::
11+
12+
Conceptual Search is an AI-powered feature that allows you to generate a data product plan from a natural language query. It bridges the gap between a high-level business question and a concrete, executable data product definition.
13+
14+
## Overview
15+
16+
At its core, Conceptual Search uses a sophisticated, two-stage process orchestrated by AI agents and knowledge graphs:
17+
18+
1. **Knowledge Graphs**: The system builds knowledge graphs for both database tables and columns. Nodes represent tables/columns, and edges connect conceptually related items based on semantic similarity and shared concepts extracted by an LLM.
19+
2. **Graph-Based Retrievers**: When you search, the system uses a hybrid approach of vector search and graph traversal to find relevant tables and columns, even if they are not direct keyword matches.
20+
3. **AI Agents**: The process is managed by two LangChain/LangGraph agents: a `DataProductPlannerAgent` and a `DataProductBuilderAgent`.
21+
22+
## The Two-Stage Workflow
23+
24+
### Stage 1: Planning
25+
26+
The goal of this stage is to convert a vague user request (e.g., "customer churn metrics") into a structured `DataProductPlan`, which is a well-defined list of dimensions and measures.
27+
28+
1. **Input**: A natural language query.
29+
2. **Agent's Task**: The `DataProductPlannerAgent` uses its tools to find relevant database tables and existing data products.
30+
3. **Output**: The agent produces a `DataProductPlan` object, which can be reviewed and modified by the user.
31+
32+
:::tip User Validation is Key
33+
The `DataProductPlan` generated by the AI is a starting point. It is crucial to review and validate this plan to ensure it aligns with your business requirements before proceeding to the building stage.
34+
:::
35+
36+
### Stage 2: Building
37+
38+
This stage takes the abstract `DataProductPlan` and maps each attribute to a specific, physical database column, defining its logic (e.g., aggregation for measures).
39+
40+
1. **Input**: The `DataProductPlan` from Stage 1.
41+
2. **Agent's Task**: The `DataProductBuilderAgent` iterates through each attribute in the plan, using the graph-based column retriever to find the most relevant physical column.
42+
3. **Output**: The collected mappings are assembled into a final `ETLModel`, which is a complete, machine-readable definition of the data product, ready to be used to generate a SQL query.
43+
44+
## Usage Example
45+
46+
```python
47+
from intugle import DataProduct
48+
49+
dp = DataProduct()
50+
51+
# 1. Generate a plan from a natural language query
52+
plan = await dp.plan(query="top 10 customers by their total purchase amount")
53+
54+
# 2. Review and modify the plan
55+
print("Original Plan:")
56+
plan.display()
57+
58+
plan.rename_attribute("total purchase amount", "Total Spend")
59+
plan.disable_attribute("customer address") # Assuming this was in the plan
60+
61+
print("\nModified Plan:")
62+
plan.display()
63+
64+
65+
# 3. Create the ETL model from the modified plan
66+
etl_model = await dp.create_etl_model_from_plan(plan)
67+
68+
# 4. Build the data product
69+
result_dataset = dp.build(etl=etl_model)
70+
71+
# 5. Access the results
72+
print(result_dataset.to_df())
73+
```
74+
75+
76+
## Modifying the Data Product Plan
77+
78+
The `DataProductPlan` object is not just a static output; it's an interactive object that you can modify to refine the AI's suggestions. This allows you to correct any misunderstandings or add your own domain knowledge to the plan.
79+
80+
Here are the available methods to modify the plan:
81+
82+
| Method | Description | Example |
83+
| ---------------------------------- | ------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
84+
| `rename_attribute(old, new)` | Renames an existing attribute. | `plan.rename_attribute('Customer ID', 'Client Identifier')` |
85+
| `set_attribute_description(name, desc)` | Updates the description of an attribute. | `plan.set_attribute_description('Client Identifier', 'The unique ID for each client')` |
86+
| `set_attribute_classification(name, class)` | Changes the classification to 'Dimension' or 'Measure'. | `plan.set_attribute_classification('Total Sales', 'Measure')` |
87+
| `disable_attribute(name)` | Deactivates an attribute so it won't be included in the final data product. | `plan.disable_attribute('Customer Address')` |
88+
| `enable_attribute(name)` | Reactivates a previously disabled attribute. | `plan.enable_attribute('Customer Address')` |
89+
| `to_df()` | Returns the final plan as a pandas DataFrame with only active attributes. | `final_plan_df = plan.to_df()` |
90+
91+
## Qdrant Server Requirement
92+
93+
Conceptual Search utilizes [Qdrant](https://qdrant.tech/) as its vector database for efficient retrieval of relevant tables and columns. Therefore, a running Qdrant instance is required.
94+
95+
You can easily set up a Qdrant server using Docker:
96+
97+
```bash
98+
docker run -d -p 6333:6333 -p 6334:6334 \
99+
-v qdrant_storage:/qdrant/storage:z \
100+
--name qdrant qdrant/qdrant
101+
```
102+
103+
After starting the Qdrant server, you need to configure its URL and API key (if authorization is used) in your environment variables:
104+
105+
```bash
106+
export QDRANT_URL="http://localhost:6333"
107+
export QDRANT_API_KEY="your-qdrant-api-key" # if authorization is used
108+
```
109+
110+
## Enhancing Performance with Tavily Web Search
111+
112+
For better performance and more contextually aware data product plans, it is recommended to use the Tavily web search tool. This allows the planning agent to research industry-best practices and common metrics related to your query.
113+
114+
To enable this feature, you need to get a API key from [Tavily](https://tavily.com/) and set it as an environment variable:
115+
116+
```bash
117+
export TAVILY_API_KEY="your-tavily-api-key"
118+
```
119+

docsite/docs/core-concepts/data-product/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,3 +93,4 @@ The `DataProduct` class provides a powerful way to query your connected data wit
9393
* **[Aggregations](./aggregations.md)**: Understand how to perform grouping and aggregation functions like `COUNT` and `SUM`.
9494
* **[Joins](./joins.md)**: Discover how the builder automatically handles joins between tables.
9595
* **[Advanced Examples](./advanced-examples.md)**: See how to combine these concepts to build complex data products.
96+
* **[Conceptual Search](./conceptual-search.md)**: Learn how to generate data products from natural language queries.

docsite/docs/core-concepts/semantic-intelligence/semantic-search.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ from intugle.semantic_search import SemanticSearch
131131

132132
# This assumes your project's .yml files are in the default location.
133133
# You can also specify the path to your models directory:
134-
# search_client = SemanticSearch(project_base="/path/to/your/models")
134+
# search_client = SemanticSearch(models_dir_path="/path/to/your/models")
135135
search_client = SemanticSearch()
136136

137137
# 1. Initialize the search index.

docsite/docs/examples.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ For a detailed, hands-on introduction to the project, please see our quickstart
1313
| **Tech Manufacturing** | [`quickstart_tech_manufacturing.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_tech_manufacturing.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_tech_manufacturing.ipynb) |
1414
| **FMCG** | [`quickstart_fmcg.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_fmcg.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_fmcg.ipynb) |
1515
| **Sports Media** | [`quickstart_sports_media.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_sports_media.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_sports_media.ipynb) |
16+
| **Conceptual Search** | [`quickstart_conceptual_search.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_conceptual_search.ipynb) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_conceptual_search.ipynb) |
1617
| **Databricks Unity Catalog [Health Care]** | [`quickstart_healthcare_databricks.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_healthcare_databricks.ipynb) | Databricks Notebook Only |
1718
| **Snowflake Horizon Catalog [ FMCG ]** | [`quickstart_fmcg_snowflake.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_fmcg_snowflake.ipynb) | Snowflake Notebook Only |
1819
| **Native Snowflake with Cortex Analyst [ Tech Manufacturing ]** | [`quickstart_native_snowflake.ipynb`](https://github.com/Intugle/data-tools/blob/main/notebooks/quickstart_native_snowflake.ipynb) | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Intugle/data-tools/blob/main/notebooks/quickstart_native_snowflake.ipynb) |

0 commit comments

Comments
 (0)