|
| 1 | +--- |
| 2 | +sidebar_position: 8 |
| 3 | +title: Conceptual Search |
| 4 | +--- |
| 5 | + |
| 6 | +# Conceptual Search |
| 7 | + |
| 8 | +:::info Experimental Feature |
| 9 | +Conceptual Search is an experimental feature. The API and functionality may change in future releases. |
| 10 | +::: |
| 11 | + |
| 12 | +Conceptual Search is an AI-powered feature that allows you to generate a data product plan from a natural language query. It bridges the gap between a high-level business question and a concrete, executable data product definition. |
| 13 | + |
| 14 | +## Overview |
| 15 | + |
| 16 | +At its core, Conceptual Search uses a sophisticated, two-stage process orchestrated by AI agents and knowledge graphs: |
| 17 | + |
| 18 | +1. **Knowledge Graphs**: The system builds knowledge graphs for both database tables and columns. Nodes represent tables/columns, and edges connect conceptually related items based on semantic similarity and shared concepts extracted by an LLM. |
| 19 | +2. **Graph-Based Retrievers**: When you search, the system uses a hybrid approach of vector search and graph traversal to find relevant tables and columns, even if they are not direct keyword matches. |
| 20 | +3. **AI Agents**: The process is managed by two LangChain/LangGraph agents: a `DataProductPlannerAgent` and a `DataProductBuilderAgent`. |
| 21 | + |
| 22 | +## The Two-Stage Workflow |
| 23 | + |
| 24 | +### Stage 1: Planning |
| 25 | + |
| 26 | +The goal of this stage is to convert a vague user request (e.g., "customer churn metrics") into a structured `DataProductPlan`, which is a well-defined list of dimensions and measures. |
| 27 | + |
| 28 | +1. **Input**: A natural language query. |
| 29 | +2. **Agent's Task**: The `DataProductPlannerAgent` uses its tools to find relevant database tables and existing data products. |
| 30 | +3. **Output**: The agent produces a `DataProductPlan` object, which can be reviewed and modified by the user. |
| 31 | + |
| 32 | +:::tip User Validation is Key |
| 33 | +The `DataProductPlan` generated by the AI is a starting point. It is crucial to review and validate this plan to ensure it aligns with your business requirements before proceeding to the building stage. |
| 34 | +::: |
| 35 | + |
| 36 | +### Stage 2: Building |
| 37 | + |
| 38 | +This stage takes the abstract `DataProductPlan` and maps each attribute to a specific, physical database column, defining its logic (e.g., aggregation for measures). |
| 39 | + |
| 40 | +1. **Input**: The `DataProductPlan` from Stage 1. |
| 41 | +2. **Agent's Task**: The `DataProductBuilderAgent` iterates through each attribute in the plan, using the graph-based column retriever to find the most relevant physical column. |
| 42 | +3. **Output**: The collected mappings are assembled into a final `ETLModel`, which is a complete, machine-readable definition of the data product, ready to be used to generate a SQL query. |
| 43 | + |
| 44 | +## Usage Example |
| 45 | + |
| 46 | +```python |
| 47 | +from intugle import DataProduct |
| 48 | + |
| 49 | +dp = DataProduct() |
| 50 | + |
| 51 | +# 1. Generate a plan from a natural language query |
| 52 | +plan = await dp.plan(query="top 10 customers by their total purchase amount") |
| 53 | + |
| 54 | +# 2. Review and modify the plan |
| 55 | +print("Original Plan:") |
| 56 | +plan.display() |
| 57 | + |
| 58 | +plan.rename_attribute("total purchase amount", "Total Spend") |
| 59 | +plan.disable_attribute("customer address") # Assuming this was in the plan |
| 60 | + |
| 61 | +print("\nModified Plan:") |
| 62 | +plan.display() |
| 63 | + |
| 64 | + |
| 65 | +# 3. Create the ETL model from the modified plan |
| 66 | +etl_model = await dp.create_etl_model_from_plan(plan) |
| 67 | + |
| 68 | +# 4. Build the data product |
| 69 | +result_dataset = dp.build(etl=etl_model) |
| 70 | + |
| 71 | +# 5. Access the results |
| 72 | +print(result_dataset.to_df()) |
| 73 | +``` |
| 74 | + |
| 75 | + |
| 76 | +## Modifying the Data Product Plan |
| 77 | + |
| 78 | +The `DataProductPlan` object is not just a static output; it's an interactive object that you can modify to refine the AI's suggestions. This allows you to correct any misunderstandings or add your own domain knowledge to the plan. |
| 79 | + |
| 80 | +Here are the available methods to modify the plan: |
| 81 | + |
| 82 | +| Method | Description | Example | |
| 83 | +| ---------------------------------- | ------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | |
| 84 | +| `rename_attribute(old, new)` | Renames an existing attribute. | `plan.rename_attribute('Customer ID', 'Client Identifier')` | |
| 85 | +| `set_attribute_description(name, desc)` | Updates the description of an attribute. | `plan.set_attribute_description('Client Identifier', 'The unique ID for each client')` | |
| 86 | +| `set_attribute_classification(name, class)` | Changes the classification to 'Dimension' or 'Measure'. | `plan.set_attribute_classification('Total Sales', 'Measure')` | |
| 87 | +| `disable_attribute(name)` | Deactivates an attribute so it won't be included in the final data product. | `plan.disable_attribute('Customer Address')` | |
| 88 | +| `enable_attribute(name)` | Reactivates a previously disabled attribute. | `plan.enable_attribute('Customer Address')` | |
| 89 | +| `to_df()` | Returns the final plan as a pandas DataFrame with only active attributes. | `final_plan_df = plan.to_df()` | |
| 90 | + |
| 91 | +## Qdrant Server Requirement |
| 92 | + |
| 93 | +Conceptual Search utilizes [Qdrant](https://qdrant.tech/) as its vector database for efficient retrieval of relevant tables and columns. Therefore, a running Qdrant instance is required. |
| 94 | + |
| 95 | +You can easily set up a Qdrant server using Docker: |
| 96 | + |
| 97 | +```bash |
| 98 | +docker run -d -p 6333:6333 -p 6334:6334 \ |
| 99 | + -v qdrant_storage:/qdrant/storage:z \ |
| 100 | + --name qdrant qdrant/qdrant |
| 101 | +``` |
| 102 | + |
| 103 | +After starting the Qdrant server, you need to configure its URL and API key (if authorization is used) in your environment variables: |
| 104 | + |
| 105 | +```bash |
| 106 | +export QDRANT_URL="http://localhost:6333" |
| 107 | +export QDRANT_API_KEY="your-qdrant-api-key" # if authorization is used |
| 108 | +``` |
| 109 | + |
| 110 | +## Enhancing Performance with Tavily Web Search |
| 111 | + |
| 112 | +For better performance and more contextually aware data product plans, it is recommended to use the Tavily web search tool. This allows the planning agent to research industry-best practices and common metrics related to your query. |
| 113 | + |
| 114 | +To enable this feature, you need to get a API key from [Tavily](https://tavily.com/) and set it as an environment variable: |
| 115 | + |
| 116 | +```bash |
| 117 | +export TAVILY_API_KEY="your-tavily-api-key" |
| 118 | +``` |
| 119 | + |
0 commit comments