Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Demostration: How to integrate AI in Microsoft Fabric

Costa Rica

GitHub brown9804

Last updated: 2025-07-16


Fabric's OneLake datastore provides a unified data storage solution that supports differents data formats and sources. This feature simplifies data access and management, enabling efficient data preparation and model training.

List of References (Click to expand)
Table of Content (Click to expand)

Overview

Microsoft Fabric is a comprehensive data analytics platform that brings together various data services to provide an end-to-end solution for data engineering, data science, data warehousing, real-time analytics, and business intelligence. It's designed to simplify the process of working with data and to enable organizations to gain insights more efficiently.

Capabilities Enabled by LLMs:

  • Document Summarization: LLMs can process and summarize large documents, making it easier to extract key information.
  • Question Answering: Users can perform Q&A tasks on PDF documents, allowing for interactive data exploration.
  • Embedding Generation: LLMs can generate embeddings for document chunks, which can be stored in a vector store for efficient search and retrieval.

Demo

Tools in practice:

Tool Description
LangChain LangChain is a framework for developing applications powered by language models. It can be used with Azure OpenAI to build applications that require natural language understanding and generation.
Use Case: Creating complex applications that involve multiple steps or stages of processing, such as preprocessing text data, applying a language model, and postprocessing the results.
SynapseML SynapseML is an open-source library that simplifies the creation of massively scalable machine learning pipelines. It integrates with Azure OpenAI to provide distributed computing capabilities, allowing you to apply large language models at scale.
Use Case: Applying powerful language models to massive amounts of data, enabling scenarios like batch processing of text data or large-scale text analytics.

Set Up Your Environment

  1. Register the Resource Provider: Ensure that the microsoft.fabric resource provider is registered in your subscription.

    image
  2. Create a Microsoft Fabric Resource:

    • Navigate to the Azure Portal.

    • Create a new resource of type Microsoft Fabric.

    • Choose the appropriate subscription, resource group, capacity name, region, size, and administrator.

      image
  3. Enable Fabric Capacity in Power BI:

    • Go to the Power BI workspace.

    • Select the Fabric capacity license and the Fabric resource created in Azure.

      image
  4. Pause Fabric Compute When Not in Use: To save costs, remember to pause the Fabric compute in Azure when you're not using it.

    image

Install Required Libraries

  1. Access Microsoft Fabric:

    • Open your web browser and navigate to the Microsoft Fabric portal.
    • Sign in with your Azure credentials.
  2. Select Your Workspace: From the Microsoft Fabric home page, select the workspace where you want to configure SynapseML.

  3. Create a New Cluster:

    • Within the Data Science component, you should find options to create a new cluster.

      image
    • Follow the prompts to configure and create your cluster, specifying the details such as cluster name, region, node size, and node count.

      image image
  4. Install SynapseML on Your Cluster: Configure your cluster to include the SynapseML package.

    image image
    %pip show synapseml
    
  5. Install LangChain and Other Dependencies:

    You can use %pip install to install the necessary packages

    %pip install openai langchain_community

    Or you can use the environment configuration:

    image

    You can also try with the .yml file approach. Just upload your list of dependencies. E.g:

    dependencies:
      - pip:
          - synapseml==1.0.8
          - langchain==0.3.4
          - langchain_community==0.3.4
          - openai==1.53.0
          - langchain.openai==0.2.4

Configure Azure OpenAI Service

Note

Click here to see all notebook

  1. Set Up API Keys: Ensure you have the API key and endpoint URL for your deployed model. Set these as environment variables

    image
    import os
    
    # Set the API version for the Azure OpenAI service
    os.environ["OPENAI_API_VERSION"] = "2023-08-01-preview"
    
    # Set the base URL for the Azure OpenAI service
    os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource-name.openai.azure.com"
    
    # Set the API key for Azure OpenAI
    os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
  2. Initialize Azure OpenAI Class: Create an instance of the Azure OpenAI class using the environment variables set above.

    image
    from langchain_openai import AzureChatOpenAI
    
    # Set the API base URL
    api_base = os.environ["AZURE_OPENAI_ENDPOINT"]
    
    # Create an instance of the Azure OpenAI Class
    llm = AzureChatOpenAI(
       openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
       temperature=0.7,
       verbose=True,
       top_p=0.9
    )
  3. Call the Deployed Model: Use the Azure OpenAI service to generate text or perform other language model tasks. Here's an example of generating a response based on a prompt

    image
    # Define a prompt
    messages = [
       (
           "system",
           "You are a helpful assistant that translates English to French. Translate the user sentence.",
       ),
       ("human", "Hi, how are you?"),
    ]
    
    # Generate a response from the Azure OpenAI service using the invoke method
    ai_msg = llm.invoke(messages)
    
    # Print the response
    print(ai_msg)

Make sure to replace "your_openai_api_key", "https://your_openai_api_base/", "your_deployment_name", and "your_model_name" with your actual API key, base URL, deployment name, and model name from your Azure OpenAI instance. This example demonstrates how to configure and use an existing Azure OpenAI instance in Microsoft Fabric.

Basic Usage of LangChain Transformer

Note

E.g: Automate the process of generating definitions for technology terms using a language model. The LangChain Transformer is a tool that makes it easy to use advanced language models for generating and transforming text. It works by setting up a template for what you want to create, linking this template to a language model, and then processing your data to produce the desired output. This setup helps automate tasks like defining technology terms or generating other text-based content, making your workflow smoother and more efficient.

LangChain Transformer helps you automate the process of generating and transforming text data using advanced language models, making it easier to integrate AI capabilities into your data workflows.

  1. Prompt Creation: Start by defining a template for the kind of text you want to generate or analyze. For example, you might create a prompt that asks the model to define a specific technology term.
  2. Chain Setup: Then set up a chain that links this prompt to a language model. This chain is responsible for sending the prompt to the model and receiving the generated response.
  3. Transformer Configuration: The LangChain Transformer is configured to use this chain. It specifies how the input data (like a list of technology names) should be processed and what kind of output (like definitions) should be produced.
  4. Data Processing: Finally, apply this setup to a dataset. E.g., list of technology names in a DataFrame, and the transformer will use the language model to generate definitions for each technology.
  1. Create a Prompt Template: Define a prompt template for generating definitions.

    image
    from langchain.prompts import PromptTemplate
    
    copy_prompt = PromptTemplate(
        input_variables=["technology"],
        template="Define the following word: {technology}",
    )
  2. Set Up an LLMChain: Create an LLMChain with the defined prompt template.

    image
    from langchain.chains import LLMChain
    
    chain = LLMChain(llm=llm, prompt=copy_prompt)
  3. Configure LangChain Transformer: Set up the LangChain transformer to execute the processing chain.

    image
    # Set up the LangChain transformer to execute the processing chain.
    from synapse.ml.cognitive.langchain import LangchainTransformer
    
    openai_api_key= os.environ["AZURE_OPENAI_API_KEY"]
    
    transformer = (
       LangchainTransformer()
       .setInputCol("technology")
       .setOutputCol("definition")
       .setChain(chain)
       .setSubscriptionKey(openai_api_key)
       .setUrl(api_base)
    )
  4. Create a Test DataFrame: Construct a DataFrame with technology names.

    image
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    # Initialize Spark session
    spark = SparkSession.builder.appName("example").getOrCreate()
    
    # Construct a DataFrame with technology names
    df = spark.createDataFrame(
       [
           (0, "docker"), (1, "spark"), (2, "python")
       ],
       ["label", "technology"]
    )
    
    # Define a simple UDF to transform the technology column
    def transform_technology(tech):
       return tech.upper()
    
    # Register the UDF
    transform_udf = udf(transform_technology, StringType())
    
    # Apply the UDF to the DataFrame
    transformed_df = df.withColumn("transformed_technology", transform_udf(df["technology"]))
    
    # Show the transformed DataFrame
    transformed_df.show()

Using LangChain for Large Scale Literature Review

Note

E.g: Automating the extraction and summarization of academic papers: script for an agent using LangChain to extract content from an online PDF and generate a prompt based on that content. An agent in the context of programming and artificial intelligence is a software entity that performs tasks autonomously. It can interact with itsenvironment, make decisions, and execute actions based on predefined rules or learned behavior.

  1. Define Functions for Content Extraction and Prompt Generation: Extract content from PDFs linked in arXiv papers and generate prompts for extracting specific information.

    image
    from langchain.document_loaders import OnlinePDFLoader
    
    def paper_content_extraction(inputs: dict) -> dict:
        arxiv_link = inputs["arxiv_link"]
        loader = OnlinePDFLoader(arxiv_link)
        pages = loader.load_and_split()
        return {"paper_content": pages[0].page_content + pages[1].page_content}
    
    def prompt_generation(inputs: dict) -> dict:
        output = inputs["Output"]
        prompt = (
            "find the paper title, author, summary in the paper description below, output them. "
            "After that, Use websearch to find out 3 recent papers of the first author in the author section below "
            "(first author is the first name separated by comma) and list the paper titles in bullet points: "
            "<Paper Description Start>\n" + output + "<Paper Description End>."
        )
        return {"prompt": prompt}
  2. Create a Sequential Chain for Information Extraction: Set up a chain to extract structured information from an arXiv link

    image
    from langchain.chains import TransformChain, SimpleSequentialChain
    
    paper_content_extraction_chain = TransformChain(
        input_variables=["arxiv_link"],
        output_variables=["paper_content"],
        transform=paper_content_extraction,
        verbose=False,
    )
    
    paper_summarizer_template = """
    You are a paper summarizer, given the paper content, it is your job to summarize the paper into a short summary, 
    and extract authors and paper title from the paper content.
    """

Machine Learning Integration with Microsoft Fabric

  1. Train and Register Machine Learning Models: Use Microsoft Fabric's native integration with the MLflow framework to log the trained machine learning models, the used hyperparameters, and evaluation metrics.

    image
    import mlflow
    from mlflow.models import infer_signature
    from sklearn.datasets import make_regression
    from sklearn.ensemble import RandomForestRegressor
    
    # Generate synthetic regression data
    X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False)
    
    # Model parameters
    params = {"n_estimators": 3, "random_state": 42}
    
    # Model tags for MLflow
    model_tags = {
       "project_name": "grocery-forecasting",
       "store_dept": "produce",
       "team": "stores-ml",
       "project_quarter": "Q3-2023"
    }
    
    # Log MLflow entities
    with mlflow.start_run() as run:
       # Train the model
       model = RandomForestRegressor(**params).fit(X, y)
    
       # Infer the model signature
       signature = infer_signature(X, model.predict(X))
    
       # Log parameters and the model
       mlflow.log_params(params)
       mlflow.sklearn.log_model(model, artifact_path="sklearn-model", signature=signature)
    
       # Register the model with tags
       model_uri = f"runs:/{run.info.run_id}/sklearn-model"
       model_version = mlflow.register_model(model_uri, "RandomForestRegressionModel", tags=model_tags)
    
       # Output model registration details
       print(f"Model Name: {model_version.name}")
       print(f"Model Version: {model_version.version}")
  2. Compare and Filter Machine Learning Models: Use MLflow to search among multiple models saved within the workspace.

    image
    from pprint import pprint
    from mlflow.tracking import MlflowClient
    
    client = MlflowClient()
    for rm in client.search_registered_models():
       pprint(dict(rm), indent=4)
Total views

Refresh Date: 2025-07-16