The integration of generative AI with traditional data management and analytics opens new opportunities to provide value and insight that were difficult to achieve in the past.

One technique for leveraging generative AI technologies in data analysis is to use AI to generate summarizations of content.  This type of summarization is an ideal use of generative models like OpenAI GPT.

In this post we'll walk through a solution that leverages Azure Open AI Services to summarize customer product reviews logged to a web site to provide managers and data analysts a product-level sense of how customers feel about products.

Video Tutorial Available

The balance of this tutorial-orientated post is available in video format on YouTube using the following embedded video.  The rest of the text walk-through continues after the embedded video.

YouTube video version of this post

Creating an OpenAI Service

The first step in the process is to create an Azure OpenAI studio deployment. In this solution an existing Azure OpenAI services instance is created.  The service establishes a specific API endpoint for our solution, creates a container for specific deployments, creates authorization keys, and serves as an anchor for GPT token billing.

An Azure OpenAI Service Instance

GPT Model Deployment

Within the Azure OpenAI Service instance, we can deploy models. For this solution, I've deployed a deployment of the gpt-35-turbo-16k model. Deploying a model creates an endpoint that specifies which model version we'll use and defines the token quota (the maximum rate at which requests can be made).

OpenAI Services Deployment

In the Fabric solution, we'll need the deployment name (gpt-35-turbo-16k), as well as the endpoint for our Azure OpenAI Services service, and one of the API keys which authorize applications to use the OpenAI Service.

Azure Key Vault

As a best practice, store keys in Azure Key Vault.  Key Vault provides underlying keys to requesting processes based on authorization granted by Azure Entra ID.   The key is copied from the above screen and added to a Key Vault. Later in Fabric, we'll import the key from the vault as the Notebook runs.

💡
In the case of Fabric, a user can be granted the authorization to use a secret key within a notebook without knowing the actual value of the key

If you're not sure how to use Key Vault, refer to my Azure Key Vault cheat sheet!

Let's Code that Notebook!

Now that we have an OpenAI Services deployment ready, and have put the authorization key in the key vault, we're ready to start building the solution in Fabric!

Review the Input Data

In this solution, we start with a Delta table in a Fabric Lakehouse containing the text of every customer review.  Each review is keyed with an associated product id and product name--but our users would like to have a high-level overview of commonalities in reviews. We can use a Large Language Model such as GPT to easily create this type of summary!

Original Review Text

Author the Notebook

In this solution I've created a notebook in the Synapse Data Science workload called Generative_AI_Summarization, and connected the notebook to the Lakehouse containing the original reviews.

Fetch the OpenAI Services Key

We need the OpenAI Services key within the notebook, so as a first step we'll gain access to the key and store it in a variable accessible to the notebook's spark session.

Fetch the OpenAI Services Key
  • In line 2-3, we import the PyTridentTokenLibrary provided by Fabric to provide access to Azure Key Vault.
  • In line 12 we call the get_secret_with_token method to fetch the secret from the vault and assign it to the variable openai_services_key.
💡
The user or service principal running the notebook needs to be granted read access to the Azure Key Vault in Entra ID

Even though the key is stored in a session variable, it's still protected from unauthorized disclosure and/or storage in a Git repository.  The notebook knows the return from get_secret_with_token is sensitive information and won't display it within the notebook.

Redacted Azure Key Vault Information

Initialize the OpenAI SDK

Next, we'll import the openai SDK and provide authorization information to it.  This is the point where we use the deployment information and security keys we created in Azure OpenAI Services in previous steps.

Initializing Azure OpenAI Services

Generate Summaries

Now for the exiting part – calling OpenAI GPT to generate summaries!

The following code does the following:

  1. Uses Spark SQL to generate a list of products contained in a table web_reviews that have user reviews available (see the table content above).
  2. For each product, fetches the text of all available reviews.
  3. Creates a GPT prompt asking the large language model to provide a single summary that encapsulates the consensus of all reviews.

Here's the code, with a description of key elements below:

GPT Summarization Code
  • Lines 4-8 define the structure of the output row for a product review summary.
  • Line 11 creates an empty array. Each product-summary review will be added to this array with the schema defined in the schema variable.
  • Line 14 fetches a list of products that have reviews available as a Pandas DataFrame.
  • Line 17 begins an iteration over the product list.
  • Lines 21 - 27 build a prompt as a string.  This is the prompt that will be sent to the LLM for each product.

The Prompt sent to the LLM will be in the following format (text abbreviated with ellipses for clarity):

[
    { "role": "system", "content" : "Assistant is a large..."},
    { "role": "user", "content": "This product has been...\n" \
     "Please very briefly summarize these reviews...\n" \
     "I had it for one month and it...\n" \
     "I'm reviewing the 25 variant here. The casing is sure to be..\n"
    }
]
  • Line 29 makes the actual call to the OpenAI GPT 3.5 Turbo deployment created in Azure, and waits for a response from the LLM.
  • Line 37 appends the review summary to the output array, along with the original product id and product title.

Create a new Spark DataFrame

After all review summaries are generated, the array is transformed into a Spark DataFrame that can be further analyzed or stored.

The review summaries generated by OpenAI Services

Write the Results to a new Delta Table

Finally, we'll write the results to a new Delta Table.

Saving review summaries to a Delta Table

Now we can use Spark SQL (or other methods) to read data from the Delta table and use the summaries in other solutions, such as Power BI reports!

Fetching the review summaries with SQL

Code Available

The Jupyter notebook used in this post is available on GitHub using this link.