Get started with semantic caching policies

This page applies to Apigee and Apigee hybrid.

View Apigee Edge documentation.

This page describes how to configure and use the Apigee semantic caching policies to enable intelligent response reuse based on semantic similarity. Using these policies in your Apigee API proxy can minimize redundant backend API calls, reduce latency, and lower operational costs.

Before you begin

Before you begin, make sure to complete the following tasks:

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Compute Engine, AI Platform, and Cloud Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Compute Engine, AI Platform, and Cloud Storage APIs.

    Enable the APIs

  8. Set up and configure the Vertex AI Text embeddings API and Vector Search within your Google Cloud project.
  9. Confirm that you have a Comprehensive environment available in your Apigee instance. Semantic caching policies can only be deployed in Comprehensive environments.

Required roles

To get the permissions that you need to create and use the semantic caching policies, ask your administrator to grant you the AI Platform User (roles/aiplatform.user) IAM role on the service account you use to deploy Apigee proxies. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Set environment variables

In the Google Cloud project that contains your Apigee instance, use the following command to set environment variables:

export PROJECT_ID=PROJECT_ID
export REGION=REGION
export RUNTIME_HOSTNAME=RUNTIME_HOSTNAME

Where:

  • PROJECT_ID is the ID of the project with your Apigee instance.
  • REGION is the Google Cloud region of your Apigee instance.
  • RUNTIME_HOSTNAME is the hostname of your Apigee runtime.

To confirm that the environment variables are set correctly, run the following command and review the output:

echo $PROJECT_ID $REGION $RUNTIME_HOSTNAME

Set the project

Set the Google Cloud project in your development environment:

    gcloud auth login
    gcloud config set project $PROJECT_ID

Overview

The semantic caching policies are designed to help Apigee users with LLM models to intelligently serve identical or semantically similar prompts efficiently, minimizing backend API calls and reducing resource consumption.

The SemanticCacheLookup and SemanticCachePopulate policies are attached to the request and response flows, respectively, of an Apigee API proxy. When the proxy receives a request, the SemanticCacheLookup policy extracts the user prompt from the request and converts the prompt into a numerical representation using the Text embeddings API. A semantic similarity search is performed using Vector Search to find similar prompts. If a similar prompt data point is found, a cache lookup is performed. If cached data is found, the cached response is returned to the client.

If the similarity search does not return a similar previous prompt, the LLM model generates content in response to the user prompt and the Apigee cache is populated with the response. A feedback loop is created to update the Vector Search index entries in preparation for future requests.

The following sections describe the steps required to create and configure the semantic caching policies:

  1. Configure a service account for the Vector Search index.
  2. Create and deploy a Vector Search index.
  3. Create an API proxy to enable semantic caching.
  4. Configure the semantic caching policies.
  5. Test the semantic caching policies.

Configure a service account for the Vector Search index

To configure a service account for the Vector Search index, complete the following steps:

  1. Create a service account using the following command:
    gcloud iam service-accounts create SERVICE_ACCOUNT_NAME \
      --description="DESCRIPTION" \
      --display-name="SERVICE_ACCOUNT_DISPLAY_NAME"

    Where:

    • SERVICE_ACCOUNT_NAME is the name of the service account.
    • DESCRIPTION is a description of the service account.
    • SERVICE_ACCOUNT_DISPLAY_NAME is the display name of the service account.

    For example:

    gcloud iam service-accounts create ai-client \
      --description="semantic cache client" \
      --display-name="ai-client"
  2. Grant the service account the AI Platform User role using the following command:
    gcloud projects add-iam-policy-binding $PROJECT_ID \
      --member="serviceAccount:SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
      --role="roles/aiplatform.user"

    Where SERVICE_ACCOUNT_NAME is the name of the service account created in the previous step.

  3. Assign the IAM Service Account User role to the service account using the following command:
    gcloud projects add-iam-policy-binding $PROJECT_ID \
      --member="serviceAccount:SERVICE_ACCOUNT_NAME@$PROJECT_ID.iam.gserviceaccount.com" \
      --role="roles/iam.serviceAccountUser"

    Where SERVICE_ACCOUNT_NAME is the name of the service account created in the previous step.

Create and deploy a Vector Search index

To create and deploy a Vector Search index:

  1. Create a Vector Search index that allows streaming updates:
    ACCESS_TOKEN=$(gcloud auth print-access-token) && curl --location --request POST \
      "https://$REGION-aiplatform.googleapis.com/v1/projects/$PROJECT_ID/locations/$REGION/indexes" \
        --header "Authorization: Bearer $ACCESS_TOKEN" \
        --header 'Content-Type: application/json' \
        --data-raw \
        '{
          "displayName": "semantic-cache-index",
          "description": "semantic-cache-index",
          "metadata": {
            "config": {
              "dimensions": "768",
              "approximateNeighborsCount": 150,
              "distanceMeasureType": "DOT_PRODUCT_DISTANCE",
              "featureNormType": "NONE",
              "algorithmConfig": {
                "treeAhConfig": {
                  "leafNodeEmbeddingCount": "10000",
                  "fractionLeafNodesToSearch": 0.05
                  }
                },
              "shardSize": "SHARD_SIZE_MEDIUM"
              },
            },
          "indexUpdateMethod": "STREAM_UPDATE"
        }'

    The $REGION defines the region where the Vector Search index is deployed. We recommend that you use the same region as your Apigee instance. This environment variable was set in a previous step.

    When this operation completes, you should see a response similar to the following:

    {
      "name": "projects/976063410430/locations/us-west1/indexes/5695338290484346880/operations/9084564741162008576",
      "metadata": {
        "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateIndexOperationMetadata",
        "genericMetadata": {
          "createTime": "2025-04-25T18:45:27.996136Z",
          "updateTime": "2025-04-25T18:45:27.996136Z"
        }
      }
    }

    For more information on creating Vector Search indexes, see Create an index.

  2. Create an IndexEndpoint using the following command:
    gcloud ai index-endpoints create \
      --display-name=semantic-cache-index-endpoint \
      --public-endpoint-enabled \
      --region=$REGION \
      --project=$PROJECT_ID

    This step may take several minutes to complete. When it completes, you should see a response similar to the following:

    Waiting for operation [8278420407862689792]...done.
      Created Vertex AI index endpoint: projects/976063410430/locations/us-west1/indexEndpoints/7953875911424606208.

    For more information on creating an IndexEndpoint, see Create an IndexEndpoint.

  3. Deploy the index to the endpoint using the following command:
    INDEX_ENDPOINT_ID=$(gcloud ai index-endpoints list \
      --project=$PROJECT_ID \
      --region=$REGION \
      --format="json" | jq -c -r \
      '.[] | select(.displayName=="semantic-cache-index-endpoint") | .name | split("/") | .[5]' \
      ) && INDEX_ID=$(gcloud ai indexes list \
      --project=$PROJECT_ID \
      --region=$REGION \
      --format="json" | jq -c -r \
      '.[] | select(.displayName=="semantic-cache-index") | .name | split("/") | .[5]' \
      ) && gcloud ai index-endpoints deploy-index \
      $INDEX_ENDPOINT_ID \
      --deployed-index-id=semantic_cache \
      --display-name=semantic-cache \
      --index=$INDEX_ID \
      --region=$REGION \
      --project=$PROJECT_ID

Initial deployment of an index to an endpoint can take between 20 and 30 minutes to complete. To check the status of the operation, use the following command:

gcloud ai operations describe OPERATION_ID \
  --project=$PROJECT_ID \
  --region=$REGION

Confirm that the index deployed:

gcloud ai operations describe OPERATION_ID \
  --index-endpoint=$INDEX_ENDPOINT_ID --region=$REGION --project=$PROJECT_ID

The command should return $ done: true.

Create an API proxy to enable semantic caching

In this step, you will create a new API proxy using the Proxy with Semantic Cache template, if you have not done so already.

Before creating the API proxy, set the following environment variable:

export PUBLIC_DOMAIN_NAME=$(gcloud ai index-endpoints describe $INDEX_ENDPOINT_ID --region=$REGION --project=$PROJECT_ID | grep "publicEndpointDomainName" | awk '{print $2}')

To create a proxy for use with semantic caching:

  1. Go to the API proxies page in the Google Cloud console.

    Go to API proxies

  2. Click the + Create to open the Create API proxy pane.
  3. In the Proxy template box, select Proxy with Semantic Cache.
  4. Enter the following details:
    • Proxy name: Enter the name of the proxy.
    • Description: (Optional) Enter a description of the proxy.
    • Target (Existing API): Enter the URL of the backend service that the proxy calls. This is the LLM model endpoint that is used to generate content.

      For this tutorial, the Target (Existing API) can be set to:

      REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/publishers/google/models/gemini-2.0-flash-001:generateContent
  5. Enter the following Semantic Cache URLs:
    • Generate Embeddings URL: This Vertex AI service converts text input into a numerical form for semantic analysis.

      For this tutorial, this URL can be set to the following:

      REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/publishers/google/models/text-embedding-004:predict.
    • Query Nearest Neighbor URL: This Vertex AI service searches for similar text input from previous requests in the Vector Search index to avoid reprocessing.

      For this tutorial, this URL can be set to the following:

      PUBLIC_DOMAIN_NAME/v1/projects/PROJECT_ID/locations/REGION/indexEndpoints/INDEX_ENDPOINT_ID:findNeighbors
    • Upsert index URL: This Vertex AI service updates the index with new or modified entries.

      For this tutorial, this URL can be set to the following:

      REGION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/indexes/INDEX_ID:upsertDatapoints
  6. Click Next.
  7. Click Create.

The API proxy's XML configuration can be viewed in the Develop tab. SemanticCacheLookup and SemanticCachePopulate policies containing default values are already attached to the proxy request and response flows.

Configure the semantic caching policies

You can view the XML configuration of each policy by clicking on the policy name in the Detail view of the API proxy's Develop tab. Edits to the policy XML can be made directly in the Code view of the Develop tab.

Edit the policies:

  • SemanticCacheLookup policy:
    • Remove the <UserPromptSource> element to use the default value.
    • Update the <DeployedIndexId> element to use the value semantic_cache.
    • Configure the semantic similarity <Threshold> value to determine when two prompts are considered a match. The default is 0.9, but you can adjust this value based on your application's sensitivity. The bigger the number, the more closely prompts have to be related to be considered a cache hit. For this tutorial, we recommend setting this value to 0.95.
    • Click Save.
  • SemanticCachePopulate policy:
    • Set the <TTLInSeconds> element to specify the number of seconds until the cache expires, in seconds. The default value is 60s. Note that Apigee will ignore any cache-control headers it receives from the LLM model.
    • Click Save.

Add Google authentication to the API proxy

You must also add Google authentication to the API proxy's target endpoint to enable proxy calls to the target.

To add the Google access token:

  1. In the Develop tab, click default under the Target endpoints folder. The Code view displays the XML configuration of the <TargetEndpoint> element.
  2. Edit the XML to add the following configuration under <HTTPTargetConnection>:
    <Authentication>
      <GoogleAccessToken>
        <Scopes>
          <Scope>https://www.googleapis.com/auth/cloud-platform</Scope>
        </Scopes>
      </GoogleAccessToken>
    </Authentication>
  3. Click Save.

Deploy the API proxy

To deploy the API proxy:

  1. Click Deploy to open the Deploy API proxy pane.
  2. The Revision field should be set to 1. If not, click 1 to select it.
  3. In the Environment list, select the environment where you want to deploy the proxy. The environment must be a Comprehensive environment.
  4. Enter the Service account you created in an earlier step.
  5. Click Deploy.

Test the semantic caching policies

To test the semantic caching policies:

  1. Send a request to the proxy using the following command:
    curl https://$RUNTIME_HOSTNAME/PROXY_NAME -H 'Content-Type: application/json' --data '{
      "contents": [
          {
              "role": "user",
              "parts": [
                  {
                      "text": "Why is the sky blue?"
                  }
              ]
          }
      ]
    }'

    Where PROXY_NAME is the basepath of the API proxy you deployed in the previous step.

  2. Repeat the API call, substituting the prompt string with the following semantically similar prompt strings:
  • Why is the sky blue?
  • What makes the sky blue?
  • Why is the sky blue colored?
  • Can you explain why the sky is blue?
  • The sky is blue, why is that?
  • Compare the response time for each call once a similar prompt has been cached.
  • To verify that your calls are being served from cache, check the response headers. There should be a Cached-Content: true header attached.

    Best practices

    We recommend incorporating the following best practices to your API management program when using the semantic caching policies:

    • Prevent caching of sensitive data with Model Armor.

      To prevent caching of sensitive data, we recommend using Model Armor for content filtering. Model Armor can flag responses as non-cacheable if it detects sensitive information. For more information, see the Model Armor overview.

    • Manage data freshness with Vertex AI data point invalidation and Time-to-Live (TTL).

      We recommend implementing appropriate data point invalidation strategies to ensure that cached responses are up-to-date and reflect the latest information from your backend systems. To learn more, see Update and rebuild an active index.

      You can also adjust the TTL for cached responses based on the data's volatility and frequency of updates. For more information on using TTL in the SemanticCachePopulate policy, see <TTLInSeconds>.

    • Use predefined caching strategies to ensure the most accurate response data.

      We recommend implementing predefined caching strategies similar to the following:

      • Generic AI responses: Configure a long TTL (for example, one hour) for non-user-specific responses.
      • User-specific responses: Do not implement caching, or set a short TTL (for example, five minutes) for responses that contain user-specific information.
      • Time-sensitive responses: Configure a short TTL (for example, five minutes) for responses that require real-time or frequent updates.

    Limitations

    The following limitations apply to the semantic caching policies:

    • The maximum cacheable text size is 256 KB. For more information, see the Cache value size on the Apigee Limits page.
    • Apigee will ignore any cache-control headers it receives from the LLM model.
    • If the cache is not invalidated properly or if the semantic similarity algorithm is not sufficiently accurate to differentiate between inputs with very similar meanings, the response may return outdated or incorrect information.
    • The Vector Search feature is not supported in all regions. For a list of supported regions, see the Feature availability section of the Vertex AI Locations page. If your Apigee organization is in an unsupported region, you will have to create index endpoints in a different region than your Apigee organization.
    • The semantic caching policies are not supported for use with API proxies using EventFlows for continuous response streaming of server-sent events (SSE).
    • The JsonPath function within <UserPromptSource> does not support the ignoreUnresolvedVariables functionality. By default, null or empty values are ignored during message template application.