We’re excited to announce the Fastly AI Accelerator, a transparent API proxy for AI platforms designed to reduce latency and lower costs through semantic caching! We’re starting with OpenAI, but more platforms are in the works.
Semantic caching, like regular caching, uses a previous response to serve a current request. The difference comes in how we decide which response to use. In regular caching, the request must match the URL, token, or hash to use a previous response. With semantic caching, we leverage advances in AI embedding models to use previous responses that “semantically” similar requests to respond to a given prompt.
In our testing, a semantic cache hit on the Fastly AI Accelerator is frequently 10x faster than generating a corresponding OpenAI response from scratch. As an added bonus, you save on the prompt and chat response token cost from OpenAI. OpenAI suggests a semantic cache in front of their API as a best practice for developers. This can result in huge savings even on the latest, more affordable models.
Frequently Asked Questions
What is the Fastly AI Accelerator?
Fastly AI Accelerator is a semantic caching solution for Open AI APIs designed to reduce latency and API usage bills. See fastly.com/ai for more information.
What LLM providers and APIs are supported?
We cache the OpenAI Chat completion API. Other Open API calls can be accelerated through Fastly’s network but will not be cached.
What is the pricing for this the AI Accelerator?
There is no charge for this product while in Beta. Future pricing is still under research and will be developed in partnership with early testers to build a predictable and scalable model. No surprises! .
Where can I see service usage metrics?
There are per customer service metrics available on the AI service page. This includes total requests, cached requests, and tokens saved.
- Total Requests - the total requests handled by AI Accelerator
- Cached requests - the number of requests served from cache
- Tokens Saved - the number of tokens saved by using AI Accelerator. It represents the total tokens associated with the cached responses.
How do I get started?
- If you don’t have a Fastly account, create an account at fastly.com/ai
- Opt-in to the AI Accelerator beta
- Create a read-only Fastly API key
- In your API client, replace the base URL with
https://ai-accelerator.fastly.com/v1
and send the API key in the Fastly Key header in your existing AI application’s Python or Javascript code in the following examples:
Python
from openai import OpenAI
client = OpenAI(
# Set the API endpoint to point to the AI Accelerator
base_url="https://ai-accelerator.fastly.com/v1",
# Set default headers to authenticate to your AI Accelerator service
default_headers= {
"Fastly-Key": f"<FASTLY-KEY>",
}
)
Javascript
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: request.env.OPENAI_API_KEY,
// Set the API endpoint to point to the AI Accelerator
baseURL: "https://ai-accelerator.fastly.com/v1",
// Set default headers to authenticate to your AI Accelerator service
defaultHeaders: {
"Fastly-Key": `<FASTLY-KEY>`,
},
});
- (Optional) Send a
Hello World
test request via:
curl --request POST \
--url https://ai-accelerator.fastly.com/v1/chat/completions \
--header 'Accept-Encoding: gzip, deflate' \
--header 'Authorization: Bearer <OpenAI Key>' \
--header 'Content-Type: application/json' \
--header 'Fastly-Key: <Fastly Key>' \
--data '{
"model": "gpt-4o",
"messages": [{
"role": "system",
"content": "You are a friendly assistant, skilled in helping Tolkien aficionados learn more about the history of Middle Earth."
},
{
"role": "user",
"content": "How did Aragorn become a ranger? Be detailed!"
}],
"temperature": 0.7,
"stream": false
}'
Have questions or need help?
Start a conversation in this forum, and our team will dive in to help! Happy creating!