Which is the fastest LLM? A comprehensive benchmark.

May 10, 2024

We compared the speed of OpenAI, Anthropic and Cohere models to see how much time each one takes to provide an answer. Here is a summary:

GPT 4 is 5-10 times slower than GPT 3.5. For most tasks, GPT 3.5 is powerful enough while being much more responsive.
GPT 3.5’s biggest limitation is its small 16K context window. Among models with larger context windows, Cohere Command-R’s latency is comparable to GPT 3.5 while allowing up-to 128K token context window.
Anthropic’s Claude Haiku is the cheapest and among the fastest models.
Both OpenAI and Cohere allow deploying their models on Azure AI Studio - Cohere’s models are much slower in Azure while OpenAI’s models are faster when deployed with Azure.

Table below shows evaluated models sorted by speed. You can observe that speed and Elo scores are closely correlated but small improvements in Elo come at big expense of response times. API response times are for summarizing a text with approximately 5000 tokens.

TL;DR

To assess the performance of different large language models (LLMs), we conducted tests focusing on response times for summarizing texts of varying lengths. Using a REST API to interface with each model, we averaged the time taken to receive responses after summarizing texts of three sizes: small (~1400 tokens), medium (~5000 tokens) and large (~14000 tokens).

We also used the Chatbot Arena Leaderboard, an online platform comparing the conversational quality of various LLMs based on human evaluation, to gauge the models' overall effectiveness. By combining both response time data and Chatbot Arena rankings, we can analyze the balance between speed and result quality.

GPT 3.5 Turbo is the fastest model with average response times between 2 and 3 seconds for all text sizes, so it should be your preferred model if the size of the prompt text is relatively small. However, for some use cases such as text summarization or RAG with a large number of context documents the 16K context may be a limiting factor. In addition, the model’s chatbot ranking is last among the models tested so it may have lower result quality than the others. Cohere’s Command R offers very similar response times to GPT 3.5, but comes with an 8x larger context window of 128K tokens and is ranked much higher on the chatbot leaderboard (#14 vs. #25) so it will have higher result quality. Anthropic’s Claude Haiku also offers comparable response times to GPT 3.5 for small and medium text size (around 10-15% slower) but much higher response times for the large text size (around 48% slower) despite its large 200K context window.

When considering the more intelligent models which are tuned for comprehension of difficult texts, complex reasoning and problem solving, Command R+ is the fastest model. It delivers average response times between 5 to 8 seconds which is up to 60% faster than GPT 4 and up to 70% faster than Claude Opus. However, Command R+ may have lower result quality than both of these models as it is ranked lower on the chatbot leaderboard. GPT 4 Turbo, which is currently ranked the best model, is up to 36% faster than Claude Opus.

Analyzing performance discrepancies between cloud services, we observe noticeable differences between GPT models on OpenAI and Azure. GPT 4 Turbo hosted on OpenAI delivers faster response times across all text sizes than on Azure. On the other hand, GPT 3.5 Turbo on Azure is consistently quicker than its OpenAI counterpart. Cohere's Command R and R+ also showcase varying performance on Cohere's platform versus Azure, with Cohere's native hosting generally offering lower response times for small- and medium-sized tasks. However, as input sizes increase, the differences become less pronounced. These variations highlight the importance of choosing the right platform based on model type and specific task requirements.

Future and other LLMs:

We tested models from OpenAI and Cohere, both of which are available as hosting on their own cloud or on Azure AI. We compared this with Anthropic’s latest models hosted on their own cloud. We focused only on the large foundational models with high Elo scores on Chatbot arena.

In this initial study we did not exhaust all possible options, and will look to add following in the future:

Evaluate Groq which hosts open models Llama 3 and Mixtral and promises a significantly higher speed.
Add Google Gemini models to the mix
Compare open models such as Llama-3 and Mistral models and across various hosting providers
Test other hosting providers, such as Amazon Bedrock, which also provides Anthropic’s models

We would love to hear your feedback and what you would like to see. Please reach us at contact AT workorb DOT com.