The Groq LPU™ Inference Engine has achieved impressive results in a recent LLM (large language models) inference benchmark conducted by Anyscale, a developer innovator and friendly competitor in the LLM inference benchmark space. The benchmark, called LLMPerf Leaderboard, includes a selection of LLM inference providers, and the analysis focuses on evaluating for performance, reliability, and efficiency measured by output tokens throughput and time to first token (TTFT).
According to the benchmark, Groq’s LPU™ Inference Engine outperformed all other cloud-based inference providers at up to 18x faster for output tokens throughput. The LPU™ Inference Engine achieved an average of 185 tokens per second (tokens/s), while other providers ranged from 3-18x slower. The LPU™ Inference Engine used in the benchmark supports FP16 and FP8 data types and processed the entire Llama 2 model as provided by Meta AI, without any sparsity.
The LLMPerf Leaderboard leverages a 550 input token count and a 150 output token count. The first metric, Output Tokens Throughput (aka the output speed), is determined by dividing the count of output tokens by the overall end-to-end time, which includes input tokens processing time and overall network latency. For a full list of caveats and disclaimers for this benchmark, please refer to the documentation here.
The Time to First Token (TTFT) is the duration of time that LLM returns the first token. TTFT is especially important for streaming applications that require low latency such as chatbots. Groq’s LPU™ Inference Engine achieved a TTFT of 0.22 seconds, indicating low latency and high reliability. The deterministic design of the LPU™ Inference Engine provides consistent response times, resulting in high repeatability and less effort designing around potential latency issues or slow responses.
The LPU™ Inference Engine is a hardware accelerator designed for machine learning and artificial intelligence applications. It is built on a deterministic architecture that provides consistent response times and high throughput, making it suitable for large-scale language models and other data-intensive workloads.
The LLM inference benchmark results demonstrate the high performance and low latency of Groq’s LPU™ Inference Engine, making it an attractive solution for machine learning and artificial intelligence applications. The benchmark results also highlight the importance of evaluating for performance, reliability, and efficiency when selecting an LLM inference provider.
In addition to the LLM inference benchmark results, Groq has also announced that it is working on providing benchmarking for Llama 2 7B and is planning to mix things up with a variety of experts beyond that. The company is committed to providing high-performance and low-latency solutions for machine learning and artificial intelligence applications.
In conclusion, the Groq LPU™ Inference Engine has achieved impressive results in the LLMPerf Leaderboard, a benchmark for evaluating the performance of LLMs inference providers. The benchmark results demonstrate the high performance and low latency of Groq’s LPU™ Inference Engine, making it an attractive solution for machine learning and artificial intelligence applications. The LLM inference benchmark results also highlight the importance of evaluating for performance, reliability, and efficiency when selecting an LLM inference provider. Groq’s commitment to providing high-performance and low-latency solutions for machine learning and artificial intelligence applications positions it as a leader in the LLM inference space.
Output Tokens Throughput (tokens/s)
The output tokens throughput is measured as the average number of output tokens returned per second. We collect results by sending 150 requests to each LLM inference provider, and calculate the mean output tokens throughput based on 150 requests. A higher output tokens throughput indicates a higher throughput of the LLM inference provider.
70B Models
Time to First Token (seconds)
For streaming applications, the TTFT is how long before the LLM returns the first token.