AI performance startup Arthur introduces an open-source tool for evaluating LLMs Heaptalk

Arthur Startup developed Bench to enable organizations to evaluate performance of diverse LLMs in real-world scenarios, which then helps make informed and data-driven decisions.

Heaptalk, Jakarta — AI performance startup Arthur revealed its recent tool for evaluating large language models (LLMs) called Bench (08/17). This product is an open-source evaluation tool for comparing LLMs, prompts, and hyperparameters for generative text models.

Bench will allow organizations to figure out the performance of diverse LLMs in real-world scenarios. As a result, organizations can make informed and data-driven decisions when integrating the latest AI technologies into their operations.

“With Bench, we’ve created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes,” said the co-founder and CEO of Arthur Adam Wenchel.

According to Wenchel, the Generative Assessment Project (GAP) research shows that understanding performance differences between LLMs can have an incredible amount of nuance. GAP is a research initiative conducted by Arthur that assesses the strengths and weaknesses of language model offerings from industry leaders, such as OpenAI, Anthropic, and Meta.

Determining the most suitable LLM for corporate applications

In more detail, Bench can help businesses in three ways: model selection & validation, budget & privacy optimization, as well as translation of academic benchmarks to real-world performance. Through model selection & validation, the tool compares the various available LLM options using consistent metrics. In this way, businesses can determine the most suitable LLM for their application.

Budget & privacy optimization helps businesses choose an affordable AI model with the ability to perform the required tasks. According to Arthur, a high price does not always refer to the LLM that best fits a company’s needs since not all applications require the most advanced or expensive AI model. In some cases, a less expensive AI model might also perform the required task in the same way.

By translating academic benchmarks to real-world performance, companies can quantitatively test and compare the performance of diverse models to evaluate them accurately and consistently. In addition, companies can also configure custom benchmarks to focus on what matters most to their specific business and customers.

As an open-source tool, there will be new metrics and other valuable features added as the project and community grow. Bench is accessible through GitHub which then can be run locally or via cloud-based.

Founded in 2019, Arthur has secured over $60M in funding from several firms, including Acrew, Greycroft, Index Ventures, BAM Elevate, Work-Bench, and Plexo Capital. Previously, the New York City-based startup launched Shield in May 2023, a firewall tool to protect organizations against risks and security issues with applied LLMs.