Salesforce Announces the First LLM Benchmark for CRM

An LLM benchmark for CRM? Salesforce may be the first to offer a glimpse of what that could look like.

Salesforce has announced the world’s first LLM benchmark for CRM to help businesses evaluate the rapidly growing number of large language models (LLMs) for use in their customer relationship management (CRM) systems.

The new benchmark is a comprehensive evaluation framework that measures the performance of LLMs against four key measures: accuracy, cost, speed, and trust and safety. It’s been specifically designed to evaluate common sales and service use cases, including prospecting, lead nurturing, as well as sales opportunity and service case summaries. The benchmark also includes a public leaderboard to help professionals decide which LLM is best for their CRM needs. Salesforce will continue to incorporate new use case scenarios into the benchmark and enhance its evaluation of LLMs, which will soon include fine-tuned LLMs.

Silvio Savarese, Salesforce

“As AI continues to evolve, enterprise leaders are saying it’s important to find the right mix of performance, accuracy, responsibility, and cost to unlock the full potential of generative AI to drive business growth,” said Silvio Savarese, EVP & Chief Scientist, Salesforce AI Research. “Salesforce’s new LLM Benchmark for CRM is a significant step forward in the way businesses assess their AI strategy within the industry. It not only provides clarity on next-generation AI deployment but also can accelerate time to value for CRM-specific use cases. Our commitment is to continuously evolve this benchmark to keep pace with technological advancements, ensuring it remains relevant and valuable.”

The new benchmark is a comprehensive evaluation framework that measures the performance of LLMs against four key measures: accuracy, cost, speed, and trust and safety.

Why it matters

Photo by cottonbro studio: https://www.pexels.com/photo/person-holding-brown-framed-eyeglasses-4098340/ Existing LLM benchmarks have been limited to academic and consumer use cases, with very little business relevance. They also lack adequate expert human evaluations and fail to address accuracy, speed, cost, and trust considerations. These deficiencies have left CRM customers lacking a reliable way to gauge the effectiveness of generative AI-powered CRM solutions. Without a clear sense of how LLMs perform across those metrics for specific use cases, businesses are left to make decisions in the dark.

Dive deeper

Developed by Salesforce AI Research, the benchmark uniquely uses real-world CRM data, and also uniquely makes use of expert human evaluations by practitioners. This enables businesses to use the benchmark to make more strategic decisions about how to incorporate generative AI into their CRM systems, with specific attention to:

Accuracy: This metric comprises four subcategories: factuality, completeness, conciseness, and instruction-following. The more accurate the predictions or recommendations, the more valuable the results are to teams across the organisation. And the more valuable the results, the better the actions they can take to improve customer experience. If a model is accurate enough for a use case, it’s also important to consider the other metrics. Even if the model isn’t accurate enough, techniques like prompt engineering and fine-tuning can improve it.
Cost: The cost metric is categorised as high, medium, and low, based on percentiles. It’s the estimated operational cost that varies by CRM use case. Customers can evaluate the cost-effectiveness of different LLMs to ensure they align with their budget and resource allocation strategies.
Speed: This metric assesses the LLM’s responsiveness and efficiency in processing and delivering information. Faster response times enhance the user experience, reduce wait times for customers, and enable sales and service teams to address inquiries and issues promptly.
Trust and Safety: This metric measures the LLM’s capability to shield sensitive customer data, adhere to data privacy regulations, secure information, and refrain from bias and toxicity for CRM use cases. By assessing the reliability of LLMs for CRM, this benchmark gives organizations a sense of transparency regarding trust and safety.

Organisations can use this benchmark to compare LLMs, identify the best solution, and make more informed decisions that will deliver customer success and propel their business forward.

And, with Salesforce’s Einstein 1 Platform, customers can choose from existing LLMs or bring their own models to meet their unique business needs. By selecting models for their CRM use cases using the benchmark, businesses can deploy more effective and efficient generative AI solutions.

“Business organisations are looking to utilise AI to drive growth, cut costs, and deliver personalised customer experiences, not to plan a kid’s birthday party or summarise Othello,” said Clara Shih, CEO of Salesforce AI. “Our customers have been asking for a purpose-built way to evaluate and select from among the proliferation of new AI models, and we are thrilled to introduce the world’s first LLM benchmark for CRM to help them navigate the complex landscape of models. This benchmark is not just a measure; it’s a comprehensive, dynamically evolving framework that empowers companies to make informed decisions, balancing accuracy, cost, speed, and trust.”

(This article was adapted from a press release)