Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
Each AI model inevitably includes charts that touts how it outperforms competitors in this benchmark or evaluation matrix.
However, these benchmarks usually test general capabilities. For organizations that want to use models and large language model-based agents, it is difficult to assess the extent to which an agent or model actually understands its specific needs.
Model repository Hug the face emission YourBenchan open source tool that developers and businesses can create their own benchmarks to test model performance against their internal data.
Part of the Hug Face Evaluation Research Team Sumuk Shashidhar announces Yourbench On X. This feature provides “using custom benchmarks and comprehensive data generation from any document. This is an important step to improving how model evaluation works.”
He added that the hugged face knows: “In many use cases, what really matters is the model’s ability to perform your specific task. Your bench can evaluate the model that is important to you.”
Create a custom evaluation
Hug the face Say in the newspaper YourBench achieves this by replicating a subset of the Large-Scale Multitasking Language Understanding (MMLU) benchmark “using minimal source text, using a total inference cost of $15, while perfectly retaining relative model performance rankings.”
The organization needs to preprocess your bench before it works. This involves three stages:
- Document ingestion “Standardized” file format.
- Semantic blocks Decompose the document to satisfy context window limitations and focus the model on.
- Document Summary
Next is the Q&A generation process, which raises questions from the information in the document. This is where users bring their LLM of choice to see which one is best answering the question.
Hugging Face tested Yourbench with DeepSeek V3 and R1 models, Alibaba’s Qwen models including the rationaling model Qwen QwQ, Mistral Large 2411 and Mistral 3.1 Small, Llama 3.1 and Llama 3.3, Gemini 2.0 Flash, Gemini 2.0 Flash Lite and Gemma 3, GPT-4o, GPT-4o-mini, and o3 mini, and Claude 3.7 Sonnet and Claude 3.5 haiku.
Hugging Face also performed a cost analysis of the model and found that Qwen and Gemini 2.0 Flash “produced enormous value at very low costs,” Shashidhar said.
Calculation limits
However, creating a custom LLM benchmark based on organizational files comes at a cost. YourBench requires a lot of computing capabilities. Shashidhar said on X that the company is “adding capabilities” as soon as possible.
Hugging face running Several GPUs Work with companies like Google Their cloud services Used for inference tasks. VentureBeat lends a helping face outside the face.
Benchmarks are not perfect
Benchmarks and other evaluation methods give users an idea of the performance of the model, but these methods do not perfectly capture how the model works every day.
Some have Even expressed doubts This benchmark shows the limitations of the model and can lead to false conclusions about its security and performance. A study also warns The benchmark agent can be “misleading”.
However, companies cannot avoid evaluating models now because there are many options in the market and technology leaders can justify the cost increase in using AI models. This leads to different ways to test model performance and reliability.
Google DeepMind introduces Facts take rootIt can test the model’s ability to generate facts and accurately respond based on the information of the document. Some researchers at Yale University and Tsinghua University have developed Automatic drive code benchmark Guidance of the business that encodes LLM for its work.
Source link