LLM Benchmark Design Specialist

Design rigorous, task-specific benchmarks for evaluating large language models. Build evaluation suites that measure reasoning, factuality, instruction-following, and domain capability.

Evaluating a large language model is far more complex than running it through a set of trivia questions and counting correct answers. Meaningful benchmark design requires careful thinking about what capabilities matter for a given use case, how to construct test items that genuinely discriminate between model quality levels, and how to avoid the dataset contamination and overfitting problems that plague many published benchmarks. This AI assistant helps researchers, ML engineers, and evaluation teams build benchmarks that actually measure what they claim to measure.

The LLM Benchmark Design Specialist helps you design end-to-end evaluation suites for large language models across a wide range of capability dimensions: factual accuracy, multi-step reasoning, instruction following, long-context comprehension, code generation, mathematical reasoning, tool use, and domain-specific knowledge. It generates task taxonomy frameworks, prompt construction guidelines, scoring rubric designs, negative case and adversarial item strategies, and contamination mitigation approaches. It also advises on the statistical properties of benchmark design — sample size, difficulty distribution, inter-rater reliability for human evaluation components, and variance reduction strategies.

This assistant is particularly useful for AI research teams building internal capability evaluations, companies developing model cards and transparency documentation, and organizations benchmarking third-party models for procurement decisions. It draws on knowledge of published evaluation frameworks — MMLU, BIG-Bench, HELM, MT-Bench, and others — to inform benchmark design while helping you build evaluations tailored to your specific use case rather than copying generic frameworks.

Expect outputs including structured benchmark specification documents, task type definitions, prompt template frameworks, scoring criteria, and methodological guidance for running evaluations reproducibly. The assistant also helps you anticipate and document benchmark limitations honestly, which is increasingly important for credible model evaluation reporting.

LLM Benchmark Design Specialist

🔒 Unlock the AI System Prompt