Random Link ¯\_(ツ)_/¯ | ||
Apr 6, 2025 | » | LLM Evals
4 min; updated Apr 6, 2025
Notable Benchmarks Some notable benchmarks in language modeling: MMLU: 57 tasks spanning elementary math, US history, computer science, law, and more. EleutherAI Eval: Unified framework to test models via zero/few-shot settings on 200 tasks from various evals, including MMLU. HELM: Evaluates LLMs across domains; tasks include Q&A, information retrieval, summarization, text classification, etc. AlpacaEval: Measures how often a strong LLM (e.g., GPT-4) prefers the output of one model over a reference model.... |