eugeneyan.com

		Random Link ¯\_(ツ)_/¯
Apr 6, 2025	»	LLM Evals 4 min; updated Apr 6, 2025 Notable Benchmarks Some notable benchmarks in language modeling: MMLU: 57 tasks spanning elementary math, US history, computer science, law, and more. EleutherAI Eval: Unified framework to test models via zero/few-shot settings on 200 tasks from various evals, including MMLU. HELM: Evaluates LLMs across domains; tasks include Q&A, information retrieval, summarization, text classification, etc. AlpacaEval: Measures how often a strong LLM (e.g., GPT-4) prefers the output of one model over a reference model....