MMLU Benchmark Explained: How AI Models Like ChatGPT Are Measured
- Samul Black
- Aug 14
- 15 min read
Artificial intelligence has made remarkable progress in recent years, but measuring how “smart” these models really are is not as simple as running a spelling test. Enter the MMLU Benchmark—short for Massive Multitask Language Understanding. This widely used evaluation method tests AI models across 57 diverse subjects, from math and history to law and medicine, simulating the variety of questions a human might face in real-world scenarios. By assessing not just factual recall but reasoning ability across multiple disciplines, MMLU has become a gold standard for comparing the performance of large language models like ChatGPT, GPT-4, Claude, and Gemini. In this guide, we’ll break down what MMLU is, how it works, and why it plays such a crucial role in determining the true capabilities of today’s AI systems.

What is the MMLU Benchmark?
The MMLU Benchmark, or Massive Multitask Language Understanding, is a standardized evaluation designed to measure how well large language models (LLMs) can answer questions across a broad range of academic and professional domains. Unlike simple accuracy tests that focus on one narrow skill, MMLU spans 57 distinct subjects, including:
Science – physics, chemistry, biology, and more
Mathematics – from basic algebra to advanced statistics
History – world history, regional studies, and historical analysis
Law – legal principles, case interpretations, and terminology
Medicine – clinical knowledge, diagnoses, and treatment reasoning
Computer Science – algorithms, programming concepts, and theory
What makes MMLU unique is its breadth and difficulty. The questions are based on real-world knowledge—often at an advanced level—requiring models to apply reasoning rather than just memorized facts.
Key characteristics that set MMLU apart:
Covers multiple domains in a single evaluation, mimicking real-world unpredictability
Questions often require multi-step reasoning, not just fact recall
Designed to test general intelligence rather than niche expertise
The benchmark’s primary goal is to simulate the diverse and unpredictable nature of human questioning. This makes it a more reliable indicator of general intelligence in AI systems than task-specific benchmarks, which may overestimate a model’s abilities in broader contexts. As a result, MMLU has become a go-to test for researchers and companies when comparing models like GPT-4, Claude, Gemini, and LLaMA.
How MMLU Works: Structure and Scoring
The MMLU benchmark evaluates AI models by presenting them with multiple-choice questions across 57 subjects. Each subject is drawn from academic exams, professional certification materials, and real-world knowledge databases, ensuring both authenticity and difficulty.
1. Structure of the Test
Question Format – Every question has four answer options (A, B, C, D), with only one correct choice.
Subject Coverage – The benchmark is split into categories such as STEM, humanities, social sciences, and professional fields.
Difficulty Levels – Questions range from high school level to graduate-level or expert knowledge.
No Prior Context – Models receive each question independently, without extra training material specific to the test.
Example question:
Which data structure allows insertion and deletion from both ends in constant time?
A. Stack
B. Queue
C. DequeD.
Linked List
Correct answer: C. Deque
2. Scoring Method
Accuracy Percentage – The primary metric is the percentage of correct answers across all questions.
Per-Category Performance – Researchers also analyze performance per subject to identify strengths and weaknesses.
Few-Shot and Zero-Shot Modes –
Zero-shot: The model answers with no examples provided.
Few-shot: The model is given a few example Q&A pairs before answering.
Comparison Across Models – Scores are often used in leaderboards to benchmark AI systems side-by-side.
3. Why This Matters
MMLU’s scoring reflects how well a model can:
Apply reasoning instead of just recalling memorized text
Generalize across different knowledge domains
Handle complex, nuanced questions similar to real-world challenges
This standardized format allows researchers, developers, and companies to make fair, data-driven comparisonsbetween AI models like GPT-4, Claude, Gemini, and open-source alternatives.
Exploring the MMLU Dataset in Python and What It Reveals
The MMLU dataset is more than just a list of exam questions—it’s a structured benchmark designed to test AI models across 57 subjects ranging from STEM and humanities to social sciences and professional fields. In this section, we’ll load and explore the dataset directly in Python to understand its structure, question formats, and subject distribution. Each entry contains a multiple-choice question with four possible answers, often adapted from real academic or professional sources, and tagged with its subject category. By examining the dataset programmatically, we can see how diverse the topics are, assess the difficulty level of the questions, and prepare the data for evaluation tasks. This hands-on approach not only reveals the dataset’s richness but also sets the stage for analyzing AI performance across different knowledge domains.
Downloading the MMLU Dataset
Before we can explore the MMLU dataset in Python, the first step is to download the dataset archive from the official source. The dataset is hosted by Berkeley and contains all 57 subjects in a structured format. Using a simple wgetcommand, we can fetch the dataset directly to our working environment, preparing it for extraction and analysis. Once downloaded, we can inspect the folder structure, load the questions, and begin exploring the content programmatically.
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar
Output:
--2025-08-13 11:31:48-- https://people.eecs.berkeley.edu/~hendrycks/data.tar
Resolving people.eecs.berkeley.edu (people.eecs.berkeley.edu)... 128.32.244.190
Connecting to people.eecs.berkeley.edu (people.eecs.berkeley.edu)|128.32.244.190|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 166184960 (158M) [application/x-tar]
Saving to: ‘data.tar’
data.tar 100%[===================>] 158.49M 156MB/s in 1.0s
2025-08-13 11:31:49 (156 MB/s) - ‘data.tar’ saved [166184960/166184960]
Extracting the MMLU Dataset
After downloading the dataset, the next step is to extract its contents. The MMLU archive (data.tar) contains all 57 subjects organized into folders, each with multiple JSON files representing questions and answers. Using the tarcommand, we can unpack the archive and access the full dataset structure, making it ready for analysis and exploration in Python.
# Unzip the data.tar file.
tar xvf data.tar
Output:
data/test/prehistory_test.csv
data/dev/
data/dev/professional_accounting_dev.csv
data/dev/clinical_knowledge_dev.csv
data/dev/college_medicine_dev.csv
data/dev/college_mathematics_dev.csv
data/dev/high_school_european_history_dev.csv
data/dev/logical_fallacies_dev.csv
data/dev/anatomy_dev.csv
data/dev/human_aging_dev.csv
data/dev/international_law_dev.csv
data/dev/high_school_chemistry_dev.csv
data/dev/formal_logic_dev.csv
data/dev/public_relations_dev.csv
data/dev/nutrition_dev.csv
data/dev/high_school_geography_dev.csv
data/dev/high_school_government_and_politics_dev.csv
data/dev/high_school_macroeconomics_dev.csv
data/dev/marketing_dev.csv
data/dev/business_ethics_dev.csv
data/dev/high_school_computer_science_dev.csv
data/dev/college_biology_dev.csv
data/dev/college_physics_dev.csv
data/dev/us_foreign_policy_dev.csv
data/dev/philosophy_dev.csv
data/dev/virology_dev.csv
data/dev/professional_medicine_dev.csv
data/dev/abstract_algebra_dev.csv
data/dev/machine_learning_dev.csv
data/dev/sociology_dev.csv
data/dev/elementary_mathematics_dev.csv
data/dev/management_dev.csv
data/dev/medical_genetics_dev.csv
data/dev/moral_disputes_dev.csv
data/dev/high_school_biology_dev.csv
data/dev/moral_scenarios_dev.csv
data/dev/security_studies_dev.csv
data/dev/prehistory_dev.csv
data/dev/high_school_mathematics_dev.csv
data/dev/global_facts_dev.csv
data/dev/high_school_statistics_dev.csv
data/dev/college_computer_science_dev.csv
data/dev/high_school_world_history_dev.csv
data/dev/human_sexuality_dev.csv
data/dev/econometrics_dev.csv
data/dev/high_school_us_history_dev.csv
data/dev/professional_psychology_dev.csv
data/dev/computer_security_dev.csv
data/dev/world_religions_dev.csv
data/dev/electrical_engineering_dev.csv
data/dev/jurisprudence_dev.csv
data/dev/high_school_microeconomics_dev.csv
data/dev/college_chemistry_dev.csv
data/dev/professional_law_dev.csv
data/dev/astronomy_dev.csv
data/dev/miscellaneous_dev.csv
data/dev/conceptual_physics_dev.csv
data/dev/high_school_psychology_dev.csv
data/dev/high_school_physics_dev.csv
Loading and Previewing the MMLU Dataset in Python
Once the MMLU dataset is extracted, the next step is to load the CSV files and inspect their contents. Using Python’s os and pandas libraries, we can programmatically iterate through each CSV file, print its filename clearly, and display the first few rows to get a quick sense of the data. This is particularly helpful in Colab, where visual clarity improves understanding and debugging.
import os
import pandas as pd
from IPython.display import display
# Directory containing the CSV files
dir_list = os.listdir("/content/data/dev")
for file_name in dir_list:
if file_name.endswith(".csv"):
print("\n" + "="*50)
print(f"📄 Filename: {file_name}")
print("="*50 + "\n")
# Load the CSV
df = pd.read_csv(f"/content/data/dev/{file_name}")
# Display the first 5 rows nicely
display(df.head())
Output (containing only first few files):
==================================================
📄 Filename: professional_medicine_dev.csv
=================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | A 42-year-old man comes to the office for preo... | Labetalol | A loading dose of potassium chloride | Nifedipine | Phenoxybenzamine | D |
1 | A 36-year-old male presents to the office with... | left-on-left sacral torsion | left-on-right sacral torsion | right unilateral sacral flexion | right-on-right sacral torsion | D |
2 | A previously healthy 32-year-old woman comes t... | Dopamine | Glutamate | Norepinephrine | Serotonin | D |
3 | A 44-year-old man comes to the office because ... | Allergic rhinitis | Epstein-Barr virus | Mycoplasma pneumoniae | Rhinovirus | D |
4 | A 22-year-old male marathon runner presents to... | anterior scalene | latissimus dorsi | pectoralis minor | quadratus lumborum | C |
==================================================
📄 Filename: us_foreign_policy_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | How did the 2008 financial crisis affect Ameri... | It damaged support for the US model of politic... | It created anger at the United States for exag... | It increased support for American global leade... | It reduced global use of the US dollar | A |
1 | How did NSC-68 change U.S. strategy? | It globalized containment. | It militarized containment. | It called for the development of the hydrogen ... | All of the above | D |
2 | The realm of policy decisions concerned primar... | terrorism policy. | economic policy. | foreign policy. | international policy. | C |
3 | How do Defensive Realism and Offensive Realism... | Defensive realists place greater emphasis on t... | Defensive realists place less emphasis on geog... | Offensive realists give more priority to the n... | Defensive realists believe states are security... | D |
4 | How did Donald Trump attack globalization in t... | Globalization had made men like him too rich | Globalization only benefited certain American ... | Liberal elites had encouraged globalization, w... | Globalization encouraged damaging trade wars | C |
==================================================
📄 Filename: international_law_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | What types of force does Article 2(4) of the U... | Article 2(4) encompasses only armed force | Article 2(4) encompasses all types of force, i... | Article 2(4) encompasses all interference in t... | Article 2(4) encompasses force directed only a... | A |
1 | What is the judge ad hoc? | If a party to a contentious case before the IC... | Judge ad hoc is the member of the bench of the... | Judge ad hoc is a surrogate judge, in case a j... | Judge ad hoc is the judge that each party will... | A |
2 | Would a reservation to the definition of tortu... | This is an acceptable reservation if the reser... | This is an unacceptable reservation because it... | This is an unacceptable reservation because th... | This is an acceptable reservation because unde... | B |
3 | When 'consent' can serve as a circumstance pre... | Consent can serve as a circumstance precluding... | Consent can never serve as a circumstance prec... | Consent can serve as a circumstance precluding... | Consent can always serve as a circumstance pre... | C |
4 | How the consent to be bound of a State may be ... | The consent of a State to be bound is expresse... | The consent of a state to be bound by a treaty... | The consent of a State to be bound is expresse... | The consent of a State to be bound is expresse... | B |
==================================================
📄 Filename: electrical_engineering_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | In an SR latch built from NOR gates, which con... | S=0, R=0 | S=0, R=1 | S=1, R=0 | S=1, R=1 | D |
1 | In a 2 pole lap winding dc machine , the resis... | 200Ω | 100Ω | 50Ω | 10Ω | C |
2 | The coil of a moving coil meter has 100 turns,... | 1 mA. | 2 mA. | 3 mA. | 4 mA. | B |
3 | Two long parallel conductors carry 100 A. If t... | 100 N. | 0.1 N. | 1 N. | 0.01 N. | B |
4 | A point pole has a strength of 4π * 10^-4 webe... | 15 N. | 20 N. | 7.5 N. | 3.75 N. | A |
==================================================
📄 Filename: high_school_microeconomics_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | In a competitive labor market for housepainter... | An effective minimum wage imposed on this labo... | An increase in the price of gallons of paint. | An increase in the construction of new houses. | An increase in the price of mechanical painter... | C |
1 | If the government subsidizes producers in a pe... | the demand for the product will increase | the demand for the product will decrease | the consumer surplus will increase | the consumer surplus will decrease | C |
2 | The concentration ratio for a monopoly is | 0 | 5 | 10 | 100 | D |
3 | Which of the following is true of a price floor? | The price floor shifts the demand curve to the... | An effective floor creates a shortage of the g... | The price floor shifts the supply curve of the... | To be an effective floor, it must be set above... | D |
4 | Which of the following is necessarily a charac... | Free entry into and exit from the market | A few large producers | One producer of a good with no close substitutes | A homogenous product | B |
==================================================
📄 Filename: professional_medicine_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | A 42-year-old man comes to the office for preo... | Labetalol | A loading dose of potassium chloride | Nifedipine | Phenoxybenzamine | D |
1 | A 36-year-old male presents to the office with... | left-on-left sacral torsion | left-on-right sacral torsion | right unilateral sacral flexion | right-on-right sacral torsion | D |
2 | A previously healthy 32-year-old woman comes t... | Dopamine | Glutamate | Norepinephrine | Serotonin | D |
3 | A 44-year-old man comes to the office because ... | Allergic rhinitis | Epstein-Barr virus | Mycoplasma pneumoniae | Rhinovirus | D |
4 | A 22-year-old male marathon runner presents to... | anterior scalene | latissimus dorsi | pectoralis minor | quadratus lumborum | C |
==================================================
📄 Filename: us_foreign_policy_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | How did the 2008 financial crisis affect Ameri... | It damaged support for the US model of politic... | It created anger at the United States for exag... | It increased support for American global leade... | It reduced global use of the US dollar | A |
1 | How did NSC-68 change U.S. strategy? | It globalized containment. | It militarized containment. | It called for the development of the hydrogen ... | All of the above | D |
2 | The realm of policy decisions concerned primar... | terrorism policy. | economic policy. | foreign policy. | international policy. | C |
3 | How do Defensive Realism and Offensive Realism... | Defensive realists place greater emphasis on t... | Defensive realists place less emphasis on geog... | Offensive realists give more priority to the n... | Defensive realists believe states are security... | D |
4 | How did Donald Trump attack globalization in t... | Globalization had made men like him too rich | Globalization only benefited certain American ... | Liberal elites had encouraged globalization, w... | Globalization encouraged damaging trade wars | C |
==================================================
📄 Filename: international_law_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | What types of force does Article 2(4) of the U... | Article 2(4) encompasses only armed force | Article 2(4) encompasses all types of force, i... | Article 2(4) encompasses all interference in t... | Article 2(4) encompasses force directed only a... | A |
1 | What is the judge ad hoc? | If a party to a contentious case before the IC... | Judge ad hoc is the member of the bench of the... | Judge ad hoc is a surrogate judge, in case a j... | Judge ad hoc is the judge that each party will... | A |
2 | Would a reservation to the definition of tortu... | This is an acceptable reservation if the reser... | This is an unacceptable reservation because it... | This is an unacceptable reservation because th... | This is an acceptable reservation because unde... | B |
3 | When 'consent' can serve as a circumstance pre... | Consent can serve as a circumstance precluding... | Consent can never serve as a circumstance prec... | Consent can serve as a circumstance precluding... | Consent can always serve as a circumstance pre... | C |
4 | How the consent to be bound of a State may be ... | The consent of a State to be bound is expresse... | The consent of a state to be bound by a treaty... | The consent of a State to be bound is expresse... | The consent of a State to be bound is expresse... | B |
==================================================
📄 Filename: electrical_engineering_dev.csv
==================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | In an SR latch built from NOR gates, which con... | S=0, R=0 | S=0, R=1 | S=1, R=0 | S=1, R=1 | D |
1 | In a 2 pole lap winding dc machine , the resis... | 200Ω | 100Ω | 50Ω | 10Ω | C |
2 | The coil of a moving coil meter has 100 turns,... | 1 mA. | 2 mA. | 3 mA. | 4 mA. | B |
3 | Two long parallel conductors carry 100 A. If t... | 100 N. | 0.1 N. | 1 N. | 0.01 N. | B |
4 | A point pole has a strength of 4π * 10^-4 webe... | 15 N. | 20 N. | 7.5 N. | 3.75 N. | A |
==================================================
📄 Filename: high_school_microeconomics_dev.csv
=================================================
0 | 1 | 2 | 3 | 4 | 5 | |
0 | In a competitive labor market for housepainter... | An effective minimum wage imposed on this labo... | An increase in the price of gallons of paint. | An increase in the construction of new houses. | An increase in the price of mechanical painter... | C |
1 | If the government subsidizes producers in a pe... | the demand for the product will increase | the demand for the product will decrease | the consumer surplus will increase | the consumer surplus will decrease | C |
2 | The concentration ratio for a monopoly is | 0 | 5 | 10 | 100 | D |
3 | Which of the following is true of a price floor? | The price floor shifts the demand curve to the... | An effective floor creates a shortage of the g... | The price floor shifts the supply curve of the... | To be an effective floor, it must be set above... | D |
4 | Which of the following is necessarily a charac... | Free entry into and exit from the market | A few large producers | One producer of a good with no close substitutes | A homogenous product | B |
Cloning the HuggingFace Mistral Model
For hands-on evaluation or inference with large language models, you can clone the Mistral-7B-v0.1 model directly from HuggingFace. Since this model is large, downloading it may take a significant amount of time, depending on your internet connection. Before cloning, we need to ensure Git LFS (Large File Storage) is installed, as HuggingFace uses it to manage the model files efficiently.
# Git clone the huggingface model
## It might cost lots of time.
git lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1# Git clone the huggingface model
Output:
Git LFS initialized.
Cloning into 'Mistral-7B-v0.1'...
remote: Enumerating objects: 87, done.
remote: Counting objects: 100% (83/83), done.
remote: Compressing objects: 100% (82/82), done.
remote: Total 87 (delta 43), reused 0 (delta 0), pack-reused 4
Unpacking objects: 100% (87/87), 473.17 KiB | 3.20 MiB/s, done.
Filtering content: 100% (5/5), 3.46 GiB | 10.58 MiB/s, done.
Encountered 4 file(s) that may not have been copied correctly on Windows:
model-00002-of-00002.safetensors
pytorch_model-00002-of-00002.bin
pytorch_model-00001-of-00002.bin
model-00001-of-00002.safetensorsGit LFS initialized.
Installing Required Python Packages
Before running large language models like Mistral-7B, certain Python packages are required to handle model loading, GPU acceleration, and efficient inference. One such essential package is accelerate, which provides utilities for distributed training and optimized model execution across CPUs and GPUs. Installing it in your environment ensures that your Colab or local setup can efficiently work with large models
# python package installation
pip install -q accelerate# python package installation
Running the Evaluation Script on the MMLU Dataset
With the Mistral-7B model cloned and the dataset ready, the next step is to run the evaluation script provided in the LLM Model Evaluation repository. This script tests the model’s performance on the MMLU dataset, measuring its accuracy across multiple subjects and saving the results for analysis. By specifying the dataset location, model path, and output directory, you can systematically evaluate large language models without writing custom code from scratch.
# Run the script the evaluate the dataset
!python3 /content/llm_model_evaluation/evaluation_hf_testing.py \
--category_type mmlu \
--model /content/Mistral-7B-v0.1 \
--data_dir "/content/data/" \
--save_dir "/content/result"# Run the script the evaluate the dataset
Output:
Average accuracy 0.280 - abstract_algebra
Average accuracy 0.622 - anatomy
Average accuracy 0.658 - astronomy
Average accuracy 0.570 - business_ethics
Average accuracy 0.698 - clinical_knowledge
Average accuracy 0.729 - college_biology
Average accuracy 0.500 - college_chemistry
Average accuracy 0.520 - college_computer_science
Average accuracy 0.410 - college_mathematics
Average accuracy 0.647 - college_medicine
Average accuracy 0.392 - college_physics
Average accuracy 0.770 - computer_security
Average accuracy 0.570 - conceptual_physics
Average accuracy 0.491 - econometrics
Average accuracy 0.579 - electrical_engineering
Average accuracy 0.376 - elementary_mathematics
Average accuracy 0.413 - formal_logic
Average accuracy 0.360 - global_facts
Average accuracy 0.768 - high_school_biology
Average accuracy 0.532 - high_school_chemistry
Average accuracy 0.680 - high_school_computer_science
Average accuracy 0.794 - high_school_european_history
Average accuracy 0.768 - high_school_geography
Average accuracy 0.865 - high_school_government_and_politics
Average accuracy 0.662 - high_school_macroeconomics
Average accuracy 0.337 - high_school_mathematics
Average accuracy 0.664 - high_school_microeconomics
Average accuracy 0.318 - high_school_physics
Average accuracy 0.824 - high_school_psychology
Average accuracy 0.574 - high_school_statistics
Average accuracy 0.789 - high_school_us_history
Average accuracy 0.776 - high_school_world_history
Average accuracy 0.700 - human_aging
Average accuracy 0.786 - human_sexuality
Average accuracy 0.777 - international_law
Average accuracy 0.778 - jurisprudence
Average accuracy 0.791 - logical_fallacies
Average accuracy 0.491 - machine_learning
Average accuracy 0.825 - management
Average accuracy 0.880 - marketing
Average accuracy 0.740 - medical_genetics
Average accuracy 0.816 - miscellaneous
Average accuracy 0.711 - moral_disputes
Average accuracy 0.322 - moral_scenarios
Average accuracy 0.755 - nutrition
Average accuracy 0.695 - philosophy
Average accuracy 0.728 - prehistory
Average accuracy 0.489 - professional_accounting
Average accuracy 0.449 - professional_law
Average accuracy 0.688 - professional_medicine
Average accuracy 0.680 - professional_psychology
Average accuracy 0.673 - public_relations
Average accuracy 0.727 - security_studies
Average accuracy 0.831 - sociology
Average accuracy 0.860 - us_foreign_policy
Average accuracy 0.548 - virology
Average accuracy 0.830 - world_religions
Average accuracy 0.400 - math
Average accuracy 0.683 - health
Average accuracy 0.503 - physics
Average accuracy 0.796 - business
Average accuracy 0.756 - biology
Average accuracy 0.521 - chemistry
Average accuracy 0.612 - computer science
Average accuracy 0.636 - economics
Average accuracy 0.579 - engineering
Average accuracy 0.533 - philosophy
Average accuracy 0.698 - other
Average accuracy 0.766 - history
Average accuracy 0.768 - geography
Average accuracy 0.779 - politics
Average accuracy 0.748 - psychology
Average accuracy 0.813 - culture
Average accuracy 0.492 - law
Average accuracy 0.525 - STEM
Average accuracy 0.564 - humanities
Average accuracy 0.736 - social sciences
Average accuracy 0.704 - other (business, health, misc.)
Average accuracy: 0.625Average accuracy 0.280 - abstract_algebra
Analyzing Mistral-7B Performance on the MMLU Dataset
After running the evaluation script, we can see how the Mistral-7B model performed across the 57 subjects in the MMLU dataset. The script outputs the average accuracy per subject, providing a clear picture of strengths and weaknesses. For instance
The model performs best in areas like marketing (0.880), high school government and politics (0.865), and sociology (0.831).
It shows moderate performance in subjects such as clinical knowledge (0.698), college biology (0.729), and psychology (0.748–0.824).
Certain domains, particularly abstract algebra (0.280), moral scenarios (0.322), and high school physics (0.318), remain challenging, highlighting gaps in mathematical reasoning and advanced physics.
The overall average accuracy across all subjects is approximately 0.625, indicating that while the model is strong in humanities, social sciences, and some professional domains, there is still significant room for improvement in STEM subjects and highly technical areas
Insights from the results
The model is better at textual reasoning and social knowledge than purely computational or symbolic tasks.
Subjects with practical, real-world context, like management, marketing, and human sexuality, tend to have higher scores.
Difficulties in math-intensive subjects suggest that numeric reasoning and multi-step problem solving remain areas for further research.
Visualizing these results using bar charts or heatmaps can make it easier to spot patterns and compare performance across domains. For example, grouping subjects by STEM, humanities, and professional fields can reveal the model’s strengths in each category.
Conclusion
The MMLU benchmark offers a comprehensive way to evaluate the general knowledge and reasoning abilities of large language models. By spanning 57 subjects across STEM, humanities, social sciences, and professional domains, it tests models in both breadth and depth, highlighting their strengths and weaknesses. Through hands-on exploration—from downloading and inspecting the dataset in Python to running the Mistral-7B evaluation script—we gained a clear view of how a state-of-the-art model performs across different knowledge areas.
The results reveal that models like Mistral-7B excel in humanities, social sciences, and practical domains, while subjects requiring advanced mathematics, physics, or symbolic reasoning remain challenging. This underscores the importance of targeted training and fine-tuning to improve performance in technical and reasoning-heavy tasks.
Overall, MMLU not only serves as a benchmark for comparing AI models but also provides actionable insights for researchers and developers to enhance model capabilities. By combining dataset exploration, programmatic analysis, and evaluation workflows, we can better understand the evolving intelligence of AI and guide future improvements in large language models.