GitHub - EQ-bench/EQ-Bench: A benchmark for emotional intelligence in large language models
A benchmark that tests language models on emotional intelligence by having them predict emotional intensity in complex social scenarios—measuring the subjective, human-like capability that traditional benchmarks miss.
Read Original Summary used for search
TLDR
• Tests LLMs on emotional intelligence using 171 questions about predicting emotional states in dialogue scenarios (0-10 scale)
• V2.4 adds Creative Writing benchmark judged by Claude 3.5 Sonnet using weighted criteria, plus "Judgemark" to test models' ability to judge creative writing
• Evolved from 60 to 171 questions to reduce variance from perturbations (temp, quantization, prompt format) that don't reflect true performance changes
• Supports multiple inference engines (transformers, llama.cpp, oobabooga) and includes German language support
• Provides standardized measurement for a capability critical to AI assistants but rarely benchmarked systematically
In Detail
EQ-Bench measures language models on emotional intelligence—their ability to understand and predict emotional states in complex social scenarios. Unlike traditional benchmarks that test factual knowledge or logical reasoning, this benchmark asks models to predict emotional intensity on a 0-10 scale across 171 dialogue scenarios. The subjective nature of these predictions makes the benchmark more sensitive to what the model actually understands about human emotions, rather than just pattern matching.
The benchmark has evolved significantly to address variance issues. Version 2 expanded from 60 to 171 questions after discovering that minor perturbations (temperature, quantization, prompt format, system messages) caused score variance beyond what the actual performance change warranted. The expanded test set makes scores more stable and representative of true model capability. Version 2.4 introduces Creative Writing v2, which uses Claude 3.5 Sonnet to judge model outputs against 24 prompts using weighted criteria, and "Judgemark," which tests a model's ability to judge creative writing from 20 test models.
The practical implications are significant for anyone building AI assistants or chatbots. Emotional intelligence is critical for natural human interaction, yet it's rarely measured systematically. EQ-Bench provides a standardized way to compare models on this dimension, with results showing it correlates well with other benchmarks while capturing something distinct. The benchmark supports multiple inference engines (transformers, llama.cpp, oobabooga) and includes features like multi-language support and result uploading to Firebase for community comparison.