GitHub - EQ-bench/EQ-Bench: A benchmark for emotional intelligence in large language models

EQ-Bench measures language models on emotional intelligence—their ability to understand and predict emotional states in complex social scenarios. Unlike traditional benchmarks that test factual knowledge or logical reasoning, this benchmark asks models to predict emotional intensity on a 0-10 scale across 171 dialogue scenarios. The subjective nature of these predictions makes the benchmark more sensitive to what the model actually understands about human emotions, rather than just pattern matching.

The benchmark has evolved significantly to address variance issues. Version 2 expanded from 60 to 171 questions after discovering that minor perturbations (temperature, quantization, prompt format, system messages) caused score variance beyond what the actual performance change warranted. The expanded test set makes scores more stable and representative of true model capability. Version 2.4 introduces Creative Writing v2, which uses Claude 3.5 Sonnet to judge model outputs against 24 prompts using weighted criteria, and "Judgemark," which tests a model's ability to judge creative writing from 20 test models.

The practical implications are significant for anyone building AI assistants or chatbots. Emotional intelligence is critical for natural human interaction, yet it's rarely measured systematically. EQ-Bench provides a standardized way to compare models on this dimension, with results showing it correlates well with other benchmarks while capturing something distinct. The benchmark supports multiple inference engines (transformers, llama.cpp, oobabooga) and includes features like multi-language support and result uploading to Firebase for community comparison.

GitHub - EQ-bench/EQ-Bench: A benchmark for emotional intelligence in large language models

TLDR

In Detail

TLDR

In Detail

Related