leaderboard-title

Evaluation Datasets

DoctorBench covers medical LLM and multimodal evaluation datasets for both public users and medical professionals to assess real-world capability across scenarios.
DoctorBench-LLM Medical LLM Evaluation Dataset
Evaluates medical large language models across the full clinical practice workflow. It covers core tasks including in-depth symptom analysis, personalized diagnosis and treatment planning, multimodal report inference, and medical safety guardrail monitoring, systematically assessing models' reasoning granularity and decision accuracy when handling real patient chief complaints and complex medical data.
data-items-title

For public users

7 items

Public users
Evaluates symptom triage, disease education, treatment guidance, report interpretation, chronic care, healthcare navigation, and mental health support across 5,400 samples.
data-items-title

For healthcare professionals

7 items

Healthcare professionals
Evaluates clinical decision support, medical knowledge lookup, research assistance, documentation, medical education, quality management, and ethics/legal reasoning across 3,490 samples.