DoctorBench Overview

DoctorBench is a comprehensive benchmark for evaluating the clinical practice capabilities of medical large models, jointly developed by WiseDiag and other authoritative organizations.

Unlike traditional exam-style or academic NLP benchmarks, DoctorBench centers on clinical practice. It builds a full-scenario medical AI evaluation benchmark covering accuracy and safety, aiming to systematically assess large language models in real-world clinical settings and support the healthy development of medical AI.

DoctorBench’s core philosophy: evaluate not only a model’s “knowledge reserves,” but also its “real-world applicability” and “medical safety.”

Why DoctorBench Is Needed

Current medical large-model evaluations face three major limitations:

  • Limitations of exam-style evaluation — Traditional multiple-choice and fill-in-the-blank tests cannot reflect a model’s overall performance in real clinical settings, nor can they assess clinical reasoning and decision-making capabilities.
  • Gap between academic NLP tasks and practice — Medical NER, relation extraction, and similar tasks are far removed from clinical use. High scores on these tasks do not necessarily translate into clinical usability.
  • Language and scenario gaps — Chinese medical scenarios, local clinical standards, and overseas benchmark settings differ significantly, calling for a benchmark tailored to clinical practice in China.

DoctorBench is designed to address these gaps.

Platform Features

FeatureDescription
Three evaluation typesLarge language models (LLM), vision-language models (VLM), and Clinical Tasks (Agent)
14+ core scenariosCovers the complete care journey for both the general public and medical professionals
10+ evaluation dimensionsCore, general, and specialized dimensions for multidimensional assessment
Veto mechanismCritical dimensions such as accuracy and safety can trigger one-vote vetoes
6,000+ gold-standard data itemsBuilt and reviewed by hundreds of medical experts for high signal-to-noise quality
API-based evaluationSubmit models through APIs and run the full evaluation process automatically
Jointly developed by authoritative institutionsDeveloped with leading medical schools and healthcare institutions

Evaluation Dimensions

DoctorBench uses a multi-layer evaluation system with 2 core + 3 general + 5 specialized dimensions, comprehensively assessing models from medical accuracy and safety to interaction quality and reasoning.

NameCategoryDescription
Medical factual accuracy (accuracy)Core 1Scientific accuracy and clinical reliability of medical facts, diagnostic evidence, drug effects, and related information
Medical information completeness (accuracy)Core 1Whether key information is covered and important medical elements are not omitted, such as contraindications or adverse reactions
Safety and risk control (safety)Core 2Whether medical risks are identified and communicated, and whether recommendations that could harm patient safety are avoided
Ethics and compliance (safety)Core 2Whether responses follow medical ethics, respect privacy, and support informed consent
Instruction responsiveness (interaction quality)General 1Whether the model accurately understands user intent and directly answers the specific question
Language clarity (interaction quality)General 1Whether expression is clear and understandable, terminology is appropriate, and logic is coherent
Information priorityGeneral 2Whether the most important information is identified, emphasized, and organized properly
Proactive inquiry and information gatheringGeneral 3Whether the model asks for key history or symptom details when information is insufficient
Evidence and citationSpecialized 1Whether authoritative medical literature, guidelines, or research evidence is cited
Explainable reasoningSpecialized 2Whether the reasoning process is clear and conclusions are explainable
ActionabilitySpecialized 3Whether recommendations are concrete, actionable, and include clear steps or dosage when appropriate
Individualized adaptationSpecialized 4Whether individual differences such as age, comorbidities, and allergies are considered
Emotional supportSpecialized 5Whether empathy and emotional support are provided in appropriate scenarios

Veto Mechanism

DoctorBench includes a veto mechanism during evaluation: if a model has severe issues in core dimensions such as medical factual accuracy or safety and risk control—for example, giving advice that could endanger a patient’s life or seriously violating medical ethics—the veto is triggered and directly affects the overall rating.

This mechanism ensures that evaluation results reflect real usability and reliability in medical scenarios, rather than relying only on aggregate scores.

LLM Evaluation Scenarios

DoctorBench covers 14 core application scenarios, divided into two groups: for the general public (ToC) and for medical professionals (ToB/ToD).

LLM evaluation architecture

General Public (7 Scenarios)

ScenarioData VolumeDescription
Symptom query and analysis1,200Helps users analyze possible causes, severity, and next steps for symptoms
Disease and health education980Provides accurate and understandable medical knowledge while distinguishing education from diagnosis or treatment advice
Treatment plans and medication guidance850Provides evidence-based reference information for diagnosed conditions
Examination/lab report interpretation720Helps interpret lab reports, imaging reports, and the meaning of indicators
Daily health and chronic disease management680Supports chronic disease monitoring, management, and lifestyle adjustment
Healthcare resources and policy navigation450Covers care pathways, insurance policies, department selection, and appointment methods
Mental health and emotional support520Provides basic mental-health knowledge, empathy, and recommendations to seek professional help

Medical Professionals (7 Scenarios)

ScenarioData VolumeDescription
Clinical decision support650Supports differential diagnosis, treatment selection, and medication dosage calculation
Medical knowledge query890Covers disease definitions, diagnostic criteria, drug information, and the latest guidelines
Research and literature assistance420Supports literature search, study design, and statistical methodology
Medical documentation generation and standardization380Assists with medical record summaries, discharge notes, consultation opinions, and similar documents
Medical education and examinations560Supports concept review, question analysis, and clinical thinking training
Healthcare quality and standards management310Covers quality metrics, standardized processes, and adverse-event analysis
Medical ethics and legal consultation280Provides principle-based guidance for informed consent, privacy protection, and medical disputes

The LLM evaluation contains approximately 8,890 gold-standard data items.

Multimodal Evaluation

Multimodal evaluation architecture

The multimodal evaluation targets vision-language models (VLMs) and assesses their ability to process medical images, tables, and other multimodal information. It uses 9 evaluation dimensions.

TaskData VolumeModalitiesDescription
Modality recognition and perception420Images, textRecognizes and understands medical images such as X-rays, CT scans, and pathology slides
Cross-modal association and reasoning380Images, text, tablesAssociates and analyzes images, lab results, history, and other multimodal information
Medical record and report generation350Images, textGenerates standardized medical reports from images and examination results
Medical knowledge QA480Images, textAnswers medical questions in image-text scenarios
Diagnosis and decision support320Images, text, tablesProvides differential diagnosis and treatment decision support based on multimodal input

The multimodal evaluation contains approximately 1,950 data items.

Clinical Tasks (Agent) Evaluation

Clinical Tasks (Agent) evaluation architecture

The Clinical Tasks (Agent) evaluation targets medical agents with tool-use capabilities, assessing clinical task execution, tool use, multi-turn consultation decisions, and closed-loop workflows in simulated care environments.

TaskData VolumeCore CapabilityDescription
Intelligent triage and guidance380Multi-turn dialogue, decision reasoningPerforms initial triage and department recommendations based on the patient’s chief complaint
Multi-turn consultation and history taking450Structured information collection, clinical logicSimulates clinical interviews and systematically collects medical history
Medical tool use and examination planning320Tool use, evidence-based decisionsAppropriately orders tests, queries drug information, and references guidelines
Treatment planning and execution350Plan development, safety checksCreates personalized treatment plans including medication and contraindication checks
Full emergency workflow handling280Urgent decisions, safety controlSupports rapid emergency assessment, decision-making, and multidisciplinary coordination
Chronic disease management and follow-up300Long-term tracking, plan adjustmentSupports regular monitoring, plan adjustment, and health education

The Clinical Tasks (Agent) evaluation contains approximately 2,080 data items.

Evaluation Datasets

DoctorBench’s evaluation data is built and reviewed by hundreds of leading medical experts in China, ensuring high signal-to-noise quality and clinical relevance.

Evaluation TypeData VolumeScenarios/TasksEvaluation Dimensions
Large language models (LLM)~8,89014 scenarios13 dimensions
Vision-language models (VLM)~1,9505 tasks9 dimensions
Clinical Tasks (Agent)~2,0806 tasks6 capability categories
Total~12,92025 scenarios/tasks

The data covers many clinical departments, including internal medicine, surgery, obstetrics and gynecology, pediatrics, emergency medicine, dermatology, psychiatry, and more.

Quick Start

Submit your model to DoctorBench in three steps:

Step 1: Register and log in — Register with your phone number, log in to DoctorBench, and provide a valid contact email.

Step 2: Configure the model API — On the “Submit Evaluation” page, choose an evaluation type and configure your model API information (provider, base_url, api_key, model_name).

Step 3: Submit the evaluation — Fill in model metadata such as display name, parameter size, and developer, then confirm and submit. After evaluation finishes, view the result under “Submissions.”

API Integration Configuration

Choose an Evaluation Type

DoctorBench supports three evaluation types:

TypeDescriptionEstimated Duration
Large language model (LLM)Pure text dialogue capability evaluationAbout 6–8 hours
Vision-language model (VLM)Joint image-text understanding evaluationAbout 3–5 hours
Clinical Tasks (Agent)Clinical task execution, tool use, and multi-turn decision evaluationAbout 2–3 hours

API Configuration Parameters

ParameterRequiredDescription
ProviderYesAPI provider. Select custom to connect any OpenAI-compatible endpoint
Base URLYesBase address of the API service, such as https://api.openai.com/v1
API KeyYesYour API key, used only when the evaluation calls your model
Model NameYesModel identifier, such as gpt-4 or qwen-max

OpenAI-compatible endpoints are currently supported. When custom is selected, you can provide a custom base_url for any API service compatible with the OpenAI format.

Connection Test

Before submission, the platform automatically tests your API configuration to verify:

  • Whether the API endpoint is reachable
  • Whether the API key is valid
  • Whether the model can respond normally

If the connection test fails, check the following common causes:

  • Whether the Base URL format is correct, including the full path such as /v1
  • Whether the API key has expired or has insufficient quota
  • Whether the target API endpoint is reachable from the network

Submission Workflow

The complete submission workflow is as follows:

  1. Log in — Log in with a registered phone number
  2. Open the submission page — Click “Submit Evaluation” in the top navigation or on the homepage
  3. Select an evaluation type — Choose from LLM, VLM, or Clinical Tasks (Agent)
  4. Configure API information — Fill in Provider, Base URL, API Key, and Model Name
  5. Test the connection — The system automatically validates API availability
  6. Fill in model metadata — Include display name, parameter size such as 7B or 70B, developer/organization, release date, and more
  7. Confirm submission — Review the information and submit the evaluation task
  8. Check status — View evaluation progress under “Submissions”

After evaluation is complete, you can view the status and results in the submission records. If an evaluation fails, the page displays the specific error so that you can correct the configuration and resubmit.

Common Failure Causes and Fixes

Failure TypePossible CauseSuggested Fix
Connection test failedAPI endpoint unreachable or key invalidCheck Base URL and API Key
Evaluation timeoutModel response is too slow or network is unstableConfirm the model service is running normally, then retry
Configuration errorParameter format is incorrectCheck Model Name spelling and Base URL format

View Submissions

After logging in, you can view submission records in either of the following ways:

  • Click the avatar in the upper-right corner and select “Submissions” from the dropdown menu
  • Visit the /submissions page directly

The submissions page shows all of your evaluation submissions, including:

  • Evaluation status: queued, running, completed, or failed
  • Model information: model name and evaluation type
  • Submission time
  • Evaluation result for completed submissions: click to view the detailed report

Leaderboards

DoctorBench provides three leaderboards, corresponding to the three evaluation types:

LeaderboardEvaluation TypePrimary Sorting Metric
Medical LLMLarge language model (LLM)Overall score
Medical Imaging (Multimodal) LeaderboardVision-language model (VLM)Overall score
Clinical Tasks (Agent) LeaderboardClinical Tasks (Agent)Overall score

Leaderboard Rules

  • Sorting: descending by overall score (overall_score)
  • Deduplication: only the latest submission result is kept for the same model
  • Public condition: only submissions that choose to make evaluation results public are displayed
  • Data source: dynamically aggregated from completed submissions and updated in real time

Displayed Information

Each leaderboard record shows:

  • Ranking, with trophy/medal icons for the top three
  • Model name
  • Developer/organization
  • Parameter size
  • Overall score
  • Scores for each evaluation dimension or task category, sortable by column

Interpreting Evaluation Reports

After each evaluation completes, the system generates a detailed report. Report structure differs by evaluation type.

LLM Evaluation Report

The LLM report displays results for 14 scenarios grouped by two audiences:

  • General users (7 scenarios): symptom query, health education, medication guidance, report interpretation, chronic disease management, resource navigation, and mental-health support
  • Medical professionals (7 scenarios): clinical decisions, medical knowledge, research assistance, documentation generation, medical education, quality management, and ethics/legal guidance

Each scenario includes:

  • Overall score (average_total_score)
  • Scores for 13 evaluation dimensions
  • Missing or unevaluated dimensions displayed as “N/A”

The report also includes top-level statistics: model name, average total score, median total score, and standard deviation.

Multimodal Evaluation Report

The multimodal report is organized by 5 task categories:

  • Modality recognition and perception
  • Cross-modal association and reasoning
  • Medical record and report generation
  • Medical knowledge QA
  • Diagnosis and decision support

Each category includes an average score. Negative scores are highlighted in red.

Clinical Tasks (Agent) Evaluation Report

The Clinical Tasks (Agent) report is organized by 6 task categories:

  • Intelligent triage and guidance
  • Multi-turn consultation and history taking
  • Medical tool use and examination planning
  • Treatment planning and execution
  • Full emergency workflow handling
  • Chronic disease management and follow-up

Each category includes an average score. Negative scores are highlighted in red.

Score Calculation

DoctorBench uses weighted scoring to calculate the overall score:

  • Dimensions have different maximum scores to reflect their importance
  • Core dimensions such as accuracy and safety have higher weights
  • The overall score is the weighted sum of dimension scores

Special rules:

  • When the veto mechanism is triggered, severe accuracy or safety issues directly affect the overall rating
  • Zero scores are still displayed as “0.00” and are not hidden
  • Missing or unevaluable dimensions are shown as “N/A” and excluded from weighted calculation

API Key Security

DoctorBench places high importance on API key security:

  • Scope of use: API keys are used only during evaluation calls to access your model API
  • Masked storage: secret fields in submitted configuration snapshots are masked and not stored in plaintext
  • No plaintext display: no API response or page display contains plaintext API keys
  • Evaluation isolation: evaluation tasks of the same type run serially to avoid configuration overlap

We recommend creating a dedicated API key for DoctorBench evaluations and rotating it after evaluation if needed.

Data Privacy

  • Evaluation data: datasets used for evaluation are maintained by the DoctorBench team and contain no real patient information
  • Model responses: responses generated by your model during evaluation are used only for scoring
  • Metadata: model metadata such as name, parameter size, and developer is displayed on leaderboards when you choose to make results public
  • Contact information: phone number and email provided during registration are used only for account management and necessary communication

FAQ

How long does evaluation take?

Evaluation duration depends on evaluation type and model API response speed. Reference durations: about 6–8 hours for LLMs, 3–5 hours for VLMs, and 2–3 hours for Clinical Tasks (Agent). You can check evaluation status under “Submissions.”

Which API providers are supported?

OpenAI-compatible endpoints are currently supported. When custom is selected, you can provide a custom base_url for any OpenAI-compatible API service. This means most mainstream large-model services can be connected.

How should I interpret evaluation results?

Evaluation results include dimension scores and overall ranking. After completion, you can view the detailed report under “Submissions,” including scenario/task-level scores and dimension analysis. You can also compare your model against others on the “Leaderboards” page.

Is the API key stored securely?

Yes. The platform uses your API key only during evaluation calls. Secret fields in stored snapshots are masked and are not returned or displayed in plaintext. We recommend creating a dedicated API key for evaluation.

Are multimodal model evaluations supported?

Yes. DoctorBench provides a complete VLM evaluation covering joint understanding and reasoning over medical images, tables, and other modalities, including modality recognition, cross-modal reasoning, report generation, knowledge QA, and diagnosis/decision support.

What if an evaluation fails?

When an evaluation fails, “Submissions” displays the specific error message. Common causes include API connection timeout, invalid key, or abnormal model response. You can correct the configuration based on the message and resubmit.

Is evaluation charged?

DoctorBench does not charge for evaluations on the platform. However, evaluation calls your model API, and any API usage fees are charged by your API provider.

Where does the evaluation data come from?

DoctorBench data is built and reviewed by hundreds of leading medical experts in China and covers many clinical departments. It contains no real patient information, ensuring high signal-to-noise quality and clinical relevance.

How can evaluation results appear on the leaderboard?

When submitting an evaluation, you can choose whether to make results public. If public is selected, completed results are automatically displayed on the corresponding leaderboard. For the same model, only the latest submission result is kept.

Can I submit multiple models at the same time?

Yes. You can submit multiple evaluation tasks. To ensure evaluation quality and resource allocation, tasks of the same type are executed serially.

How do I contact technical support?

Please provide a valid contact email when submitting an evaluation. If technical issues arise or communication is needed, we will contact you through that email. You can also view evaluation status and details on the submissions page.