Model Evaluation Leaderboards
Objective benchmarking of the top medical foundation models on clinical reasoning tasks. Sort by benchmark to discover the state-of-the-art.
| Model ↕ | Access ↕ | Parameters ↕ | PubMedQA (%) ↕ | MedQA-USMLE (%) ↕ | MedMCQA (%) ↕ |
|---|---|---|---|---|---|
|
Med-PaLM 2
Google
|
🔒 Closed API | Unknown | 81.8 | 86.5 | 72.3 |
|
GPT-4
OpenAI
|
🔒 Closed API | Unknown | 80.4 | 81.4 | 73.0 |
|
Clinical Llama-3 (8B)
Open Source
|
🔓 Open Weights | 8B | 78.2 | 74.5 | 68.9 |
|
MedAlpaca (13B)
Open Source
|
🔓 Open Weights | 13B | 76.5 | 60.2 | 58.7 |
|
BioGPT-Large
Microsoft
|
🔓 Open Weights | 1.5B | 81.0 | 50.5 | 45.2 |
|
ClinicalBERT
MIT
|
🔓 Open Weights | 110M | 65.0 | 45.3 | 42.1 |