Model Evaluation Leaderboards

Objective benchmarking of the top medical foundation models on clinical reasoning tasks. Sort by benchmark to discover the state-of-the-art.

Model ↕	Access ↕	Parameters ↕	PubMedQA (%) ↕	MedQA-USMLE (%) ↕	MedMCQA (%) ↕
Med-PaLM 2 Google	🔒 Closed API	Unknown	81.8	86.5	72.3
GPT-4 OpenAI	🔒 Closed API	Unknown	80.4	81.4	73.0
Clinical Llama-3 (8B) Open Source	🔓 Open Weights	8B	78.2	74.5	68.9
MedAlpaca (13B) Open Source	🔓 Open Weights	13B	76.5	60.2	58.7
BioGPT-Large Microsoft	🔓 Open Weights	1.5B	81.0	50.5	45.2
ClinicalBERT MIT	🔓 Open Weights	110M	65.0	45.3	42.1