πŸš€ We're looking for ML Engineers and Medical Reviewers! Join the OpenPHR Mission →

Datasets

Discover high-quality, open-source medical datasets for training and benchmarking.

Dataset Infrastructure

500-AI-Agents-Projects

A comprehensive, open-source curated collection of over 500 AI agent use cases and projects across various industries including healthcare. Developers can use this directory to...

Intended Use Cases:
Not specified.
Dataset Infrastructure

AI-Healthcare-System

AI-powered healthcare platform combining Machine Learning for multi-disease prediction (Diabetes, Heart, Liver, Kidney, Lungs) with Generative AI for intelligent medical assistance and lab report analysis....

Intended Use Cases:
Not specified.
Dataset Clinical Text

AI-Powered-Healthcare-Intelligence-Network

The AI-Powered Healthcare Intelligence Network is an AI-driven system offering disease prediction, drug recommendations, heart disease risk assessment, and an AI medical chatbot. Using ML,...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

AI-Projects-for-Healthcare

This repository is included artificial intelligence, machine learning, data science, computer vision projects related to healthcare. solving core medical data engineering challenges. It processes clinical...

Intended Use Cases:
Not specified.
Dataset Infrastructure

AI-for-Healthcare-Project-using-NVIDIA-Jetson-Nano-2GB-Developer-kit

This project uses Deep learning concept in detection of Various Deadly diseases. It can Detect 1) Lung Cancer 2) Covid-19 3)Tuberculosis 4) Pneumonia. It uses...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

AI-for-healthcare

The impact of Artificial Intelligence in improving healthcare facilities is increasing significantly. This repository provides implementation of different Deep Learning and Machine Learning techniques used...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

Advanced AI Curriculum

An educational data science repository containing advanced capstone projects focused on healthcare machine learning. It includes implementations of Convolutional Neural Networks for pneumonia detection from...

Intended Use Cases:
Not specified.
Dataset Genomics / EHR / Wearables

All of Us Research Program

The All of Us Research Program is a historic effort by the National Institutes of Health (NIH) to collect and study data from one million...

Intended Use Cases:
Not specified.
Dataset Infrastructure

Artificial Intelligence Deep Learning Machine Learning Tutorials

An extensive compendium of machine learning and deep learning tutorials, including specific applications in medicine and healthcare. This repository provides foundational Python notebooks that developers...

Intended Use Cases:
Not specified.
Dataset Infrastructure

Awesome AI Agents for Healthcare

A curated collection of research papers, projects, and resources related to the application of Agentic AI in healthcare. It provides a comprehensive ecosystem mapping of...

Intended Use Cases:
Not specified.
Dataset Radiology / MRI

BraTS (Brain Tumor Segmentation)

The Brain Tumor Segmentation (BraTS) dataset is a cornerstone resource for medical image analysis, focusing on the evaluation of state-of-the-art methods for the segmentation of...

Intended Use Cases:
Training 3D computer vision models for automated segmentation of brain tumors (gliomas) in multiparametric MRI.
Dataset Clinical / Demographics

Cervical Cancer Risk Classification

This dataset contains demographic information, habits, and historic medical records for a set of patients at the Hospital Universitario de Caracas. Key Capabilities Risk Factor...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

CheXpert

CheXpert is a large public dataset for chest radiograph interpretation, consisting of 224,316 chest radiographs of 65,240 patients collected from Stanford Hospital. Key Features Scale:...

Intended Use Cases:
Not specified.
Dataset Clinical Text

ClinGen

[ACL 2024 Findings] This is the code for our paper β€œKnowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models”. solving core...

Intended Use Cases:
Not specified.
Dataset Genomics

ClinVar

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. It is hosted by the...

Intended Use Cases:
Not specified.
Dataset Clinical Text

DCAN

Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text, ClinicalNLP workshop at EMNLP 2020 solving core medical data engineering challenges. It processes clinical...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

DeepCareX

DeepCareX is an AI-powered healthcare system leveraging machine learning models for intelligent health insights solving core medical data engineering challenges. It processes clinical data inputs...

Intended Use Cases:
Not specified.
Dataset Genomics

Ensembl

Ensembl is a comprehensive, integrated genomic database and genome browser project managed by the European Bioinformatics Institute (EMBL-EBI). It provides automated annotation of the human...

Intended Use Cases:
Not specified.
Dataset Infrastructure

Facial-Expression-Recognition-FER-for-Mental-Health-Detection-

Facial Expression Recognition (FER) for Mental Health Detection applies AI models like Swin Transformer, CNN, and ViT for detecting emotions linked to anxiety, depression, PTSD,...

Intended Use Cases:
Not specified.
Dataset Clinical Text

ForteHealth

The project is in the incubation stage and still under development. ForteHealth is a flexible and powerful ML workflow builder for biomedical and clinical scenarios....

Intended Use Cases:
Not specified.
Dataset Clinical Text

GraphCare

[ICLR’24] Enhancing Healthcare Predictions with Personalized Knowledge Graphs solving core medical data engineering challenges. It processes clinical data inputs natively through its core repository architectures....

Intended Use Cases:
Not specified.
Dataset Clinical Text

HealifyAI--LLM-based-Healthcare-System

Leverages extensive power of multiple Machine Learning algorithms & LLM to provide in-depth answers to medical queries and predicts condition/diseases based on patient symptoms solving...

Intended Use Cases:
Not specified.
Dataset Dermatology / Imaging

ISIC 2019

The International Skin Imaging Collaboration (ISIC) 2019 challenge dataset contains thousands of dermoscopic images of skin lesions for the automated diagnosis of melanoma from dermoscopic...

Intended Use Cases:
Automated classification of dermoscopic images among 9 different diagnostic categories including melanoma.
Dataset Radiology / CT Scan

LIDC-IDRI

The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. Key...

Intended Use Cases:
Training, testing, and validation of computer-aided diagnosis (CAD) methods for lung cancer detection.
Dataset Clinical Text

Lightweight-Clinical-Transformers

This project develops compact transformer models tailored for clinical text analysis, balancing efficiency and performance for healthcare NLP tasks. solving core medical data engineering challenges....

Intended Use Cases:
Not specified.
Dataset Medical Imaging

LungGuardianAI-pneumonia_detection_from_chest_xrays_with_transfer_learning

State-of-the-Art Pneumonia detection from chest X-rays system using EfficientNetV2 + FPN + Faster R-CNN. Features Focal Loss, Weighted Box Fusion, Mosaic Augmentation & StratifiedGroupKFold. Built...

Intended Use Cases:
Not specified.
Dataset Imaging / Text

MIMIC-CXR

MIMIC-CXR (Medical Information Mart for Intensive Care, Chest X-Ray) is a large publicly available dataset of chest radiographs with free-text radiology reports. Key Capabilities Multimodal...

Intended Use Cases:
Not specified.
Dataset EHR / Clinical Text

MIMIC-III

MIMIC-III (Medical Information Mart for Intensive Care) is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in...

Intended Use Cases:
Epidemiological studies, machine learning research, and clinical decision support systems development using ICU patient data.
Dataset EHR Tabular

MIMIC-IV

MIMIC-IV (Medical Information Mart for Intensive Care) is a large, freely-available database comprising de-identified health-related data associated with over forty thousand patients who stayed in...

Intended Use Cases:
Not specified.
Dataset Clinical Text

MedChat-AI

MedChat AI is an open-source large language model (LLM) specifically designed for healthcare chat applications. It aims to provide accurate, reliable, and context-aware responses to...

Intended Use Cases:
Not specified.
Dataset Clinical Text / NLP

MedNLI

MedNLI is a dataset annotated by doctors for Natural Language Inference (NLI) in the clinical domain. It is derived from the clinical notes in MIMIC-III....

Intended Use Cases:
Natural Language Inference (NLI) in the clinical domain; determining if a clinical hypothesis is true, false, or undetermined given a clinical premise.
Dataset Clinical Text

MedTagger

MedTagger is a light weight clinical NLP system built upon Apache UIMA. solving core medical data engineering challenges. It processes clinical data inputs natively through...

Intended Use Cases:
Not specified.
Dataset Clinical Text

MediBeng-Whisper-Tiny

MediBeng Whisper Tiny improves doctor-patient transcription by training the Whisper Tiny model to translate mixed Bengali-English speech into English, making it easier for analysis, record-keeping,...

Intended Use Cases:
Not specified.
Dataset Infrastructure

MediScan

MediScan: AI-powered bone fracture detection system achieving 99.8% accuracy through deep learning. Features real-time X-ray analysis, transparent Grad-CAM visualizations, and clinical integration tools. Built with...

Intended Use Cases:
Not specified.
Dataset Clinical Text

Medical-AGI

LLM powered AI multi agent platform that coordinate global to individual health through scaling each layer of healthcare solving core medical data engineering challenges. It...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

Medical-Healthcare-3D-Imaging-AI

:hospital: :eye_speech_bubble: Medical Healthcare AI Robotic Surgery Automated Brain Tumour Segmentation Skin Cancer Lesion Detection & Segmentation (Melonama Recognition) Lung Cancer detection (Chest CT Scan)...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

MedicalModelLibrary

A collection of AI models tailored for healthcare applications. solving core medical data engineering challenges. It processes clinical data inputs natively through its core repository...

Intended Use Cases:
Not specified.
Dataset Clinical Text

MultiCaRe_Dataset

Open-source multimodal dataset: 98K+ clinical cases & 139K+ medical images from PubMed Central solving core medical data engineering challenges. It processes clinical data inputs natively...

Intended Use Cases:
Not specified.
Dataset Radiology / Imaging

NIH Chest X-ray Dataset

The NIH Chest X-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with text-mined disease labels from the associated radiological reports. Key Features...

Intended Use Cases:
Training algorithms to detect 14 common thoracic pathologies from frontal chest X-rays.
Dataset Infrastructure

NeuroCloak

Real-time AI governance platform implementing a Cognitive Digital Twin framework. Monitors AI decisions, traces reasoning steps, detects bias, and generates cognitive audit reports across healthcare,...

Intended Use Cases:
Not specified.
Dataset Radiology / MRI

OASIS (Open Access Series of Imaging Studies)

The Open Access Series of Imaging Studies (OASIS) is a project aimed at making MRI data sets of the brain freely available to the scientific...

Intended Use Cases:
Neuroimaging analysis, developing biomarkers for Alzheimer's disease, and training volumetric segmentation models.
Dataset Genomics / Multi-omics

Open Targets

Open Targets is a comprehensive, open-source platform that integrates human genetics and genomics data for systematic drug target identification and prioritization. Key Capabilities Target-Disease Associations:...

Intended Use Cases:
Drug target identification and prioritization, integrating genetics, omics, and chemical data to find safe and effective therapeutic targets.
Dataset Clinical Text

PhysicianBench

The benchmark tasks and evaluation harness for β€œPhysicianBench: Evaluating LLM Agents in Real-World EHR Environments”. solving core medical data engineering challenges. It processes clinical data...

Intended Use Cases:
Not specified.
Dataset Physiological Signals / Multi-modal

PhysioNet

PhysioNet is a massive repository of freely-available medical research data, managed by the MIT Laboratory for Computational Physiology. It hosts collections of recorded physiologic signals...

Intended Use Cases:
Not specified.
Dataset EHR / Tabular

PhysioNet 2012 Mortality Prediction

The PhysioNet/Computing in Cardiology Challenge 2012 dataset focuses on predicting the mortality of ICU patients. It contains data for 12,000 ICU stays, including general descriptors...

Intended Use Cases:
Predicting in-hospital mortality of ICU patients using multivariate time series data.
Dataset Medical Imaging

Prior-Authorization-Multi-Agent-Solution-Accelerator

Payer-side AI-assisted prior authorization review using Microsoft Agent Framework with four Foundry Hosted Agents (Compliance, Clinical, Coverage, Synthesis). Gate-based decision rubric, MCP healthcare data access,...

Intended Use Cases:
Not specified.
Dataset Text

SMM4H (Social Media Mining for Health)

The Social Media Mining for Health (SMM4H) dataset is a collection of annotated tweets designed for natural language processing tasks in healthcare, particularly pharmacovigilance and...

Intended Use Cases:
Not specified.
Dataset Clinical Text

Symptom-Based-Disease-Prediction-Chatbot-Using-NLP

A cutting-edge AI-powered health diagnosis chatbot that leverages machine learning to interpret symptoms and predict potential medical conditions. Designed to improve healthcare accessibility, this chatbot...

Intended Use Cases:
Not specified.
Dataset Simulator

Synthea

Synthea is an open-source synthetic patient population simulator. It generates realistic, but not real, patient data and medical records across a patient’s entire lifecycle and...

Intended Use Cases:
Not specified.
Dataset Genomics / Proteomics

TREAT-AD

The TREAT-AD (Target Enablement to Accelerate Therapy Development for Alzheimer’s Disease) dataset is an open-science platform providing extensive multi-omics data, target enabling resources, and bioinformatics...

Intended Use Cases:
Not specified.
Dataset Genomics / Transcriptomics / Imaging

The Cancer Genome Atlas (TCGA)

The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer...

Intended Use Cases:
Not specified.
Dataset Genomics / EHR

UK Biobank

The UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. Key Capabilities...

Intended Use Cases:
Large-scale epidemiological studies, genome-wide association studies (GWAS), and discovery of disease biomarkers.
Dataset Genomics / EHR / Imaging

UK Biobank

The UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants. Key Capabilities...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

UniMedVL

Official implementation of β€œUniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis” - A unified medical vision-language model that integrates multimodal understanding and generation capabilities....

Intended Use Cases:
Not specified.
Dataset Medical Imaging

awesome-healthcare-ai

A curated list of awesome open source healthcare tools, algorithms, datasets and research papers. solving core medical data engineering challenges. It processes clinical data inputs...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

awesome-healthcare-datasets

Healthcare and biomedical datasets, for AI/ML solving core medical data engineering challenges. It processes clinical data inputs natively through its core repository architectures. Developers can...

Intended Use Cases:
Not specified.
Dataset Medical Imaging

awesome-healthmetrics

A curated list of awesome resources at the intersection of healthcare and AI solving core medical data engineering challenges. It processes clinical data inputs natively...

Intended Use Cases:
Not specified.
Dataset Clinical Text

chi-bench

Ξ§-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? solving core medical data engineering challenges. It processes clinical data inputs natively through its core...

Intended Use Cases:
Not specified.
Dataset Clinical Text

cnlp_transformers

Transformers for Clinical NLP solving core medical data engineering challenges. It processes clinical data inputs natively through its core repository architectures. Developers can implement and...

Intended Use Cases:
Not specified.
Dataset Clinical Text

ctakes

Apache cTAKES is a Natural Language Processing (NLP) platform for clinical text. solving core medical data engineering challenges. It processes clinical data inputs natively through...

Intended Use Cases:
Not specified.
Dataset Clinical Text

dutch-medical-concepts

Instructions and code to create for a table of UMLS, SNOMED or HPO concepts containing Dutch medical names, usable in named entity recognition and linking...

Intended Use Cases:
Not specified.
Dataset EHR / Time-series

eICU-CRD

The eICU Collaborative Research Database (eICU-CRD) is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for...

Intended Use Cases:
Predictive modeling, federated learning across different hospitals, and clinical decision support algorithms.
Dataset Clinical Text

ehrsql-2024

Clinical NLP Shared Task @ NAACL’24 solving core medical data engineering challenges. It processes clinical data inputs natively through its core repository architectures. Developers can...

Intended Use Cases:
Not specified.
Dataset Clinical Text

ensemble

Ensembles of NLP Tools for Data Element Extraction from Clinical Notes solving core medical data engineering challenges. It processes clinical data inputs natively through its...

Intended Use Cases:
Not specified.
Dataset Infrastructure

evaluating-eeg-representations

Resources for the paper titled β€œEvaluating Latent Space Robustness and Uncertainty of EEG-ML Models under Realistic Distribution Shifts”. Accepted at NeurIPS 2022. solving core medical...

Intended Use Cases:
Not specified.
Dataset Genomics

gnomAD

The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and...

Intended Use Cases:
Estimating allele frequencies for rare diseases, identifying loss-of-function intolerance in genes, and establishing a baseline for healthy human genetic variation.
Dataset Medical Imaging

healthcare-agents

Portable prompt and SKILL.md pack with 51 specialist AI agents for US healthcare administration workflows. solving core medical data engineering challenges. It processes clinical data...

Intended Use Cases:
Not specified.
Dataset Clinical Text

healthcare-ai-model-evaluator

Healthcare AI Model Evaluator (HAIME) empowers healthcare organizations to independently evaluate and customize AI solutions, addressing challenges of transparency, clinical relevance, and real-world impact. By...

Intended Use Cases:
Not specified.
Dataset Clinical Text

m3

πŸ₯πŸ€– Query MIMIC-IV medical data using natural language through Model Context Protocol (MCP). Transform healthcare research with AI-powered database interactions - supports both local MIMIC-IV...

Intended Use Cases:
Not specified.