Tutorial

How to Train an EHR Mortality Prediction Model with PhysioNet 2012

Difficulty: Advanced Time: 25 min read

Introduction

In this cookbook, we will cover how to download the PhysioNet 2012 dataset, preprocess multivariate clinical time-series data, and train an LSTM to predict in-hospital mortality.

Architecture Overview


graph LR
    Raw(Raw Text Files) -->|Parse & Pivot| TS(Time-Series Matrix)
    TS -->|Impute NaNs| Clean(Clean Matrix)
    Clean -->|Pad Sequences| Tensor(PyTorch Tensors)
    Tensor -->|Batch Training| LSTM(LSTM Model)
    LSTM -->|Sigmoid| Pred(Mortality Risk 0-1)

Prerequisites

Python 3.10+
Pandas, Scikit-learn, and PyTorch installed
Access to the PhysioNet 2012 dataset download page

Step 1: Download and Parse the Data

First, download the PhysioNet 2012 dataset. The data comes as raw text files, one for each patient, containing timestamped vital signs and lab results over the first 48 hours of an ICU stay.

wget https://physionet.org/files/challenge-2012/1.0.0/set-a.tar.gz
tar -xvzf set-a.tar.gz

Step 2: Time-Series Imputation and Padding

EHR data is notoriously sparse and patients have variable-length stays. We must impute missing values using forward-filling and then pad the sequences to a fixed maximum length (e.g., 48 hours) for batching in PyTorch.

import pandas as pd
import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence

def parse_patient_file(filepath):
    df = pd.read_csv(filepath, header=0)
    # Pivot time-series data (Time vs Parameter)
    df = df.pivot(index='Time', columns='Parameter', values='Value')
    
    # Resample to hourly bins to standardize the time steps
    df.index = pd.to_timedelta(df.index + ':00')
    df = df.resample('1H').mean()
    
    # Forward fill then fill remaining NaNs with 0 (or feature means)
    df = df.ffill().fillna(0)
    return torch.tensor(df.values, dtype=torch.float32)

# Example: Process a list of files and pad them
# sequences = [parse_patient_file(f) for f in file_paths]
# padded_batch = pad_sequence(sequences, batch_first=True, padding_value=0.0)

Step 3: Training an LSTM in PyTorch

Because the data consists of sequences, Recurrent Neural Networks like LSTMs perform exceptionally well.

import torch.nn as nn

class MortalityLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers=2, dropout=0.2):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers, 
                            batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        # x shape: (batch, seq_len, features)
        out, (hn, cn) = self.lstm(x)
        # Take the hidden state of the last layer for the last time step
        last_hidden = hn[-1, :, :]
        out = self.fc(last_hidden)
        return self.sigmoid(out)

Step 4: The Training Loop and Evaluation

Because mortality data is highly imbalanced (most patients survive), accuracy is a misleading metric. We must evaluate using AUROC and AUPRC.

from sklearn.metrics import roc_auc_score, average_precision_score
import torch.optim as optim

model = MortalityLSTM(input_dim=37, hidden_dim=64)
criterion = nn.BCELoss() # Binary Cross Entropy
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(10):
    model.train()
    for x_batch, y_batch in dataloader:
        optimizer.zero_grad()
        preds = model(x_batch).squeeze()
        loss = criterion(preds, y_batch)
        loss.backward()
        optimizer.step()
        
    # Evaluation (Validation Set)
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for x_val, y_val in val_loader:
            preds = model(x_val).squeeze()
            all_preds.extend(preds.numpy())
            all_labels.extend(y_val.numpy())
            
    auroc = roc_auc_score(all_labels, all_preds)
    auprc = average_precision_score(all_labels, all_preds)
    print(f"Epoch {epoch} | AUROC: {auroc:.3f} | AUPRC: {auprc:.3f}")

Conclusion

Training an EHR predictive model requires significant data wrangling, but building an LSTM on top of the PhysioNet dataset is an excellent way to benchmark clinical machine learning architectures and prepare for deploying models in real ICU settings.