Tutorial

Deploying a Clinical LLM API with Docker and vLLM

Difficulty: Intermediate Time: 12 min read

Introduction

In this cookbook, we will cover how to deploy an open-source clinical Large Language Model (like Clinical Llama-3 or Gemma-2-9B-IT) locally using Docker and vLLM for high-throughput inference.

Architecture Overview


graph TD
    Client(Clinical App / Python Script) -->|REST /v1/chat/completions| Nginx(Reverse Proxy/Auth)
    Nginx -->|Port 8000| vLLM(vLLM Docker Container)
    vLLM -->|Loads Model| HF(HuggingFace Cache)
    vLLM -->|Tensor Parallelism| GPUs(NVIDIA GPUs)

Prerequisites

An NVIDIA GPU with at least 24GB of VRAM (e.g., RTX 3090/4090 or A10G)
Docker and NVIDIA Container Toolkit installed
Huggingface Token (if using gated models)

Step 1: Write the Docker Compose File

We use vLLM because it provides an OpenAI-compatible API server out of the box and features PagedAttention for maximum throughput. Below is a production-ready docker-compose.yml file.

version: '3.8'

services:
  vllm-server:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Change to 'all' for multi-GPU
              capabilities: [gpu]
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    # Command flags for optimized clinical serving
    command: >
      --model OpenPHR/clinical-llama-3-8b
      --dtype bfloat16
      --api-key my_secure_token
      --max-model-len 4096
      --enforce-eager
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      # Optional: mount a local model directory if completely offline
      # - /data/models/clinical-llama-3:/models/clinical-llama-3

Step 2: Multi-GPU and Quantization (Optional)

If you are deploying a massive model (like Llama-3-70B) across multiple GPUs, vLLM supports Ray-based Tensor Parallelism. Simply add --tensor-parallel-size 4 (for 4 GPUs) to the command. If you only have a single 24GB GPU, consider using a quantized model (AWQ or GPTQ) by adding --quantization awq.

Step 3: Start the Server

Ensure your Huggingface token is exported in your environment, then spin up the container.

export HF_TOKEN="hf_your_token_here"
docker-compose up -d

Step 4: Query the Local API

You can now query your locally hosted clinical LLM exactly as you would use the OpenAI API, but completely offline and secure for PHI. The streaming capabilities are highly recommended for generating long clinical notes.

import openai
import os

client = openai.OpenAI(
    api_key="my_secure_token",
    base_url="http://localhost:8000/v1"
)

# Example: Zero-shot clinical summarization
response = client.chat.completions.create(
    model="OpenPHR/clinical-llama-3-8b",
    messages=[
        {"role": "system", "content": "You are an expert medical AI assistant. Summarize the following clinical note."},
        {"role": "user", "content": "Patient is a 54yo male presenting with sudden onset dyspnea and diaphoresis. EKG shows ST elevation in leads II, III, aVF. Troponin elevated."}
    ],
    temperature=0.1, # Low temperature for factual medical tasks
    stream=True # Stream tokens for faster UI response
)

# Process the stream
print("Summary: ", end="")
for chunk in response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Troubleshooting Tips

CUDA Out of Memory (OOM): Reduce --max-model-len or adjust --gpu-memory-utilization (default is 0.90, drop it to 0.85 if crashing).
Slow Startup: Model downloading takes time on the first run. The huggingface cache volume ensures subsequent restarts are instant.

Conclusion

By leveraging vLLM and Docker, you can securely host and scale large clinical language models inside your hospital firewall while maintaining compatibility with standard API interfaces.