🚀 We're looking for ML Engineers and Medical Reviewers! Join the OpenPHR Mission →
Tutorial

Deploying a Clinical LLM API with Docker and vLLM

Difficulty: Intermediate Time: 12 min read

Introduction

In this cookbook, we will cover how to deploy an open-source clinical Large Language Model (like Clinical Llama-3 or Gemma-2-9B-IT) locally using Docker and vLLM for high-throughput inference.

Prerequisites

  • An NVIDIA GPU with at least 24GB of VRAM (e.g., RTX 3090/4090 or A10G)
  • Docker and NVIDIA Container Toolkit installed
  • Huggingface Token (if using gated models)

Step 1: Write the Docker Compose File

We use vLLM because it provides an OpenAI-compatible API server out of the box and features PagedAttention for maximum throughput.

version: '3.8'

services:
  vllm-server:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: --model OpenPHR/clinical-llama-3-8b --dtype bfloat16 --api-key my_secure_token
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface

Step 2: Start the Server

Ensure your Huggingface token is exported in your environment, then spin up the container.

export HF_TOKEN="hf_your_token_here"
docker-compose up -d

Step 3: Query the Local API

You can now query your locally hosted clinical LLM exactly as you would use the OpenAI API, but completely offline and secure for PHI.

import openai

client = openai.OpenAI(
    api_key="my_secure_token",
    base_url="http://localhost:8000/v1"
)

response = client.chat.completions.create(
    model="OpenPHR/clinical-llama-3-8b",
    messages=[
        {"role": "system", "content": "You are an expert medical AI assistant."},
        {"role": "user", "content": "What are the common side effects of lisinopril?"}
    ]
)

print(response.choices[0].message.content)

Conclusion

By leveraging vLLM and Docker, you can securely host and scale large clinical language models inside your hospital firewall while maintaining compatibility with standard API interfaces.