Deploying a Clinical LLM API with Docker and vLLM
Introduction
In this cookbook, we will cover how to deploy an open-source clinical Large Language Model (like Clinical Llama-3 or Gemma-2-9B-IT) locally using Docker and vLLM for high-throughput inference.
Prerequisites
- An NVIDIA GPU with at least 24GB of VRAM (e.g., RTX 3090/4090 or A10G)
- Docker and NVIDIA Container Toolkit installed
- Huggingface Token (if using gated models)
Step 1: Write the Docker Compose File
We use vLLM because it provides an OpenAI-compatible API server out of the box and features PagedAttention for maximum throughput.
version: '3.8'
services:
vllm-server:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: --model OpenPHR/clinical-llama-3-8b --dtype bfloat16 --api-key my_secure_token
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
Step 2: Start the Server
Ensure your Huggingface token is exported in your environment, then spin up the container.
export HF_TOKEN="hf_your_token_here"
docker-compose up -d
Step 3: Query the Local API
You can now query your locally hosted clinical LLM exactly as you would use the OpenAI API, but completely offline and secure for PHI.
import openai
client = openai.OpenAI(
api_key="my_secure_token",
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="OpenPHR/clinical-llama-3-8b",
messages=[
{"role": "system", "content": "You are an expert medical AI assistant."},
{"role": "user", "content": "What are the common side effects of lisinopril?"}
]
)
print(response.choices[0].message.content)
Conclusion
By leveraging vLLM and Docker, you can securely host and scale large clinical language models inside your hospital firewall while maintaining compatibility with standard API interfaces.