Tutorial

Fine-Tuning Clinical Llama-3 with PEFT

Difficulty: Intermediate Time: 15 min read

Introduction

In this cookbook, we will walk through how to fine-tune the Clinical Llama-3 model on your own proprietary hospital QA data using Parameter-Efficient Fine-Tuning (PEFT) and LoRA. By using QLoRA, we can train a massive 8B parameter model on a single consumer GPU.

Architecture Overview


graph TD
    Data(Hospital QA Dataset) --> Tokenizer(Llama-3 Tokenizer)
    Tokenizer --> Base(Frozen Llama-3 8B in 4-bit)
    Base --> LoRA(Trainable LoRA Adapters)
    LoRA --> SFT(SFTTrainer)
    SFT --> Weights(Saved Adapter Weights)

Prerequisites

A GPU with at least 16GB of VRAM (e.g., T4, A10G, RTX 4080)
Python 3.10+
Huggingface Transformers, PEFT, TRL, and BitsAndBytes installed

Step 1: Install Dependencies

pip install transformers peft accelerate datasets bitsandbytes trl

Step 2: Load the Model with 4-bit Quantization

To fit an 8B model on a single GPU, we will load it in 4-bit precision using the NF4 format.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "OpenPHR/clinical-llama-3-8b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set pad token to eos token for Llama-3
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Step 3: Apply LoRA Adapters

We configure the LoRA adapters. For optimal performance on Llama-3, it is highly recommended to target all linear layers, not just Q and V projections.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model for k-bit training (gradient checkpointing)
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# Output should show ~0.2% to 1% of parameters are trainable

Step 4: Supervised Fine-Tuning (SFT)

We use the SFTTrainer from the TRL library, which simplifies training on instruction-completion pairs.

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load your custom hospital QA dataset
# Expected format: {"text": "Patient asks: ... Doctor says: ..."}
dataset = load_dataset("json", data_files="hospital_qa.json", split="train")

training_args = TrainingArguments(
    output_dir="./clinical-llama-3-adapters",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=10,
    learning_rate=2e-4,
    max_grad_norm=0.3,
    max_steps=500,
    warmup_ratio=0.03,
    fp16=True, # Set to bf16=True if using Ampere GPUs
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
)

# Start training
trainer.train()
trainer.model.save_pretrained("final_adapters")

Step 5: Inference with Fine-Tuned Model

After training, you can load the base model and apply your newly trained adapters for inference.

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)
ft_model = PeftModel.from_pretrained(base_model, "final_adapters")

inputs = tokenizer("Patient asks: What are the side effects of Lisinopril? Doctor says:", return_tensors="pt").to("cuda")
outputs = ft_model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conclusion

By leveraging QLoRA and the SFTTrainer, you can fine-tune massive open-source clinical models on your hospital's proprietary data locally, securely, and cheaply.