🚀 We're looking for ML Engineers and Medical Reviewers! Join the OpenPHR Mission →
Tutorial

Fine-Tuning Clinical Llama-3 with PEFT

Difficulty: Intermediate Time: 15 min read

Introduction

In this cookbook, we will walk through how to fine-tune the Clinical Llama-3 model on your own proprietary hospital QA data using Parameter-Efficient Fine-Tuning (PEFT) and LoRA.

Prerequisites

  • A GPU with at least 16GB of VRAM (e.g., T4, A10G)
  • Python 3.10+
  • Huggingface Transformers and PEFT installed

Step 1: Install Dependencies

pip install transformers peft accelerate datasets bitsandbytes

Step 2: Load the Model with 4-bit Quantization

To fit an 8B model on a single GPU, we will load it in 4-bit precision.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "OpenPHR/clinical-llama-3-8b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Step 3: Apply LoRA Adapters

We configure the LoRA adapters to target the attention layers.

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()

Conclusion

You can now train the model using the Huggingface `Trainer` API. Because you are only training the LoRA adapters, this process will require significantly less memory and compute compared to full fine-tuning.