Fine-tuning Qwen2.5-VL for Medical Image Analysis

Introduction

Vision-Language Models (VLMs) have emerged as powerful tools for understanding and generating content that involves both images and text. While these models demonstrate impressive capabilities on general domain tasks, specialized fields like medical imaging require domain-specific adaptation to achieve optimal performance.

In this comprehensive guide, we'll walk through the process of fine-tuning the Qwen2.5-VL (Vision-Language) model for medical imaging analysis. Qwen2.5-VL is a powerful multimodal model that can process both images and text, making it ideal for medical applications where visual analysis and textual explanations are equally important.

                Key topics we'll cover:
                Understanding Qwen2.5-VL architecture and capabilities
Preparing medical imaging datasets
Implementing quantization for efficient fine-tuning
Applying Parameter-Efficient Fine-Tuning (PEFT) through QLoRA
Training and evaluating the model on medical image analysis tasks

            

By the end of this guide, you'll have a comprehensive understanding of how to adapt large vision-language models for specialized medical imaging tasks, even with limited computational resources.

Understanding Qwen2.5-VL Architecture

Before diving into fine-tuning, let's explore the architecture of the Qwen2.5-VL model and why it's suitable for medical imaging tasks.

Key Architecture Components

Qwen2.5-VL is a multimodal model based on the Qwen family developed by Alibaba Cloud. The 7B version we're using contains approximately 7 billion parameters and incorporates both vision and language capabilities:

Vision Encoder: Processes images through a vision transformer architecture, extracting visual features from medical images
Language Model: A decoder-only transformer model similar to other LLMs in the Qwen family
Cross-Modal Integration: Mechanisms to align and integrate visual and textual information for unified understanding
Instruction Tuning: The "-Instruct" suffix indicates pre-training on instruction-following datasets

Advantages for Medical Imaging

Qwen2.5-VL offers several advantages for medical imaging applications:

Detail Recognition: The vision components can identify subtle details in medical scans and images
Contextual Understanding: The model can relate visual patterns to medical knowledge
Natural Language Explanation: Can generate detailed explanations of findings in human-readable format
Multi-task Capability: Can handle different types of analysis from the same scan (detection, segmentation, description)

Why Qwen2.5-VL for Medical Imaging? While there are dedicated medical imaging models, adapting general-purpose VLMs like Qwen2.5-VL offers the advantage of retaining broader world knowledge while gaining domain-specific capabilities. This enables more comprehensive analysis and explanation of medical images in context.

1. Setting Up Your Environment

The first step in our fine-tuning journey is to set up the proper environment with all necessary dependencies. This includes installing the latest versions of transformer libraries, quantization tools, and utilities specific to the Qwen2.5-VL model.

Installing Dependencies

!pip install -U -q git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git datasets bitsandbytes peft qwen-vl-utils wandb accelerate
# Tested with transformers==4.47.0.dev0, trl==0.12.0.dev0, datasets==3.0.2, bitsandbytes==0.44.1, peft==0.13.2, qwen-vl-utils==0.0.8, wandb==0.18.5, accelerate==1.0.1

!pip install -q torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 --extra-index-url https://download.pytorch.org/whl/cu121

from huggingface_hub import notebook_login

notebook_login()

Each component in our installation serves a specific purpose:

transformers: Provides the base framework for working with the Qwen2.5-VL model
trl: Transformer Reinforcement Learning library for supervised fine-tuning
bitsandbytes: Enables 4-bit quantization for memory-efficient training
peft: Implements Parameter-Efficient Fine-Tuning methods like LoRA
qwen-vl-utils: Specialized utilities for working with Qwen's vision-language models
wandb: Weights & Biases for experiment tracking
accelerate: Simplifies distributed training across multiple GPUs

We're using the latest PyTorch version with CUDA 12.1 support to ensure compatibility with modern GPUs and to take advantage of the latest optimizations for large model training.

2. Loading and Preparing Medical Datasets

To fine-tune our model for medical applications, we need appropriate medical imaging datasets. In this section, we'll download and prepare a specialized dataset containing medical images paired with reports.

Downloading the Dataset

data_url="https://www.kaggle.com/datasets/ahmedsta/qa-vlm-med"
import os
os.makedirs('/root/.kaggle', exist_ok=True)  # Make the .kaggle directory
os.rename('kaggle.json', '/root/.kaggle/kaggle.json')  # Move the file to the correct path
os.chmod('/root/.kaggle/kaggle.json', 600)  # Set the correct file permissions

!kaggle datasets download ahmedsta/report-ray
!unzip report-ray.zip

Processing and Resizing Images

Medical images often come in various sizes and formats. To ensure consistency, we resize all images to a standard dimension (320x320 pixels) that's appropriate for our model's input requirements:

import json
from tqdm import tqdm
from PIL import Image

with open(r"/content/Report_json.json", 'r',encoding='utf-8') as file:
    med = json.load(file)

def resize(image_path):
  try:
    if os.path.exists(image_path):
      image = Image.open(image_path)
      resized_image = image.resize((320, 320))
      resized_image.save(image_path)
  except Exception as e:
        print(f"Error processing image: {e}")

med_folder="/content/Report_Med/Report_Med"
for e in tqdm(med):
  for i in range(len(e[1]['content'])-1):
    image=f"""{med_folder}/{e[1]['content'][i]['image']}"""
    resize(image)
    e[1]['content'][i]['image']=image

print(f"""lenght of med:{len(med)}""")

Creating Train-Test Split

We divide our dataset into training and evaluation sets to properly measure our model's performance:

import random

Data_model = random.sample(med, k=len(med)//0.9)
Data_test = random.sample(med, k=len(med)//0.1)
# Creating train/eval split
train_data = random.sample(med, k=len(Data_model)*0.8)
eval_data = random.sample(med, k=len(Data_model)*0.2)

Dataset Structure: The medical dataset consists of radiology images paired with expert reports. Each example contains an image path and corresponding medical interpretation or analysis. This paired data is ideal for teaching the model to associate visual patterns in medical images with appropriate clinical descriptions.

3. Loading the Base Model for Testing

Before proceeding with fine-tuning, it's important to load and test the base model to understand its initial capabilities on medical images. This gives us a baseline for comparison after fine-tuning.

Loading Qwen2.5-VL in Full Precision

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

Creating a Text Generation Utility

To test the model, we need a function that can process our medical images and generate descriptive text:

from qwen_vl_utils import process_vision_info

def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(
        sample[1:2],  # Use the sample without the system message
        tokenize=False,
        add_generation_prompt=True
    )

    # Process the visual input from the sample
    image_inputs, _ = process_vision_info(sample)

    # Prepare the inputs for the model
    model_inputs = processor(
        text=[text_input],
        images=image_inputs,
        return_tensors="pt",
    ).to(device)  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

Memory Management for Large Models

Working with large multimodal models requires careful memory management, especially when fine-tuning. We implement a utility function to clear memory between different stages:

import gc
import time

def clear_memory():
    # Delete variables if they exist in the current global scope
    if 'inputs' in globals(): del globals()['inputs']
    if 'model' in globals(): del globals()['model']
    if 'processor' in globals(): del globals()['processor']
    if 'trainer' in globals(): del globals()['trainer']
    if 'peft_model' in globals(): del globals()['peft_model']
    if 'bnb_config' in globals(): del globals()['bnb_config']
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")

clear_memory()

GPU Memory Challenge: The Qwen2.5-VL-7B model with 7 billion parameters requires significant GPU memory, especially when handling both images and text. Without quantization, it would need over 14GB of VRAM in full precision (FP32) or around 7GB in half precision (BF16). The memory management strategy is crucial for successfully fine-tuning on consumer-grade hardware.

4. Preparing for Fine-Tuning

To fine-tune our large vision-language model efficiently, we need to implement memory optimization techniques like quantization and parameter-efficient fine-tuning methods.

Loading the Model with 4-bit Quantization

import torch

from transformers import BitsAndBytesConfig
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
model_id = "Qwen/Qwen2.5-VL-7B-Instruct"

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Understanding 4-bit Quantization for VLMs

Quantization reduces the precision of model weights from 32-bit floating point to just 4 bits, significantly decreasing memory requirements. Our quantization configuration uses:

load_in_4bit=True: Enables 4-bit precision instead of 32-bit or 16-bit
bnb_4bit_use_double_quant=True: Applies nested quantization for further memory savings
bnb_4bit_quant_type="nf4": Uses normalized float 4 format optimized for transformer weights
bnb_4bit_compute_dtype=torch.bfloat16: Uses bfloat16 for intermediate computations

Memory Reduction: 4-bit quantization reduces the memory footprint of the Qwen2.5-VL-7B model from approximately 14GB (in FP32) to around 2GB, making it possible to fine-tune on GPUs with limited VRAM. This is particularly important for vision-language models that need to process both large images and text sequences simultaneously.

Setting Up QLoRA Configuration

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that adds small trainable matrices to the model while keeping most weights frozen. When combined with quantization, this approach is known as QLoRA:

from peft import LoraConfig, get_peft_model

# Configure LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)

# Print trainable parameters
peft_model.print_trainable_parameters()

LoRA Parameters for Vision-Language Models

Our LoRA configuration is optimized for vision-language models with the following parameters:

lora_alpha=16: A higher scaling factor than typical text-only models to accommodate the multimodal nature
lora_dropout=0.05: Light dropout to prevent overfitting while preserving visual pattern recognition
r=8: Rank of 8 provides a good balance between expressive power and parameter efficiency
target_modules=["q_proj", "v_proj"]: We specifically target the query and value projection layers in the attention mechanism, which are critical for both visual and textual understanding

7. Custom Data Collator for Vision-Language Processing

Vision-language models require special handling for processing batches that contain both images and text. We implement a custom data collator function:

from qwen_vl_utils import process_vision_info

# Create a data collator to encode text and image pairs
def collate_fn(examples):
    # Get the texts and images, and apply the chat template
    texts = [processor.apply_chat_template(example, tokenize=False) for example in examples]  # Prepare texts for processing
    image_inputs = []
    for example in examples:
        if isinstance(example, dict) and "image" in example:
            try:
                image_inputs.append(process_vision_info([example])[0])
            except KeyError:
                # Handle missing 'content' key, for example:
                print(f"Warning: Example missing 'content' key: {example}")
                # Skip this example or provide a default value
                image_inputs.append(None)  # or your preferred default
        else:
            image_inputs.append(process_vision_info(example)[0])
    # Tokenize the texts and process the images
    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)  # Encode texts and images into tensors

    # The labels are the input_ids, and we mask the padding tokens in the loss computation
    labels = batch["input_ids"].clone()  # Clone input IDs for labels
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Mask padding tokens in labels

    image_tokens = [processor.tokenizer.convert_tokens_to_ids(processor.image_token)]  # Convert image token to ID

    # Mask image token IDs in the labels
    for image_token_id in image_tokens:
        labels[labels == image_token_id] = -100  # Mask image token IDs in labels

    batch["labels"] = labels  # Add labels to the batch

    return batch  # Return the prepared batch

Understanding the Data Collator

Our custom collator performs several key functions for handling multimodal data:

Template Application: Formats conversations using the model's chat template
Image Processing: Converts raw image data into tensors that the vision encoder can process
Token Masking: Ensures that special tokens like padding and image placeholders don't contribute to the loss calculation
Batch Creation: Combines all processed inputs into a single batch ready for the model

Handling Special Tokens: The collator sets labels for padding tokens and image tokens to -100, which tells the model to ignore these positions when computing the loss. This is crucial for vision-language models because we don't want the model to predict the raw image tokens—only the text that should come after or around the images.

8. Training the Model

With all components in place, we can now set up the SFTTrainer and begin the fine-tuning process:

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=eval_data,
    data_collator=collate_fn,
    peft_config=peft_config,
    #tokenizer=processor.tokenizer,
)

trainer.train()

trainer.save_model(training_args.output_dir)

The Training Process

During training, the SFTTrainer will:

Load batches of medical images and associated text
Process them through our custom collator
Apply the LoRA adapters during forward passes
Compute loss based on the model's ability to generate accurate descriptions
Update only the LoRA parameters while keeping the base model frozen
Periodically evaluate the model on the validation set
Save checkpoints of the best-performing model versions

Training Efficiency: With our QLoRA setup, we're only training about 0.1% of the total parameters in the model. This translates to approximately 7 million trainable parameters out of the 7 billion total parameters, dramatically reducing memory requirements while still achieving significant adaptation to the medical domain.

9. Testing the Fine-tuned Model

After training, we can clear memory and load our fine-tuned model to evaluate its performance:

clear_memory()

import torch
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor, AutoConfig


# Instantiate the model with the correct configuration
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

processor = Qwen2VLProcessor.from_pretrained(model_id)

adapter_path = "StaAhmed/qwen2.5-7b-med"
model.load_adapter(adapter_path)

output = generate_text_from_sample(model, processor, data[500])
output

Evaluating Medical Understanding

The fine-tuned model should now demonstrate improved capabilities specific to medical imaging:

Better recognition of anatomical structures in X-rays and scans
More accurate identification of medical conditions and abnormalities
Use of appropriate medical terminology in descriptions
More detailed and clinically relevant explanations
Better handling of medical-specific queries

Model Integration: One key advantage of our approach is that the fine-tuned adapter is only about 7MB in size, compared to the full model which is several gigabytes. This small adapter can be easily distributed and applied to the base model, making it practical to deploy specialized medical capabilities without duplicating the entire model.