Bab 9: Fine-tuning Large Language Models

Parameter-Efficient Fine-Tuning, LoRA, dan Praktik Transfer Learning untuk LLM

Bab 9: Fine-tuning Large Language Models

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

  1. Memahami konsep pre-training vs fine-tuning dalam Large Language Models
  2. Menjelaskan perbedaan Full Fine-tuning dan Parameter-Efficient Fine-Tuning (PEFT)
  3. Mengimplementasikan LoRA (Low-Rank Adaptation) untuk efficient fine-tuning
  4. Menerapkan QLoRA untuk fine-tuning dengan memory constraints
  5. Melakukan fine-tuning model kecil (BERT, GPT-2) untuk tasks spesifik
  6. Mengevaluasi performa model sebelum dan sesudah fine-tuning
  7. Mengoptimalkan hyperparameters untuk training yang efisien

9.1 Pengantar: Dari Pre-training ke Fine-tuning

9.1.1 Transfer Learning dalam Era LLM

Di Chapter 8, kita telah belajar arsitektur Transformer dan bagaimana pre-trained models seperti BERT dan GPT bekerja. Sekarang, pertanyaan pentingnya: Bagaimana kita menyesuaikan model besar ini untuk tugas spesifik kita?

Problem:

  • Training LLM from scratch membutuhkan jutaan dollar dan ribuan GPU
  • GPT-3 dilatih dengan ~45TB text, 175 billion parameters, $4.6M compute cost
  • BERT-Large: 24 layers, 340M parameters, trained on 3.3B words
  • Tidak realistis untuk most organizations/researchers

Solution: Transfer Learning

Code
flowchart LR
    A["🌍 Pre-training\n(Large corpus)\nBillions of tokens"] --> B["πŸ€– Base Model\nGeneral knowledge"]
    B --> C["🎯 Fine-tuning\n(Task-specific data)\nThousands of examples"]
    C --> D["βœ… Specialized Model\nDomain expert"]

    style A fill:#e3f2fd
    style B fill:#fff9c4
    style C fill:#c8e6c9
    style D fill:#a5d6a7

flowchart LR
    A["🌍 Pre-training\n(Large corpus)\nBillions of tokens"] --> B["πŸ€– Base Model\nGeneral knowledge"]
    B --> C["🎯 Fine-tuning\n(Task-specific data)\nThousands of examples"]
    C --> D["βœ… Specialized Model\nDomain expert"]

    style A fill:#e3f2fd
    style B fill:#fff9c4
    style C fill:#c8e6c9
    style D fill:#a5d6a7

Transfer Learning Paradigm untuk LLM

Key Insight: Model sudah belajar general language understanding dari pre-training. Kita hanya perlu adapt untuk task spesifik!

9.1.2 Pre-training vs Fine-tuning

Pre-training:

  • Data: Massive unlabeled text (entire internet)
  • Objective: Language modeling (predict next word, masked tokens)
  • Duration: Weeks/months dengan ribuan GPUs
  • Cost: Millions of dollars
  • Result: General-purpose language understanding

Fine-tuning:

  • Data: Small labeled dataset untuk specific task
  • Objective: Task-specific (classification, QA, summarization)
  • Duration: Hours/days dengan 1-8 GPUs
  • Cost: Hundreds/thousands of dollars
  • Result: Specialized model untuk target task
πŸ’‘ Analogi

Pre-training = Pendidikan umum (SD sampai SMA) - Belajar berbagai mata pelajaran - Membangun foundational knowledge - Waktu lama, biaya besar

Fine-tuning = Spesialisasi (S1 jurusan tertentu) - Fokus pada domain spesifik - Leverage knowledge dari pendidikan umum - Lebih cepat, biaya lebih rendah

9.1.3 Tantangan Fine-tuning LLM

Computational Challenges:

Model Parameters Memory (FP32) Memory (FP16) Fine-tuning Time
BERT-Base 110M 440 MB 220 MB 2-4 hours
BERT-Large 340M 1.36 GB 680 MB 8-12 hours
GPT-2 Medium 355M 1.42 GB 710 MB 6-10 hours
GPT-3 Small 1.3B 5.2 GB 2.6 GB 1-2 days
LLaMA-7B 7B 28 GB 14 GB 3-5 days
LLaMA-13B 13B 52 GB 26 GB 5-7 days

Problem yang muncul:

  1. Memory constraints: Tidak semua orang punya GPU dengan 80GB VRAM
  2. Catastrophic forgetting: Model lupa general knowledge saat fine-tuning
  3. Overfitting: Small dataset β†’ model overfit mudah
  4. Cost: Training time = money untuk cloud GPUs

Solution yang akan kita pelajari:

  • Parameter-Efficient Fine-Tuning (PEFT)
  • LoRA (Low-Rank Adaptation)
  • QLoRA (Quantized LoRA)
  • Gradient checkpointing
  • Mixed precision training

9.2 Full Fine-tuning vs Parameter-Efficient Fine-tuning

9.2.1 Full Fine-tuning

Konsep: Update seluruh parameter model saat training.

Process:

  1. Load pre-trained model weights
  2. Add task-specific head (classifier/regression layer)
  3. Train semua layers dengan task data
  4. Save updated model weights
import torch
from transformers import BertForSequenceClassification, BertTokenizer

# Contoh Full Fine-tuning structure
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)

# Cek total parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"πŸ“Š Total parameters: {total_params:,}")
print(f"🎯 Trainable parameters: {trainable_params:,}")
print(f"βœ… Percentage trainable: {100 * trainable_params / total_params:.2f}%")

Kelebihan:

  • Maximum performance (jika data cukup)
  • Flexibility penuh dalam adaptation
  • Proven approach dengan banyak best practices

Kekurangan:

  • Memory intensive (need to store gradients untuk semua parameters)
  • Storage: Perlu save full model checkpoint untuk setiap task
  • Slow training
  • Expensive untuk large models

9.2.2 Parameter-Efficient Fine-tuning (PEFT)

Key Idea: Freeze most of pre-trained parameters, hanya train small subset atau additional parameters.

Keuntungan:

  • βœ… Drastically reduce memory usage
  • βœ… Faster training
  • βœ… Store hanya small adapter weights (few MB vs GBs)
  • βœ… Avoid catastrophic forgetting
  • βœ… Easy to switch between tasks

Kategori PEFT Methods:

Code
graph TD
    A["Parameter-Efficient Fine-Tuning"] --> B["Adapter Modules"]
    A --> C["Prefix/Prompt Tuning"]
    A --> D["Low-Rank Adaptation"]
    A --> E["Selective Fine-tuning"]

    B --> B1["Houlsby Adapters\n(Serial)"]
    B --> B2["Parallel Adapters"]

    C --> C1["Prefix Tuning\n(Add learnable vectors)"]
    C --> C2["Prompt Tuning\n(Input-level prompts)"]

    D --> D1["LoRA\n(Most popular!)"]
    D --> D2["AdaLoRA\n(Adaptive rank)"]

    E --> E1["BitFit\n(Only bias terms)"]
    E --> E2["Layer-wise tuning"]

    style D1 fill:#c8e6c9
    style A fill:#e3f2fd

graph TD
    A["Parameter-Efficient Fine-Tuning"] --> B["Adapter Modules"]
    A --> C["Prefix/Prompt Tuning"]
    A --> D["Low-Rank Adaptation"]
    A --> E["Selective Fine-tuning"]

    B --> B1["Houlsby Adapters\n(Serial)"]
    B --> B2["Parallel Adapters"]

    C --> C1["Prefix Tuning\n(Add learnable vectors)"]
    C --> C2["Prompt Tuning\n(Input-level prompts)"]

    D --> D1["LoRA\n(Most popular!)"]
    D --> D2["AdaLoRA\n(Adaptive rank)"]

    E --> E1["BitFit\n(Only bias terms)"]
    E --> E2["Layer-wise tuning"]

    style D1 fill:#c8e6c9
    style A fill:#e3f2fd

Taxonomy of PEFT Methods

Perbandingan Methods:

Method Trainable Params Memory Speed Performance
Full Fine-tuning 100% Very High Slow Best
Adapter Modules ~2-5% Medium Medium Good
Prefix Tuning ~0.1-1% Low Fast Good
LoRA ~0.1-1% Low Fast Excellent
BitFit ~0.1% Very Low Very Fast Fair

LoRA adalah sweet spot: low parameters, excellent performance!


9.3 LoRA (Low-Rank Adaptation)

9.3.1 Konsep dan Intuisi

Paper: β€œLoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., 2021)

Core Insight:

  • Weight updates during fine-tuning memiliki low intrinsic rank
  • Tidak perlu update full weight matrix \(W\)
  • Cukup add small low-rank decomposition \(\Delta W = BA\)

Mathematical Formulation:

Original forward pass:

\[h = W_0 x\]

LoRA forward pass:

\[h = W_0 x + \Delta W x = W_0 x + BAx\]

Where:

  • \(W_0 \in \mathbb{R}^{d \times k}\): Original pre-trained weights (frozen)
  • \(B \in \mathbb{R}^{d \times r}\): Down-projection matrix (trainable)
  • \(A \in \mathbb{R}^{r \times k}\): Up-projection matrix (trainable)
  • \(r \ll \min(d, k)\): Rank (typically 4, 8, 16, or 32)

Visualisasi:

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
import numpy as np

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left: Full Fine-tuning
ax1.text(0.5, 0.95, 'Full Fine-tuning', ha='center', va='top',
         fontsize=16, fontweight='bold', transform=ax1.transAxes)

# Weight matrix
rect1 = FancyBboxPatch((0.2, 0.4), 0.6, 0.4,
                       boxstyle="round,pad=0.02",
                       edgecolor='red', facecolor='lightcoral',
                       linewidth=3, alpha=0.7)
ax1.add_patch(rect1)
ax1.text(0.5, 0.6, 'W (All Trainable)\nd Γ— k parameters\n\n110M params for BERT-Base',
         ha='center', va='center', fontsize=12, fontweight='bold')

ax1.text(0.5, 0.25, '❌ Memory Intensive\n❌ Slow Training\n❌ Large Storage',
         ha='center', va='center', fontsize=11, color='darkred')

ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.axis('off')

# Right: LoRA
ax2.text(0.5, 0.95, 'LoRA Fine-tuning', ha='center', va='top',
         fontsize=16, fontweight='bold', transform=ax2.transAxes)

# Original weights (frozen)
rect2 = FancyBboxPatch((0.15, 0.4), 0.35, 0.4,
                       boxstyle="round,pad=0.02",
                       edgecolor='gray', facecolor='lightgray',
                       linewidth=2, alpha=0.5)
ax2.add_patch(rect2)
ax2.text(0.325, 0.6, 'Wβ‚€\n(Frozen)\nd Γ— k',
         ha='center', va='center', fontsize=11, color='gray')

# Plus sign
ax2.text(0.53, 0.6, '+', ha='center', va='center', fontsize=24, fontweight='bold')

# Low-rank matrices
rect3 = FancyBboxPatch((0.6, 0.55), 0.12, 0.25,
                       boxstyle="round,pad=0.01",
                       edgecolor='green', facecolor='lightgreen',
                       linewidth=3, alpha=0.7)
ax2.add_patch(rect3)
ax2.text(0.66, 0.675, 'B\ndΓ—r', ha='center', va='center', fontsize=10, fontweight='bold')

rect4 = FancyBboxPatch((0.74, 0.55), 0.12, 0.25,
                       boxstyle="round,pad=0.01",
                       edgecolor='green', facecolor='lightgreen',
                       linewidth=3, alpha=0.7)
ax2.add_patch(rect4)
ax2.text(0.8, 0.675, 'A\nrΓ—k', ha='center', va='center', fontsize=10, fontweight='bold')

ax2.text(0.5, 0.25, 'βœ… Low Memory (~0.1% params)\nβœ… Fast Training\nβœ… Small Storage (few MB)',
         ha='center', va='center', fontsize=11, color='darkgreen', fontweight='bold')

ax2.text(0.75, 0.45, 'r=8: Only ~300K params!',
         ha='center', va='center', fontsize=10,
         bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.axis('off')

plt.tight_layout()
plt.show()

Contoh Kalkulasi:

Untuk BERT attention layer dengan \(d=768, k=768\):

Full Fine-tuning:

  • Parameters: \(768 \times 768 = 589,824\) parameters

LoRA dengan rank \(r=8\):

  • Matrix B: \(768 \times 8 = 6,144\) parameters
  • Matrix A: \(8 \times 768 = 6,144\) parameters
  • Total: 12,288 parameters (only 2.1% of original!)

For entire BERT-Base model:

  • Full fine-tuning: 110M parameters
  • LoRA (r=8): ~300K parameters (0.27%!)

9.3.2 Implementasi LoRA dengan PEFT Library

Hugging Face menyediakan library peft yang membuat implementasi LoRA sangat mudah!

Installation:

pip install peft transformers datasets

Basic Implementation:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# 1. Load pre-trained model
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

print("=" * 70)
print("πŸ”΅ BEFORE LoRA:")
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

# 2. Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence Classification
    r=8,                          # Rank
    lora_alpha=32,                # Scaling factor
    lora_dropout=0.1,             # Dropout for regularization
    target_modules=["query", "value"],  # Apply LoRA to attention Q and V
)

# 3. Convert model to LoRA model
model = get_peft_model(model, lora_config)

print("\n" + "=" * 70)
print("🟒 AFTER LoRA:")
model.print_trainable_parameters()

print("\n" + "=" * 70)
print("πŸ“Š Summary:")
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable: {trainable:,} ({100*trainable/total_params:.3f}%)")
print(f"Frozen: {total_params - trainable:,} ({100*(total_params-trainable)/total_params:.2f}%)")
print(f"πŸ’Ύ Memory savings: ~{100*(1-trainable/total_params):.1f}%")

Key Parameters Explained:

  • r (rank): Dimensionality of low-rank matrices
    • Lower = fewer parameters, faster, less expressive
    • Higher = more parameters, slower, more expressive
    • Typical values: 4, 8, 16, 32
  • lora_alpha: Scaling parameter
    • Controls magnitude of LoRA updates
    • Typically set to 2-4x the rank
  • target_modules: Which modules to apply LoRA
    • Common choices: ["query", "value"] atau ["query", "key", "value", "dense"]
    • More modules = more parameters tapi better performance
  • lora_dropout: Regularization
    • Prevents overfitting
    • Typical: 0.05 - 0.1

9.3.3 Complete Fine-tuning Example: Sentiment Analysis

Mari kita implementasikan complete pipeline untuk fine-tuning BERT dengan LoRA!

Klik untuk melihat complete fine-tuning code
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("πŸš€ Starting LoRA Fine-tuning Pipeline\n")

# ============================================================================
# 1. LOAD DATASET
# ============================================================================
print("πŸ“Š Loading dataset...")
# Menggunakan IMDB dataset (50K movie reviews)
dataset = load_dataset("imdb")

# Ambil subset kecil untuk demo (1000 samples)
train_dataset = dataset['train'].shuffle(seed=42).select(range(1000))
test_dataset = dataset['test'].shuffle(seed=42).select(range(200))

print(f"βœ… Train samples: {len(train_dataset)}")
print(f"βœ… Test samples: {len(test_dataset)}")
print(f"\nExample: {train_dataset[0]['text'][:100]}...")
print(f"Label: {train_dataset[0]['label']} (0=negative, 1=positive)")

# ============================================================================
# 2. TOKENIZATION
# ============================================================================
print("\n" + "="*70)
print("πŸ”€ Tokenizing dataset...")

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=256  # Shorter untuk demo
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Format untuk PyTorch
train_dataset = train_dataset.rename_column("label", "labels")
test_dataset = test_dataset.rename_column("label", "labels")

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print("βœ… Tokenization complete!")

# ============================================================================
# 3. LOAD MODEL & APPLY LORA
# ============================================================================
print("\n" + "="*70)
print("πŸ€– Loading model and applying LoRA...")

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,                    # Rank
    lora_alpha=16,          # Scaling
    lora_dropout=0.1,
    target_modules=["query", "value"],
    bias="none",
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ============================================================================
# 4. TRAINING ARGUMENTS
# ============================================================================
print("\n" + "="*70)
print("βš™οΈ Setting up training configuration...")

training_args = TrainingArguments(
    output_dir="./results_lora",
    learning_rate=2e-4,              # Higher LR untuk LoRA
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to="none",                # Disable wandb/tensorboard untuk demo
    logging_steps=50,
)

# ============================================================================
# 5. EVALUATION METRICS
# ============================================================================
def compute_metrics(eval_pred):
    """Compute accuracy, precision, recall, F1"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary'
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# ============================================================================
# 6. TRAINER
# ============================================================================
print("πŸ‹οΈ Initializing trainer...")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# ============================================================================
# 7. TRAINING
# ============================================================================
print("\n" + "="*70)
print("🎯 Starting training...\n")

# Evaluate before training
print("πŸ“Š BEFORE Fine-tuning:")
eval_results_before = trainer.evaluate()
print(f"Accuracy: {eval_results_before['eval_accuracy']:.4f}")
print(f"F1 Score: {eval_results_before['eval_f1']:.4f}")

# Train
print("\nπŸš‚ Training in progress...")
train_results = trainer.train()

# Evaluate after training
print("\nπŸ“Š AFTER Fine-tuning:")
eval_results_after = trainer.evaluate()
print(f"Accuracy: {eval_results_after['eval_accuracy']:.4f}")
print(f"F1 Score: {eval_results_after['eval_f1']:.4f}")

# ============================================================================
# 8. IMPROVEMENT SUMMARY
# ============================================================================
print("\n" + "="*70)
print("πŸ“ˆ IMPROVEMENT SUMMARY:")
print("="*70)
accuracy_improvement = (eval_results_after['eval_accuracy'] -
                       eval_results_before['eval_accuracy']) * 100
f1_improvement = (eval_results_after['eval_f1'] -
                 eval_results_before['eval_f1']) * 100

print(f"Accuracy improvement: +{accuracy_improvement:.2f}%")
print(f"F1 Score improvement: +{f1_improvement:.2f}%")
print(f"\nTraining time: {train_results.metrics['train_runtime']:.2f} seconds")
print(f"Samples/second: {train_results.metrics['train_samples_per_second']:.2f}")

# ============================================================================
# 9. SAVE MODEL
# ============================================================================
print("\n" + "="*70)
print("πŸ’Ύ Saving model...")
model.save_pretrained("./lora_sentiment_model")
tokenizer.save_pretrained("./lora_sentiment_model")
print("βœ… Model saved to ./lora_sentiment_model")
print(f"   Size: Only LoRA adapters (~few MB) instead of full model (~440MB)")

Output yang diharapkan:

trainable params: 296,448 || all params: 109,779,714 || trainable%: 0.27%

BEFORE Fine-tuning:
Accuracy: 0.5200
F1 Score: 0.5150

AFTER Fine-tuning:
Accuracy: 0.8850
F1 Score: 0.8820

Improvement: +36.5% accuracy, +36.7% F1

9.3.4 Loading dan Menggunakan LoRA Model

Setelah training, bagaimana cara load dan use model?

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

# Load LoRA adapters
model = PeftModel.from_pretrained(base_model, "./lora_sentiment_model")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./lora_sentiment_model")

# Inference function
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.softmax(outputs.logits, dim=-1)

    sentiment = "Positive 😊" if predictions[0][1] > 0.5 else "Negative 😞"
    confidence = predictions[0][1].item() if predictions[0][1] > 0.5 else predictions[0][0].item()

    return sentiment, confidence

# Test
test_texts = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "Terrible film. Waste of time and money.",
    "It was okay, nothing special but not terrible either."
]

print("🎬 Sentiment Analysis Results:\n")
for text in test_texts:
    sentiment, conf = predict_sentiment(text)
    print(f"Text: {text}")
    print(f"Prediction: {sentiment} (confidence: {conf:.2%})\n")

9.4 QLoRA: Quantized LoRA

9.4.1 Konsep Quantization

Problem: Even dengan LoRA, memory requirement untuk large models masih tinggi karena base model weights.

Solution: Quantization - Represent weights dengan lower precision.

Precision Levels:

Precision Bits Range Memory per Param
FP32 (Float32) 32 Β±3.4Γ—10³⁸ 4 bytes
FP16 (Float16) 16 Β±65,504 2 bytes
INT8 8 -128 to 127 1 byte
INT4 4 -8 to 7 0.5 bytes

QLoRA Innovation:

  • Base model weights in 4-bit (frozen)
  • LoRA adapters in 16-bit (trainable)
  • Computation in 16-bit (for stability)

Memory Savings Example (LLaMA-7B):

import pandas as pd
import matplotlib.pyplot as plt

# Calculate memory requirements
model_params = 7e9  # 7 billion parameters

configs = {
    'Full FP32': model_params * 4 / (1024**3),      # 4 bytes
    'Full FP16': model_params * 2 / (1024**3),      # 2 bytes
    'LoRA (FP16)': model_params * 2 / (1024**3),    # Base model still FP16
    'QLoRA (4-bit)': model_params * 0.5 / (1024**3),  # 4-bit base
}

df = pd.DataFrame(list(configs.items()), columns=['Method', 'Memory (GB)'])

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.barh(df['Method'], df['Memory (GB)'], color=['#ef5350', '#ff7043', '#66bb6a', '#26a69a'])

ax.set_xlabel('GPU Memory Required (GB)', fontsize=12, fontweight='bold')
ax.set_title('Memory Requirements: LLaMA-7B Fine-tuning', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, (method, mem) in enumerate(zip(df['Method'], df['Memory (GB)'])):
    ax.text(mem + 0.5, i, f'{mem:.1f} GB', va='center', fontweight='bold')

# Add annotations
ax.text(28, 3.3, 'βœ… Fits in 4GB GPU!', fontsize=11, color='green', fontweight='bold')
ax.text(28, 0.3, '❌ Needs 32GB GPU', fontsize=11, color='red', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nπŸ“Š Summary:")
print(f"QLoRA reduces memory by {100*(1 - configs['QLoRA (4-bit)']/configs['Full FP32']):.1f}% vs Full FP32!")
print("This enables fine-tuning 7B models on consumer GPUs (RTX 3090, 4090)")

9.4.2 Implementasi QLoRA

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# ============================================================================
# QLoRA Configuration
# ============================================================================
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",             # Normal Float 4-bit
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_use_double_quant=True,        # Nested quantization
)

# Load model dengan 4-bit quantization
model_name = "gpt2"  # Menggunakan GPT-2 untuk demo
print(f"πŸ”½ Loading {model_name} with 4-bit quantization...")

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",  # Automatic device mapping
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn"],  # GPT-2 attention module
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

print("\n" + "="*70)
model.print_trainable_parameters()

print("\nπŸ’Ύ Memory Usage:")
if torch.cuda.is_available():
    print(f"Allocated: {torch.cuda.memory_allocated(0)/1024**3:.2f} GB")
    print(f"Reserved: {torch.cuda.memory_reserved(0)/1024**3:.2f} GB")
else:
    print("CPU mode - GPU not available")

print("\nβœ… QLoRA model ready for training!")
print("   Base model: 4-bit (frozen)")
print("   LoRA adapters: 16-bit (trainable)")

Key BitsAndBytes Parameters:

  • load_in_4bit: Enable 4-bit quantization

  • bnb_4bit_quant_type:

    • "nf4" (NormalFloat4): Best for normally distributed weights
    • "fp4": Standard 4-bit float
  • bnb_4bit_compute_dtype: Computation precision (FP16/BF16)

  • bnb_4bit_use_double_quant: Quantize quantization constants (extra savings)

9.4.3 QLoRA Fine-tuning: Text Generation

Mari kita fine-tune GPT-2 untuk generate text dengan style tertentu!

Klik untuk melihat QLoRA fine-tuning example
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
import torch

print("πŸš€ QLoRA Fine-tuning for Text Generation\n")

# ============================================================================
# 1. PREPARE DATASET
# ============================================================================
print("πŸ“š Loading dataset...")
# Menggunakan tiny dataset untuk demo
dataset = load_dataset("imdb", split="train[:500]")  # Small subset

# ============================================================================
# 2. LOAD MODEL WITH QUANTIZATION
# ============================================================================
print("\nπŸ€– Loading quantized model...")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 tidak punya pad token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# ============================================================================
# 3. APPLY LORA
# ============================================================================
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn", "c_proj"],  # Attention modules
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ============================================================================
# 4. TOKENIZE
# ============================================================================
print("\nπŸ”€ Tokenizing...")

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=128,  # Shorter untuk demo
        padding='max_length'
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

# ============================================================================
# 5. TRAINING SETUP
# ============================================================================
training_args = TrainingArguments(
    output_dir="./qlora_gpt2",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 16
    num_train_epochs=1,              # Just 1 epoch untuk demo
    learning_rate=2e-4,
    fp16=True,                       # Mixed precision
    logging_steps=25,
    save_strategy="no",
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# ============================================================================
# 6. TRAIN
# ============================================================================
print("\nπŸ‹οΈ Training...")
trainer.train()

print("\nβœ… Training complete!")

# ============================================================================
# 7. GENERATE TEXT
# ============================================================================
print("\nπŸ“ Generating text samples:\n")

model.eval()

prompts = [
    "This movie is",
    "The acting was",
    "I really enjoyed",
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=50,
            num_return_sequences=1,
            temperature=0.8,
            do_sample=True,
            top_p=0.9,
        )

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Prompt: '{prompt}'")
    print(f"Generated: {generated_text}\n")

# ============================================================================
# 8. SAVE
# ============================================================================
print("πŸ’Ύ Saving LoRA adapters...")
model.save_pretrained("./qlora_gpt2_adapters")
tokenizer.save_pretrained("./qlora_gpt2_adapters")
print("βœ… Saved! Adapter size: ~few MB")

9.5 Best Practices dan Tips

9.5.1 Hyperparameter Tuning untuk LoRA

Learning Rate:

  • LoRA biasanya butuh higher learning rate daripada full fine-tuning
  • Recommended: 1e-4 to 5e-4 (vs 1e-5 to 5e-5 untuk full fine-tuning)
  • Alasan: Fewer parameters = less prone to overfitting

Rank Selection:

import pandas as pd
import matplotlib.pyplot as plt

# Performance vs Rank (hypothetical data untuk ilustrasi)
ranks = [2, 4, 8, 16, 32, 64]
performance = [82.5, 85.2, 88.1, 89.3, 89.5, 89.6]  # Accuracy %
params_mb = [0.5, 1.0, 2.0, 4.0, 8.0, 16.0]  # Size in MB

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Performance vs Rank
ax1.plot(ranks, performance, marker='o', linewidth=2, markersize=8, color='#2196f3')
ax1.axhline(y=89.0, color='green', linestyle='--', label='Target Performance', alpha=0.7)
ax1.axvline(x=8, color='red', linestyle='--', label='Sweet Spot (r=8)', alpha=0.7)
ax1.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Test Accuracy (%)', fontsize=12, fontweight='bold')
ax1.set_title('Performance vs LoRA Rank', fontsize=14, fontweight='bold')
ax1.grid(alpha=0.3)
ax1.legend()
ax1.set_ylim([80, 92])

# Size vs Rank
ax2.plot(ranks, params_mb, marker='s', linewidth=2, markersize=8, color='#ff9800')
ax2.fill_between(ranks, params_mb, alpha=0.3, color='#ff9800')
ax2.set_xlabel('LoRA Rank (r)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Adapter Size (MB)', fontsize=12, fontweight='bold')
ax2.set_title('Storage Cost vs LoRA Rank', fontsize=14, fontweight='bold')
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("πŸ“Š Rank Selection Guidelines:")
print("=" * 60)
print("r=4-8:   Good starting point, efficient")
print("r=16:    Better performance, reasonable cost")
print("r=32+:   Marginal gains, higher cost")
print("\nπŸ’‘ Recommendation: Start dengan r=8, increase jika underperforming")

Target Modules:

Different untuk different architectures:

# Target modules untuk different model architectures
target_configs = {
    "BERT": {
        "minimal": ["query", "value"],
        "recommended": ["query", "key", "value"],
        "full": ["query", "key", "value", "dense"]
    },
    "GPT-2": {
        "minimal": ["c_attn"],
        "recommended": ["c_attn", "c_proj"],
        "full": ["c_attn", "c_proj", "c_fc"]
    },
    "LLaMA": {
        "minimal": ["q_proj", "v_proj"],
        "recommended": ["q_proj", "k_proj", "v_proj"],
        "full": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    }
}

print("🎯 Target Modules by Architecture:\n")
for model, configs in target_configs.items():
    print(f"{'='*60}\n{model}:")
    for level, modules in configs.items():
        print(f"  {level:12s}: {modules}")
    print()

9.5.2 Avoiding Common Pitfalls

⚠️ Common Mistakes

1. Too Small Learning Rate - ❌ Problem: Using same LR as full fine-tuning (1e-5) - βœ… Solution: Use 10x higher LR (1e-4 to 5e-4)

2. Wrong Target Modules - ❌ Problem: Applying LoRA to wrong layers atau using wrong module names - βœ… Solution: Check model architecture, use model.named_modules() to verify

3. Forgetting to Prepare Model - ❌ Problem: Not calling prepare_model_for_kbit_training() untuk QLoRA - βœ… Solution: Always call setelah loading quantized model

4. Insufficient Data - ❌ Problem: Fine-tuning dengan <100 examples - βœ… Solution: Aim untuk 1000+ examples, atau use data augmentation

5. No Validation Set - ❌ Problem: No way to detect overfitting - βœ… Solution: Always split train/val, monitor validation metrics

9.5.3 Debugging dan Monitoring

Sanity Checks:

def check_lora_model(model):
    """Comprehensive checks untuk LoRA model"""

    print("πŸ” LoRA Model Diagnostic\n")
    print("=" * 70)

    # 1. Check trainable parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"1️⃣ Parameter Count:")
    print(f"   Total: {total_params:,}")
    print(f"   Trainable: {trainable_params:,} ({100*trainable_params/total_params:.2f}%)")

    # 2. Check if LoRA modules exist
    print(f"\n2️⃣ LoRA Modules:")
    lora_modules = [name for name, module in model.named_modules()
                    if 'lora' in name.lower()]
    print(f"   Found {len(lora_modules)} LoRA modules")
    if len(lora_modules) > 0:
        print(f"   Examples: {lora_modules[:3]}")

    # 3. Check device
    print(f"\n3️⃣ Device:")
    device = next(model.parameters()).device
    print(f"   Model on: {device}")

    # 4. Check dtype
    print(f"\n4️⃣ Data Types:")
    dtypes = set([p.dtype for p in model.parameters()])
    print(f"   Found dtypes: {dtypes}")

    # 5. Gradient check
    print(f"\n5️⃣ Gradient Status:")
    grad_enabled = sum(1 for p in model.parameters() if p.requires_grad)
    print(f"   Parameters with grad: {grad_enabled}")

    print("\n" + "=" * 70)

    # Warnings
    if trainable_params / total_params > 0.05:
        print("⚠️  Warning: >5% parameters trainable. Sure you want LoRA?")
    if trainable_params / total_params < 0.001:
        print("⚠️  Warning: <0.1% parameters trainable. Rank too low?")

    print("βœ… Diagnostic complete!")

# Example usage (jika model sudah ada):
# check_lora_model(model)

9.5.4 Performance Optimization Tips

optimization_tips = """
πŸš€ PERFORMANCE OPTIMIZATION TIPS

1. GRADIENT ACCUMULATION
   - Batch size terbatas karena GPU memory?
   - Use gradient_accumulation_steps untuk simulate larger batch
   - Example: batch_size=4, accumulation=8 β†’ effective_batch=32

2. MIXED PRECISION TRAINING (FP16/BF16)
   - Set fp16=True atau bf16=True dalam TrainingArguments
   - ~2x speedup + ~2x memory reduction
   - BF16 better untuk stability (jika GPU support)

3. GRADIENT CHECKPOINTING
   - Trade computation untuk memory
   - Enable: model.gradient_checkpointing_enable()
   - ~30% slower tapi ~50% less memory

4. DATA LOADING
   - Set num_workers>0 dalam DataLoader
   - Use pin_memory=True untuk faster GPU transfer
   - Preprocess dataset sebelum training (map dengan batched=True)

5. COMPILATION (PyTorch 2.0+)
   - model = torch.compile(model)
   - ~20-30% speedup dengan minimal effort

6. EFFICIENT ATTENTION
   - Use Flash Attention (jika available)
   - Set use_flash_attention_2=True saat load model
   - ~2-3x faster untuk long sequences
"""

print(optimization_tips)

9.6 Advanced Topics

9.6.1 Multi-task LoRA

Satu base model, multiple task-specific LoRA adapters!

from peft import PeftModel

# Scenario: Fine-tune untuk berbagai tasks
# 1. Sentiment analysis
# 2. Question answering
# 3. Named entity recognition

# Train separate LoRA for each task
# lora_sentiment = train_lora(task="sentiment")
# lora_qa = train_lora(task="qa")
# lora_ner = train_lora(task="ner")

# At inference, load appropriate adapter
def load_task_model(base_model, task):
    """Load LoRA adapter untuk specific task"""
    adapter_path = f"./lora_adapters/{task}"
    model = PeftModel.from_pretrained(base_model, adapter_path)
    return model

# Benefit:
# - Single base model (~440MB)
# - Multiple adapters (~2MB each)
# - Total storage: 440MB + 3Γ—2MB = 446MB
# vs Full fine-tuning: 440MB Γ— 3 = 1.32GB

print("πŸ’‘ Multi-task LoRA Benefits:")
print("  β€’ Storage efficient: Share base model")
print("  β€’ Memory efficient: Load only needed adapter")
print("  β€’ Easy to experiment: Train new task without affecting others")

9.6.2 LoRA Merging

Merge LoRA weights ke base model untuk faster inference!

# After training, merge LoRA adapters into base model
# This eliminates the adapter overhead at inference

# Pseudo-code (konsep):
# merged_model = base_model.merge_and_unload()
# merged_model.save_pretrained("./merged_model")

# Benefits:
# βœ… No adapter overhead at inference
# βœ… Faster inference (same speed as base model)
# ❌ Need full model storage (not just adapter)

print("πŸ”€ LoRA Merging:")
print("  Development: Use separate adapters (flexible, small)")
print("  Production: Merge untuk best inference speed")

9.6.3 AdaLoRA: Adaptive Rank

Automatically determine optimal rank untuk each module!

from peft import AdaLoraConfig, get_peft_model

# AdaLoRA config
adalora_config = AdaLoraConfig(
    r=8,                        # Initial rank
    target_r=4,                 # Target average rank
    init_r=12,                  # Initial rank for each adapter
    tinit=200,                  # Steps before pruning
    tfinal=1000,                # Steps to reach target rank
    delta_t=10,                 # Steps between rank updates
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["query", "value"],
    task_type="SEQ_CLS"
)

# Apply AdaLoRA
# model = get_peft_model(base_model, adalora_config)

# How it works:
# 1. Start dengan rank r=12 untuk semua modules
# 2. During training, compute importance score untuk each module
# 3. Gradually reduce rank untuk less important modules
# 4. End dengan average rank ~4, tapi important modules bisa retain higher rank

print("🎯 AdaLoRA Advantages:")
print("  β€’ Automatically optimize rank per module")
print("  β€’ Better parameter efficiency")
print("  β€’ Comparable atau better performance vs fixed-rank LoRA")

9.7 Evaluation dan Comparison

9.7.1 Comparing Fine-tuning Methods

Mari kita bandingkan different approaches secara comprehensive:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Comparison data (based on typical BERT-Base fine-tuning)
data = {
    'Method': ['Full FT', 'Adapter', 'Prefix', 'LoRA (r=8)', 'LoRA (r=16)', 'BitFit'],
    'Trainable Params (%)': [100, 3.5, 0.5, 0.27, 0.54, 0.08],
    'Training Time (hrs)': [4.0, 2.5, 1.8, 1.5, 1.7, 1.2],
    'Test Accuracy (%)': [89.5, 87.2, 85.8, 88.9, 89.2, 83.5],
    'Storage (MB)': [440, 15, 5, 2, 4, 1],
}

df = pd.DataFrame(data)

# Create comparison plots
fig = plt.figure(figsize=(16, 10))
gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3)

# 1. Performance vs Parameters
ax1 = fig.add_subplot(gs[0, 0])
scatter = ax1.scatter(df['Trainable Params (%)'], df['Test Accuracy (%)'],
                     s=300, c=df.index, cmap='viridis', alpha=0.7, edgecolors='black', linewidth=2)

for i, method in enumerate(df['Method']):
    ax1.annotate(method, (df['Trainable Params (%)'][i], df['Test Accuracy (%)'][i]),
                xytext=(10, 5), textcoords='offset points', fontsize=9, fontweight='bold')

ax1.set_xlabel('Trainable Parameters (%)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Test Accuracy (%)', fontsize=12, fontweight='bold')
ax1.set_title('Accuracy vs Parameter Efficiency', fontsize=14, fontweight='bold')
ax1.grid(alpha=0.3)
ax1.set_xscale('log')

# Sweet spot annotation
ax1.annotate('LoRA Sweet Spot!', xy=(0.27, 88.9), xytext=(1, 86),
            arrowprops=dict(arrowstyle='->', color='green', lw=2),
            fontsize=11, color='green', fontweight='bold')

# 2. Training Time Comparison
ax2 = fig.add_subplot(gs[0, 1])
bars = ax2.barh(df['Method'], df['Training Time (hrs)'],
                color=plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(df))))
ax2.set_xlabel('Training Time (hours)', fontsize=12, fontweight='bold')
ax2.set_title('Training Efficiency', fontsize=14, fontweight='bold')
ax2.grid(axis='x', alpha=0.3)

for i, (bar, time) in enumerate(zip(bars, df['Training Time (hrs)'])):
    ax2.text(time + 0.1, bar.get_y() + bar.get_height()/2,
            f'{time:.1f}h', va='center', fontweight='bold')

# 3. Storage Requirements
ax3 = fig.add_subplot(gs[1, 0])
colors = ['#ef5350' if x > 100 else '#66bb6a' for x in df['Storage (MB)']]
bars = ax3.bar(df['Method'], df['Storage (MB)'], color=colors, alpha=0.7)
ax3.set_ylabel('Storage Size (MB)', fontsize=12, fontweight='bold')
ax3.set_title('Model Storage Requirements', fontsize=14, fontweight='bold')
ax3.set_yscale('log')
ax3.grid(axis='y', alpha=0.3)
plt.setp(ax3.xaxis.get_majorticklabels(), rotation=45, ha='right')

for bar, size in zip(bars, df['Storage (MB)']):
    height = bar.get_height()
    ax3.text(bar.get_x() + bar.get_width()/2., height*1.2,
            f'{size}MB', ha='center', va='bottom', fontweight='bold')

# 4. Overall Score (normalized)
ax4 = fig.add_subplot(gs[1, 1])

# Normalize metrics (higher is better)
norm_acc = df['Test Accuracy (%)'] / df['Test Accuracy (%)'].max()
norm_params = 1 - (df['Trainable Params (%)'] / df['Trainable Params (%)'].max())
norm_time = 1 - (df['Training Time (hrs)'] / df['Training Time (hrs)'].max())
norm_storage = 1 - (df['Storage (MB)'] / df['Storage (MB)'].max())

overall_score = (norm_acc + norm_params + norm_time + norm_storage) / 4

x = np.arange(len(df['Method']))
bars = ax4.bar(x, overall_score, color=plt.cm.RdYlGn(overall_score), alpha=0.7)
ax4.set_xticks(x)
ax4.set_xticklabels(df['Method'], rotation=45, ha='right')
ax4.set_ylabel('Overall Score (normalized)', fontsize=12, fontweight='bold')
ax4.set_title('Overall Comparison (Higher is Better)', fontsize=14, fontweight='bold')
ax4.set_ylim([0, 1])
ax4.grid(axis='y', alpha=0.3)

for bar, score in zip(bars, overall_score):
    height = bar.get_height()
    ax4.text(bar.get_x() + bar.get_width()/2., height + 0.02,
            f'{score:.2f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nπŸ“Š Comparison Summary:")
print("=" * 70)
best_idx = overall_score.argmax()
print(f"πŸ† Best Overall: {df['Method'][best_idx]} (score: {overall_score[best_idx]:.3f})")
print(f"🎯 Best Accuracy: {df.loc[df['Test Accuracy (%)'].idxmax(), 'Method']} ({df['Test Accuracy (%)'].max():.1f}%)")
print(f"⚑ Fastest: {df.loc[df['Training Time (hrs)'].idxmin(), 'Method']} ({df['Training Time (hrs)'].min():.1f}h)")
print(f"πŸ’Ύ Smallest: {df.loc[df['Storage (MB)'].idxmin(), 'Method']} ({df['Storage (MB)'].min():.0f}MB)")

9.7.2 When to Use What?

Decision Tree:

decision_guide = """
🎯 FINE-TUNING METHOD SELECTION GUIDE

β”Œβ”€ Do you have LARGE dataset (>10K samples)?
β”‚
β”œβ”€ YES β†’ Do you have GPU with >16GB VRAM?
β”‚   β”‚
β”‚   β”œβ”€ YES β†’ Consider FULL FINE-TUNING
β”‚   β”‚        β€’ Best performance
β”‚   β”‚        β€’ Worth the cost untuk production
β”‚   β”‚
β”‚   └─ NO β†’ Use LoRA (r=16-32)
β”‚            β€’ Good performance
β”‚            β€’ Manageable memory
β”‚
└─ NO β†’ Small dataset (<10K samples)
    β”‚
    β”œβ”€ Do you have GPU with >8GB VRAM?
    β”‚   β”‚
    β”‚   β”œβ”€ YES β†’ Use LoRA (r=8-16)
    β”‚   β”‚        β€’ Prevents overfitting
    β”‚   β”‚        β€’ Fast training
    β”‚   β”‚
    β”‚   └─ NO β†’ Use QLoRA (4-bit)
    β”‚            β€’ Minimal memory
    β”‚            β€’ Accessible untuk consumer GPU
    β”‚
    └─ VERY small dataset (<1K) β†’ Consider:
        β€’ Data augmentation
        β€’ Few-shot learning
        β€’ Prompt engineering instead of fine-tuning

πŸ’‘ SPECIAL CASES:

β€’ Multiple tasks? β†’ Multi-task LoRA
β€’ Need fastest inference? β†’ Merge LoRA after training
β€’ Extremely limited resources? β†’ BitFit atau Prefix Tuning
β€’ Research/experimentation? β†’ LoRA (flexible, fast iteration)
"""

print(decision_guide)

9.8 Praktik Terbaik: Complete Workflow

9.8.1 End-to-End Fine-tuning Pipeline

Mari kita buat production-ready pipeline untuk fine-tuning dengan best practices!

Klik untuk melihat complete production pipeline
import os
import json
from datetime import datetime
from pathlib import Path

import torch
import numpy as np
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# ============================================================================
# CONFIGURATION
# ============================================================================
class Config:
    """Centralized configuration"""
    # Model
    model_name = "bert-base-uncased"
    task_name = "sentiment_analysis"

    # LoRA
    lora_r = 8
    lora_alpha = 16
    lora_dropout = 0.1
    lora_target_modules = ["query", "value"]

    # Training
    learning_rate = 2e-4
    batch_size = 16
    num_epochs = 3
    warmup_ratio = 0.1
    weight_decay = 0.01

    # Data
    max_length = 128
    train_samples = 1000
    test_samples = 200

    # Paths
    output_dir = f"./models/{task_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

    # Misc
    seed = 42
    fp16 = torch.cuda.is_available()

    def save(self, path):
        """Save config to JSON"""
        Path(path).parent.mkdir(parents=True, exist_ok=True)
        with open(path, 'w') as f:
            json.dump(self.__dict__, f, indent=2, default=str)

config = Config()

# ============================================================================
# SETUP
# ============================================================================
def set_seed(seed):
    """Set random seeds untuk reproducibility"""
    torch.manual_seed(seed)
    np.random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(config.seed)

print("πŸš€ Production Fine-tuning Pipeline")
print("=" * 70)
print(f"Task: {config.task_name}")
print(f"Model: {config.model_name}")
print(f"Output: {config.output_dir}")
print("=" * 70)

# ============================================================================
# DATA LOADING & PREPROCESSING
# ============================================================================
print("\nπŸ“Š Loading and preprocessing data...")

dataset = load_dataset("imdb")
train_data = dataset['train'].shuffle(seed=config.seed).select(range(config.train_samples))
test_data = dataset['test'].shuffle(seed=config.seed).select(range(config.test_samples))

tokenizer = AutoTokenizer.from_pretrained(config.model_name)

def preprocess_function(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=config.max_length
    )

train_dataset = train_data.map(preprocess_function, batched=True)
test_dataset = test_data.map(preprocess_function, batched=True)

train_dataset = train_dataset.rename_column("label", "labels")
test_dataset = test_dataset.rename_column("label", "labels")

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print(f"βœ… Train samples: {len(train_dataset)}")
print(f"βœ… Test samples: {len(test_dataset)}")

# ============================================================================
# MODEL SETUP
# ============================================================================
print("\nπŸ€– Setting up model with LoRA...")

model = AutoModelForSequenceClassification.from_pretrained(
    config.model_name,
    num_labels=2
)

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=config.lora_r,
    lora_alpha=config.lora_alpha,
    lora_dropout=config.lora_dropout,
    target_modules=config.lora_target_modules,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# ============================================================================
# METRICS
# ============================================================================
def compute_metrics(eval_pred):
    """Comprehensive metrics"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='binary'
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

# ============================================================================
# TRAINING SETUP
# ============================================================================
print("\nβš™οΈ Configuring training...")

training_args = TrainingArguments(
    output_dir=config.output_dir,
    learning_rate=config.learning_rate,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.batch_size,
    num_train_epochs=config.num_epochs,
    weight_decay=config.weight_decay,
    warmup_ratio=config.warmup_ratio,

    # Evaluation
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",

    # Optimization
    fp16=config.fp16,
    gradient_accumulation_steps=2,

    # Logging
    logging_dir=f"{config.output_dir}/logs",
    logging_steps=50,
    report_to="none",

    # Misc
    seed=config.seed,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# ============================================================================
# TRAINING
# ============================================================================
print("\nπŸ‹οΈ Training...\n")

# Baseline evaluation
print("πŸ“Š Baseline (before fine-tuning):")
baseline_metrics = trainer.evaluate()
for key, value in baseline_metrics.items():
    if key.startswith('eval_'):
        print(f"  {key[5:]}: {value:.4f}")

# Train
train_result = trainer.train()

# Final evaluation
print("\nπŸ“Š Final (after fine-tuning):")
final_metrics = trainer.evaluate()
for key, value in final_metrics.items():
    if key.startswith('eval_'):
        print(f"  {key[5:]}: {value:.4f}")

# ============================================================================
# SAVE RESULTS
# ============================================================================
print("\nπŸ’Ύ Saving model and results...")

# Save model
model.save_pretrained(f"{config.output_dir}/lora_adapter")
tokenizer.save_pretrained(f"{config.output_dir}/lora_adapter")

# Save config
config.save(f"{config.output_dir}/config.json")

# Save metrics
metrics_history = {
    'baseline': baseline_metrics,
    'final': final_metrics,
    'training': train_result.metrics
}

with open(f"{config.output_dir}/metrics.json", 'w') as f:
    json.dump(metrics_history, f, indent=2)

print(f"βœ… Saved to {config.output_dir}")

# ============================================================================
# VISUALIZATION
# ============================================================================
print("\nπŸ“ˆ Generating visualizations...")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Metrics comparison
metrics_names = ['accuracy', 'precision', 'recall', 'f1']
baseline_values = [baseline_metrics[f'eval_{m}'] for m in metrics_names]
final_values = [final_metrics[f'eval_{m}'] for m in metrics_names]

x = np.arange(len(metrics_names))
width = 0.35

axes[0].bar(x - width/2, baseline_values, width, label='Baseline', color='coral', alpha=0.7)
axes[0].bar(x + width/2, final_values, width, label='Fine-tuned', color='green', alpha=0.7)
axes[0].set_ylabel('Score', fontweight='bold')
axes[0].set_title('Performance Comparison', fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels([m.capitalize() for m in metrics_names])
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)
axes[0].set_ylim([0, 1])

# Improvement percentages
improvements = [(f - b) * 100 for b, f in zip(baseline_values, final_values)]
colors = ['green' if i > 0 else 'red' for i in improvements]
axes[1].barh(metrics_names, improvements, color=colors, alpha=0.7)
axes[1].set_xlabel('Improvement (%)', fontweight='bold')
axes[1].set_title('Performance Gains', fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)
axes[1].axvline(x=0, color='black', linestyle='-', linewidth=0.8)

for i, (metric, imp) in enumerate(zip(metrics_names, improvements)):
    axes[1].text(imp + 1, i, f'+{imp:.1f}%' if imp > 0 else f'{imp:.1f}%',
                va='center', fontweight='bold')

plt.tight_layout()
plt.savefig(f"{config.output_dir}/results.png", dpi=150, bbox_inches='tight')
plt.show()

print(f"βœ… Saved visualization to {config.output_dir}/results.png")

# ============================================================================
# SUMMARY
# ============================================================================
print("\n" + "=" * 70)
print("✨ FINE-TUNING COMPLETE!")
print("=" * 70)
print(f"\nπŸ“ Output Directory: {config.output_dir}")
print(f"   β”œβ”€β”€ lora_adapter/          (LoRA weights)")
print(f"   β”œβ”€β”€ config.json            (Training configuration)")
print(f"   β”œβ”€β”€ metrics.json           (Performance metrics)")
print(f"   └── results.png            (Visualizations)")

print(f"\nπŸ“Š Key Results:")
print(f"   Accuracy: {baseline_metrics['eval_accuracy']:.3f} β†’ {final_metrics['eval_accuracy']:.3f} (+{(final_metrics['eval_accuracy']-baseline_metrics['eval_accuracy'])*100:.1f}%)")
print(f"   F1 Score: {baseline_metrics['eval_f1']:.3f} β†’ {final_metrics['eval_f1']:.3f} (+{(final_metrics['eval_f1']-baseline_metrics['eval_f1'])*100:.1f}%)")
print(f"\n⏱️  Training Time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"πŸ’Ύ Model Size: ~{config.lora_r * 2} MB (LoRA adapters only)")

print("\nπŸŽ‰ Ready for deployment!")

πŸ§ͺ Hands-on Exercise

Objektif: Implement dan compare LoRA fine-tuning dengan different ranks

Instruksi:

  1. Setup Environment

    pip install transformers datasets peft torch
  2. Task: Fine-tune BERT untuk sentiment analysis (IMDB dataset)

  3. Experiments: Train dengan different LoRA ranks

    • Experiment 1: r=4
    • Experiment 2: r=8
    • Experiment 3: r=16
  4. Compare:

    • Training time
    • Model size
    • Test accuracy
    • F1 score
  5. Analysis:

    • Buat plot performance vs rank
    • Determine optimal rank untuk task ini
    • Explain tradeoffs

Deliverables:

  • Jupyter notebook dengan complete code
  • Performance comparison table
  • Visualizations
  • Written analysis (200-300 words)

Bonus:

  • Try QLoRA (4-bit quantization)
  • Implement early stopping
  • Try different target modules

πŸ“ Review Questions

Conceptual Questions

  1. Jelaskan perbedaan fundamental antara pre-training dan fine-tuning dalam konteks LLM. Mengapa fine-tuning lebih praktis daripada training from scratch?

  2. Apa yang dimaksud dengan β€œlow-rank” dalam LoRA? Jelaskan intuisi matematika di balik mengapa weight updates memiliki low intrinsic rank.

  3. Bandingkan Full Fine-tuning vs LoRA. Dalam situasi apa Anda akan memilih masing-masing approach?

  4. Jelaskan konsep quantization dalam QLoRA. Bagaimana 4-bit quantization memungkinkan fine-tuning model besar di GPU dengan memory terbatas?

  5. Apa yang dimaksud dengan β€œcatastrophic forgetting”? Bagaimana PEFT methods seperti LoRA membantu mengurangi masalah ini?

Practical Questions

  1. Jika BERT-Base memiliki 110M parameters dan Anda apply LoRA dengan r=8 pada query dan value matrices di semua 12 layers, berapa total trainable parameters?

  2. Anda punya dataset dengan 500 labeled samples untuk text classification. Metode fine-tuning apa yang Anda rekomendasikan dan mengapa?

  3. Jelaskan parameter lora_alpha dalam LoRA config. Bagaimana Anda akan set nilai ini relative terhadap rank?

  4. Anda observe bahwa model overfit pada training set (train acc = 95%, test acc = 65%). Apa actions yang bisa Anda lakukan dalam konteks LoRA fine-tuning?

  5. Untuk production deployment, lebih baik save LoRA adapters terpisah atau merge ke base model? Jelaskan tradeoffs.


🎯 Key Takeaways

βœ… Transfer Learning memungkinkan kita leverage pre-trained LLMs tanpa training from scratch

βœ… LoRA adalah parameter-efficient method yang achieve ~99% performance dengan <1% trainable parameters

βœ… QLoRA combines quantization dengan LoRA untuk fine-tune model besar di consumer hardware

βœ… Rank (r) adalah critical hyperparameter: balance antara expressiveness dan efficiency

βœ… Multi-task LoRA allows sharing base model across different tasks dengan separate adapters

βœ… Best practices: Start dengan r=8, use higher LR (1e-4), monitor validation metrics


πŸ“š References dan Further Reading

Papers

  • LoRA: Hu et al. (2021). β€œLoRA: Low-Rank Adaptation of Large Language Models”. arXiv:2106.09685

  • QLoRA: Dettmers et al. (2023). β€œQLoRA: Efficient Finetuning of Quantized LLMs”. arXiv:2305.14314

  • AdaLoRA: Zhang et al. (2023). β€œAdaptive Budget Allocation for Parameter-Efficient Fine-Tuning”. arXiv:2303.10512

  • Prefix Tuning: Li & Liang (2021). β€œPrefix-Tuning: Optimizing Continuous Prompts for Generation”. arXiv:2101.00190

Libraries & Tools

  • Hugging Face PEFT: https://github.com/huggingface/peft
  • Hugging Face Transformers: https://huggingface.co/docs/transformers
  • BitsAndBytes: https://github.com/TimDettmers/bitsandbytes

Tutorials & Blogs

  • Hugging Face PEFT Documentation: https://huggingface.co/docs/peft
  • LoRA Tutorial: https://huggingface.co/blog/lora
  • QLoRA Blog Post: https://huggingface.co/blog/4bit-transformers-bitsandbytes

✨ Kesimpulan

Anda sekarang memiliki pemahaman mendalam tentang fine-tuning Large Language Models:

  • Fundamental concepts: pre-training vs fine-tuning
  • Parameter-Efficient Fine-Tuning (PEFT) methods
  • LoRA dan QLoRA implementation
  • Best practices untuk production deployment
  • Evaluation dan comparison frameworks

Di Bab 10 (RAG & AI Agents), kita akan belajar bagaimana menggunakan fine-tuned LLMs dalam sistem yang lebih kompleks: Retrieval-Augmented Generation dan autonomous agents yang dapat interact dengan external tools dan knowledge bases! πŸš€


Status: βœ… Chapter 9 Complete Word Count: ~12,000 words Code Examples: 15+ Diagrams: 5 Exercises: 1 comprehensive + 10 review questions