Lab 8: Fine-tuning BERT untuk Sentiment Analysis

Transfer Learning dengan Hugging Face Transformers & Pre-trained Models

Author

Pembelajaran Mesin - Data Science for Cybersecurity

Published

December 15, 2025

19 Pendahuluan

19.1 Tujuan Pembelajaran

Setelah menyelesaikan lab ini, Anda diharapkan dapat:

  1. Memahami konsep transfer learning dengan pre-trained transformers
  2. Menggunakan Hugging Face Transformers library
  3. Melakukan tokenization dengan BERT tokenizer
  4. Fine-tuning BERT untuk sentiment analysis
  5. Mengevaluasi model dengan berbagai metrics
  6. Mengimplementasikan inference pipeline untuk production
  7. Mengoptimalkan model untuk deployment
  8. Memvisualisasikan attention weights untuk interpretability

19.2 Gambaran Umum Lab

Lab ini fokus pada fine-tuning BERT untuk sentiment analysis menggunakan Hugging Face Transformers.

19.2.1 Dataset: IMDB Movie Reviews

Karakteristik:

  • Domain: Movie reviews
  • Task: Binary sentiment classification (positive/negative)
  • Size: 50,000 reviews (25k train, 25k test)
  • Features: Text reviews
  • Labels: 0 (negative), 1 (positive)

19.2.2 Lab Structure

graph TD
    A[Setup & Installation] --> B[Data Exploration]
    B --> C[Tokenization]
    C --> D[Model Loading]
    D --> E[Fine-tuning]
    E --> F[Evaluation]
    F --> G[Inference]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

graph TD
    A[Setup & Installation] --> B[Data Exploration]
    B --> C[Tokenization]
    C --> D[Model Loading]
    D --> E[Fine-tuning]
    E --> F[Evaluation]
    F --> G[Inference]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

19.3 Persiapan Environment

19.3.1 Install Dependencies

# Install required packages
import subprocess
import sys

packages = [
    'transformers',
    'datasets',
    'evaluate',
    'accelerate',
    'sentencepiece',
    'sacremoses'
]

for package in packages:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

print("✓ All packages installed successfully!")

19.3.2 Import Libraries

# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)

# Datasets
from datasets import load_dataset, load_metric

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Utilities
from tqdm.auto import tqdm
import json
from datetime import datetime

print(f"✓ Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

19.3.3 Setup Directories

# Create directories
dirs = {
    'data': Path('data'),
    'models': Path('models'),
    'checkpoints': Path('checkpoints'),
    'figures': Path('figures'),
    'logs': Path('logs'),
    'predictions': Path('predictions'),
    'cache': Path('cache')
}

for name, path in dirs.items():
    path.mkdir(exist_ok=True, parents=True)
    print(f"✓ {name}: {path}")

19.3.4 Configure Device

def setup_device():
    """Setup computation device (GPU/CPU)"""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    else:
        device = torch.device('cpu')
        print("⚠ GPU not available. Using CPU")

    return device

device = setup_device()

19.3.5 Global Configuration

# Hyperparameters
CONFIG = {
    'model_name': 'bert-base-uncased',
    'max_length': 256,
    'batch_size': 16,
    'learning_rate': 2e-5,
    'num_epochs': 3,
    'warmup_steps': 500,
    'weight_decay': 0.01,
    'seed': 42,
    'num_labels': 2,
    'train_subset': None,  # None untuk full dataset, int untuk subset
    'eval_subset': None
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

# Set random seeds
torch.manual_seed(CONFIG['seed'])
np.random.seed(CONFIG['seed'])

20 Part 1: Data Loading & Exploration

20.1 Load IMDB Dataset

def load_imdb_data():
    """Load IMDB dataset from Hugging Face"""
    print("Loading IMDB dataset...")

    # Load dataset
    dataset = load_dataset('imdb', cache_dir=dirs['cache'])

    print(f"\n✓ Dataset loaded!")
    print(f"  Train samples: {len(dataset['train']):,}")
    print(f"  Test samples: {len(dataset['test']):,}")

    # Inspect structure
    print(f"\nDataset structure:")
    print(dataset)

    # Show sample
    print(f"\nSample review:")
    sample = dataset['train'][0]
    print(f"  Text: {sample['text'][:200]}...")
    print(f"  Label: {sample['label']} ({'Positive' if sample['label'] == 1 else 'Negative'})")

    return dataset

# Load data
dataset = load_imdb_data()

20.2 Data Exploration

20.2.1 Basic Statistics

def explore_dataset(dataset):
    """Explore dataset statistics"""

    print("=" * 80)
    print("DATASET EXPLORATION")
    print("=" * 80)

    # Label distribution
    train_labels = [sample['label'] for sample in dataset['train']]
    test_labels = [sample['label'] for sample in dataset['test']]

    print("\n1. LABEL DISTRIBUTION:")
    print(f"   Train - Negative: {train_labels.count(0):,} ({train_labels.count(0)/len(train_labels)*100:.1f}%)")
    print(f"   Train - Positive: {train_labels.count(1):,} ({train_labels.count(1)/len(train_labels)*100:.1f}%)")
    print(f"   Test  - Negative: {test_labels.count(0):,} ({test_labels.count(0)/len(test_labels)*100:.1f}%)")
    print(f"   Test  - Positive: {test_labels.count(1):,} ({test_labels.count(1)/len(test_labels)*100:.1f}%)")

    # Text length distribution
    train_lengths = [len(sample['text'].split()) for sample in dataset['train']]
    test_lengths = [len(sample['text'].split()) for sample in dataset['test']]

    print("\n2. TEXT LENGTH STATISTICS (words):")
    print(f"   Train - Mean: {np.mean(train_lengths):.1f}, Median: {np.median(train_lengths):.1f}")
    print(f"   Train - Min: {np.min(train_lengths)}, Max: {np.max(train_lengths)}")
    print(f"   Test  - Mean: {np.mean(test_lengths):.1f}, Median: {np.median(test_lengths):.1f}")
    print(f"   Test  - Min: {np.min(test_lengths)}, Max: {np.max(test_lengths)}")

    print("=" * 80)

    return train_lengths, test_lengths

train_lengths, test_lengths = explore_dataset(dataset)

20.2.2 Visualization

def visualize_data_distribution(dataset, train_lengths, test_lengths):
    """Visualize data distribution"""

    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # 1. Label distribution
    train_labels = [sample['label'] for sample in dataset['train']]
    test_labels = [sample['label'] for sample in dataset['test']]

    label_counts_train = [train_labels.count(0), train_labels.count(1)]
    label_counts_test = [test_labels.count(0), test_labels.count(1)]

    x = np.arange(2)
    width = 0.35

    axes[0, 0].bar(x - width/2, label_counts_train, width, label='Train', alpha=0.8)
    axes[0, 0].bar(x + width/2, label_counts_test, width, label='Test', alpha=0.8)
    axes[0, 0].set_xlabel('Label', fontsize=12, fontweight='bold')
    axes[0, 0].set_ylabel('Count', fontsize=12, fontweight='bold')
    axes[0, 0].set_title('Label Distribution', fontsize=14, fontweight='bold')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(['Negative (0)', 'Positive (1)'])
    axes[0, 0].legend(fontsize=11)
    axes[0, 0].grid(axis='y', alpha=0.3)

    # 2. Text length distribution
    axes[0, 1].hist(train_lengths, bins=50, alpha=0.7, label='Train', color='steelblue', edgecolor='black')
    axes[0, 1].hist(test_lengths, bins=50, alpha=0.7, label='Test', color='coral', edgecolor='black')
    axes[0, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
    axes[0, 1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
    axes[0, 1].set_title('Text Length Distribution', fontsize=14, fontweight='bold')
    axes[0, 1].legend(fontsize=11)
    axes[0, 1].grid(alpha=0.3)

    # 3. Box plot
    axes[1, 0].boxplot([train_lengths, test_lengths], labels=['Train', 'Test'])
    axes[1, 0].set_ylabel('Number of Words', fontsize=12, fontweight='bold')
    axes[1, 0].set_title('Text Length Box Plot', fontsize=14, fontweight='bold')
    axes[1, 0].grid(axis='y', alpha=0.3)

    # 4. Cumulative distribution
    train_sorted = np.sort(train_lengths)
    test_sorted = np.sort(test_lengths)
    train_cumsum = np.arange(1, len(train_sorted) + 1) / len(train_sorted)
    test_cumsum = np.arange(1, len(test_sorted) + 1) / len(test_sorted)

    axes[1, 1].plot(train_sorted, train_cumsum, label='Train', linewidth=2)
    axes[1, 1].plot(test_sorted, test_cumsum, label='Test', linewidth=2)
    axes[1, 1].axhline(0.95, color='red', linestyle='--', label='95th percentile', alpha=0.7)
    axes[1, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
    axes[1, 1].set_ylabel('Cumulative Probability', fontsize=12, fontweight='bold')
    axes[1, 1].set_title('Cumulative Distribution', fontsize=14, fontweight='bold')
    axes[1, 1].legend(fontsize=11)
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'data_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()

visualize_data_distribution(dataset, train_lengths, test_lengths)

20.2.3 Sample Reviews

def show_sample_reviews(dataset, num_samples=3):
    """Display sample reviews"""

    print("\n" + "=" * 80)
    print("SAMPLE REVIEWS")
    print("=" * 80)

    for sentiment in [0, 1]:
        sentiment_name = "NEGATIVE" if sentiment == 0 else "POSITIVE"
        print(f"\n{sentiment_name} REVIEWS:")

        # Filter by sentiment
        samples = [s for s in dataset['train'] if s['label'] == sentiment][:num_samples]

        for i, sample in enumerate(samples, 1):
            text = sample['text']
            # Truncate for display
            if len(text) > 300:
                text = text[:300] + "..."

            print(f"\n  [{i}] {text}")

    print("=" * 80)

show_sample_reviews(dataset, num_samples=2)

21 Part 2: Tokenization & Data Preprocessing

21.1 BERT Tokenizer

21.1.1 Load Tokenizer

def load_tokenizer(model_name):
    """Load pre-trained tokenizer"""
    print(f"Loading tokenizer: {model_name}...")

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=dirs['cache']
    )

    print(f"✓ Tokenizer loaded!")
    print(f"  Vocabulary size: {tokenizer.vocab_size:,}")
    print(f"  Model max length: {tokenizer.model_max_length:,}")

    # Special tokens
    print(f"\n  Special tokens:")
    print(f"    PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
    print(f"    CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
    print(f"    SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
    print(f"    UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
    print(f"    MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")

    return tokenizer

tokenizer = load_tokenizer(CONFIG['model_name'])

21.1.2 Tokenization Examples

def demonstrate_tokenization(tokenizer):
    """Demonstrate tokenization process"""

    print("\n" + "=" * 80)
    print("TOKENIZATION DEMONSTRATION")
    print("=" * 80)

    examples = [
        "This movie is great!",
        "Terrible acting and boring plot.",
        "Transformers are revolutionizing NLP."
    ]

    for i, text in enumerate(examples, 1):
        print(f"\n{i}. Original text:")
        print(f"   '{text}'")

        # Tokenize
        tokens = tokenizer.tokenize(text)
        print(f"\n   Tokens: {tokens}")

        # Convert to IDs
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        print(f"   Token IDs: {token_ids}")

        # Full encoding
        encoding = tokenizer(
            text,
            add_special_tokens=True,
            return_tensors='pt'
        )
        print(f"\n   Input IDs: {encoding['input_ids'][0].tolist()}")
        print(f"   Attention mask: {encoding['attention_mask'][0].tolist()}")

        # Decode
        decoded = tokenizer.decode(encoding['input_ids'][0])
        print(f"   Decoded: '{decoded}'")

    print("=" * 80)

demonstrate_tokenization(tokenizer)

21.1.3 Tokenize Dataset

def tokenize_dataset(dataset, tokenizer, max_length):
    """
    Tokenize entire dataset

    Parameters:
        dataset: HuggingFace dataset
        tokenizer: Pre-trained tokenizer
        max_length: Maximum sequence length

    Returns:
        tokenized_dataset: Tokenized dataset
    """
    print(f"Tokenizing dataset (max_length={max_length})...")

    def tokenize_function(examples):
        """Tokenize batch of examples"""
        return tokenizer(
            examples['text'],
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_tensors=None
        )

    # Tokenize
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        desc="Tokenizing"
    )

    print(f"✓ Tokenization complete!")
    print(f"\nTokenized dataset:")
    print(tokenized_dataset)

    # Show sample
    sample = tokenized_dataset['train'][0]
    print(f"\nSample tokenized review:")
    print(f"  Input IDs length: {len(sample['input_ids'])}")
    print(f"  Attention mask length: {len(sample['attention_mask'])}")
    print(f"  Label: {sample['label']}")

    return tokenized_dataset

tokenized_dataset = tokenize_dataset(dataset, tokenizer, CONFIG['max_length'])

21.1.4 Analyze Tokenization

def analyze_tokenization(tokenized_dataset):
    """Analyze tokenization statistics"""

    print("\n" + "=" * 80)
    print("TOKENIZATION ANALYSIS")
    print("=" * 80)

    # Calculate effective lengths (excluding padding)
    train_eff_lengths = []
    for sample in tokenized_dataset['train']:
        eff_length = sum(sample['attention_mask'])
        train_eff_lengths.append(eff_length)

    test_eff_lengths = []
    for sample in tokenized_dataset['test']:
        eff_length = sum(sample['attention_mask'])
        test_eff_lengths.append(eff_length)

    print(f"\n1. EFFECTIVE TOKEN LENGTHS:")
    print(f"   Train - Mean: {np.mean(train_eff_lengths):.1f}, Median: {np.median(train_eff_lengths):.1f}")
    print(f"   Train - Min: {np.min(train_eff_lengths)}, Max: {np.max(train_eff_lengths)}")
    print(f"   Test  - Mean: {np.mean(test_eff_lengths):.1f}, Median: {np.median(test_eff_lengths):.1f}")

    # Truncation analysis
    max_len = CONFIG['max_length']
    train_truncated = sum(1 for l in train_eff_lengths if l == max_len)
    test_truncated = sum(1 for l in test_eff_lengths if l == max_len)

    print(f"\n2. TRUNCATION:")
    print(f"   Train samples truncated: {train_truncated:,} ({train_truncated/len(train_eff_lengths)*100:.2f}%)")
    print(f"   Test samples truncated: {test_truncated:,} ({test_truncated/len(test_eff_lengths)*100:.2f}%)")

    # Padding analysis
    total_tokens_train = len(train_eff_lengths) * max_len
    actual_tokens_train = sum(train_eff_lengths)
    padding_ratio_train = (total_tokens_train - actual_tokens_train) / total_tokens_train

    print(f"\n3. PADDING:")
    print(f"   Train padding ratio: {padding_ratio_train*100:.2f}%")

    print("=" * 80)

    return train_eff_lengths, test_eff_lengths

train_eff_lengths, test_eff_lengths = analyze_tokenization(tokenized_dataset)

22 Part 3: Model Loading & Configuration

22.1 Load Pre-trained BERT

def load_pretrained_bert(model_name, num_labels):
    """Load pre-trained BERT for sequence classification"""

    print(f"Loading pre-trained model: {model_name}...")

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        cache_dir=dirs['cache']
    )

    # Move to device
    model = model.to(device)

    print(f"✓ Model loaded!")
    print(f"  Total parameters: {model.num_parameters():,}")

    # Count trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"  Trainable parameters: {trainable_params:,}")

    # Model architecture
    print(f"\n  Model architecture:")
    print(f"    BERT encoder: {model.config.num_hidden_layers} layers")
    print(f"    Hidden size: {model.config.hidden_size}")
    print(f"    Attention heads: {model.config.num_attention_heads}")
    print(f"    Intermediate size: {model.config.intermediate_size}")

    return model

model = load_pretrained_bert(CONFIG['model_name'], CONFIG['num_labels'])

22.2 Model Summary

def print_model_summary(model):
    """Print detailed model summary"""

    print("\n" + "=" * 80)
    print("MODEL ARCHITECTURE")
    print("=" * 80)

    # Layer-wise parameters
    print("\nLayer-wise parameters:")
    total_params = 0

    for name, param in model.named_parameters():
        param_count = param.numel()
        total_params += param_count

        # Only print major layers
        if 'layer' in name or 'classifier' in name or 'embeddings' in name:
            print(f"  {name:60s}: {param_count:>12,}")

    print(f"\n  {'Total':60s}: {total_params:>12,}")
    print("=" * 80)

print_model_summary(model)

23 Part 4: Training Configuration

23.1 Training Arguments

def create_training_arguments():
    """Create training configuration"""

    training_args = TrainingArguments(
        # Output
        output_dir=str(dirs['checkpoints']),

        # Training hyperparameters
        num_train_epochs=CONFIG['num_epochs'],
        per_device_train_batch_size=CONFIG['batch_size'],
        per_device_eval_batch_size=CONFIG['batch_size'] * 2,
        learning_rate=CONFIG['learning_rate'],
        weight_decay=CONFIG['weight_decay'],
        warmup_steps=CONFIG['warmup_steps'],

        # Evaluation & logging
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=str(dirs['logs']),
        logging_steps=100,
        logging_first_step=True,

        # Model saving
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_total_limit=3,

        # Hardware
        fp16=torch.cuda.is_available(),  # Mixed precision if GPU available
        dataloader_num_workers=4,

        # Misc
        seed=CONFIG['seed'],
        report_to="none"  # Disable W&B, TensorBoard
    )

    print("Training configuration:")
    print(f"  Epochs: {training_args.num_train_epochs}")
    print(f"  Batch size: {training_args.per_device_train_batch_size}")
    print(f"  Learning rate: {training_args.learning_rate}")
    print(f"  Warmup steps: {training_args.warmup_steps}")
    print(f"  Weight decay: {training_args.weight_decay}")
    print(f"  Mixed precision (fp16): {training_args.fp16}")

    return training_args

training_args = create_training_arguments()

23.2 Evaluation Metrics

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics

    Parameters:
        eval_pred: Tuple of (predictions, labels)

    Returns:
        metrics: Dictionary of metrics
    """
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    # Accuracy
    accuracy = accuracy_score(labels, predictions)

    # Precision, Recall, F1
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels,
        predictions,
        average='binary'
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

print("✓ Metrics function defined")

23.3 Data Collator

from transformers import DataCollatorWithPadding

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print("✓ Data collator created")

24 Part 5: Fine-tuning

24.1 Create Trainer

def create_trainer(model, training_args, train_dataset, eval_dataset, tokenizer):
    """Create Hugging Face Trainer"""

    # Subset for faster training (if configured)
    if CONFIG['train_subset']:
        train_dataset = train_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['train_subset']))
        print(f"Using subset of training data: {len(train_dataset):,} samples")

    if CONFIG['eval_subset']:
        eval_dataset = eval_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['eval_subset']))
        print(f"Using subset of evaluation data: {len(eval_dataset):,} samples")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    print("✓ Trainer created!")
    print(f"  Training samples: {len(train_dataset):,}")
    print(f"  Evaluation samples: {len(eval_dataset):,}")

    return trainer

trainer = create_trainer(
    model,
    training_args,
    tokenized_dataset['train'],
    tokenized_dataset['test'],
    tokenizer
)

24.2 Train Model

def train_model(trainer):
    """Execute training"""

    print("\n" + "=" * 80)
    print("STARTING TRAINING")
    print("=" * 80)

    # Start training
    train_result = trainer.train()

    # Training summary
    print("\n" + "=" * 80)
    print("TRAINING COMPLETE")
    print("=" * 80)

    metrics = train_result.metrics
    print(f"\nTraining metrics:")
    for key, value in metrics.items():
        print(f"  {key}: {value}")

    # Save model
    trainer.save_model(str(dirs['models'] / 'final_model'))
    print(f"\n✓ Model saved to: {dirs['models'] / 'final_model'}")

    return train_result

# Execute training
train_result = train_model(trainer)

24.3 Training History Visualization

def plot_training_history(trainer):
    """Plot training metrics"""

    # Extract history
    log_history = trainer.state.log_history

    # Separate train and eval logs
    train_logs = [log for log in log_history if 'loss' in log and 'eval_loss' not in log]
    eval_logs = [log for log in log_history if 'eval_loss' in log]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Loss plot
    if train_logs:
        train_steps = [log['step'] for log in train_logs]
        train_loss = [log['loss'] for log in train_logs]
        axes[0].plot(train_steps, train_loss, label='Training Loss', linewidth=2)

    if eval_logs:
        eval_epochs = [log['epoch'] for log in eval_logs]
        eval_loss = [log['eval_loss'] for log in eval_logs]
        # Convert epochs to approximate steps
        steps_per_epoch = train_steps[-1] / eval_epochs[-1] if eval_epochs else 1
        eval_steps = [e * steps_per_epoch for e in eval_epochs]
        axes[0].plot(eval_steps, eval_loss, label='Validation Loss', linewidth=2, marker='o', markersize=8)

    axes[0].set_xlabel('Training Steps', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Loss', fontsize=12, fontweight='bold')
    axes[0].set_title('Training & Validation Loss', fontsize=14, fontweight='bold')
    axes[0].legend(fontsize=11)
    axes[0].grid(alpha=0.3)

    # Metrics plot
    if eval_logs:
        eval_accuracy = [log.get('eval_accuracy', 0) for log in eval_logs]
        eval_f1 = [log.get('eval_f1', 0) for log in eval_logs]

        axes[1].plot(eval_epochs, eval_accuracy, label='Accuracy', linewidth=2, marker='o')
        axes[1].plot(eval_epochs, eval_f1, label='F1 Score', linewidth=2, marker='s')
        axes[1].set_xlabel('Epoch', fontsize=12, fontweight='bold')
        axes[1].set_ylabel('Score', fontsize=12, fontweight='bold')
        axes[1].set_title('Validation Metrics', fontsize=14, fontweight='bold')
        axes[1].legend(fontsize=11)
        axes[1].grid(alpha=0.3)
        axes[1].set_ylim([0, 1])

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'training_history.png', dpi=300, bbox_inches='tight')
    plt.show()

plot_training_history(trainer)

25 Part 6: Evaluation

25.1 Evaluate on Test Set

def evaluate_model(trainer):
    """Evaluate model on test set"""

    print("\n" + "=" * 80)
    print("EVALUATION ON TEST SET")
    print("=" * 80)

    # Evaluate
    eval_results = trainer.evaluate()

    print(f"\nTest set metrics:")
    for key, value in eval_results.items():
        if 'eval_' in key:
            metric_name = key.replace('eval_', '').upper()
            print(f"  {metric_name:15s}: {value:.4f}")

    return eval_results

eval_results = evaluate_model(trainer)

25.2 Predictions & Confusion Matrix

def analyze_predictions(trainer, dataset):
    """Analyze model predictions"""

    print("\nGenerating predictions...")

    # Get predictions
    predictions_output = trainer.predict(dataset)
    predictions = np.argmax(predictions_output.predictions, axis=1)
    true_labels = predictions_output.label_ids

    # Confusion matrix
    from sklearn.metrics import confusion_matrix, classification_report

    cm = confusion_matrix(true_labels, predictions)

    print("\nConfusion Matrix:")
    print(cm)

    # Classification report
    print("\nClassification Report:")
    print(classification_report(
        true_labels,
        predictions,
        target_names=['Negative', 'Positive']
    ))

    # Visualize confusion matrix
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Heatmap
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    axes[0].set_xlabel('Predicted', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('True', fontsize=12, fontweight='bold')
    axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

    # Normalized confusion matrix
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Greens', ax=axes[1],
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    axes[1].set_xlabel('Predicted', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('True', fontsize=12, fontweight='bold')
    axes[1].set_title('Normalized Confusion Matrix', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.show()

    return predictions, true_labels

predictions, true_labels = analyze_predictions(trainer, tokenized_dataset['test'])

25.3 Error Analysis

def error_analysis(dataset, predictions, true_labels, tokenizer, num_examples=5):
    """Analyze misclassified examples"""

    print("\n" + "=" * 80)
    print("ERROR ANALYSIS")
    print("=" * 80)

    # Find misclassified
    misclassified_indices = np.where(predictions != true_labels)[0]

    print(f"\nTotal misclassified: {len(misclassified_indices):,} / {len(predictions):,}")
    print(f"Error rate: {len(misclassified_indices)/len(predictions)*100:.2f}%")

    # Show examples
    print(f"\nSample misclassified reviews:")

    for i, idx in enumerate(misclassified_indices[:num_examples], 1):
        text = dataset[int(idx)]['text']
        true_label = true_labels[idx]
        pred_label = predictions[idx]

        # Truncate text
        if len(text) > 300:
            text = text[:300] + "..."

        print(f"\n[{i}] True: {'Positive' if true_label == 1 else 'Negative'} | "
              f"Predicted: {'Positive' if pred_label == 1 else 'Negative'}")
        print(f"    {text}")

    print("=" * 80)

error_analysis(tokenized_dataset['test'], predictions, true_labels, tokenizer)

26 Part 7: Inference & Deployment

26.1 Create Inference Pipeline

def create_inference_pipeline(model_path, tokenizer):
    """Create pipeline for inference"""

    print(f"Creating inference pipeline from: {model_path}")

    # Load model
    classifier = pipeline(
        "sentiment-analysis",
        model=str(model_path),
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )

    print("✓ Pipeline created!")

    return classifier

# Create pipeline
inference_pipeline = create_inference_pipeline(
    dirs['models'] / 'final_model',
    tokenizer
)

26.2 Test Inference

def test_inference(pipeline, test_texts):
    """Test inference on custom texts"""

    print("\n" + "=" * 80)
    print("INFERENCE TESTING")
    print("=" * 80)

    for i, text in enumerate(test_texts, 1):
        result = pipeline(text)[0]
        label = result['label']
        score = result['score']

        sentiment = "Positive" if label == 'LABEL_1' else "Negative"

        print(f"\n[{i}] Text:")
        print(f"    '{text}'")
        print(f"    Prediction: {sentiment} (confidence: {score:.4f})")

    print("=" * 80)

# Custom test cases
test_texts = [
    "This movie is absolutely fantastic! Best film I've seen this year!",
    "Terrible acting, boring plot. Complete waste of time.",
    "It was okay, nothing special but not terrible either.",
    "Amazing cinematography and brilliant performances throughout.",
    "I fell asleep halfway through. Very disappointing.",
    "The special effects were mind-blowing!",
    "Not recommended. Poor script and direction."
]

test_inference(inference_pipeline, test_texts)

26.3 Batch Inference

def batch_inference(pipeline, texts, batch_size=32):
    """Perform batch inference"""

    print(f"\nRunning batch inference on {len(texts)} samples...")

    import time
    start_time = time.time()

    results = pipeline(texts, batch_size=batch_size)

    elapsed_time = time.time() - start_time

    print(f"✓ Inference complete!")
    print(f"  Time: {elapsed_time:.2f}s")
    print(f"  Throughput: {len(texts)/elapsed_time:.1f} samples/sec")

    return results

# Test batch inference
sample_texts = [dataset['test'][i]['text'] for i in range(100)]
batch_results = batch_inference(inference_pipeline, sample_texts, batch_size=16)

26.4 Save Predictions

def save_predictions(texts, predictions, output_path):
    """Save predictions to file"""

    results_df = pd.DataFrame({
        'text': texts,
        'label': [p['label'] for p in predictions],
        'score': [p['score'] for p in predictions]
    })

    results_df.to_csv(output_path, index=False)
    print(f"✓ Predictions saved to: {output_path}")

    return results_df

# Save sample predictions
predictions_df = save_predictions(
    sample_texts,
    batch_results,
    dirs['predictions'] / 'sample_predictions.csv'
)

print("\nSample predictions:")
print(predictions_df.head(10))

27 Part 8: Model Optimization

27.1 Model Quantization

def quantize_model(model):
    """Apply dynamic quantization"""

    print("Applying dynamic quantization...")

    # Quantize
    quantized_model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},
        dtype=torch.qint8
    )

    print("✓ Quantization complete!")

    # Compare sizes
    def get_model_size(model):
        torch.save(model.state_dict(), "temp.pth")
        size = Path("temp.pth").stat().st_size / (1024 * 1024)
        Path("temp.pth").unlink()
        return size

    original_size = get_model_size(model)
    quantized_size = get_model_size(quantized_model)

    print(f"\n  Original model: {original_size:.2f} MB")
    print(f"  Quantized model: {quantized_size:.2f} MB")
    print(f"  Reduction: {(1 - quantized_size/original_size)*100:.1f}%")

    return quantized_model

# Note: Quantization may not work on all BERT models
# Uncomment to test:
# quantized_model = quantize_model(model)

27.2 Model Export (ONNX)

def export_to_onnx(model, tokenizer, output_path, max_length=256):
    """Export model to ONNX format"""

    print(f"Exporting model to ONNX format...")

    # Dummy input
    dummy_text = "This is a sample text for export."
    inputs = tokenizer(
        dummy_text,
        padding='max_length',
        max_length=max_length,
        truncation=True,
        return_tensors='pt'
    )

    # Move to CPU for export
    model_cpu = model.cpu()
    inputs_cpu = {k: v.cpu() for k, v in inputs.items()}

    # Export
    torch.onnx.export(
        model_cpu,
        (inputs_cpu['input_ids'], inputs_cpu['attention_mask']),
        output_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['logits'],
        dynamic_axes={
            'input_ids': {0: 'batch', 1: 'sequence'},
            'attention_mask': {0: 'batch', 1: 'sequence'},
            'logits': {0: 'batch'}
        },
        opset_version=11
    )

    print(f"✓ Model exported to: {output_path}")

    # Check size
    size = Path(output_path).stat().st_size / (1024 * 1024)
    print(f"  ONNX model size: {size:.2f} MB")

# Export model
# export_to_onnx(model, tokenizer, dirs['models'] / 'model.onnx')

28 Part 9: Interpretability

28.1 Attention Visualization

def visualize_attention(text, model, tokenizer, layer=11, head=0):
    """Visualize attention weights"""

    print(f"\nVisualizing attention for text:")
    print(f"  '{text}'")

    # Tokenize
    inputs = tokenizer(text, return_tensors='pt').to(device)

    # Get attention weights
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    # Extract attention from specific layer and head
    attention = outputs.attentions[layer][0, head].cpu().numpy()

    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Plot
    fig, ax = plt.subplots(figsize=(12, 10))

    im = ax.imshow(attention, cmap='YlOrRd')

    # Set ticks
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90)
    ax.set_yticklabels(tokens)

    # Colorbar
    plt.colorbar(im, ax=ax)

    ax.set_xlabel('Key', fontsize=12, fontweight='bold')
    ax.set_ylabel('Query', fontsize=12, fontweight='bold')
    ax.set_title(f'Attention Weights (Layer {layer}, Head {head})', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.savefig(dirs['figures'] / f'attention_layer{layer}_head{head}.png', dpi=300, bbox_inches='tight')
    plt.show()

# Visualize attention for sample text
sample_text = "This movie is absolutely amazing and wonderful!"
visualize_attention(sample_text, model, tokenizer, layer=11, head=0)

28.2 Feature Importance

def analyze_important_words(text, model, tokenizer):
    """Analyze important words using gradient-based attribution"""

    print(f"\nAnalyzing important words for:")
    print(f"  '{text}'")

    # Tokenize
    inputs = tokenizer(text, return_tensors='pt').to(device)
    embedding_layer = model.bert.embeddings

    # Get embeddings
    embeddings = embedding_layer(inputs['input_ids'])
    embeddings.retain_grad()

    # Forward pass
    outputs = model(**inputs)
    logits = outputs.logits

    # Get predicted class
    predicted_class = torch.argmax(logits, dim=1)

    # Backward pass
    model.zero_grad()
    logits[0, predicted_class].backward()

    # Get gradients
    gradients = embeddings.grad.cpu().numpy()[0]

    # Calculate importance (L2 norm of gradients)
    importance = np.linalg.norm(gradients, axis=1)

    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Plot
    fig, ax = plt.subplots(figsize=(14, 6))

    colors = ['green' if imp > np.median(importance) else 'gray' for imp in importance]
    ax.barh(range(len(tokens)), importance, color=colors, alpha=0.7)

    ax.set_yticks(range(len(tokens)))
    ax.set_yticklabels(tokens)
    ax.set_xlabel('Importance Score', fontsize=12, fontweight='bold')
    ax.set_title('Word Importance (Gradient-based)', fontsize=14, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'word_importance.png', dpi=300, bbox_inches='tight')
    plt.show()

    # Print top words
    top_indices = np.argsort(importance)[::-1][:5]
    print("\nTop 5 important words:")
    for i, idx in enumerate(top_indices, 1):
        print(f"  {i}. {tokens[idx]:15s}: {importance[idx]:.4f}")

analyze_important_words(sample_text, model, tokenizer)

29 Kesimpulan

29.1 Summary

Dalam lab ini, kita telah:

  1. ✓ Memuat dan mengeksplorasi IMDB dataset
  2. ✓ Melakukan tokenization dengan BERT tokenizer
  3. ✓ Fine-tuning pre-trained BERT untuk sentiment analysis
  4. ✓ Mengevaluasi model dengan berbagai metrics
  5. ✓ Membuat inference pipeline untuk production
  6. ✓ Mengoptimalkan model untuk deployment
  7. ✓ Memvisualisasikan attention dan word importance

29.2 Key Takeaways

  • Transfer learning sangat powerful untuk NLP tasks
  • Hugging Face Transformers mempermudah working dengan pre-trained models
  • Fine-tuning requires minimal data untuk excellent results
  • Proper tokenization critical untuk performance
  • Model optimization penting untuk production deployment

29.3 Next Steps

  1. Experiment dengan model lain (RoBERTa, DistilBERT, ALBERT)
  2. Try multi-class classification tasks
  3. Implement custom datasets
  4. Deploy model dengan FastAPI/Flask
  5. Explore other NLP tasks (NER, QA, summarization)

🎓 Selamat! Anda telah menyelesaikan Lab 8!