Lab 8: Fine-tuning BERT untuk Sentiment Analysis

Transfer Learning dengan Hugging Face Transformers & Pre-trained Models

Author

Pembelajaran Mesin - Data Science for Cybersecurity

Published

December 15, 2025

19 Pendahuluan

19.1 Tujuan Pembelajaran

Setelah menyelesaikan lab ini, Anda diharapkan dapat:

Memahami konsep transfer learning dengan pre-trained transformers
Menggunakan Hugging Face Transformers library
Melakukan tokenization dengan BERT tokenizer
Fine-tuning BERT untuk sentiment analysis
Mengevaluasi model dengan berbagai metrics
Mengimplementasikan inference pipeline untuk production
Mengoptimalkan model untuk deployment
Memvisualisasikan attention weights untuk interpretability

19.2 Gambaran Umum Lab

Lab ini fokus pada fine-tuning BERT untuk sentiment analysis menggunakan Hugging Face Transformers.

19.2.1 Dataset: IMDB Movie Reviews

Karakteristik:

Domain: Movie reviews
Task: Binary sentiment classification (positive/negative)
Size: 50,000 reviews (25k train, 25k test)
Features: Text reviews
Labels: 0 (negative), 1 (positive)

19.2.2 Lab Structure

graph TD
    A[Setup & Installation] --> B[Data Exploration]
    B --> C[Tokenization]
    C --> D[Model Loading]
    D --> E[Fine-tuning]
    E --> F[Evaluation]
    F --> G[Inference]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

graph TD
    A[Setup & Installation] --> B[Data Exploration]
    B --> C[Tokenization]
    C --> D[Model Loading]
    D --> E[Fine-tuning]
    E --> F[Evaluation]
    F --> G[Inference]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

19.3 Persiapan Environment

19.3.1 Install Dependencies

# Install required packages
import subprocess
import sys

packages = [
    'transformers',
    'datasets',
    'evaluate',
    'accelerate',
    'sentencepiece',
    'sacremoses'
]

for package in packages:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

print("✓ All packages installed successfully!")

19.3.2 Import Libraries

# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline
)

# Datasets
from datasets import load_dataset, load_metric

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Utilities
from tqdm.auto import tqdm
import json
from datetime import datetime

print(f"✓ Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

19.3.3 Setup Directories

# Create directories
dirs = {
    'data': Path('data'),
    'models': Path('models'),
    'checkpoints': Path('checkpoints'),
    'figures': Path('figures'),
    'logs': Path('logs'),
    'predictions': Path('predictions'),
    'cache': Path('cache')
}

for name, path in dirs.items():
    path.mkdir(exist_ok=True, parents=True)
    print(f"✓ {name}: {path}")

19.3.4 Configure Device

def setup_device():
    """Setup computation device (GPU/CPU)"""
    if torch.cuda.is_available():
        device = torch.device('cuda')
        print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    else:
        device = torch.device('cpu')
        print("⚠ GPU not available. Using CPU")

    return device

device = setup_device()

19.3.5 Global Configuration

# Hyperparameters
CONFIG = {
    'model_name': 'bert-base-uncased',
    'max_length': 256,
    'batch_size': 16,
    'learning_rate': 2e-5,
    'num_epochs': 3,
    'warmup_steps': 500,
    'weight_decay': 0.01,
    'seed': 42,
    'num_labels': 2,
    'train_subset': None,  # None untuk full dataset, int untuk subset
    'eval_subset': None
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

# Set random seeds
torch.manual_seed(CONFIG['seed'])
np.random.seed(CONFIG['seed'])

20 Part 1: Data Loading & Exploration

20.1 Load IMDB Dataset

def load_imdb_data():
    """Load IMDB dataset from Hugging Face"""
    print("Loading IMDB dataset...")

    # Load dataset
    dataset = load_dataset('imdb', cache_dir=dirs['cache'])

    print(f"\n✓ Dataset loaded!")
    print(f"  Train samples: {len(dataset['train']):,}")
    print(f"  Test samples: {len(dataset['test']):,}")

    # Inspect structure
    print(f"\nDataset structure:")
    print(dataset)

    # Show sample
    print(f"\nSample review:")
    sample = dataset['train'][0]
    print(f"  Text: {sample['text'][:200]}...")
    print(f"  Label: {sample['label']} ({'Positive' if sample['label'] == 1 else 'Negative'})")

    return dataset

# Load data
dataset = load_imdb_data()

20.2 Data Exploration

20.2.1 Basic Statistics

def explore_dataset(dataset):
    """Explore dataset statistics"""

    print("=" * 80)
    print("DATASET EXPLORATION")
    print("=" * 80)

    # Label distribution
    train_labels = [sample['label'] for sample in dataset['train']]
    test_labels = [sample['label'] for sample in dataset['test']]

    print("\n1. LABEL DISTRIBUTION:")
    print(f"   Train - Negative: {train_labels.count(0):,} ({train_labels.count(0)/len(train_labels)*100:.1f}%)")
    print(f"   Train - Positive: {train_labels.count(1):,} ({train_labels.count(1)/len(train_labels)*100:.1f}%)")
    print(f"   Test  - Negative: {test_labels.count(0):,} ({test_labels.count(0)/len(test_labels)*100:.1f}%)")
    print(f"   Test  - Positive: {test_labels.count(1):,} ({test_labels.count(1)/len(test_labels)*100:.1f}%)")

    # Text length distribution
    train_lengths = [len(sample['text'].split()) for sample in dataset['train']]
    test_lengths = [len(sample['text'].split()) for sample in dataset['test']]

    print("\n2. TEXT LENGTH STATISTICS (words):")
    print(f"   Train - Mean: {np.mean(train_lengths):.1f}, Median: {np.median(train_lengths):.1f}")
    print(f"   Train - Min: {np.min(train_lengths)}, Max: {np.max(train_lengths)}")
    print(f"   Test  - Mean: {np.mean(test_lengths):.1f}, Median: {np.median(test_lengths):.1f}")
    print(f"   Test  - Min: {np.min(test_lengths)}, Max: {np.max(test_lengths)}")

    print("=" * 80)

    return train_lengths, test_lengths

train_lengths, test_lengths = explore_dataset(dataset)

20.2.2 Visualization

def visualize_data_distribution(dataset, train_lengths, test_lengths):
    """Visualize data distribution"""

    fig, axes = plt.subplots(2, 2, figsize=(16, 12))

    # 1. Label distribution
    train_labels = [sample['label'] for sample in dataset['train']]
    test_labels = [sample['label'] for sample in dataset['test']]

    label_counts_train = [train_labels.count(0), train_labels.count(1)]
    label_counts_test = [test_labels.count(0), test_labels.count(1)]

    x = np.arange(2)
    width = 0.35

    axes[0, 0].bar(x - width/2, label_counts_train, width, label='Train', alpha=0.8)
    axes[0, 0].bar(x + width/2, label_counts_test, width, label='Test', alpha=0.8)
    axes[0, 0].set_xlabel('Label', fontsize=12, fontweight='bold')
    axes[0, 0].set_ylabel('Count', fontsize=12, fontweight='bold')
    axes[0, 0].set_title('Label Distribution', fontsize=14, fontweight='bold')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(['Negative (0)', 'Positive (1)'])
    axes[0, 0].legend(fontsize=11)
    axes[0, 0].grid(axis='y', alpha=0.3)

    # 2. Text length distribution
    axes[0, 1].hist(train_lengths, bins=50, alpha=0.7, label='Train', color='steelblue', edgecolor='black')
    axes[0, 1].hist(test_lengths, bins=50, alpha=0.7, label='Test', color='coral', edgecolor='black')
    axes[0, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
    axes[0, 1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
    axes[0, 1].set_title('Text Length Distribution', fontsize=14, fontweight='bold')
    axes[0, 1].legend(fontsize=11)
    axes[0, 1].grid(alpha=0.3)

    # 3. Box plot
    axes[1, 0].boxplot([train_lengths, test_lengths], labels=['Train', 'Test'])
    axes[1, 0].set_ylabel('Number of Words', fontsize=12, fontweight='bold')
    axes[1, 0].set_title('Text Length Box Plot', fontsize=14, fontweight='bold')
    axes[1, 0].grid(axis='y', alpha=0.3)

    # 4. Cumulative distribution
    train_sorted = np.sort(train_lengths)
    test_sorted = np.sort(test_lengths)
    train_cumsum = np.arange(1, len(train_sorted) + 1) / len(train_sorted)
    test_cumsum = np.arange(1, len(test_sorted) + 1) / len(test_sorted)

    axes[1, 1].plot(train_sorted, train_cumsum, label='Train', linewidth=2)
    axes[1, 1].plot(test_sorted, test_cumsum, label='Test', linewidth=2)
    axes[1, 1].axhline(0.95, color='red', linestyle='--', label='95th percentile', alpha=0.7)
    axes[1, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold')
    axes[1, 1].set_ylabel('Cumulative Probability', fontsize=12, fontweight='bold')
    axes[1, 1].set_title('Cumulative Distribution', fontsize=14, fontweight='bold')
    axes[1, 1].legend(fontsize=11)
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'data_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()

visualize_data_distribution(dataset, train_lengths, test_lengths)

20.2.3 Sample Reviews

def show_sample_reviews(dataset, num_samples=3):
    """Display sample reviews"""

    print("\n" + "=" * 80)
    print("SAMPLE REVIEWS")
    print("=" * 80)

    for sentiment in [0, 1]:
        sentiment_name = "NEGATIVE" if sentiment == 0 else "POSITIVE"
        print(f"\n{sentiment_name} REVIEWS:")

        # Filter by sentiment
        samples = [s for s in dataset['train'] if s['label'] == sentiment][:num_samples]

        for i, sample in enumerate(samples, 1):
            text = sample['text']
            # Truncate for display
            if len(text) > 300:
                text = text[:300] + "..."

            print(f"\n  [{i}] {text}")

    print("=" * 80)

show_sample_reviews(dataset, num_samples=2)

21 Part 2: Tokenization & Data Preprocessing

21.1 BERT Tokenizer

21.1.1 Load Tokenizer

def load_tokenizer(model_name):
    """Load pre-trained tokenizer"""
    print(f"Loading tokenizer: {model_name}...")

    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        cache_dir=dirs['cache']
    )

    print(f"✓ Tokenizer loaded!")
    print(f"  Vocabulary size: {tokenizer.vocab_size:,}")
    print(f"  Model max length: {tokenizer.model_max_length:,}")

    # Special tokens
    print(f"\n  Special tokens:")
    print(f"    PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
    print(f"    CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
    print(f"    SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
    print(f"    UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")
    print(f"    MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})")

    return tokenizer

tokenizer = load_tokenizer(CONFIG['model_name'])

21.1.2 Tokenization Examples

def demonstrate_tokenization(tokenizer):
    """Demonstrate tokenization process"""

    print("\n" + "=" * 80)
    print("TOKENIZATION DEMONSTRATION")
    print("=" * 80)

    examples = [
        "This movie is great!",
        "Terrible acting and boring plot.",
        "Transformers are revolutionizing NLP."
    ]

    for i, text in enumerate(examples, 1):
        print(f"\n{i}. Original text:")
        print(f"   '{text}'")

        # Tokenize
        tokens = tokenizer.tokenize(text)
        print(f"\n   Tokens: {tokens}")

        # Convert to IDs
        token_ids = tokenizer.convert_tokens_to_ids(tokens)
        print(f"   Token IDs: {token_ids}")

        # Full encoding
        encoding = tokenizer(
            text,
            add_special_tokens=True,
            return_tensors='pt'
        )
        print(f"\n   Input IDs: {encoding['input_ids'][0].tolist()}")
        print(f"   Attention mask: {encoding['attention_mask'][0].tolist()}")

        # Decode
        decoded = tokenizer.decode(encoding['input_ids'][0])
        print(f"   Decoded: '{decoded}'")

    print("=" * 80)

demonstrate_tokenization(tokenizer)

21.1.3 Tokenize Dataset

def tokenize_dataset(dataset, tokenizer, max_length):
    """
    Tokenize entire dataset

    Parameters:
        dataset: HuggingFace dataset
        tokenizer: Pre-trained tokenizer
        max_length: Maximum sequence length

    Returns:
        tokenized_dataset: Tokenized dataset
    """
    print(f"Tokenizing dataset (max_length={max_length})...")

    def tokenize_function(examples):
        """Tokenize batch of examples"""
        return tokenizer(
            examples['text'],
            padding='max_length',
            truncation=True,
            max_length=max_length,
            return_tensors=None
        )

    # Tokenize
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        desc="Tokenizing"
    )

    print(f"✓ Tokenization complete!")
    print(f"\nTokenized dataset:")
    print(tokenized_dataset)

    # Show sample
    sample = tokenized_dataset['train'][0]
    print(f"\nSample tokenized review:")
    print(f"  Input IDs length: {len(sample['input_ids'])}")
    print(f"  Attention mask length: {len(sample['attention_mask'])}")
    print(f"  Label: {sample['label']}")

    return tokenized_dataset

tokenized_dataset = tokenize_dataset(dataset, tokenizer, CONFIG['max_length'])

21.1.4 Analyze Tokenization

def analyze_tokenization(tokenized_dataset):
    """Analyze tokenization statistics"""

    print("\n" + "=" * 80)
    print("TOKENIZATION ANALYSIS")
    print("=" * 80)

    # Calculate effective lengths (excluding padding)
    train_eff_lengths = []
    for sample in tokenized_dataset['train']:
        eff_length = sum(sample['attention_mask'])
        train_eff_lengths.append(eff_length)

    test_eff_lengths = []
    for sample in tokenized_dataset['test']:
        eff_length = sum(sample['attention_mask'])
        test_eff_lengths.append(eff_length)

    print(f"\n1. EFFECTIVE TOKEN LENGTHS:")
    print(f"   Train - Mean: {np.mean(train_eff_lengths):.1f}, Median: {np.median(train_eff_lengths):.1f}")
    print(f"   Train - Min: {np.min(train_eff_lengths)}, Max: {np.max(train_eff_lengths)}")
    print(f"   Test  - Mean: {np.mean(test_eff_lengths):.1f}, Median: {np.median(test_eff_lengths):.1f}")

    # Truncation analysis
    max_len = CONFIG['max_length']
    train_truncated = sum(1 for l in train_eff_lengths if l == max_len)
    test_truncated = sum(1 for l in test_eff_lengths if l == max_len)

    print(f"\n2. TRUNCATION:")
    print(f"   Train samples truncated: {train_truncated:,} ({train_truncated/len(train_eff_lengths)*100:.2f}%)")
    print(f"   Test samples truncated: {test_truncated:,} ({test_truncated/len(test_eff_lengths)*100:.2f}%)")

    # Padding analysis
    total_tokens_train = len(train_eff_lengths) * max_len
    actual_tokens_train = sum(train_eff_lengths)
    padding_ratio_train = (total_tokens_train - actual_tokens_train) / total_tokens_train

    print(f"\n3. PADDING:")
    print(f"   Train padding ratio: {padding_ratio_train*100:.2f}%")

    print("=" * 80)

    return train_eff_lengths, test_eff_lengths

train_eff_lengths, test_eff_lengths = analyze_tokenization(tokenized_dataset)

22 Part 3: Model Loading & Configuration

22.1 Load Pre-trained BERT

def load_pretrained_bert(model_name, num_labels):
    """Load pre-trained BERT for sequence classification"""

    print(f"Loading pre-trained model: {model_name}...")

    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=num_labels,
        cache_dir=dirs['cache']
    )

    # Move to device
    model = model.to(device)

    print(f"✓ Model loaded!")
    print(f"  Total parameters: {model.num_parameters():,}")

    # Count trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"  Trainable parameters: {trainable_params:,}")

    # Model architecture
    print(f"\n  Model architecture:")
    print(f"    BERT encoder: {model.config.num_hidden_layers} layers")
    print(f"    Hidden size: {model.config.hidden_size}")
    print(f"    Attention heads: {model.config.num_attention_heads}")
    print(f"    Intermediate size: {model.config.intermediate_size}")

    return model

model = load_pretrained_bert(CONFIG['model_name'], CONFIG['num_labels'])

22.2 Model Summary

def print_model_summary(model):
    """Print detailed model summary"""

    print("\n" + "=" * 80)
    print("MODEL ARCHITECTURE")
    print("=" * 80)

    # Layer-wise parameters
    print("\nLayer-wise parameters:")
    total_params = 0

    for name, param in model.named_parameters():
        param_count = param.numel()
        total_params += param_count

        # Only print major layers
        if 'layer' in name or 'classifier' in name or 'embeddings' in name:
            print(f"  {name:60s}: {param_count:>12,}")

    print(f"\n  {'Total':60s}: {total_params:>12,}")
    print("=" * 80)

print_model_summary(model)

23 Part 4: Training Configuration

23.1 Training Arguments

def create_training_arguments():
    """Create training configuration"""

    training_args = TrainingArguments(
        # Output
        output_dir=str(dirs['checkpoints']),

        # Training hyperparameters
        num_train_epochs=CONFIG['num_epochs'],
        per_device_train_batch_size=CONFIG['batch_size'],
        per_device_eval_batch_size=CONFIG['batch_size'] * 2,
        learning_rate=CONFIG['learning_rate'],
        weight_decay=CONFIG['weight_decay'],
        warmup_steps=CONFIG['warmup_steps'],

        # Evaluation & logging
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir=str(dirs['logs']),
        logging_steps=100,
        logging_first_step=True,

        # Model saving
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_total_limit=3,

        # Hardware
        fp16=torch.cuda.is_available(),  # Mixed precision if GPU available
        dataloader_num_workers=4,

        # Misc
        seed=CONFIG['seed'],
        report_to="none"  # Disable W&B, TensorBoard
    )

    print("Training configuration:")
    print(f"  Epochs: {training_args.num_train_epochs}")
    print(f"  Batch size: {training_args.per_device_train_batch_size}")
    print(f"  Learning rate: {training_args.learning_rate}")
    print(f"  Warmup steps: {training_args.warmup_steps}")
    print(f"  Weight decay: {training_args.weight_decay}")
    print(f"  Mixed precision (fp16): {training_args.fp16}")

    return training_args

training_args = create_training_arguments()

23.2 Evaluation Metrics

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics

    Parameters:
        eval_pred: Tuple of (predictions, labels)

    Returns:
        metrics: Dictionary of metrics
    """
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support

    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    # Accuracy
    accuracy = accuracy_score(labels, predictions)

    # Precision, Recall, F1
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels,
        predictions,
        average='binary'
    )

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

print("✓ Metrics function defined")

23.3 Data Collator

from transformers import DataCollatorWithPadding

# Create data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

print("✓ Data collator created")

24 Part 5: Fine-tuning

24.1 Create Trainer

def create_trainer(model, training_args, train_dataset, eval_dataset, tokenizer):
    """Create Hugging Face Trainer"""

    # Subset for faster training (if configured)
    if CONFIG['train_subset']:
        train_dataset = train_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['train_subset']))
        print(f"Using subset of training data: {len(train_dataset):,} samples")

    if CONFIG['eval_subset']:
        eval_dataset = eval_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['eval_subset']))
        print(f"Using subset of evaluation data: {len(eval_dataset):,} samples")

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    print("✓ Trainer created!")
    print(f"  Training samples: {len(train_dataset):,}")
    print(f"  Evaluation samples: {len(eval_dataset):,}")

    return trainer

trainer = create_trainer(
    model,
    training_args,
    tokenized_dataset['train'],
    tokenized_dataset['test'],
    tokenizer
)

24.2 Train Model

def train_model(trainer):
    """Execute training"""

    print("\n" + "=" * 80)
    print("STARTING TRAINING")
    print("=" * 80)

    # Start training
    train_result = trainer.train()

    # Training summary
    print("\n" + "=" * 80)
    print("TRAINING COMPLETE")
    print("=" * 80)

    metrics = train_result.metrics
    print(f"\nTraining metrics:")
    for key, value in metrics.items():
        print(f"  {key}: {value}")

    # Save model
    trainer.save_model(str(dirs['models'] / 'final_model'))
    print(f"\n✓ Model saved to: {dirs['models'] / 'final_model'}")

    return train_result

# Execute training
train_result = train_model(trainer)

24.3 Training History Visualization

def plot_training_history(trainer):
    """Plot training metrics"""

    # Extract history
    log_history = trainer.state.log_history

    # Separate train and eval logs
    train_logs = [log for log in log_history if 'loss' in log and 'eval_loss' not in log]
    eval_logs = [log for log in log_history if 'eval_loss' in log]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Loss plot
    if train_logs:
        train_steps = [log['step'] for log in train_logs]
        train_loss = [log['loss'] for log in train_logs]
        axes[0].plot(train_steps, train_loss, label='Training Loss', linewidth=2)

    if eval_logs:
        eval_epochs = [log['epoch'] for log in eval_logs]
        eval_loss = [log['eval_loss'] for log in eval_logs]
        # Convert epochs to approximate steps
        steps_per_epoch = train_steps[-1] / eval_epochs[-1] if eval_epochs else 1
        eval_steps = [e * steps_per_epoch for e in eval_epochs]
        axes[0].plot(eval_steps, eval_loss, label='Validation Loss', linewidth=2, marker='o', markersize=8)

    axes[0].set_xlabel('Training Steps', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Loss', fontsize=12, fontweight='bold')
    axes[0].set_title('Training & Validation Loss', fontsize=14, fontweight='bold')
    axes[0].legend(fontsize=11)
    axes[0].grid(alpha=0.3)

    # Metrics plot
    if eval_logs:
        eval_accuracy = [log.get('eval_accuracy', 0) for log in eval_logs]
        eval_f1 = [log.get('eval_f1', 0) for log in eval_logs]

        axes[1].plot(eval_epochs, eval_accuracy, label='Accuracy', linewidth=2, marker='o')
        axes[1].plot(eval_epochs, eval_f1, label='F1 Score', linewidth=2, marker='s')
        axes[1].set_xlabel('Epoch', fontsize=12, fontweight='bold')
        axes[1].set_ylabel('Score', fontsize=12, fontweight='bold')
        axes[1].set_title('Validation Metrics', fontsize=14, fontweight='bold')
        axes[1].legend(fontsize=11)
        axes[1].grid(alpha=0.3)
        axes[1].set_ylim([0, 1])

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'training_history.png', dpi=300, bbox_inches='tight')
    plt.show()

plot_training_history(trainer)

25 Part 6: Evaluation

25.1 Evaluate on Test Set

def evaluate_model(trainer):
    """Evaluate model on test set"""

    print("\n" + "=" * 80)
    print("EVALUATION ON TEST SET")
    print("=" * 80)

    # Evaluate
    eval_results = trainer.evaluate()

    print(f"\nTest set metrics:")
    for key, value in eval_results.items():
        if 'eval_' in key:
            metric_name = key.replace('eval_', '').upper()
            print(f"  {metric_name:15s}: {value:.4f}")

    return eval_results

eval_results = evaluate_model(trainer)

25.2 Predictions & Confusion Matrix

def analyze_predictions(trainer, dataset):
    """Analyze model predictions"""

    print("\nGenerating predictions...")

    # Get predictions
    predictions_output = trainer.predict(dataset)
    predictions = np.argmax(predictions_output.predictions, axis=1)
    true_labels = predictions_output.label_ids

    # Confusion matrix
    from sklearn.metrics import confusion_matrix, classification_report

    cm = confusion_matrix(true_labels, predictions)

    print("\nConfusion Matrix:")
    print(cm)

    # Classification report
    print("\nClassification Report:")
    print(classification_report(
        true_labels,
        predictions,
        target_names=['Negative', 'Positive']
    ))

    # Visualize confusion matrix
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))

    # Heatmap
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    axes[0].set_xlabel('Predicted', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('True', fontsize=12, fontweight='bold')
    axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold')

    # Normalized confusion matrix
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Greens', ax=axes[1],
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    axes[1].set_xlabel('Predicted', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('True', fontsize=12, fontweight='bold')
    axes[1].set_title('Normalized Confusion Matrix', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'confusion_matrix.png', dpi=300, bbox_inches='tight')
    plt.show()

    return predictions, true_labels

predictions, true_labels = analyze_predictions(trainer, tokenized_dataset['test'])

25.3 Error Analysis

def error_analysis(dataset, predictions, true_labels, tokenizer, num_examples=5):
    """Analyze misclassified examples"""

    print("\n" + "=" * 80)
    print("ERROR ANALYSIS")
    print("=" * 80)

    # Find misclassified
    misclassified_indices = np.where(predictions != true_labels)[0]

    print(f"\nTotal misclassified: {len(misclassified_indices):,} / {len(predictions):,}")
    print(f"Error rate: {len(misclassified_indices)/len(predictions)*100:.2f}%")

    # Show examples
    print(f"\nSample misclassified reviews:")

    for i, idx in enumerate(misclassified_indices[:num_examples], 1):
        text = dataset[int(idx)]['text']
        true_label = true_labels[idx]
        pred_label = predictions[idx]

        # Truncate text
        if len(text) > 300:
            text = text[:300] + "..."

        print(f"\n[{i}] True: {'Positive' if true_label == 1 else 'Negative'} | "
              f"Predicted: {'Positive' if pred_label == 1 else 'Negative'}")
        print(f"    {text}")

    print("=" * 80)

error_analysis(tokenized_dataset['test'], predictions, true_labels, tokenizer)

26 Part 7: Inference & Deployment

26.1 Create Inference Pipeline

def create_inference_pipeline(model_path, tokenizer):
    """Create pipeline for inference"""

    print(f"Creating inference pipeline from: {model_path}")

    # Load model
    classifier = pipeline(
        "sentiment-analysis",
        model=str(model_path),
        tokenizer=tokenizer,
        device=0 if torch.cuda.is_available() else -1
    )

    print("✓ Pipeline created!")

    return classifier

# Create pipeline
inference_pipeline = create_inference_pipeline(
    dirs['models'] / 'final_model',
    tokenizer
)

26.2 Test Inference

def test_inference(pipeline, test_texts):
    """Test inference on custom texts"""

    print("\n" + "=" * 80)
    print("INFERENCE TESTING")
    print("=" * 80)

    for i, text in enumerate(test_texts, 1):
        result = pipeline(text)[0]
        label = result['label']
        score = result['score']

        sentiment = "Positive" if label == 'LABEL_1' else "Negative"

        print(f"\n[{i}] Text:")
        print(f"    '{text}'")
        print(f"    Prediction: {sentiment} (confidence: {score:.4f})")

    print("=" * 80)

# Custom test cases
test_texts = [
    "This movie is absolutely fantastic! Best film I've seen this year!",
    "Terrible acting, boring plot. Complete waste of time.",
    "It was okay, nothing special but not terrible either.",
    "Amazing cinematography and brilliant performances throughout.",
    "I fell asleep halfway through. Very disappointing.",
    "The special effects were mind-blowing!",
    "Not recommended. Poor script and direction."
]

test_inference(inference_pipeline, test_texts)

26.3 Batch Inference

def batch_inference(pipeline, texts, batch_size=32):
    """Perform batch inference"""

    print(f"\nRunning batch inference on {len(texts)} samples...")

    import time
    start_time = time.time()

    results = pipeline(texts, batch_size=batch_size)

    elapsed_time = time.time() - start_time

    print(f"✓ Inference complete!")
    print(f"  Time: {elapsed_time:.2f}s")
    print(f"  Throughput: {len(texts)/elapsed_time:.1f} samples/sec")

    return results

# Test batch inference
sample_texts = [dataset['test'][i]['text'] for i in range(100)]
batch_results = batch_inference(inference_pipeline, sample_texts, batch_size=16)

26.4 Save Predictions

def save_predictions(texts, predictions, output_path):
    """Save predictions to file"""

    results_df = pd.DataFrame({
        'text': texts,
        'label': [p['label'] for p in predictions],
        'score': [p['score'] for p in predictions]
    })

    results_df.to_csv(output_path, index=False)
    print(f"✓ Predictions saved to: {output_path}")

    return results_df

# Save sample predictions
predictions_df = save_predictions(
    sample_texts,
    batch_results,
    dirs['predictions'] / 'sample_predictions.csv'
)

print("\nSample predictions:")
print(predictions_df.head(10))

27 Part 8: Model Optimization

27.1 Model Quantization

def quantize_model(model):
    """Apply dynamic quantization"""

    print("Applying dynamic quantization...")

    # Quantize
    quantized_model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},
        dtype=torch.qint8
    )

    print("✓ Quantization complete!")

    # Compare sizes
    def get_model_size(model):
        torch.save(model.state_dict(), "temp.pth")
        size = Path("temp.pth").stat().st_size / (1024 * 1024)
        Path("temp.pth").unlink()
        return size

    original_size = get_model_size(model)
    quantized_size = get_model_size(quantized_model)

    print(f"\n  Original model: {original_size:.2f} MB")
    print(f"  Quantized model: {quantized_size:.2f} MB")
    print(f"  Reduction: {(1 - quantized_size/original_size)*100:.1f}%")

    return quantized_model

# Note: Quantization may not work on all BERT models
# Uncomment to test:
# quantized_model = quantize_model(model)

27.2 Model Export (ONNX)

def export_to_onnx(model, tokenizer, output_path, max_length=256):
    """Export model to ONNX format"""

    print(f"Exporting model to ONNX format...")

    # Dummy input
    dummy_text = "This is a sample text for export."
    inputs = tokenizer(
        dummy_text,
        padding='max_length',
        max_length=max_length,
        truncation=True,
        return_tensors='pt'
    )

    # Move to CPU for export
    model_cpu = model.cpu()
    inputs_cpu = {k: v.cpu() for k, v in inputs.items()}

    # Export
    torch.onnx.export(
        model_cpu,
        (inputs_cpu['input_ids'], inputs_cpu['attention_mask']),
        output_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['logits'],
        dynamic_axes={
            'input_ids': {0: 'batch', 1: 'sequence'},
            'attention_mask': {0: 'batch', 1: 'sequence'},
            'logits': {0: 'batch'}
        },
        opset_version=11
    )

    print(f"✓ Model exported to: {output_path}")

    # Check size
    size = Path(output_path).stat().st_size / (1024 * 1024)
    print(f"  ONNX model size: {size:.2f} MB")

# Export model
# export_to_onnx(model, tokenizer, dirs['models'] / 'model.onnx')

28 Part 9: Interpretability

28.1 Attention Visualization

def visualize_attention(text, model, tokenizer, layer=11, head=0):
    """Visualize attention weights"""

    print(f"\nVisualizing attention for text:")
    print(f"  '{text}'")

    # Tokenize
    inputs = tokenizer(text, return_tensors='pt').to(device)

    # Get attention weights
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)

    # Extract attention from specific layer and head
    attention = outputs.attentions[layer][0, head].cpu().numpy()

    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Plot
    fig, ax = plt.subplots(figsize=(12, 10))

    im = ax.imshow(attention, cmap='YlOrRd')

    # Set ticks
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90)
    ax.set_yticklabels(tokens)

    # Colorbar
    plt.colorbar(im, ax=ax)

    ax.set_xlabel('Key', fontsize=12, fontweight='bold')
    ax.set_ylabel('Query', fontsize=12, fontweight='bold')
    ax.set_title(f'Attention Weights (Layer {layer}, Head {head})', fontsize=14, fontweight='bold')

    plt.tight_layout()
    plt.savefig(dirs['figures'] / f'attention_layer{layer}_head{head}.png', dpi=300, bbox_inches='tight')
    plt.show()

# Visualize attention for sample text
sample_text = "This movie is absolutely amazing and wonderful!"
visualize_attention(sample_text, model, tokenizer, layer=11, head=0)

28.2 Feature Importance

def analyze_important_words(text, model, tokenizer):
    """Analyze important words using gradient-based attribution"""

    print(f"\nAnalyzing important words for:")
    print(f"  '{text}'")

    # Tokenize
    inputs = tokenizer(text, return_tensors='pt').to(device)
    embedding_layer = model.bert.embeddings

    # Get embeddings
    embeddings = embedding_layer(inputs['input_ids'])
    embeddings.retain_grad()

    # Forward pass
    outputs = model(**inputs)
    logits = outputs.logits

    # Get predicted class
    predicted_class = torch.argmax(logits, dim=1)

    # Backward pass
    model.zero_grad()
    logits[0, predicted_class].backward()

    # Get gradients
    gradients = embeddings.grad.cpu().numpy()[0]

    # Calculate importance (L2 norm of gradients)
    importance = np.linalg.norm(gradients, axis=1)

    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Plot
    fig, ax = plt.subplots(figsize=(14, 6))

    colors = ['green' if imp > np.median(importance) else 'gray' for imp in importance]
    ax.barh(range(len(tokens)), importance, color=colors, alpha=0.7)

    ax.set_yticks(range(len(tokens)))
    ax.set_yticklabels(tokens)
    ax.set_xlabel('Importance Score', fontsize=12, fontweight='bold')
    ax.set_title('Word Importance (Gradient-based)', fontsize=14, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'word_importance.png', dpi=300, bbox_inches='tight')
    plt.show()

    # Print top words
    top_indices = np.argsort(importance)[::-1][:5]
    print("\nTop 5 important words:")
    for i, idx in enumerate(top_indices, 1):
        print(f"  {i}. {tokens[idx]:15s}: {importance[idx]:.4f}")

analyze_important_words(sample_text, model, tokenizer)

29 Kesimpulan

29.1 Summary

Dalam lab ini, kita telah:

✓ Memuat dan mengeksplorasi IMDB dataset
✓ Melakukan tokenization dengan BERT tokenizer
✓ Fine-tuning pre-trained BERT untuk sentiment analysis
✓ Mengevaluasi model dengan berbagai metrics
✓ Membuat inference pipeline untuk production
✓ Mengoptimalkan model untuk deployment
✓ Memvisualisasikan attention dan word importance

29.2 Key Takeaways

Transfer learning sangat powerful untuk NLP tasks
Hugging Face Transformers mempermudah working dengan pre-trained models
Fine-tuning requires minimal data untuk excellent results
Proper tokenization critical untuk performance
Model optimization penting untuk production deployment

29.3 Next Steps

Experiment dengan model lain (RoBERTa, DistilBERT, ALBERT)
Try multi-class classification tasks
Implement custom datasets
Deploy model dengan FastAPI/Flask
Explore other NLP tasks (NER, QA, summarization)

🎓 Selamat! Anda telah menyelesaikan Lab 8!

--- title: "Lab 8: Fine-tuning BERT untuk Sentiment Analysis" subtitle: "Transfer Learning dengan Hugging Face Transformers & Pre-trained Models" author: "Pembelajaran Mesin - Data Science for Cybersecurity" date: today format: html: toc: true toc-depth: 4 toc-location: left number-sections: true number-depth: 4 code-fold: false code-tools: true code-line-numbers: true code-copy: true theme: cosmo highlight-style: github fig-width: 10 fig-height: 6 fig-dpi: 300 css: ../styles/lab-style.css pdf: toc: true number-sections: true colorlinks: true geometry: - top=20mm - left=20mm - right=20mm - bottom=20mm code-block-bg: "#f5f5f5" code-block-border-left: "#3498db" jupyter: python3 execute: echo: true warning: false message: false cache: false --- # Pendahuluan {#sec-intro} ## Tujuan Pembelajaran {#sec-objectives} Setelah menyelesaikan lab ini, Anda diharapkan dapat: 1. **Memahami** konsep transfer learning dengan pre-trained transformers 2. **Menggunakan** Hugging Face Transformers library 3. **Melakukan tokenization** dengan BERT tokenizer 4. **Fine-tuning** BERT untuk sentiment analysis 5. **Mengevaluasi** model dengan berbagai metrics 6. **Mengimplementasikan** inference pipeline untuk production 7. **Mengoptimalkan** model untuk deployment 8. **Memvisualisasikan** attention weights untuk interpretability ## Gambaran Umum Lab {#sec-overview} Lab ini fokus pada **fine-tuning BERT** untuk sentiment analysis menggunakan **Hugging Face Transformers**. ### Dataset: IMDB Movie Reviews **Karakteristik:** - **Domain**: Movie reviews - **Task**: Binary sentiment classification (positive/negative) - **Size**: 50,000 reviews (25k train, 25k test) - **Features**: Text reviews - **Labels**: 0 (negative), 1 (positive) ### Lab Structure ```{mermaid} graph TD A[Setup & Installation] --> B[Data Exploration] B --> C[Tokenization] C --> D[Model Loading] D --> E[Fine-tuning] E --> F[Evaluation] F --> G[Inference] G --> H[Optimization] style A fill:#e6f3ff style B fill:#ffe6e6 style C fill:#ffffcc style D fill:#ccffcc style E fill:#e6ccff style F fill:#ffcccc style G fill:#ccffe6 style H fill:#ffccff ``` ## Persiapan Environment {#sec-setup} ### Install Dependencies ```{python} # Install required packages import subprocess import sys packages = [ 'transformers', 'datasets', 'evaluate', 'accelerate', 'sentencepiece', 'sacremoses' ] for package in packages: subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package]) print("✓ All packages installed successfully!") ``` ### Import Libraries ```{python} # Core libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from pathlib import Path import warnings warnings.filterwarnings('ignore') # Transformers from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, pipeline ) # Datasets from datasets import load_dataset, load_metric # PyTorch import torch import torch.nn as nn from torch.utils.data import DataLoader, Dataset # Utilities from tqdm.auto import tqdm import json from datetime import datetime print(f"✓ Imports successful!") print(f"PyTorch version: {torch.__version__}") print(f"Transformers version: {transformers.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") ``` ### Setup Directories ```{python} # Create directories dirs = { 'data': Path('data'), 'models': Path('models'), 'checkpoints': Path('checkpoints'), 'figures': Path('figures'), 'logs': Path('logs'), 'predictions': Path('predictions'), 'cache': Path('cache') } for name, path in dirs.items(): path.mkdir(exist_ok=True, parents=True) print(f"✓ {name}: {path}") ``` ### Configure Device ```{python} def setup_device(): """Setup computation device (GPU/CPU)""" if torch.cuda.is_available(): device = torch.device('cuda') print(f"✓ Using GPU: {torch.cuda.get_device_name(0)}") print(f" Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB") else: device = torch.device('cpu') print("⚠ GPU not available. Using CPU") return device device = setup_device() ``` ### Global Configuration ```{python} # Hyperparameters CONFIG = { 'model_name': 'bert-base-uncased', 'max_length': 256, 'batch_size': 16, 'learning_rate': 2e-5, 'num_epochs': 3, 'warmup_steps': 500, 'weight_decay': 0.01, 'seed': 42, 'num_labels': 2, 'train_subset': None, # None untuk full dataset, int untuk subset 'eval_subset': None } print("Configuration:") for key, value in CONFIG.items(): print(f" {key}: {value}") # Set random seeds torch.manual_seed(CONFIG['seed']) np.random.seed(CONFIG['seed']) ``` --- # Part 1: Data Loading & Exploration {#sec-part1} ## Load IMDB Dataset {#sec-load-data} ```{python} def load_imdb_data(): """Load IMDB dataset from Hugging Face""" print("Loading IMDB dataset...") # Load dataset dataset = load_dataset('imdb', cache_dir=dirs['cache']) print(f"\n✓ Dataset loaded!") print(f" Train samples: {len(dataset['train']):,}") print(f" Test samples: {len(dataset['test']):,}") # Inspect structure print(f"\nDataset structure:") print(dataset) # Show sample print(f"\nSample review:") sample = dataset['train'][0] print(f" Text: {sample['text'][:200]}...") print(f" Label: {sample['label']} ({'Positive' if sample['label'] == 1 else 'Negative'})") return dataset # Load data dataset = load_imdb_data() ``` ## Data Exploration {#sec-exploration} ### Basic Statistics ```{python} def explore_dataset(dataset): """Explore dataset statistics""" print("=" * 80) print("DATASET EXPLORATION") print("=" * 80) # Label distribution train_labels = [sample['label'] for sample in dataset['train']] test_labels = [sample['label'] for sample in dataset['test']] print("\n1. LABEL DISTRIBUTION:") print(f" Train - Negative: {train_labels.count(0):,} ({train_labels.count(0)/len(train_labels)*100:.1f}%)") print(f" Train - Positive: {train_labels.count(1):,} ({train_labels.count(1)/len(train_labels)*100:.1f}%)") print(f" Test - Negative: {test_labels.count(0):,} ({test_labels.count(0)/len(test_labels)*100:.1f}%)") print(f" Test - Positive: {test_labels.count(1):,} ({test_labels.count(1)/len(test_labels)*100:.1f}%)") # Text length distribution train_lengths = [len(sample['text'].split()) for sample in dataset['train']] test_lengths = [len(sample['text'].split()) for sample in dataset['test']] print("\n2. TEXT LENGTH STATISTICS (words):") print(f" Train - Mean: {np.mean(train_lengths):.1f}, Median: {np.median(train_lengths):.1f}") print(f" Train - Min: {np.min(train_lengths)}, Max: {np.max(train_lengths)}") print(f" Test - Mean: {np.mean(test_lengths):.1f}, Median: {np.median(test_lengths):.1f}") print(f" Test - Min: {np.min(test_lengths)}, Max: {np.max(test_lengths)}") print("=" * 80) return train_lengths, test_lengths train_lengths, test_lengths = explore_dataset(dataset) ``` ### Visualization ```{python} def visualize_data_distribution(dataset, train_lengths, test_lengths): """Visualize data distribution""" fig, axes = plt.subplots(2, 2, figsize=(16, 12)) # 1. Label distribution train_labels = [sample['label'] for sample in dataset['train']] test_labels = [sample['label'] for sample in dataset['test']] label_counts_train = [train_labels.count(0), train_labels.count(1)] label_counts_test = [test_labels.count(0), test_labels.count(1)] x = np.arange(2) width = 0.35 axes[0, 0].bar(x - width/2, label_counts_train, width, label='Train', alpha=0.8) axes[0, 0].bar(x + width/2, label_counts_test, width, label='Test', alpha=0.8) axes[0, 0].set_xlabel('Label', fontsize=12, fontweight='bold') axes[0, 0].set_ylabel('Count', fontsize=12, fontweight='bold') axes[0, 0].set_title('Label Distribution', fontsize=14, fontweight='bold') axes[0, 0].set_xticks(x) axes[0, 0].set_xticklabels(['Negative (0)', 'Positive (1)']) axes[0, 0].legend(fontsize=11) axes[0, 0].grid(axis='y', alpha=0.3) # 2. Text length distribution axes[0, 1].hist(train_lengths, bins=50, alpha=0.7, label='Train', color='steelblue', edgecolor='black') axes[0, 1].hist(test_lengths, bins=50, alpha=0.7, label='Test', color='coral', edgecolor='black') axes[0, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold') axes[0, 1].set_ylabel('Frequency', fontsize=12, fontweight='bold') axes[0, 1].set_title('Text Length Distribution', fontsize=14, fontweight='bold') axes[0, 1].legend(fontsize=11) axes[0, 1].grid(alpha=0.3) # 3. Box plot axes[1, 0].boxplot([train_lengths, test_lengths], labels=['Train', 'Test']) axes[1, 0].set_ylabel('Number of Words', fontsize=12, fontweight='bold') axes[1, 0].set_title('Text Length Box Plot', fontsize=14, fontweight='bold') axes[1, 0].grid(axis='y', alpha=0.3) # 4. Cumulative distribution train_sorted = np.sort(train_lengths) test_sorted = np.sort(test_lengths) train_cumsum = np.arange(1, len(train_sorted) + 1) / len(train_sorted) test_cumsum = np.arange(1, len(test_sorted) + 1) / len(test_sorted) axes[1, 1].plot(train_sorted, train_cumsum, label='Train', linewidth=2) axes[1, 1].plot(test_sorted, test_cumsum, label='Test', linewidth=2) axes[1, 1].axhline(0.95, color='red', linestyle='--', label='95th percentile', alpha=0.7) axes[1, 1].set_xlabel('Number of Words', fontsize=12, fontweight='bold') axes[1, 1].set_ylabel('Cumulative Probability', fontsize=12, fontweight='bold') axes[1, 1].set_title('Cumulative Distribution', fontsize=14, fontweight='bold') axes[1, 1].legend(fontsize=11) axes[1, 1].grid(alpha=0.3) plt.tight_layout() plt.savefig(dirs['figures'] / 'data_distribution.png', dpi=300, bbox_inches='tight') plt.show() visualize_data_distribution(dataset, train_lengths, test_lengths) ``` ### Sample Reviews ```{python} def show_sample_reviews(dataset, num_samples=3): """Display sample reviews""" print("\n" + "=" * 80) print("SAMPLE REVIEWS") print("=" * 80) for sentiment in [0, 1]: sentiment_name = "NEGATIVE" if sentiment == 0 else "POSITIVE" print(f"\n{sentiment_name} REVIEWS:") # Filter by sentiment samples = [s for s in dataset['train'] if s['label'] == sentiment][:num_samples] for i, sample in enumerate(samples, 1): text = sample['text'] # Truncate for display if len(text) > 300: text = text[:300] + "..." print(f"\n [{i}] {text}") print("=" * 80) show_sample_reviews(dataset, num_samples=2) ``` --- # Part 2: Tokenization & Data Preprocessing {#sec-part2} ## BERT Tokenizer {#sec-tokenizer} ### Load Tokenizer ```{python} def load_tokenizer(model_name): """Load pre-trained tokenizer""" print(f"Loading tokenizer: {model_name}...") tokenizer = AutoTokenizer.from_pretrained( model_name, cache_dir=dirs['cache'] ) print(f"✓ Tokenizer loaded!") print(f" Vocabulary size: {tokenizer.vocab_size:,}") print(f" Model max length: {tokenizer.model_max_length:,}") # Special tokens print(f"\n Special tokens:") print(f" PAD: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})") print(f" CLS: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})") print(f" SEP: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})") print(f" UNK: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})") print(f" MASK: {tokenizer.mask_token} (ID: {tokenizer.mask_token_id})") return tokenizer tokenizer = load_tokenizer(CONFIG['model_name']) ``` ### Tokenization Examples ```{python} def demonstrate_tokenization(tokenizer): """Demonstrate tokenization process""" print("\n" + "=" * 80) print("TOKENIZATION DEMONSTRATION") print("=" * 80) examples = [ "This movie is great!", "Terrible acting and boring plot.", "Transformers are revolutionizing NLP." ] for i, text in enumerate(examples, 1): print(f"\n{i}. Original text:") print(f" '{text}'") # Tokenize tokens = tokenizer.tokenize(text) print(f"\n Tokens: {tokens}") # Convert to IDs token_ids = tokenizer.convert_tokens_to_ids(tokens) print(f" Token IDs: {token_ids}") # Full encoding encoding = tokenizer( text, add_special_tokens=True, return_tensors='pt' ) print(f"\n Input IDs: {encoding['input_ids'][0].tolist()}") print(f" Attention mask: {encoding['attention_mask'][0].tolist()}") # Decode decoded = tokenizer.decode(encoding['input_ids'][0]) print(f" Decoded: '{decoded}'") print("=" * 80) demonstrate_tokenization(tokenizer) ``` ### Tokenize Dataset ```{python} def tokenize_dataset(dataset, tokenizer, max_length): """ Tokenize entire dataset Parameters: dataset: HuggingFace dataset tokenizer: Pre-trained tokenizer max_length: Maximum sequence length Returns: tokenized_dataset: Tokenized dataset """ print(f"Tokenizing dataset (max_length={max_length})...") def tokenize_function(examples): """Tokenize batch of examples""" return tokenizer( examples['text'], padding='max_length', truncation=True, max_length=max_length, return_tensors=None ) # Tokenize tokenized_dataset = dataset.map( tokenize_function, batched=True, desc="Tokenizing" ) print(f"✓ Tokenization complete!") print(f"\nTokenized dataset:") print(tokenized_dataset) # Show sample sample = tokenized_dataset['train'][0] print(f"\nSample tokenized review:") print(f" Input IDs length: {len(sample['input_ids'])}") print(f" Attention mask length: {len(sample['attention_mask'])}") print(f" Label: {sample['label']}") return tokenized_dataset tokenized_dataset = tokenize_dataset(dataset, tokenizer, CONFIG['max_length']) ``` ### Analyze Tokenization ```{python} def analyze_tokenization(tokenized_dataset): """Analyze tokenization statistics""" print("\n" + "=" * 80) print("TOKENIZATION ANALYSIS") print("=" * 80) # Calculate effective lengths (excluding padding) train_eff_lengths = [] for sample in tokenized_dataset['train']: eff_length = sum(sample['attention_mask']) train_eff_lengths.append(eff_length) test_eff_lengths = [] for sample in tokenized_dataset['test']: eff_length = sum(sample['attention_mask']) test_eff_lengths.append(eff_length) print(f"\n1. EFFECTIVE TOKEN LENGTHS:") print(f" Train - Mean: {np.mean(train_eff_lengths):.1f}, Median: {np.median(train_eff_lengths):.1f}") print(f" Train - Min: {np.min(train_eff_lengths)}, Max: {np.max(train_eff_lengths)}") print(f" Test - Mean: {np.mean(test_eff_lengths):.1f}, Median: {np.median(test_eff_lengths):.1f}") # Truncation analysis max_len = CONFIG['max_length'] train_truncated = sum(1 for l in train_eff_lengths if l == max_len) test_truncated = sum(1 for l in test_eff_lengths if l == max_len) print(f"\n2. TRUNCATION:") print(f" Train samples truncated: {train_truncated:,} ({train_truncated/len(train_eff_lengths)*100:.2f}%)") print(f" Test samples truncated: {test_truncated:,} ({test_truncated/len(test_eff_lengths)*100:.2f}%)") # Padding analysis total_tokens_train = len(train_eff_lengths) * max_len actual_tokens_train = sum(train_eff_lengths) padding_ratio_train = (total_tokens_train - actual_tokens_train) / total_tokens_train print(f"\n3. PADDING:") print(f" Train padding ratio: {padding_ratio_train*100:.2f}%") print("=" * 80) return train_eff_lengths, test_eff_lengths train_eff_lengths, test_eff_lengths = analyze_tokenization(tokenized_dataset) ``` --- # Part 3: Model Loading & Configuration {#sec-part3} ## Load Pre-trained BERT {#sec-load-model} ```{python} def load_pretrained_bert(model_name, num_labels): """Load pre-trained BERT for sequence classification""" print(f"Loading pre-trained model: {model_name}...") model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=num_labels, cache_dir=dirs['cache'] ) # Move to device model = model.to(device) print(f"✓ Model loaded!") print(f" Total parameters: {model.num_parameters():,}") # Count trainable parameters trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f" Trainable parameters: {trainable_params:,}") # Model architecture print(f"\n Model architecture:") print(f" BERT encoder: {model.config.num_hidden_layers} layers") print(f" Hidden size: {model.config.hidden_size}") print(f" Attention heads: {model.config.num_attention_heads}") print(f" Intermediate size: {model.config.intermediate_size}") return model model = load_pretrained_bert(CONFIG['model_name'], CONFIG['num_labels']) ``` ## Model Summary ```{python} def print_model_summary(model): """Print detailed model summary""" print("\n" + "=" * 80) print("MODEL ARCHITECTURE") print("=" * 80) # Layer-wise parameters print("\nLayer-wise parameters:") total_params = 0 for name, param in model.named_parameters(): param_count = param.numel() total_params += param_count # Only print major layers if 'layer' in name or 'classifier' in name or 'embeddings' in name: print(f" {name:60s}: {param_count:>12,}") print(f"\n {'Total':60s}: {total_params:>12,}") print("=" * 80) print_model_summary(model) ``` --- # Part 4: Training Configuration {#sec-part4} ## Training Arguments ```{python} def create_training_arguments(): """Create training configuration""" training_args = TrainingArguments( # Output output_dir=str(dirs['checkpoints']), # Training hyperparameters num_train_epochs=CONFIG['num_epochs'], per_device_train_batch_size=CONFIG['batch_size'], per_device_eval_batch_size=CONFIG['batch_size'] * 2, learning_rate=CONFIG['learning_rate'], weight_decay=CONFIG['weight_decay'], warmup_steps=CONFIG['warmup_steps'], # Evaluation & logging evaluation_strategy="epoch", save_strategy="epoch", logging_dir=str(dirs['logs']), logging_steps=100, logging_first_step=True, # Model saving load_best_model_at_end=True, metric_for_best_model="accuracy", save_total_limit=3, # Hardware fp16=torch.cuda.is_available(), # Mixed precision if GPU available dataloader_num_workers=4, # Misc seed=CONFIG['seed'], report_to="none" # Disable W&B, TensorBoard ) print("Training configuration:") print(f" Epochs: {training_args.num_train_epochs}") print(f" Batch size: {training_args.per_device_train_batch_size}") print(f" Learning rate: {training_args.learning_rate}") print(f" Warmup steps: {training_args.warmup_steps}") print(f" Weight decay: {training_args.weight_decay}") print(f" Mixed precision (fp16): {training_args.fp16}") return training_args training_args = create_training_arguments() ``` ## Evaluation Metrics ```{python} def compute_metrics(eval_pred): """ Compute evaluation metrics Parameters: eval_pred: Tuple of (predictions, labels) Returns: metrics: Dictionary of metrics """ from sklearn.metrics import accuracy_score, precision_recall_fscore_support predictions, labels = eval_pred predictions = np.argmax(predictions, axis=1) # Accuracy accuracy = accuracy_score(labels, predictions) # Precision, Recall, F1 precision, recall, f1, _ = precision_recall_fscore_support( labels, predictions, average='binary' ) return { 'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1 } print("✓ Metrics function defined") ``` ## Data Collator ```{python} from transformers import DataCollatorWithPadding # Create data collator data_collator = DataCollatorWithPadding(tokenizer=tokenizer) print("✓ Data collator created") ``` --- # Part 5: Fine-tuning {#sec-part5} ## Create Trainer ```{python} def create_trainer(model, training_args, train_dataset, eval_dataset, tokenizer): """Create Hugging Face Trainer""" # Subset for faster training (if configured) if CONFIG['train_subset']: train_dataset = train_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['train_subset'])) print(f"Using subset of training data: {len(train_dataset):,} samples") if CONFIG['eval_subset']: eval_dataset = eval_dataset.shuffle(seed=CONFIG['seed']).select(range(CONFIG['eval_subset'])) print(f"Using subset of evaluation data: {len(eval_dataset):,} samples") trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) print("✓ Trainer created!") print(f" Training samples: {len(train_dataset):,}") print(f" Evaluation samples: {len(eval_dataset):,}") return trainer trainer = create_trainer( model, training_args, tokenized_dataset['train'], tokenized_dataset['test'], tokenizer ) ``` ## Train Model ```{python} def train_model(trainer): """Execute training""" print("\n" + "=" * 80) print("STARTING TRAINING") print("=" * 80) # Start training train_result = trainer.train() # Training summary print("\n" + "=" * 80) print("TRAINING COMPLETE") print("=" * 80) metrics = train_result.metrics print(f"\nTraining metrics:") for key, value in metrics.items(): print(f" {key}: {value}") # Save model trainer.save_model(str(dirs['models'] / 'final_model')) print(f"\n✓ Model saved to: {dirs['models'] / 'final_model'}") return train_result # Execute training train_result = train_model(trainer) ``` ## Training History Visualization ```{python} def plot_training_history(trainer): """Plot training metrics""" # Extract history log_history = trainer.state.log_history # Separate train and eval logs train_logs = [log for log in log_history if 'loss' in log and 'eval_loss' not in log] eval_logs = [log for log in log_history if 'eval_loss' in log] fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Loss plot if train_logs: train_steps = [log['step'] for log in train_logs] train_loss = [log['loss'] for log in train_logs] axes[0].plot(train_steps, train_loss, label='Training Loss', linewidth=2) if eval_logs: eval_epochs = [log['epoch'] for log in eval_logs] eval_loss = [log['eval_loss'] for log in eval_logs] # Convert epochs to approximate steps steps_per_epoch = train_steps[-1] / eval_epochs[-1] if eval_epochs else 1 eval_steps = [e * steps_per_epoch for e in eval_epochs] axes[0].plot(eval_steps, eval_loss, label='Validation Loss', linewidth=2, marker='o', markersize=8) axes[0].set_xlabel('Training Steps', fontsize=12, fontweight='bold') axes[0].set_ylabel('Loss', fontsize=12, fontweight='bold') axes[0].set_title('Training & Validation Loss', fontsize=14, fontweight='bold') axes[0].legend(fontsize=11) axes[0].grid(alpha=0.3) # Metrics plot if eval_logs: eval_accuracy = [log.get('eval_accuracy', 0) for log in eval_logs] eval_f1 = [log.get('eval_f1', 0) for log in eval_logs] axes[1].plot(eval_epochs, eval_accuracy, label='Accuracy', linewidth=2, marker='o') axes[1].plot(eval_epochs, eval_f1, label='F1 Score', linewidth=2, marker='s') axes[1].set_xlabel('Epoch', fontsize=12, fontweight='bold') axes[1].set_ylabel('Score', fontsize=12, fontweight='bold') axes[1].set_title('Validation Metrics', fontsize=14, fontweight='bold') axes[1].legend(fontsize=11) axes[1].grid(alpha=0.3) axes[1].set_ylim([0, 1]) plt.tight_layout() plt.savefig(dirs['figures'] / 'training_history.png', dpi=300, bbox_inches='tight') plt.show() plot_training_history(trainer) ``` --- # Part 6: Evaluation {#sec-part6} ## Evaluate on Test Set ```{python} def evaluate_model(trainer): """Evaluate model on test set""" print("\n" + "=" * 80) print("EVALUATION ON TEST SET") print("=" * 80) # Evaluate eval_results = trainer.evaluate() print(f"\nTest set metrics:") for key, value in eval_results.items(): if 'eval_' in key: metric_name = key.replace('eval_', '').upper() print(f" {metric_name:15s}: {value:.4f}") return eval_results eval_results = evaluate_model(trainer) ``` ## Predictions & Confusion Matrix ```{python} def analyze_predictions(trainer, dataset): """Analyze model predictions""" print("\nGenerating predictions...") # Get predictions predictions_output = trainer.predict(dataset) predictions = np.argmax(predictions_output.predictions, axis=1) true_labels = predictions_output.label_ids # Confusion matrix from sklearn.metrics import confusion_matrix, classification_report cm = confusion_matrix(true_labels, predictions) print("\nConfusion Matrix:") print(cm) # Classification report print("\nClassification Report:") print(classification_report( true_labels, predictions, target_names=['Negative', 'Positive'] )) # Visualize confusion matrix fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Heatmap sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0], xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive']) axes[0].set_xlabel('Predicted', fontsize=12, fontweight='bold') axes[0].set_ylabel('True', fontsize=12, fontweight='bold') axes[0].set_title('Confusion Matrix', fontsize=14, fontweight='bold') # Normalized confusion matrix cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Greens', ax=axes[1], xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive']) axes[1].set_xlabel('Predicted', fontsize=12, fontweight='bold') axes[1].set_ylabel('True', fontsize=12, fontweight='bold') axes[1].set_title('Normalized Confusion Matrix', fontsize=14, fontweight='bold') plt.tight_layout() plt.savefig(dirs['figures'] / 'confusion_matrix.png', dpi=300, bbox_inches='tight') plt.show() return predictions, true_labels predictions, true_labels = analyze_predictions(trainer, tokenized_dataset['test']) ``` ## Error Analysis ```{python} def error_analysis(dataset, predictions, true_labels, tokenizer, num_examples=5): """Analyze misclassified examples""" print("\n" + "=" * 80) print("ERROR ANALYSIS") print("=" * 80) # Find misclassified misclassified_indices = np.where(predictions != true_labels)[0] print(f"\nTotal misclassified: {len(misclassified_indices):,} / {len(predictions):,}") print(f"Error rate: {len(misclassified_indices)/len(predictions)*100:.2f}%") # Show examples print(f"\nSample misclassified reviews:") for i, idx in enumerate(misclassified_indices[:num_examples], 1): text = dataset[int(idx)]['text'] true_label = true_labels[idx] pred_label = predictions[idx] # Truncate text if len(text) > 300: text = text[:300] + "..." print(f"\n[{i}] True: {'Positive' if true_label == 1 else 'Negative'} | " f"Predicted: {'Positive' if pred_label == 1 else 'Negative'}") print(f" {text}") print("=" * 80) error_analysis(tokenized_dataset['test'], predictions, true_labels, tokenizer) ``` --- # Part 7: Inference & Deployment {#sec-part7} ## Create Inference Pipeline ```{python} def create_inference_pipeline(model_path, tokenizer): """Create pipeline for inference""" print(f"Creating inference pipeline from: {model_path}") # Load model classifier = pipeline( "sentiment-analysis", model=str(model_path), tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1 ) print("✓ Pipeline created!") return classifier # Create pipeline inference_pipeline = create_inference_pipeline( dirs['models'] / 'final_model', tokenizer ) ``` ## Test Inference ```{python} def test_inference(pipeline, test_texts): """Test inference on custom texts""" print("\n" + "=" * 80) print("INFERENCE TESTING") print("=" * 80) for i, text in enumerate(test_texts, 1): result = pipeline(text)[0] label = result['label'] score = result['score'] sentiment = "Positive" if label == 'LABEL_1' else "Negative" print(f"\n[{i}] Text:") print(f" '{text}'") print(f" Prediction: {sentiment} (confidence: {score:.4f})") print("=" * 80) # Custom test cases test_texts = [ "This movie is absolutely fantastic! Best film I've seen this year!", "Terrible acting, boring plot. Complete waste of time.", "It was okay, nothing special but not terrible either.", "Amazing cinematography and brilliant performances throughout.", "I fell asleep halfway through. Very disappointing.", "The special effects were mind-blowing!", "Not recommended. Poor script and direction." ] test_inference(inference_pipeline, test_texts) ``` ## Batch Inference ```{python} def batch_inference(pipeline, texts, batch_size=32): """Perform batch inference""" print(f"\nRunning batch inference on {len(texts)} samples...") import time start_time = time.time() results = pipeline(texts, batch_size=batch_size) elapsed_time = time.time() - start_time print(f"✓ Inference complete!") print(f" Time: {elapsed_time:.2f}s") print(f" Throughput: {len(texts)/elapsed_time:.1f} samples/sec") return results # Test batch inference sample_texts = [dataset['test'][i]['text'] for i in range(100)] batch_results = batch_inference(inference_pipeline, sample_texts, batch_size=16) ``` ## Save Predictions ```{python} def save_predictions(texts, predictions, output_path): """Save predictions to file""" results_df = pd.DataFrame({ 'text': texts, 'label': [p['label'] for p in predictions], 'score': [p['score'] for p in predictions] }) results_df.to_csv(output_path, index=False) print(f"✓ Predictions saved to: {output_path}") return results_df # Save sample predictions predictions_df = save_predictions( sample_texts, batch_results, dirs['predictions'] / 'sample_predictions.csv' ) print("\nSample predictions:") print(predictions_df.head(10)) ``` --- # Part 8: Model Optimization {#sec-part8} ## Model Quantization ```{python} def quantize_model(model): """Apply dynamic quantization""" print("Applying dynamic quantization...") # Quantize quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) print("✓ Quantization complete!") # Compare sizes def get_model_size(model): torch.save(model.state_dict(), "temp.pth") size = Path("temp.pth").stat().st_size / (1024 * 1024) Path("temp.pth").unlink() return size original_size = get_model_size(model) quantized_size = get_model_size(quantized_model) print(f"\n Original model: {original_size:.2f} MB") print(f" Quantized model: {quantized_size:.2f} MB") print(f" Reduction: {(1 - quantized_size/original_size)*100:.1f}%") return quantized_model # Note: Quantization may not work on all BERT models # Uncomment to test: # quantized_model = quantize_model(model) ``` ## Model Export (ONNX) ```{python} def export_to_onnx(model, tokenizer, output_path, max_length=256): """Export model to ONNX format""" print(f"Exporting model to ONNX format...") # Dummy input dummy_text = "This is a sample text for export." inputs = tokenizer( dummy_text, padding='max_length', max_length=max_length, truncation=True, return_tensors='pt' ) # Move to CPU for export model_cpu = model.cpu() inputs_cpu = {k: v.cpu() for k, v in inputs.items()} # Export torch.onnx.export( model_cpu, (inputs_cpu['input_ids'], inputs_cpu['attention_mask']), output_path, input_names=['input_ids', 'attention_mask'], output_names=['logits'], dynamic_axes={ 'input_ids': {0: 'batch', 1: 'sequence'}, 'attention_mask': {0: 'batch', 1: 'sequence'}, 'logits': {0: 'batch'} }, opset_version=11 ) print(f"✓ Model exported to: {output_path}") # Check size size = Path(output_path).stat().st_size / (1024 * 1024) print(f" ONNX model size: {size:.2f} MB") # Export model # export_to_onnx(model, tokenizer, dirs['models'] / 'model.onnx') ``` --- # Part 9: Interpretability {#sec-part9} ## Attention Visualization ```{python} def visualize_attention(text, model, tokenizer, layer=11, head=0): """Visualize attention weights""" print(f"\nVisualizing attention for text:") print(f" '{text}'") # Tokenize inputs = tokenizer(text, return_tensors='pt').to(device) # Get attention weights with torch.no_grad(): outputs = model(**inputs, output_attentions=True) # Extract attention from specific layer and head attention = outputs.attentions[layer][0, head].cpu().numpy() # Get tokens tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) # Plot fig, ax = plt.subplots(figsize=(12, 10)) im = ax.imshow(attention, cmap='YlOrRd') # Set ticks ax.set_xticks(range(len(tokens))) ax.set_yticks(range(len(tokens))) ax.set_xticklabels(tokens, rotation=90) ax.set_yticklabels(tokens) # Colorbar plt.colorbar(im, ax=ax) ax.set_xlabel('Key', fontsize=12, fontweight='bold') ax.set_ylabel('Query', fontsize=12, fontweight='bold') ax.set_title(f'Attention Weights (Layer {layer}, Head {head})', fontsize=14, fontweight='bold') plt.tight_layout() plt.savefig(dirs['figures'] / f'attention_layer{layer}_head{head}.png', dpi=300, bbox_inches='tight') plt.show() # Visualize attention for sample text sample_text = "This movie is absolutely amazing and wonderful!" visualize_attention(sample_text, model, tokenizer, layer=11, head=0) ``` ## Feature Importance ```{python} def analyze_important_words(text, model, tokenizer): """Analyze important words using gradient-based attribution""" print(f"\nAnalyzing important words for:") print(f" '{text}'") # Tokenize inputs = tokenizer(text, return_tensors='pt').to(device) embedding_layer = model.bert.embeddings # Get embeddings embeddings = embedding_layer(inputs['input_ids']) embeddings.retain_grad() # Forward pass outputs = model(**inputs) logits = outputs.logits # Get predicted class predicted_class = torch.argmax(logits, dim=1) # Backward pass model.zero_grad() logits[0, predicted_class].backward() # Get gradients gradients = embeddings.grad.cpu().numpy()[0] # Calculate importance (L2 norm of gradients) importance = np.linalg.norm(gradients, axis=1) # Get tokens tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) # Plot fig, ax = plt.subplots(figsize=(14, 6)) colors = ['green' if imp > np.median(importance) else 'gray' for imp in importance] ax.barh(range(len(tokens)), importance, color=colors, alpha=0.7) ax.set_yticks(range(len(tokens))) ax.set_yticklabels(tokens) ax.set_xlabel('Importance Score', fontsize=12, fontweight='bold') ax.set_title('Word Importance (Gradient-based)', fontsize=14, fontweight='bold') ax.grid(axis='x', alpha=0.3) plt.tight_layout() plt.savefig(dirs['figures'] / 'word_importance.png', dpi=300, bbox_inches='tight') plt.show() # Print top words top_indices = np.argsort(importance)[::-1][:5] print("\nTop 5 important words:") for i, idx in enumerate(top_indices, 1): print(f" {i}. {tokens[idx]:15s}: {importance[idx]:.4f}") analyze_important_words(sample_text, model, tokenizer) ``` --- # Kesimpulan {#sec-conclusion} ## Summary Dalam lab ini, kita telah: 1. ✓ Memuat dan mengeksplorasi IMDB dataset 2. ✓ Melakukan tokenization dengan BERT tokenizer 3. ✓ Fine-tuning pre-trained BERT untuk sentiment analysis 4. ✓ Mengevaluasi model dengan berbagai metrics 5. ✓ Membuat inference pipeline untuk production 6. ✓ Mengoptimalkan model untuk deployment 7. ✓ Memvisualisasikan attention dan word importance ## Key Takeaways - Transfer learning sangat powerful untuk NLP tasks - Hugging Face Transformers mempermudah working dengan pre-trained models - Fine-tuning requires minimal data untuk excellent results - Proper tokenization critical untuk performance - Model optimization penting untuk production deployment ## Next Steps 1. Experiment dengan model lain (RoBERTa, DistilBERT, ALBERT) 2. Try multi-class classification tasks 3. Implement custom datasets 4. Deploy model dengan FastAPI/Flask 5. Explore other NLP tasks (NER, QA, summarization) --- **🎓 Selamat! Anda telah menyelesaikan Lab 8!**