Bab 8: Transformers & Attention Mechanism

Modern NLP dengan Self-Attention, BERT, GPT & Transfer Learning

Bab 8: Transformers & Attention Mechanism

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

Memahami konsep attention mechanism dan self-attention
Mengidentifikasi arsitektur Transformer dan komponennya
Mengimplementasikan pre-trained transformers (BERT, GPT) dengan Hugging Face
Menerapkan transfer learning untuk NLP tasks
Melakukan fine-tuning model transformer untuk domain spesifik
Menggunakan tokenizers dan pipeline untuk inference
Mengevaluasi performa model transformer pada berbagai tasks

8.1 Dari RNN ke Transformers: Revolusi NLP

8.1.1 Limitasi RNN dan LSTM

Masalah Sequential Processing:

Di Chapter 7, kita belajar RNN dan LSTM yang powerful untuk sequential data. Namun, mereka memiliki fundamental limitations:

1. Sequential Computation (tidak bisa paralel):

Must process timestep-by-timestep
Cannot leverage GPU parallelism fully
Training sangat lambat untuk long sequences

2. Long-Range Dependencies (masih sulit):

Meskipun LSTM lebih baik dari Simple RNN
Information bottleneck pada hidden state
Gradient masih bisa vanish untuk very long sequences (>1000 tokens)

3. Fixed Context Window:

Encoder mengkompresi semua info ke single vector
Information loss untuk long documents
Early tokens might be “forgotten”

💡 Contoh Problem

Kalimat: “The cat, which was sleeping peacefully on the soft cushion all afternoon, was hungry.”

RNN harus remember “cat” dari awal kalimat sampai “was”
Distance = 12 words
LSTM dapat handle ini, tapi untuk 100+ words? Sulit!
Solution: Attention mechanism

8.1.2 The Attention Revolution

Attention Mechanism diperkenalkan dalam paper “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2015).

Key Insight: Tidak perlu compress semua information ke single vector. Model dapat selectively focus pada relevant parts!

Evolution Timeline:

Code

timeline
    title Evolution of NLP Architectures
    2014 : Seq2Seq with RNN
         : Encoder-Decoder architecture
    2015 : Attention Mechanism
         : Bahdanau & Luong Attention
    2017 : Transformer Architecture
         : "Attention is All You Need"
         : Self-attention & Multi-head attention
    2018 : Pre-trained Language Models
         : BERT (Bidirectional)
         : GPT (Autoregressive)
    2019-2020 : Large Language Models
         : GPT-2, GPT-3
         : RoBERTa, ALBERT, T5
    2022-2023 : Foundation Models Era
         : ChatGPT, GPT-4
         : LLaMA, PaLM, Claude
    2024 : Multimodal Transformers
         : Vision-Language models
         : Code generation models

timeline
    title Evolution of NLP Architectures
    2014 : Seq2Seq with RNN
         : Encoder-Decoder architecture
    2015 : Attention Mechanism
         : Bahdanau & Luong Attention
    2017 : Transformer Architecture
         : "Attention is All You Need"
         : Self-attention & Multi-head attention
    2018 : Pre-trained Language Models
         : BERT (Bidirectional)
         : GPT (Autoregressive)
    2019-2020 : Large Language Models
         : GPT-2, GPT-3
         : RoBERTa, ALBERT, T5
    2022-2023 : Foundation Models Era
         : ChatGPT, GPT-4
         : LLaMA, PaLM, Claude
    2024 : Multimodal Transformers
         : Vision-Language models
         : Code generation models

8.1.3 Mengapa Transformers Menang?

Kelebihan Transformers:

Parallelization: Semua tokens diproses simultaneously
Long-range dependencies: Direct connections via attention
Scalability: Bisa scale ke billions of parameters
Transfer learning: Pre-training + fine-tuning paradigm
Versatility: Works untuk NLP, vision, audio, multimodal

Dampak di Industry:

import matplotlib.pyplot as plt
import numpy as np

# Visualize the impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Model size growth
years = [2017, 2018, 2019, 2020, 2021, 2022, 2023]
models = ['Transformer\nBase', 'BERT\nBase', 'GPT-2', 'GPT-3', 'PaLM', 'GPT-3.5', 'GPT-4']
params = [65e6, 110e6, 1.5e9, 175e9, 540e9, 175e9, 1000e9]  # Estimated parameters

ax1.bar(range(len(models)), np.array(params)/1e9, color='steelblue', alpha=0.7)
ax1.set_xticks(range(len(models)))
ax1.set_xticklabels(models, rotation=45, ha='right')
ax1.set_ylabel('Parameters (Billions)', fontsize=12, fontweight='bold')
ax1.set_title('Growth of Transformer Models', fontsize=14, fontweight='bold')
ax1.set_yscale('log')
ax1.grid(axis='y', alpha=0.3)

# Performance on benchmarks
benchmarks = ['SQuAD\n(QA)', 'GLUE\n(General)', 'SuperGLUE\n(Hard)', 'MMLU\n(Knowledge)']
rnn_scores = [65, 70, 45, 30]
transformer_scores = [93, 90, 89, 85]

x = np.arange(len(benchmarks))
width = 0.35

ax2.bar(x - width/2, rnn_scores, width, label='RNN/LSTM (2016)', color='coral', alpha=0.7)
ax2.bar(x + width/2, transformer_scores, width, label='Transformers (2023)', color='green', alpha=0.7)
ax2.set_ylabel('Score (%)', fontsize=12, fontweight='bold')
ax2.set_title('Performance Comparison on Benchmarks', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(benchmarks)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

8.2 Attention Mechanism: Konsep Fundamental

8.2.1 Intuisi Attention

Analogi Manusia:

Saat membaca kalimat: “The quick brown fox jumps over the lazy dog”

Untuk understand “jumps”, kita fokus pada “fox” (subject) dan “over” (direction)
Tidak semua kata equally important
Attention weights menentukan relevance setiap kata

Mathematical Formulation:

Attention adalah weighted sum dari values, where weights ditentukan oleh compatibility function.

Basic Attention:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

Q (Query): “What I’m looking for”
K (Key): “What I have”
V (Value): “Actual content”
$d_k$: Dimension of keys (untuk scaling)

8.2.2 Attention Step-by-Step

Example: Simple attention calculation

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def attention_mechanism(Q, K, V):
    """
    Compute attention mechanism

    Parameters:
        Q: Query matrix (seq_len_q, d_k)
        K: Key matrix (seq_len_k, d_k)
        V: Value matrix (seq_len_k, d_v)

    Returns:
        output: Attention output (seq_len_q, d_v)
        attention_weights: Attention weights (seq_len_q, seq_len_k)
    """
    # Step 1: Compute scores (dot product)
    d_k = K.shape[1]
    scores = np.matmul(Q, K.T)  # (seq_len_q, seq_len_k)

    # Step 2: Scale
    scaled_scores = scores / np.sqrt(d_k)

    # Step 3: Softmax
    attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True)

    # Step 4: Weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

# Example: Simple sentence
# "The cat sat"
# Embedding dimension = 4 (simplified)

np.random.seed(42)
seq_len = 3
d_model = 4

# Simulate word embeddings
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

# Compute attention
output, weights = attention_mechanism(Q, K, V)

# Visualize attention weights
words = ['The', 'cat', 'sat']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Attention heatmap
sns.heatmap(weights, annot=True, fmt='.3f', cmap='YlOrRd',
            xticklabels=words, yticklabels=words,
            cbar_kws={'label': 'Attention Weight'}, ax=ax1)
ax1.set_xlabel('Key (Input)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Query (Output)', fontsize=12, fontweight='bold')
ax1.set_title('Attention Weights Heatmap', fontsize=14, fontweight='bold')

# Attention flow
for i, query_word in enumerate(words):
    ax2.barh(range(len(words)), weights[i], alpha=0.7,
             label=f'Query: {query_word}')

ax2.set_yticks(range(len(words)))
ax2.set_yticklabels(words)
ax2.set_xlabel('Attention Weight', fontsize=12, fontweight='bold')
ax2.set_ylabel('Key Words', fontsize=12, fontweight='bold')
ax2.set_title('Attention Distribution per Query', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("Attention Output Shape:", output.shape)
print("\nAttention Weights:")
print(weights)

8.2.3 Self-Attention

Self-Attention adalah kasus khusus di mana Q, K, V berasal dari input yang sama.

Purpose: Sequence dapat “attend to itself” untuk capture relationships antar words.

Implementation:

class SelfAttention:
    """
    Self-Attention mechanism (NumPy implementation)
    """
    def __init__(self, d_model, d_k=None):
        """
        Parameters:
            d_model: Embedding dimension
            d_k: Key/Query dimension (default: d_model)
        """
        self.d_model = d_model
        self.d_k = d_k if d_k is not None else d_model

        # Weight matrices (randomly initialized)
        self.W_q = np.random.randn(d_model, self.d_k) * 0.01
        self.W_k = np.random.randn(d_model, self.d_k) * 0.01
        self.W_v = np.random.randn(d_model, d_model) * 0.01

    def forward(self, X):
        """
        Forward pass

        Parameters:
            X: Input (batch, seq_len, d_model)

        Returns:
            output: Self-attention output
            attention_weights: Attention weights
        """
        # Linear projections
        Q = np.matmul(X, self.W_q)  # (batch, seq_len, d_k)
        K = np.matmul(X, self.W_k)  # (batch, seq_len, d_k)
        V = np.matmul(X, self.W_v)  # (batch, seq_len, d_model)

        # Scaled dot-product attention
        scores = np.matmul(Q, K.transpose(0, 2, 1))  # (batch, seq_len, seq_len)
        scaled_scores = scores / np.sqrt(self.d_k)

        # Softmax
        attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True)

        # Weighted sum
        output = np.matmul(attention_weights, V)  # (batch, seq_len, d_model)

        return output, attention_weights

# Test self-attention
batch_size = 2
seq_len = 5
d_model = 8

X = np.random.randn(batch_size, seq_len, d_model)

self_attn = SelfAttention(d_model=d_model)
output, weights = self_attn.forward(X)

print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

# Visualize attention for first sample
plt.figure(figsize=(8, 6))
sns.heatmap(weights[0], annot=True, fmt='.2f', cmap='Blues',
            cbar_kws={'label': 'Attention Weight'})
plt.xlabel('Key Position', fontsize=12, fontweight='bold')
plt.ylabel('Query Position', fontsize=12, fontweight='bold')
plt.title('Self-Attention Weights (Sample 1)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

8.2.4 Multi-Head Attention

Problem dengan Single Attention: Model hanya bisa fokus pada satu “aspect” atau “relationship” at a time.

Solution: Multi-Head Attention - Multiple attention mechanisms in parallel!

Intuisi:

Head 1: Fokus pada syntactic relationships
Head 2: Fokus pada semantic relationships
Head 3: Fokus pada positional dependencies
dll.

Architecture:

Code

graph TD
    X[Input X] --> L1[Linear Q1]
    X --> L2[Linear K1]
    X --> L3[Linear V1]
    X --> L4[Linear Q2]
    X --> L5[Linear K2]
    X --> L6[Linear V2]
    X --> L7[Linear Qh]
    X --> L8[Linear Kh]
    X --> L9[Linear Vh]

    L1 --> A1[Attention<br/>Head 1]
    L2 --> A1
    L3 --> A1

    L4 --> A2[Attention<br/>Head 2]
    L5 --> A2
    L6 --> A2

    L7 --> Ah[Attention<br/>Head h]
    L8 --> Ah
    L9 --> Ah

    A1 --> C[Concat]
    A2 --> C
    Ah --> C

    C --> LO[Linear<br/>Output]
    LO --> O[Output]

    style X fill:#e6f3ff
    style A1 fill:#ffe6e6
    style A2 fill:#ffe6e6
    style Ah fill:#ffe6e6
    style C fill:#ffffcc
    style O fill:#ccffcc

graph TD
    X[Input X] --> L1[Linear Q1]
    X --> L2[Linear K1]
    X --> L3[Linear V1]
    X --> L4[Linear Q2]
    X --> L5[Linear K2]
    X --> L6[Linear V2]
    X --> L7[Linear Qh]
    X --> L8[Linear Kh]
    X --> L9[Linear Vh]

    L1 --> A1[Attention<br/>Head 1]
    L2 --> A1
    L3 --> A1

    L4 --> A2[Attention<br/>Head 2]
    L5 --> A2
    L6 --> A2

    L7 --> Ah[Attention<br/>Head h]
    L8 --> Ah
    L9 --> Ah

    A1 --> C[Concat]
    A2 --> C
    Ah --> C

    C --> LO[Linear<br/>Output]
    LO --> O[Output]

    style X fill:#e6f3ff
    style A1 fill:#ffe6e6
    style A2 fill:#ffe6e6
    style Ah fill:#ffe6e6
    style C fill:#ffffcc
    style O fill:#ccffcc

Mathematical Formulation:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

Where each head:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Implementation:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class MultiHeadAttention(layers.Layer):
    """
    Multi-Head Attention Layer (Keras)
    """
    def __init__(self, d_model, num_heads, name="multi_head_attention"):
        super(MultiHeadAttention, self).__init__(name=name)

        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        # Linear layers untuk Q, K, V
        self.wq = layers.Dense(d_model, name='query')
        self.wk = layers.Dense(d_model, name='key')
        self.wv = layers.Dense(d_model, name='value')

        # Output linear layer
        self.dense = layers.Dense(d_model, name='output')

    def split_heads(self, x, batch_size):
        """Split last dimension into (num_heads, depth)"""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        """Calculate attention weights"""
        # Q・K^T
        matmul_qk = tf.matmul(q, k, transpose_b=True)

        # Scale
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        # Mask (optional)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Softmax
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

        # Output
        output = tf.matmul(attention_weights, v)

        return output, attention_weights

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]

        # Linear projections
        q = self.wq(q)  # (batch, seq_len_q, d_model)
        k = self.wk(k)  # (batch, seq_len_k, d_model)
        v = self.wv(v)  # (batch, seq_len_v, d_model)

        # Split into heads
        q = self.split_heads(q, batch_size)  # (batch, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
        # scaled_attention: (batch, num_heads, seq_len_q, depth)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        # Final linear
        output = self.dense(concat_attention)

        return output, attention_weights

# Test multi-head attention
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2

mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

# Dummy input
query = tf.random.normal((batch_size, seq_len, d_model))
key = tf.random.normal((batch_size, seq_len, d_model))
value = tf.random.normal((batch_size, seq_len, d_model))

output, weights = mha(query, key, value)

print(f"Input shape: {query.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Number of parameters: {mha.count_params():,}")

8.3 Transformer Architecture

8.3.1 Complete Transformer Model

“Attention is All You Need” (Vaswani et al., 2017) memperkenalkan Transformer architecture yang sepenuhnya based on attention, tanpa recurrence atau convolution.

High-Level Architecture:

Code

graph TB
    subgraph "ENCODER (Left)"
        IE[Input<br/>Embedding] --> IPE[+ Positional<br/>Encoding]
        IPE --> EN1[Encoder<br/>Layer 1]
        EN1 --> EN2[Encoder<br/>Layer 2]
        EN2 --> ENn[Encoder<br/>Layer N]
    end

    subgraph "DECODER (Right)"
        OE[Output<br/>Embedding] --> OPE[+ Positional<br/>Encoding]
        OPE --> DN1[Decoder<br/>Layer 1]
        DN1 --> DN2[Decoder<br/>Layer 2]
        DN2 --> DNn[Decoder<br/>Layer N]
    end

    ENn -.->|Context| DN1
    ENn -.->|Context| DN2
    ENn -.->|Context| DNn

    DNn --> L[Linear]
    L --> S[Softmax]
    S --> OUT[Output<br/>Probabilities]

    style IE fill:#e6f3ff
    style OE fill:#e6f3ff
    style EN1 fill:#ffe6e6
    style EN2 fill:#ffe6e6
    style ENn fill:#ffe6e6
    style DN1 fill:#ffffcc
    style DN2 fill:#ffffcc
    style DNn fill:#ffffcc
    style OUT fill:#ccffcc

graph TB
    subgraph "ENCODER (Left)"
        IE[Input<br/>Embedding] --> IPE[+ Positional<br/>Encoding]
        IPE --> EN1[Encoder<br/>Layer 1]
        EN1 --> EN2[Encoder<br/>Layer 2]
        EN2 --> ENn[Encoder<br/>Layer N]
    end

    subgraph "DECODER (Right)"
        OE[Output<br/>Embedding] --> OPE[+ Positional<br/>Encoding]
        OPE --> DN1[Decoder<br/>Layer 1]
        DN1 --> DN2[Decoder<br/>Layer 2]
        DN2 --> DNn[Decoder<br/>Layer N]
    end

    ENn -.->|Context| DN1
    ENn -.->|Context| DN2
    ENn -.->|Context| DNn

    DNn --> L[Linear]
    L --> S[Softmax]
    S --> OUT[Output<br/>Probabilities]

    style IE fill:#e6f3ff
    style OE fill:#e6f3ff
    style EN1 fill:#ffe6e6
    style EN2 fill:#ffe6e6
    style ENn fill:#ffe6e6
    style DN1 fill:#ffffcc
    style DN2 fill:#ffffcc
    style DNn fill:#ffffcc
    style OUT fill:#ccffcc

8.3.2 Encoder Layer Details

Setiap Encoder layer terdiri dari 2 sub-layers:

Multi-Head Self-Attention
Position-wise Feed-Forward Network

Keduanya menggunakan residual connections dan layer normalization.

Encoder Layer Architecture:

class EncoderLayer(layers.Layer):
    """
    Single Transformer Encoder Layer
    """
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1, name="encoder_layer"):
        """
        Parameters:
            d_model: Model dimension
            num_heads: Number of attention heads
            dff: Dimension of feed-forward network
            dropout_rate: Dropout rate
        """
        super(EncoderLayer, self).__init__(name=name)

        # Multi-head attention
        self.mha = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network
        self.ffn = keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        # Dropout
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        """Forward pass"""

        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual + LayerNorm

        # Feed-forward
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual + LayerNorm

        return out2

# Test encoder layer
encoder_layer = EncoderLayer(d_model=512, num_heads=8, dff=2048)

x = tf.random.normal((batch_size, seq_len, 512))
output = encoder_layer(x, training=False)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Encoder layer parameters: {encoder_layer.count_params():,}")

8.3.3 Positional Encoding

Problem: Attention mechanism tidak ada notion of order/position! - “Cat chases dog” vs “Dog chases cat” → Same attention weights!

Solution: Positional Encoding - Add position information ke embeddings.

Sinusoidal Positional Encoding:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

Where:

$pos$: Position dalam sequence
$i$: Dimension index
$d_{model}$: Model dimension

Implementation:

def get_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encoding

    Parameters:
        seq_len: Sequence length
        d_model: Model dimension

    Returns:
        pos_encoding: (1, seq_len, d_model)
    """
    # Create position indices
    positions = np.arange(seq_len)[:, np.newaxis]  # (seq_len, 1)

    # Create dimension indices
    dimensions = np.arange(d_model)[np.newaxis, :]  # (1, d_model)

    # Calculate angle rates
    angle_rates = 1 / np.power(10000, (2 * (dimensions // 2)) / d_model)

    # Calculate angles
    angles = positions * angle_rates  # (seq_len, d_model)

    # Apply sin to even indices, cos to odd indices
    angles[:, 0::2] = np.sin(angles[:, 0::2])
    angles[:, 1::2] = np.cos(angles[:, 1::2])

    # Add batch dimension
    pos_encoding = angles[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

# Generate and visualize
seq_len = 100
d_model = 128

pos_encoding = get_positional_encoding(seq_len, d_model)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
im = axes[0].imshow(pos_encoding[0], cmap='RdBu', aspect='auto')
axes[0].set_xlabel('Dimension', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Position', fontsize=12, fontweight='bold')
axes[0].set_title('Positional Encoding Heatmap', fontsize=14, fontweight='bold')
plt.colorbar(im, ax=axes[0])

# Selected dimensions
axes[1].plot(pos_encoding[0, :, 4], label='dim 4 (sin)')
axes[1].plot(pos_encoding[0, :, 5], label='dim 5 (cos)')
axes[1].plot(pos_encoding[0, :, 32], label='dim 32 (sin)', linestyle='--')
axes[1].plot(pos_encoding[0, :, 33], label='dim 33 (cos)', linestyle='--')
axes[1].set_xlabel('Position', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Encoding Value', fontsize=12, fontweight='bold')
axes[1].set_title('Positional Encoding Patterns', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Positional encoding shape: {pos_encoding.shape}")

8.3.4 Complete Transformer Encoder

class TransformerEncoder(layers.Layer):
    """
    Transformer Encoder (Stack of N encoder layers)
    """
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, maximum_position_encoding, dropout_rate=0.1,
                 name="transformer_encoder"):
        super(TransformerEncoder, self).__init__(name=name)

        self.d_model = d_model
        self.num_layers = num_layers

        # Embedding layer
        self.embedding = layers.Embedding(input_vocab_size, d_model)

        # Positional encoding
        self.pos_encoding = get_positional_encoding(maximum_position_encoding, d_model)

        # Stack of encoder layers
        self.enc_layers = [
            EncoderLayer(d_model, num_heads, dff, dropout_rate, name=f'encoder_layer_{i}')
            for i in range(num_layers)
        ]

        self.dropout = layers.Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        seq_len = tf.shape(x)[1]

        # Embedding + positional encoding
        x = self.embedding(x)  # (batch, seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))  # Scaling
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        # Pass through encoder layers
        for enc_layer in self.enc_layers:
            x = enc_layer(x, training, mask)

        return x

# Build complete encoder
encoder = TransformerEncoder(
    num_layers=6,
    d_model=512,
    num_heads=8,
    dff=2048,
    input_vocab_size=10000,
    maximum_position_encoding=1000,
    dropout_rate=0.1
)

# Test
sample_input = tf.random.uniform((2, 10), dtype=tf.int32, maxval=10000)
encoder_output = encoder(sample_input, training=False)

print(f"Input shape: {sample_input.shape}")
print(f"Encoder output shape: {encoder_output.shape}")
print(f"Total encoder parameters: {encoder.count_params():,}")

8.4 Pre-trained Transformers: BERT & GPT

8.4.1 Transfer Learning Paradigm

Traditional ML: Train from scratch untuk setiap task Modern NLP: Pre-train on massive corpus, then fine-tune!

Two-Stage Process:

Pre-training: Learn general language understanding
- Corpus: Wikipedia, books, web text (billions of words)
- Task: Self-supervised (mask prediction, next word prediction)
- Duration: Days/weeks pada hundreds of GPUs
Fine-tuning: Adapt ke specific task
- Corpus: Task-specific data (thousands of examples)
- Task: Supervised (classification, QA, NER, etc.)
- Duration: Minutes/hours pada single GPU

Benefits:

Tidak perlu millions of labeled examples
Leverage knowledge from massive pre-training
State-of-the-art results dengan less data

8.4.2 BERT: Bidirectional Encoder Representations

BERT (Devlin et al., 2018) adalah breakthrough pre-trained model.

Key Ideas:

Bidirectional: Process text kiri dan kanan simultaneously
Masked Language Model (MLM): Predict masked tokens
Next Sentence Prediction (NSP): Understand sentence relationships

Architecture:

Code

graph TB
    T[Text Input] --> TOK[Tokenization]
    TOK --> EMB[Token + Segment + Position<br/>Embeddings]
    EMB --> E1[Encoder 1]
    E1 --> E2[Encoder 2]
    E2 --> E3[...]
    E3 --> EN[Encoder N<br/>12 or 24 layers]

    EN --> CLS[CLS Token<br/>Sentence representation]
    EN --> TOK_OUT[Token Outputs<br/>Contextualized embeddings]

    CLS --> TASK1[Classification<br/>Sentiment, NER, etc.]
    TOK_OUT --> TASK2[Token tasks<br/>QA, NER]

    style T fill:#e6f3ff
    style EMB fill:#ffe6e6
    style EN fill:#ffffcc
    style CLS fill:#ccffcc
    style TOK_OUT fill:#ccffcc

graph TB
    T[Text Input] --> TOK[Tokenization]
    TOK --> EMB[Token + Segment + Position<br/>Embeddings]
    EMB --> E1[Encoder 1]
    E1 --> E2[Encoder 2]
    E2 --> E3[...]
    E3 --> EN[Encoder N<br/>12 or 24 layers]

    EN --> CLS[CLS Token<br/>Sentence representation]
    EN --> TOK_OUT[Token Outputs<br/>Contextualized embeddings]

    CLS --> TASK1[Classification<br/>Sentiment, NER, etc.]
    TOK_OUT --> TASK2[Token tasks<br/>QA, NER]

    style T fill:#e6f3ff
    style EMB fill:#ffe6e6
    style EN fill:#ffffcc
    style CLS fill:#ccffcc
    style TOK_OUT fill:#ccffcc

BERT Variants:

Model	Layers	Hidden Size	Heads	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M
RoBERTa	24	1024	16	355M
ALBERT	12	768	12	12M
DistilBERT	6	768	12	66M

8.4.3 GPT: Generative Pre-trained Transformer

GPT (Radford et al., 2018) menggunakan autoregressive approach.

Key Differences dari BERT:

Unidirectional: Left-to-right only (decoder architecture)
Causal Language Modeling: Predict next token
Generation focus: Designed untuk text generation

GPT Evolution:

GPT (2018): 117M parameters
GPT-2 (2019): 1.5B parameters
GPT-3 (2020): 175B parameters
ChatGPT (2022): GPT-3.5 + RLHF
GPT-4 (2023): Multimodal, >1T parameters (estimated)

8.4.4 BERT vs GPT: When to Use What?

Comparison:

Aspect	BERT	GPT
Architecture	Encoder only	Decoder only
Direction	Bidirectional	Left-to-right
Pre-training	Masked LM	Next token prediction
Best for	Understanding tasks	Generation tasks
Tasks	Classification, NER, QA	Text generation, completion
Context	Full sentence	Left context only

Use Cases:

# Visualize use cases
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# BERT use cases
bert_tasks = ['Sentiment\nAnalysis', 'Named Entity\nRecognition', 'Question\nAnswering',
              'Text\nClassification', 'Similarity\nMatching']
bert_scores = [95, 93, 88, 94, 90]

ax1.barh(bert_tasks, bert_scores, color='#4285F4', alpha=0.8)
ax1.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold')
ax1.set_title('BERT: Understanding Tasks', fontsize=14, fontweight='bold')
ax1.set_xlim([0, 100])
ax1.grid(axis='x', alpha=0.3)

# GPT use cases
gpt_tasks = ['Story\nGeneration', 'Code\nCompletion', 'Summarization',
             'Translation', 'Conversation']
gpt_scores = [92, 89, 85, 83, 95]

ax2.barh(gpt_tasks, gpt_scores, color='#34A853', alpha=0.8)
ax2.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold')
ax2.set_title('GPT: Generation Tasks', fontsize=14, fontweight='bold')
ax2.set_xlim([0, 100])
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

8.5 Hugging Face Transformers Library

8.5.1 Introduction to Hugging Face

Hugging Face adalah library paling populer untuk working dengan transformers.

Features:

100,000+ pre-trained models
Support PyTorch, TensorFlow, JAX
Easy-to-use APIs
Active community
Hub untuk sharing models

Installation:

pip install transformers
pip install datasets
pip install evaluate

8.5.2 Loading Pre-trained Models

Basic Usage:

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

# Load BERT tokenizer and model
model_name = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Tokenizer vocab size: {tokenizer.vocab_size:,}")
print(f"Model parameters: {model.num_parameters():,}")

# Tokenize text
text = "Transformers are amazing for NLP tasks!"
tokens = tokenizer(text, return_tensors='pt')

print(f"\nOriginal text: {text}")
print(f"Tokens: {tokens['input_ids']}")
print(f"Decoded: {tokenizer.decode(tokens['input_ids'][0])}")

# Forward pass
outputs = model(**tokens)
last_hidden_state = outputs.last_hidden_state

print(f"\nOutput shape: {last_hidden_state.shape}")
print(f"  [batch_size, sequence_length, hidden_size]")

8.5.3 Pipelines untuk Common Tasks

Hugging Face Pipelines = High-level API untuk inference.

from transformers import pipeline

# 1. Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love using transformers for NLP!")
print("Sentiment:", result)

# 2. Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Elon Musk founded SpaceX in 2002 in California")
print("\nNER:", result)

# 3. Question Answering
qa = pipeline("question-answering")
context = "Transformers were introduced in 2017 by Google researchers"
question = "When were transformers introduced?"
result = qa(question=question, context=context)
print("\nQA:", result)

# 4. Text Generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50, num_return_sequences=1)
print("\nGeneration:", result[0]['generated_text'])

# 5. Summarization
summarizer = pipeline("summarization")
long_text = """
Transformers have revolutionized natural language processing.
They use attention mechanisms to process entire sequences in parallel,
making them much faster than RNNs. Pre-trained models like BERT and GPT
have achieved state-of-the-art results on numerous benchmarks.
"""
result = summarizer(long_text, max_length=50, min_length=20)
print("\nSummary:", result[0]['summary_text'])

8.5.4 Tokenization Deep Dive

Tokenizer Types:

Word-based: Split pada spaces (simple, large vocab)
Character-based: Individual characters (small vocab, long sequences)
Subword: Best of both worlds (BPE, WordPiece, SentencePiece)

WordPiece Tokenization (used by BERT):

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentences
texts = [
    "Machine learning is awesome",
    "Transformers use self-attention",
    "Supercalifragilisticexpialidocious"
]

for text in texts:
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    print(f"\nText: {text}")
    print(f"Tokens: {tokens}")
    print(f"IDs: {token_ids}")

# Encoding with special tokens
encoding = tokenizer(
    "Hello world",
    add_special_tokens=True,
    max_length=10,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

print("\nEncoding keys:", encoding.keys())
print("Input IDs:", encoding['input_ids'])
print("Attention mask:", encoding['attention_mask'])

8.6 Fine-tuning BERT untuk Sentiment Analysis

8.6.1 Dataset Preparation

from datasets import load_dataset
from transformers import AutoTokenizer

# Load IMDB dataset
dataset = load_dataset("imdb")

print(f"Train examples: {len(dataset['train']):,}")
print(f"Test examples: {len(dataset['test']):,}")

# Inspect sample
sample = dataset['train'][0]
print(f"\nSample review:")
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']}")  # 0=negative, 1=positive

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    """Tokenize texts"""
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512
    )

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

print(f"\nTokenized dataset:")
print(tokenized_datasets)

8.6.2 Model Configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from datasets import load_metric

# Load pre-trained BERT for classification
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)

print(f"Model parameters: {model.num_parameters():,}")

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Metrics
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    """Compute accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

8.6.3 Training

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'].shuffle(seed=42).select(range(1000)),  # Small subset for demo
    eval_dataset=tokenized_datasets['test'].select(range(200)),
    compute_metrics=compute_metrics,
)

# Train
print("Starting training...")
trainer.train()

# Evaluate
print("\nEvaluating on test set...")
eval_results = trainer.evaluate()
print(f"Test accuracy: {eval_results['eval_accuracy']:.4f}")

8.6.4 Inference

from transformers import pipeline

# Create classifier pipeline
classifier = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer
)

# Test predictions
test_texts = [
    "This movie was absolutely fantastic! Best film of the year!",
    "Terrible waste of time. Worst movie ever.",
    "It was okay, nothing special.",
    "Amazing acting and great story. Highly recommended!"
]

print("\nPredictions:")
for text in test_texts:
    result = classifier(text)[0]
    label = "Positive" if result['label'] == 'LABEL_1' else "Negative"
    score = result['score']
    print(f"\nText: {text}")
    print(f"Prediction: {label} (confidence: {score:.4f})")

8.7 Advanced Topics

8.7.1 Model Optimization

Techniques untuk production:

Distillation: Compress large models
- DistilBERT: 40% smaller, 60% faster, 97% performance
Quantization: Reduce precision
- FP32 → INT8: 4x smaller, faster inference
Pruning: Remove unnecessary connections
ONNX Runtime: Optimized inference engine

# Quantization example
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

print(f"Original model size: {model.num_parameters():,} parameters")
print(f"Quantized model: ~4x smaller in memory")

8.7.2 Custom Tasks

Fine-tuning untuk custom domain:

# Example: Cybersecurity threat detection
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Custom dataset
texts = [
    "Detected SQL injection attempt in login form",
    "Normal user login activity",
    "Suspicious port scanning detected",
    "Regular API request"
]
labels = [1, 0, 1, 0]  # 1=threat, 0=normal

# Tokenize
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encodings = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Create dataset
class ThreatDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

dataset = ThreatDataset(encodings, labels)

# Fine-tune
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# ... training code ...

8.8 Rangkuman & Best Practices

8.8.1 Key Takeaways

📚 Chapter Summary

1. Attention Mechanism:

Selective focus pada relevant parts
Solusi untuk long-range dependencies
Self-attention dan multi-head attention

2. Transformer Architecture:

Encoder-decoder structure
Parallel processing (no sequential bottleneck)
Positional encoding untuk order information

3. Pre-trained Models:

BERT: Bidirectional, understanding tasks
GPT: Autoregressive, generation tasks
Transfer learning paradigm

4. Hugging Face:

Comprehensive transformers library
Easy access ke pre-trained models
Pipelines untuk quick inference

5. Fine-tuning:

Adapt pre-trained models ke specific tasks
Requires less data than training from scratch
State-of-the-art results

8.8.2 Best Practices

Model Selection:

Use BERT untuk classification, NER, QA
Use GPT untuk generation, completion
Consider model size vs performance trade-off

Fine-tuning:

Start dengan small learning rate (2e-5 to 5e-5)
Use gradient accumulation untuk large batches
Monitor validation loss untuk early stopping

Production:

Optimize dengan quantization/distillation
Cache tokenizer outputs
Batch predictions untuk throughput

8.9 Soal Latihan

Review Questions

Jelaskan perbedaan fundamental antara RNN dan Transformer dalam memproses sequences.
Apa itu self-attention? Bagaimana cara kerjanya secara matematis?
Mengapa positional encoding diperlukan dalam Transformer?
Bandingkan BERT dan GPT:
- Arsitektur (encoder vs decoder)
- Pre-training objective
- Use cases
Jelaskan konsep multi-head attention. Apa keuntungannya dibanding single-head?
Apa itu transfer learning dalam context NLP? Jelaskan pre-training dan fine-tuning.
Sebutkan 5 tasks yang cocok untuk BERT dan 5 untuk GPT.
Bagaimana WordPiece tokenization bekerja? Apa advantagenya?
Jelaskan trade-off antara model size dan performance.
Apa itu attention weights dan bagaimana interpretasinya?

Coding Exercises

Exercise 1: Implement Scaled Dot-Product Attention dari scratch (NumPy)

Exercise 2: Fine-tune BERT untuk custom text classification

Exercise 3: Compare different transformer models (BERT, RoBERTa, DistilBERT) pada same task

Exercise 4: Implement text generation dengan GPT-2

Exercise 5: Visualize attention weights untuk interpreting model

🎓 Selamat! Anda telah menyelesaikan Chapter 8 - Transformers & Attention!

Di Lab 8, kita akan fine-tune BERT untuk sentiment analysis dengan Hugging Face Transformers! 🚀

--- title: "Bab 8: Transformers & Attention Mechanism" subtitle: "Modern NLP dengan Self-Attention, BERT, GPT & Transfer Learning" number-sections: false --- # Bab 8: Transformers & Attention Mechanism {#sec-chapter-08} ::: {.callout-note} ## 🎯 Hasil Pembelajaran (Learning Outcomes) Setelah mempelajari bab ini, Anda akan mampu: 1. **Memahami** konsep attention mechanism dan self-attention 2. **Mengidentifikasi** arsitektur Transformer dan komponennya 3. **Mengimplementasikan** pre-trained transformers (BERT, GPT) dengan Hugging Face 4. **Menerapkan** transfer learning untuk NLP tasks 5. **Melakukan fine-tuning** model transformer untuk domain spesifik 6. **Menggunakan** tokenizers dan pipeline untuk inference 7. **Mengevaluasi** performa model transformer pada berbagai tasks ::: ## 8.1 Dari RNN ke Transformers: Revolusi NLP ### 8.1.1 Limitasi RNN dan LSTM **Masalah Sequential Processing:** Di Chapter 7, kita belajar RNN dan LSTM yang powerful untuk sequential data. Namun, mereka memiliki **fundamental limitations**: **1. Sequential Computation** (tidak bisa paralel): - Must process timestep-by-timestep - Cannot leverage GPU parallelism fully - Training sangat lambat untuk long sequences **2. Long-Range Dependencies** (masih sulit): - Meskipun LSTM lebih baik dari Simple RNN - Information bottleneck pada hidden state - Gradient masih bisa vanish untuk very long sequences (>1000 tokens) **3. Fixed Context Window**: - Encoder mengkompresi semua info ke single vector - Information loss untuk long documents - Early tokens might be "forgotten" ::: {.callout-tip} ## 💡 Contoh Problem Kalimat: "The cat, which was sleeping peacefully on the soft cushion all afternoon, **was** hungry." - RNN harus remember "cat" dari awal kalimat sampai "was" - Distance = 12 words - LSTM dapat handle ini, tapi untuk 100+ words? Sulit! - **Solution**: Attention mechanism ::: ### 8.1.2 The Attention Revolution **Attention Mechanism** diperkenalkan dalam paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2015). **Key Insight**: Tidak perlu compress semua information ke single vector. Model dapat **selectively focus** pada relevant parts! **Evolution Timeline:** ```{mermaid} timeline title Evolution of NLP Architectures 2014 : Seq2Seq with RNN : Encoder-Decoder architecture 2015 : Attention Mechanism : Bahdanau & Luong Attention 2017 : Transformer Architecture : "Attention is All You Need" : Self-attention & Multi-head attention 2018 : Pre-trained Language Models : BERT (Bidirectional) : GPT (Autoregressive) 2019-2020 : Large Language Models : GPT-2, GPT-3 : RoBERTa, ALBERT, T5 2022-2023 : Foundation Models Era : ChatGPT, GPT-4 : LLaMA, PaLM, Claude 2024 : Multimodal Transformers : Vision-Language models : Code generation models ``` ### 8.1.3 Mengapa Transformers Menang? **Kelebihan Transformers:** 1. **Parallelization**: Semua tokens diproses simultaneously 2. **Long-range dependencies**: Direct connections via attention 3. **Scalability**: Bisa scale ke billions of parameters 4. **Transfer learning**: Pre-training + fine-tuning paradigm 5. **Versatility**: Works untuk NLP, vision, audio, multimodal **Dampak di Industry:** ```{python} #| echo: true #| code-fold: false import matplotlib.pyplot as plt import numpy as np # Visualize the impact fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) # Model size growth years = [2017, 2018, 2019, 2020, 2021, 2022, 2023] models = ['Transformer\nBase', 'BERT\nBase', 'GPT-2', 'GPT-3', 'PaLM', 'GPT-3.5', 'GPT-4'] params = [65e6, 110e6, 1.5e9, 175e9, 540e9, 175e9, 1000e9] # Estimated parameters ax1.bar(range(len(models)), np.array(params)/1e9, color='steelblue', alpha=0.7) ax1.set_xticks(range(len(models))) ax1.set_xticklabels(models, rotation=45, ha='right') ax1.set_ylabel('Parameters (Billions)', fontsize=12, fontweight='bold') ax1.set_title('Growth of Transformer Models', fontsize=14, fontweight='bold') ax1.set_yscale('log') ax1.grid(axis='y', alpha=0.3) # Performance on benchmarks benchmarks = ['SQuAD\n(QA)', 'GLUE\n(General)', 'SuperGLUE\n(Hard)', 'MMLU\n(Knowledge)'] rnn_scores = [65, 70, 45, 30] transformer_scores = [93, 90, 89, 85] x = np.arange(len(benchmarks)) width = 0.35 ax2.bar(x - width/2, rnn_scores, width, label='RNN/LSTM (2016)', color='coral', alpha=0.7) ax2.bar(x + width/2, transformer_scores, width, label='Transformers (2023)', color='green', alpha=0.7) ax2.set_ylabel('Score (%)', fontsize=12, fontweight='bold') ax2.set_title('Performance Comparison on Benchmarks', fontsize=14, fontweight='bold') ax2.set_xticks(x) ax2.set_xticklabels(benchmarks) ax2.legend(fontsize=11) ax2.grid(axis='y', alpha=0.3) ax2.set_ylim([0, 100]) plt.tight_layout() plt.show() ``` ## 8.2 Attention Mechanism: Konsep Fundamental ### 8.2.1 Intuisi Attention **Analogi Manusia:** Saat membaca kalimat: "The quick brown fox jumps over the lazy dog" - Untuk understand "jumps", kita fokus pada "fox" (subject) dan "over" (direction) - Tidak semua kata equally important - **Attention weights** menentukan relevance setiap kata **Mathematical Formulation:** Attention adalah weighted sum dari values, where weights ditentukan oleh compatibility function. **Basic Attention:** $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Where: - **Q** (Query): "What I'm looking for" - **K** (Key): "What I have" - **V** (Value): "Actual content" - $d_k$: Dimension of keys (untuk scaling) ### 8.2.2 Attention Step-by-Step **Example**: Simple attention calculation ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt import seaborn as sns def attention_mechanism(Q, K, V): """ Compute attention mechanism Parameters: Q: Query matrix (seq_len_q, d_k) K: Key matrix (seq_len_k, d_k) V: Value matrix (seq_len_k, d_v) Returns: output: Attention output (seq_len_q, d_v) attention_weights: Attention weights (seq_len_q, seq_len_k) """ # Step 1: Compute scores (dot product) d_k = K.shape[1] scores = np.matmul(Q, K.T) # (seq_len_q, seq_len_k) # Step 2: Scale scaled_scores = scores / np.sqrt(d_k) # Step 3: Softmax attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True) # Step 4: Weighted sum of values output = np.matmul(attention_weights, V) return output, attention_weights # Example: Simple sentence # "The cat sat" # Embedding dimension = 4 (simplified) np.random.seed(42) seq_len = 3 d_model = 4 # Simulate word embeddings Q = np.random.randn(seq_len, d_model) K = np.random.randn(seq_len, d_model) V = np.random.randn(seq_len, d_model) # Compute attention output, weights = attention_mechanism(Q, K, V) # Visualize attention weights words = ['The', 'cat', 'sat'] fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) # Attention heatmap sns.heatmap(weights, annot=True, fmt='.3f', cmap='YlOrRd', xticklabels=words, yticklabels=words, cbar_kws={'label': 'Attention Weight'}, ax=ax1) ax1.set_xlabel('Key (Input)', fontsize=12, fontweight='bold') ax1.set_ylabel('Query (Output)', fontsize=12, fontweight='bold') ax1.set_title('Attention Weights Heatmap', fontsize=14, fontweight='bold') # Attention flow for i, query_word in enumerate(words): ax2.barh(range(len(words)), weights[i], alpha=0.7, label=f'Query: {query_word}') ax2.set_yticks(range(len(words))) ax2.set_yticklabels(words) ax2.set_xlabel('Attention Weight', fontsize=12, fontweight='bold') ax2.set_ylabel('Key Words', fontsize=12, fontweight='bold') ax2.set_title('Attention Distribution per Query', fontsize=14, fontweight='bold') ax2.legend(fontsize=10) ax2.grid(axis='x', alpha=0.3) plt.tight_layout() plt.show() print("Attention Output Shape:", output.shape) print("\nAttention Weights:") print(weights) ``` ### 8.2.3 Self-Attention **Self-Attention** adalah kasus khusus di mana Q, K, V berasal dari **input yang sama**. **Purpose**: Sequence dapat "attend to itself" untuk capture relationships antar words. **Implementation:** ```{python} #| echo: true #| code-fold: false class SelfAttention: """ Self-Attention mechanism (NumPy implementation) """ def __init__(self, d_model, d_k=None): """ Parameters: d_model: Embedding dimension d_k: Key/Query dimension (default: d_model) """ self.d_model = d_model self.d_k = d_k if d_k is not None else d_model # Weight matrices (randomly initialized) self.W_q = np.random.randn(d_model, self.d_k) * 0.01 self.W_k = np.random.randn(d_model, self.d_k) * 0.01 self.W_v = np.random.randn(d_model, d_model) * 0.01 def forward(self, X): """ Forward pass Parameters: X: Input (batch, seq_len, d_model) Returns: output: Self-attention output attention_weights: Attention weights """ # Linear projections Q = np.matmul(X, self.W_q) # (batch, seq_len, d_k) K = np.matmul(X, self.W_k) # (batch, seq_len, d_k) V = np.matmul(X, self.W_v) # (batch, seq_len, d_model) # Scaled dot-product attention scores = np.matmul(Q, K.transpose(0, 2, 1)) # (batch, seq_len, seq_len) scaled_scores = scores / np.sqrt(self.d_k) # Softmax attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True) # Weighted sum output = np.matmul(attention_weights, V) # (batch, seq_len, d_model) return output, attention_weights # Test self-attention batch_size = 2 seq_len = 5 d_model = 8 X = np.random.randn(batch_size, seq_len, d_model) self_attn = SelfAttention(d_model=d_model) output, weights = self_attn.forward(X) print(f"Input shape: {X.shape}") print(f"Output shape: {output.shape}") print(f"Attention weights shape: {weights.shape}") # Visualize attention for first sample plt.figure(figsize=(8, 6)) sns.heatmap(weights[0], annot=True, fmt='.2f', cmap='Blues', cbar_kws={'label': 'Attention Weight'}) plt.xlabel('Key Position', fontsize=12, fontweight='bold') plt.ylabel('Query Position', fontsize=12, fontweight='bold') plt.title('Self-Attention Weights (Sample 1)', fontsize=14, fontweight='bold') plt.tight_layout() plt.show() ``` ### 8.2.4 Multi-Head Attention **Problem dengan Single Attention**: Model hanya bisa fokus pada satu "aspect" atau "relationship" at a time. **Solution**: **Multi-Head Attention** - Multiple attention mechanisms in parallel! **Intuisi**: - Head 1: Fokus pada syntactic relationships - Head 2: Fokus pada semantic relationships - Head 3: Fokus pada positional dependencies - dll. **Architecture:** ```{mermaid} graph TD X[Input X] --> L1[Linear Q1] X --> L2[Linear K1] X --> L3[Linear V1] X --> L4[Linear Q2] X --> L5[Linear K2] X --> L6[Linear V2] X --> L7[Linear Qh] X --> L8[Linear Kh] X --> L9[Linear Vh] L1 --> A1[Attention Head 1] L2 --> A1 L3 --> A1 L4 --> A2[Attention Head 2] L5 --> A2 L6 --> A2 L7 --> Ah[Attention Head h] L8 --> Ah L9 --> Ah A1 --> C[Concat] A2 --> C Ah --> C C --> LO[Linear Output] LO --> O[Output] style X fill:#e6f3ff style A1 fill:#ffe6e6 style A2 fill:#ffe6e6 style Ah fill:#ffe6e6 style C fill:#ffffcc style O fill:#ccffcc ``` **Mathematical Formulation:** $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$ Where each head: $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ **Implementation:** ```{python} #| echo: true #| code-fold: false import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers class MultiHeadAttention(layers.Layer): """ Multi-Head Attention Layer (Keras) """ def __init__(self, d_model, num_heads, name="multi_head_attention"): super(MultiHeadAttention, self).__init__(name=name) assert d_model % num_heads == 0, "d_model must be divisible by num_heads" self.d_model = d_model self.num_heads = num_heads self.depth = d_model // num_heads # Linear layers untuk Q, K, V self.wq = layers.Dense(d_model, name='query') self.wk = layers.Dense(d_model, name='key') self.wv = layers.Dense(d_model, name='value') # Output linear layer self.dense = layers.Dense(d_model, name='output') def split_heads(self, x, batch_size): """Split last dimension into (num_heads, depth)""" x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm=[0, 2, 1, 3]) def scaled_dot_product_attention(self, q, k, v, mask=None): """Calculate attention weights""" # Q・K^T matmul_qk = tf.matmul(q, k, transpose_b=True) # Scale dk = tf.cast(tf.shape(k)[-1], tf.float32) scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) # Mask (optional) if mask is not None: scaled_attention_logits += (mask * -1e9) # Softmax attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # Output output = tf.matmul(attention_weights, v) return output, attention_weights def call(self, q, k, v, mask=None): batch_size = tf.shape(q)[0] # Linear projections q = self.wq(q) # (batch, seq_len_q, d_model) k = self.wk(k) # (batch, seq_len_k, d_model) v = self.wv(v) # (batch, seq_len_v, d_model) # Split into heads q = self.split_heads(q, batch_size) # (batch, num_heads, seq_len_q, depth) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) # Scaled dot-product attention scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask) # scaled_attention: (batch, num_heads, seq_len_q, depth) # Concatenate heads scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model)) # Final linear output = self.dense(concat_attention) return output, attention_weights # Test multi-head attention d_model = 512 num_heads = 8 seq_len = 10 batch_size = 2 mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads) # Dummy input query = tf.random.normal((batch_size, seq_len, d_model)) key = tf.random.normal((batch_size, seq_len, d_model)) value = tf.random.normal((batch_size, seq_len, d_model)) output, weights = mha(query, key, value) print(f"Input shape: {query.shape}") print(f"Output shape: {output.shape}") print(f"Attention weights shape: {weights.shape}") print(f"Number of parameters: {mha.count_params():,}") ``` ## 8.3 Transformer Architecture ### 8.3.1 Complete Transformer Model **"Attention is All You Need"** (Vaswani et al., 2017) memperkenalkan Transformer architecture yang sepenuhnya based on attention, tanpa recurrence atau convolution. **High-Level Architecture:** ```{mermaid} graph TB subgraph "ENCODER (Left)" IE[Input Embedding] --> IPE[+ Positional Encoding] IPE --> EN1[Encoder Layer 1] EN1 --> EN2[Encoder Layer 2] EN2 --> ENn[Encoder Layer N] end subgraph "DECODER (Right)" OE[Output Embedding] --> OPE[+ Positional Encoding] OPE --> DN1[Decoder Layer 1] DN1 --> DN2[Decoder Layer 2] DN2 --> DNn[Decoder Layer N] end ENn -.->|Context| DN1 ENn -.->|Context| DN2 ENn -.->|Context| DNn DNn --> L[Linear] L --> S[Softmax] S --> OUT[Output Probabilities] style IE fill:#e6f3ff style OE fill:#e6f3ff style EN1 fill:#ffe6e6 style EN2 fill:#ffe6e6 style ENn fill:#ffe6e6 style DN1 fill:#ffffcc style DN2 fill:#ffffcc style DNn fill:#ffffcc style OUT fill:#ccffcc ``` ### 8.3.2 Encoder Layer Details Setiap Encoder layer terdiri dari 2 sub-layers: 1. **Multi-Head Self-Attention** 2. **Position-wise Feed-Forward Network** Keduanya menggunakan **residual connections** dan **layer normalization**. **Encoder Layer Architecture:** ```{python} #| echo: true #| code-fold: false class EncoderLayer(layers.Layer): """ Single Transformer Encoder Layer """ def __init__(self, d_model, num_heads, dff, dropout_rate=0.1, name="encoder_layer"): """ Parameters: d_model: Model dimension num_heads: Number of attention heads dff: Dimension of feed-forward network dropout_rate: Dropout rate """ super(EncoderLayer, self).__init__(name=name) # Multi-head attention self.mha = MultiHeadAttention(d_model, num_heads) # Feed-forward network self.ffn = keras.Sequential([ layers.Dense(dff, activation='relu'), layers.Dense(d_model) ]) # Layer normalization self.layernorm1 = layers.LayerNormalization(epsilon=1e-6) self.layernorm2 = layers.LayerNormalization(epsilon=1e-6) # Dropout self.dropout1 = layers.Dropout(dropout_rate) self.dropout2 = layers.Dropout(dropout_rate) def call(self, x, training, mask=None): """Forward pass""" # Multi-head attention attn_output, _ = self.mha(x, x, x, mask) attn_output = self.dropout1(attn_output, training=training) out1 = self.layernorm1(x + attn_output) # Residual + LayerNorm # Feed-forward ffn_output = self.ffn(out1) ffn_output = self.dropout2(ffn_output, training=training) out2 = self.layernorm2(out1 + ffn_output) # Residual + LayerNorm return out2 # Test encoder layer encoder_layer = EncoderLayer(d_model=512, num_heads=8, dff=2048) x = tf.random.normal((batch_size, seq_len, 512)) output = encoder_layer(x, training=False) print(f"Input shape: {x.shape}") print(f"Output shape: {output.shape}") print(f"Encoder layer parameters: {encoder_layer.count_params():,}") ``` ### 8.3.3 Positional Encoding **Problem**: Attention mechanism tidak ada notion of order/position! - "Cat chases dog" vs "Dog chases cat" → Same attention weights! **Solution**: **Positional Encoding** - Add position information ke embeddings. **Sinusoidal Positional Encoding:** $$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ Where: - $pos$: Position dalam sequence - $i$: Dimension index - $d_{model}$: Model dimension **Implementation:** ```{python} #| echo: true #| code-fold: false def get_positional_encoding(seq_len, d_model): """ Generate sinusoidal positional encoding Parameters: seq_len: Sequence length d_model: Model dimension Returns: pos_encoding: (1, seq_len, d_model) """ # Create position indices positions = np.arange(seq_len)[:, np.newaxis] # (seq_len, 1) # Create dimension indices dimensions = np.arange(d_model)[np.newaxis, :] # (1, d_model) # Calculate angle rates angle_rates = 1 / np.power(10000, (2 * (dimensions // 2)) / d_model) # Calculate angles angles = positions * angle_rates # (seq_len, d_model) # Apply sin to even indices, cos to odd indices angles[:, 0::2] = np.sin(angles[:, 0::2]) angles[:, 1::2] = np.cos(angles[:, 1::2]) # Add batch dimension pos_encoding = angles[np.newaxis, ...] return tf.cast(pos_encoding, dtype=tf.float32) # Generate and visualize seq_len = 100 d_model = 128 pos_encoding = get_positional_encoding(seq_len, d_model) fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Heatmap im = axes[0].imshow(pos_encoding[0], cmap='RdBu', aspect='auto') axes[0].set_xlabel('Dimension', fontsize=12, fontweight='bold') axes[0].set_ylabel('Position', fontsize=12, fontweight='bold') axes[0].set_title('Positional Encoding Heatmap', fontsize=14, fontweight='bold') plt.colorbar(im, ax=axes[0]) # Selected dimensions axes[1].plot(pos_encoding[0, :, 4], label='dim 4 (sin)') axes[1].plot(pos_encoding[0, :, 5], label='dim 5 (cos)') axes[1].plot(pos_encoding[0, :, 32], label='dim 32 (sin)', linestyle='--') axes[1].plot(pos_encoding[0, :, 33], label='dim 33 (cos)', linestyle='--') axes[1].set_xlabel('Position', fontsize=12, fontweight='bold') axes[1].set_ylabel('Encoding Value', fontsize=12, fontweight='bold') axes[1].set_title('Positional Encoding Patterns', fontsize=14, fontweight='bold') axes[1].legend(fontsize=10) axes[1].grid(alpha=0.3) plt.tight_layout() plt.show() print(f"Positional encoding shape: {pos_encoding.shape}") ``` ### 8.3.4 Complete Transformer Encoder ```{python} #| echo: true #| code-fold: false class TransformerEncoder(layers.Layer): """ Transformer Encoder (Stack of N encoder layers) """ def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, dropout_rate=0.1, name="transformer_encoder"): super(TransformerEncoder, self).__init__(name=name) self.d_model = d_model self.num_layers = num_layers # Embedding layer self.embedding = layers.Embedding(input_vocab_size, d_model) # Positional encoding self.pos_encoding = get_positional_encoding(maximum_position_encoding, d_model) # Stack of encoder layers self.enc_layers = [ EncoderLayer(d_model, num_heads, dff, dropout_rate, name=f'encoder_layer_{i}') for i in range(num_layers) ] self.dropout = layers.Dropout(dropout_rate) def call(self, x, training, mask=None): seq_len = tf.shape(x)[1] # Embedding + positional encoding x = self.embedding(x) # (batch, seq_len, d_model) x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) # Scaling x += self.pos_encoding[:, :seq_len, :] x = self.dropout(x, training=training) # Pass through encoder layers for enc_layer in self.enc_layers: x = enc_layer(x, training, mask) return x # Build complete encoder encoder = TransformerEncoder( num_layers=6, d_model=512, num_heads=8, dff=2048, input_vocab_size=10000, maximum_position_encoding=1000, dropout_rate=0.1 ) # Test sample_input = tf.random.uniform((2, 10), dtype=tf.int32, maxval=10000) encoder_output = encoder(sample_input, training=False) print(f"Input shape: {sample_input.shape}") print(f"Encoder output shape: {encoder_output.shape}") print(f"Total encoder parameters: {encoder.count_params():,}") ``` ## 8.4 Pre-trained Transformers: BERT & GPT ### 8.4.1 Transfer Learning Paradigm **Traditional ML**: Train from scratch untuk setiap task **Modern NLP**: Pre-train on massive corpus, then fine-tune! **Two-Stage Process:** 1. **Pre-training**: Learn general language understanding - Corpus: Wikipedia, books, web text (billions of words) - Task: Self-supervised (mask prediction, next word prediction) - Duration: Days/weeks pada hundreds of GPUs 2. **Fine-tuning**: Adapt ke specific task - Corpus: Task-specific data (thousands of examples) - Task: Supervised (classification, QA, NER, etc.) - Duration: Minutes/hours pada single GPU **Benefits:** - Tidak perlu millions of labeled examples - Leverage knowledge from massive pre-training - State-of-the-art results dengan less data ### 8.4.2 BERT: Bidirectional Encoder Representations **BERT** (Devlin et al., 2018) adalah breakthrough pre-trained model. **Key Ideas:** 1. **Bidirectional**: Process text kiri dan kanan simultaneously 2. **Masked Language Model (MLM)**: Predict masked tokens 3. **Next Sentence Prediction (NSP)**: Understand sentence relationships **Architecture:** ```{mermaid} graph TB T[Text Input] --> TOK[Tokenization] TOK --> EMB[Token + Segment + Position Embeddings] EMB --> E1[Encoder 1] E1 --> E2[Encoder 2] E2 --> E3[...] E3 --> EN[Encoder N 12 or 24 layers] EN --> CLS[CLS Token Sentence representation] EN --> TOK_OUT[Token Outputs Contextualized embeddings] CLS --> TASK1[Classification Sentiment, NER, etc.] TOK_OUT --> TASK2[Token tasks QA, NER] style T fill:#e6f3ff style EMB fill:#ffe6e6 style EN fill:#ffffcc style CLS fill:#ccffcc style TOK_OUT fill:#ccffcc ``` **BERT Variants:** | Model | Layers | Hidden Size | Heads | Parameters | |-------|--------|-------------|-------|------------| | BERT-Base | 12 | 768 | 12 | 110M | | BERT-Large | 24 | 1024 | 16 | 340M | | RoBERTa | 24 | 1024 | 16 | 355M | | ALBERT | 12 | 768 | 12 | 12M | | DistilBERT | 6 | 768 | 12 | 66M | ### 8.4.3 GPT: Generative Pre-trained Transformer **GPT** (Radford et al., 2018) menggunakan **autoregressive** approach. **Key Differences dari BERT:** 1. **Unidirectional**: Left-to-right only (decoder architecture) 2. **Causal Language Modeling**: Predict next token 3. **Generation focus**: Designed untuk text generation **GPT Evolution:** - **GPT** (2018): 117M parameters - **GPT-2** (2019): 1.5B parameters - **GPT-3** (2020): 175B parameters - **ChatGPT** (2022): GPT-3.5 + RLHF - **GPT-4** (2023): Multimodal, >1T parameters (estimated) ### 8.4.4 BERT vs GPT: When to Use What? **Comparison:** | Aspect | BERT | GPT | |--------|------|-----| | **Architecture** | Encoder only | Decoder only | | **Direction** | Bidirectional | Left-to-right | | **Pre-training** | Masked LM | Next token prediction | | **Best for** | Understanding tasks | Generation tasks | | **Tasks** | Classification, NER, QA | Text generation, completion | | **Context** | Full sentence | Left context only | **Use Cases:** ```{python} #| echo: true #| code-fold: false # Visualize use cases import matplotlib.pyplot as plt import matplotlib.patches as mpatches fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) # BERT use cases bert_tasks = ['Sentiment\nAnalysis', 'Named Entity\nRecognition', 'Question\nAnswering', 'Text\nClassification', 'Similarity\nMatching'] bert_scores = [95, 93, 88, 94, 90] ax1.barh(bert_tasks, bert_scores, color='#4285F4', alpha=0.8) ax1.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold') ax1.set_title('BERT: Understanding Tasks', fontsize=14, fontweight='bold') ax1.set_xlim([0, 100]) ax1.grid(axis='x', alpha=0.3) # GPT use cases gpt_tasks = ['Story\nGeneration', 'Code\nCompletion', 'Summarization', 'Translation', 'Conversation'] gpt_scores = [92, 89, 85, 83, 95] ax2.barh(gpt_tasks, gpt_scores, color='#34A853', alpha=0.8) ax2.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold') ax2.set_title('GPT: Generation Tasks', fontsize=14, fontweight='bold') ax2.set_xlim([0, 100]) ax2.grid(axis='x', alpha=0.3) plt.tight_layout() plt.show() ``` ## 8.5 Hugging Face Transformers Library ### 8.5.1 Introduction to Hugging Face **Hugging Face** adalah library paling populer untuk working dengan transformers. **Features:** - 100,000+ pre-trained models - Support PyTorch, TensorFlow, JAX - Easy-to-use APIs - Active community - Hub untuk sharing models **Installation:** ```python pip install transformers pip install datasets pip install evaluate ``` ### 8.5.2 Loading Pre-trained Models **Basic Usage:** ```{python} #| echo: true #| code-fold: false #| eval: false from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification # Load BERT tokenizer and model model_name = 'bert-base-uncased' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) print(f"Model: {model_name}") print(f"Tokenizer vocab size: {tokenizer.vocab_size:,}") print(f"Model parameters: {model.num_parameters():,}") # Tokenize text text = "Transformers are amazing for NLP tasks!" tokens = tokenizer(text, return_tensors='pt') print(f"\nOriginal text: {text}") print(f"Tokens: {tokens['input_ids']}") print(f"Decoded: {tokenizer.decode(tokens['input_ids'][0])}") # Forward pass outputs = model(**tokens) last_hidden_state = outputs.last_hidden_state print(f"\nOutput shape: {last_hidden_state.shape}") print(f" [batch_size, sequence_length, hidden_size]") ``` ### 8.5.3 Pipelines untuk Common Tasks **Hugging Face Pipelines** = High-level API untuk inference. ```{python} #| echo: true #| code-fold: false #| eval: false from transformers import pipeline # 1. Sentiment Analysis sentiment_analyzer = pipeline("sentiment-analysis") result = sentiment_analyzer("I love using transformers for NLP!") print("Sentiment:", result) # 2. Named Entity Recognition ner = pipeline("ner", grouped_entities=True) result = ner("Elon Musk founded SpaceX in 2002 in California") print("\nNER:", result) # 3. Question Answering qa = pipeline("question-answering") context = "Transformers were introduced in 2017 by Google researchers" question = "When were transformers introduced?" result = qa(question=question, context=context) print("\nQA:", result) # 4. Text Generation generator = pipeline("text-generation", model="gpt2") result = generator("Once upon a time", max_length=50, num_return_sequences=1) print("\nGeneration:", result[0]['generated_text']) # 5. Summarization summarizer = pipeline("summarization") long_text = """ Transformers have revolutionized natural language processing. They use attention mechanisms to process entire sequences in parallel, making them much faster than RNNs. Pre-trained models like BERT and GPT have achieved state-of-the-art results on numerous benchmarks. """ result = summarizer(long_text, max_length=50, min_length=20) print("\nSummary:", result[0]['summary_text']) ``` ### 8.5.4 Tokenization Deep Dive **Tokenizer Types:** 1. **Word-based**: Split pada spaces (simple, large vocab) 2. **Character-based**: Individual characters (small vocab, long sequences) 3. **Subword**: Best of both worlds (BPE, WordPiece, SentencePiece) **WordPiece Tokenization** (used by BERT): ```{python} #| echo: true #| code-fold: false #| eval: false from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Example sentences texts = [ "Machine learning is awesome", "Transformers use self-attention", "Supercalifragilisticexpialidocious" ] for text in texts: # Tokenize tokens = tokenizer.tokenize(text) token_ids = tokenizer.convert_tokens_to_ids(tokens) print(f"\nText: {text}") print(f"Tokens: {tokens}") print(f"IDs: {token_ids}") # Encoding with special tokens encoding = tokenizer( "Hello world", add_special_tokens=True, max_length=10, padding='max_length', truncation=True, return_tensors='pt' ) print("\nEncoding keys:", encoding.keys()) print("Input IDs:", encoding['input_ids']) print("Attention mask:", encoding['attention_mask']) ``` ## 8.6 Fine-tuning BERT untuk Sentiment Analysis ### 8.6.1 Dataset Preparation ```{python} #| echo: true #| code-fold: false #| eval: false from datasets import load_dataset from transformers import AutoTokenizer # Load IMDB dataset dataset = load_dataset("imdb") print(f"Train examples: {len(dataset['train']):,}") print(f"Test examples: {len(dataset['test']):,}") # Inspect sample sample = dataset['train'][0] print(f"\nSample review:") print(f"Text: {sample['text'][:200]}...") print(f"Label: {sample['label']}") # 0=negative, 1=positive # Load tokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') def tokenize_function(examples): """Tokenize texts""" return tokenizer( examples['text'], padding='max_length', truncation=True, max_length=512 ) # Tokenize dataset tokenized_datasets = dataset.map(tokenize_function, batched=True) print(f"\nTokenized dataset:") print(tokenized_datasets) ``` ### 8.6.2 Model Configuration ```{python} #| echo: true #| code-fold: false #| eval: false from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer import numpy as np from datasets import load_metric # Load pre-trained BERT for classification model = AutoModelForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2 # Binary classification ) print(f"Model parameters: {model.num_parameters():,}") # Training arguments training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=64, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', logging_steps=100, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) # Metrics metric = load_metric("accuracy") def compute_metrics(eval_pred): """Compute accuracy""" logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels) ``` ### 8.6.3 Training ```{python} #| echo: true #| code-fold: false #| eval: false # Create Trainer trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets['train'].shuffle(seed=42).select(range(1000)), # Small subset for demo eval_dataset=tokenized_datasets['test'].select(range(200)), compute_metrics=compute_metrics, ) # Train print("Starting training...") trainer.train() # Evaluate print("\nEvaluating on test set...") eval_results = trainer.evaluate() print(f"Test accuracy: {eval_results['eval_accuracy']:.4f}") ``` ### 8.6.4 Inference ```{python} #| echo: true #| code-fold: false #| eval: false from transformers import pipeline # Create classifier pipeline classifier = pipeline( "sentiment-analysis", model=model, tokenizer=tokenizer ) # Test predictions test_texts = [ "This movie was absolutely fantastic! Best film of the year!", "Terrible waste of time. Worst movie ever.", "It was okay, nothing special.", "Amazing acting and great story. Highly recommended!" ] print("\nPredictions:") for text in test_texts: result = classifier(text)[0] label = "Positive" if result['label'] == 'LABEL_1' else "Negative" score = result['score'] print(f"\nText: {text}") print(f"Prediction: {label} (confidence: {score:.4f})") ``` ## 8.7 Advanced Topics ### 8.7.1 Model Optimization **Techniques untuk production:** 1. **Distillation**: Compress large models - DistilBERT: 40% smaller, 60% faster, 97% performance 2. **Quantization**: Reduce precision - FP32 → INT8: 4x smaller, faster inference 3. **Pruning**: Remove unnecessary connections 4. **ONNX Runtime**: Optimized inference engine ```{python} #| echo: true #| code-fold: false #| eval: false # Quantization example from transformers import AutoModelForSequenceClassification import torch model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased') # Dynamic quantization quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) print(f"Original model size: {model.num_parameters():,} parameters") print(f"Quantized model: ~4x smaller in memory") ``` ### 8.7.2 Custom Tasks **Fine-tuning untuk custom domain:** ```{python} #| echo: true #| code-fold: false #| eval: false # Example: Cybersecurity threat detection from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments # Custom dataset texts = [ "Detected SQL injection attempt in login form", "Normal user login activity", "Suspicious port scanning detected", "Regular API request" ] labels = [1, 0, 1, 0] # 1=threat, 0=normal # Tokenize tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') encodings = tokenizer(texts, padding=True, truncation=True, return_tensors='pt') # Create dataset class ThreatDataset(torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __len__(self): return len(self.labels) def __getitem__(self, idx): item = {key: val[idx] for key, val in self.encodings.items()} item['labels'] = torch.tensor(self.labels[idx]) return item dataset = ThreatDataset(encodings, labels) # Fine-tune model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # ... training code ... ``` ## 8.8 Rangkuman & Best Practices ### 8.8.1 Key Takeaways ::: {.callout-note} ## 📚 Chapter Summary **1. Attention Mechanism:** - Selective focus pada relevant parts - Solusi untuk long-range dependencies - Self-attention dan multi-head attention **2. Transformer Architecture:** - Encoder-decoder structure - Parallel processing (no sequential bottleneck) - Positional encoding untuk order information **3. Pre-trained Models:** - **BERT**: Bidirectional, understanding tasks - **GPT**: Autoregressive, generation tasks - Transfer learning paradigm **4. Hugging Face:** - Comprehensive transformers library - Easy access ke pre-trained models - Pipelines untuk quick inference **5. Fine-tuning:** - Adapt pre-trained models ke specific tasks - Requires less data than training from scratch - State-of-the-art results ::: ### 8.8.2 Best Practices **Model Selection:** - Use BERT untuk classification, NER, QA - Use GPT untuk generation, completion - Consider model size vs performance trade-off **Fine-tuning:** - Start dengan small learning rate (2e-5 to 5e-5) - Use gradient accumulation untuk large batches - Monitor validation loss untuk early stopping **Production:** - Optimize dengan quantization/distillation - Cache tokenizer outputs - Batch predictions untuk throughput ## 8.9 Soal Latihan ### Review Questions 1. Jelaskan perbedaan fundamental antara RNN dan Transformer dalam memproses sequences. 2. Apa itu self-attention? Bagaimana cara kerjanya secara matematis? 3. Mengapa positional encoding diperlukan dalam Transformer? 4. Bandingkan BERT dan GPT: - Arsitektur (encoder vs decoder) - Pre-training objective - Use cases 5. Jelaskan konsep multi-head attention. Apa keuntungannya dibanding single-head? 6. Apa itu transfer learning dalam context NLP? Jelaskan pre-training dan fine-tuning. 7. Sebutkan 5 tasks yang cocok untuk BERT dan 5 untuk GPT. 8. Bagaimana WordPiece tokenization bekerja? Apa advantagenya? 9. Jelaskan trade-off antara model size dan performance. 10. Apa itu attention weights dan bagaimana interpretasinya? ### Coding Exercises **Exercise 1**: Implement Scaled Dot-Product Attention dari scratch (NumPy) **Exercise 2**: Fine-tune BERT untuk custom text classification **Exercise 3**: Compare different transformer models (BERT, RoBERTa, DistilBERT) pada same task **Exercise 4**: Implement text generation dengan GPT-2 **Exercise 5**: Visualize attention weights untuk interpreting model --- **🎓 Selamat! Anda telah menyelesaikan Chapter 8 - Transformers & Attention!** **Di Lab 8, kita akan fine-tune BERT untuk sentiment analysis dengan Hugging Face Transformers! 🚀**