Bab 8: Transformers & Attention Mechanism

Modern NLP dengan Self-Attention, BERT, GPT & Transfer Learning

Bab 8: Transformers & Attention Mechanism

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

  1. Memahami konsep attention mechanism dan self-attention
  2. Mengidentifikasi arsitektur Transformer dan komponennya
  3. Mengimplementasikan pre-trained transformers (BERT, GPT) dengan Hugging Face
  4. Menerapkan transfer learning untuk NLP tasks
  5. Melakukan fine-tuning model transformer untuk domain spesifik
  6. Menggunakan tokenizers dan pipeline untuk inference
  7. Mengevaluasi performa model transformer pada berbagai tasks

8.1 Dari RNN ke Transformers: Revolusi NLP

8.1.1 Limitasi RNN dan LSTM

Masalah Sequential Processing:

Di Chapter 7, kita belajar RNN dan LSTM yang powerful untuk sequential data. Namun, mereka memiliki fundamental limitations:

1. Sequential Computation (tidak bisa paralel):

  • Must process timestep-by-timestep
  • Cannot leverage GPU parallelism fully
  • Training sangat lambat untuk long sequences

2. Long-Range Dependencies (masih sulit):

  • Meskipun LSTM lebih baik dari Simple RNN
  • Information bottleneck pada hidden state
  • Gradient masih bisa vanish untuk very long sequences (>1000 tokens)

3. Fixed Context Window:

  • Encoder mengkompresi semua info ke single vector
  • Information loss untuk long documents
  • Early tokens might be “forgotten”
💡 Contoh Problem

Kalimat: “The cat, which was sleeping peacefully on the soft cushion all afternoon, was hungry.”

  • RNN harus remember “cat” dari awal kalimat sampai “was”
  • Distance = 12 words
  • LSTM dapat handle ini, tapi untuk 100+ words? Sulit!
  • Solution: Attention mechanism

8.1.2 The Attention Revolution

Attention Mechanism diperkenalkan dalam paper “Neural Machine Translation by Jointly Learning to Align and Translate” (Bahdanau et al., 2015).

Key Insight: Tidak perlu compress semua information ke single vector. Model dapat selectively focus pada relevant parts!

Evolution Timeline:

Code
timeline
    title Evolution of NLP Architectures
    2014 : Seq2Seq with RNN
         : Encoder-Decoder architecture
    2015 : Attention Mechanism
         : Bahdanau & Luong Attention
    2017 : Transformer Architecture
         : "Attention is All You Need"
         : Self-attention & Multi-head attention
    2018 : Pre-trained Language Models
         : BERT (Bidirectional)
         : GPT (Autoregressive)
    2019-2020 : Large Language Models
         : GPT-2, GPT-3
         : RoBERTa, ALBERT, T5
    2022-2023 : Foundation Models Era
         : ChatGPT, GPT-4
         : LLaMA, PaLM, Claude
    2024 : Multimodal Transformers
         : Vision-Language models
         : Code generation models

timeline
    title Evolution of NLP Architectures
    2014 : Seq2Seq with RNN
         : Encoder-Decoder architecture
    2015 : Attention Mechanism
         : Bahdanau & Luong Attention
    2017 : Transformer Architecture
         : "Attention is All You Need"
         : Self-attention & Multi-head attention
    2018 : Pre-trained Language Models
         : BERT (Bidirectional)
         : GPT (Autoregressive)
    2019-2020 : Large Language Models
         : GPT-2, GPT-3
         : RoBERTa, ALBERT, T5
    2022-2023 : Foundation Models Era
         : ChatGPT, GPT-4
         : LLaMA, PaLM, Claude
    2024 : Multimodal Transformers
         : Vision-Language models
         : Code generation models

8.1.3 Mengapa Transformers Menang?

Kelebihan Transformers:

  1. Parallelization: Semua tokens diproses simultaneously
  2. Long-range dependencies: Direct connections via attention
  3. Scalability: Bisa scale ke billions of parameters
  4. Transfer learning: Pre-training + fine-tuning paradigm
  5. Versatility: Works untuk NLP, vision, audio, multimodal

Dampak di Industry:

import matplotlib.pyplot as plt
import numpy as np

# Visualize the impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Model size growth
years = [2017, 2018, 2019, 2020, 2021, 2022, 2023]
models = ['Transformer\nBase', 'BERT\nBase', 'GPT-2', 'GPT-3', 'PaLM', 'GPT-3.5', 'GPT-4']
params = [65e6, 110e6, 1.5e9, 175e9, 540e9, 175e9, 1000e9]  # Estimated parameters

ax1.bar(range(len(models)), np.array(params)/1e9, color='steelblue', alpha=0.7)
ax1.set_xticks(range(len(models)))
ax1.set_xticklabels(models, rotation=45, ha='right')
ax1.set_ylabel('Parameters (Billions)', fontsize=12, fontweight='bold')
ax1.set_title('Growth of Transformer Models', fontsize=14, fontweight='bold')
ax1.set_yscale('log')
ax1.grid(axis='y', alpha=0.3)

# Performance on benchmarks
benchmarks = ['SQuAD\n(QA)', 'GLUE\n(General)', 'SuperGLUE\n(Hard)', 'MMLU\n(Knowledge)']
rnn_scores = [65, 70, 45, 30]
transformer_scores = [93, 90, 89, 85]

x = np.arange(len(benchmarks))
width = 0.35

ax2.bar(x - width/2, rnn_scores, width, label='RNN/LSTM (2016)', color='coral', alpha=0.7)
ax2.bar(x + width/2, transformer_scores, width, label='Transformers (2023)', color='green', alpha=0.7)
ax2.set_ylabel('Score (%)', fontsize=12, fontweight='bold')
ax2.set_title('Performance Comparison on Benchmarks', fontsize=14, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(benchmarks)
ax2.legend(fontsize=11)
ax2.grid(axis='y', alpha=0.3)
ax2.set_ylim([0, 100])

plt.tight_layout()
plt.show()

8.2 Attention Mechanism: Konsep Fundamental

8.2.1 Intuisi Attention

Analogi Manusia:

Saat membaca kalimat: “The quick brown fox jumps over the lazy dog”

  • Untuk understand “jumps”, kita fokus pada “fox” (subject) dan “over” (direction)
  • Tidak semua kata equally important
  • Attention weights menentukan relevance setiap kata

Mathematical Formulation:

Attention adalah weighted sum dari values, where weights ditentukan oleh compatibility function.

Basic Attention:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

  • Q (Query): “What I’m looking for”
  • K (Key): “What I have”
  • V (Value): “Actual content”
  • \(d_k\): Dimension of keys (untuk scaling)

8.2.2 Attention Step-by-Step

Example: Simple attention calculation

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def attention_mechanism(Q, K, V):
    """
    Compute attention mechanism

    Parameters:
        Q: Query matrix (seq_len_q, d_k)
        K: Key matrix (seq_len_k, d_k)
        V: Value matrix (seq_len_k, d_v)

    Returns:
        output: Attention output (seq_len_q, d_v)
        attention_weights: Attention weights (seq_len_q, seq_len_k)
    """
    # Step 1: Compute scores (dot product)
    d_k = K.shape[1]
    scores = np.matmul(Q, K.T)  # (seq_len_q, seq_len_k)

    # Step 2: Scale
    scaled_scores = scores / np.sqrt(d_k)

    # Step 3: Softmax
    attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True)

    # Step 4: Weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

# Example: Simple sentence
# "The cat sat"
# Embedding dimension = 4 (simplified)

np.random.seed(42)
seq_len = 3
d_model = 4

# Simulate word embeddings
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

# Compute attention
output, weights = attention_mechanism(Q, K, V)

# Visualize attention weights
words = ['The', 'cat', 'sat']

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Attention heatmap
sns.heatmap(weights, annot=True, fmt='.3f', cmap='YlOrRd',
            xticklabels=words, yticklabels=words,
            cbar_kws={'label': 'Attention Weight'}, ax=ax1)
ax1.set_xlabel('Key (Input)', fontsize=12, fontweight='bold')
ax1.set_ylabel('Query (Output)', fontsize=12, fontweight='bold')
ax1.set_title('Attention Weights Heatmap', fontsize=14, fontweight='bold')

# Attention flow
for i, query_word in enumerate(words):
    ax2.barh(range(len(words)), weights[i], alpha=0.7,
             label=f'Query: {query_word}')

ax2.set_yticks(range(len(words)))
ax2.set_yticklabels(words)
ax2.set_xlabel('Attention Weight', fontsize=12, fontweight='bold')
ax2.set_ylabel('Key Words', fontsize=12, fontweight='bold')
ax2.set_title('Attention Distribution per Query', fontsize=14, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("Attention Output Shape:", output.shape)
print("\nAttention Weights:")
print(weights)

8.2.3 Self-Attention

Self-Attention adalah kasus khusus di mana Q, K, V berasal dari input yang sama.

Purpose: Sequence dapat “attend to itself” untuk capture relationships antar words.

Implementation:

class SelfAttention:
    """
    Self-Attention mechanism (NumPy implementation)
    """
    def __init__(self, d_model, d_k=None):
        """
        Parameters:
            d_model: Embedding dimension
            d_k: Key/Query dimension (default: d_model)
        """
        self.d_model = d_model
        self.d_k = d_k if d_k is not None else d_model

        # Weight matrices (randomly initialized)
        self.W_q = np.random.randn(d_model, self.d_k) * 0.01
        self.W_k = np.random.randn(d_model, self.d_k) * 0.01
        self.W_v = np.random.randn(d_model, d_model) * 0.01

    def forward(self, X):
        """
        Forward pass

        Parameters:
            X: Input (batch, seq_len, d_model)

        Returns:
            output: Self-attention output
            attention_weights: Attention weights
        """
        # Linear projections
        Q = np.matmul(X, self.W_q)  # (batch, seq_len, d_k)
        K = np.matmul(X, self.W_k)  # (batch, seq_len, d_k)
        V = np.matmul(X, self.W_v)  # (batch, seq_len, d_model)

        # Scaled dot-product attention
        scores = np.matmul(Q, K.transpose(0, 2, 1))  # (batch, seq_len, seq_len)
        scaled_scores = scores / np.sqrt(self.d_k)

        # Softmax
        attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum(axis=-1, keepdims=True)

        # Weighted sum
        output = np.matmul(attention_weights, V)  # (batch, seq_len, d_model)

        return output, attention_weights

# Test self-attention
batch_size = 2
seq_len = 5
d_model = 8

X = np.random.randn(batch_size, seq_len, d_model)

self_attn = SelfAttention(d_model=d_model)
output, weights = self_attn.forward(X)

print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

# Visualize attention for first sample
plt.figure(figsize=(8, 6))
sns.heatmap(weights[0], annot=True, fmt='.2f', cmap='Blues',
            cbar_kws={'label': 'Attention Weight'})
plt.xlabel('Key Position', fontsize=12, fontweight='bold')
plt.ylabel('Query Position', fontsize=12, fontweight='bold')
plt.title('Self-Attention Weights (Sample 1)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

8.2.4 Multi-Head Attention

Problem dengan Single Attention: Model hanya bisa fokus pada satu “aspect” atau “relationship” at a time.

Solution: Multi-Head Attention - Multiple attention mechanisms in parallel!

Intuisi:

  • Head 1: Fokus pada syntactic relationships
  • Head 2: Fokus pada semantic relationships
  • Head 3: Fokus pada positional dependencies
  • dll.

Architecture:

Code
graph TD
    X[Input X] --> L1[Linear Q1]
    X --> L2[Linear K1]
    X --> L3[Linear V1]
    X --> L4[Linear Q2]
    X --> L5[Linear K2]
    X --> L6[Linear V2]
    X --> L7[Linear Qh]
    X --> L8[Linear Kh]
    X --> L9[Linear Vh]

    L1 --> A1[Attention<br/>Head 1]
    L2 --> A1
    L3 --> A1

    L4 --> A2[Attention<br/>Head 2]
    L5 --> A2
    L6 --> A2

    L7 --> Ah[Attention<br/>Head h]
    L8 --> Ah
    L9 --> Ah

    A1 --> C[Concat]
    A2 --> C
    Ah --> C

    C --> LO[Linear<br/>Output]
    LO --> O[Output]

    style X fill:#e6f3ff
    style A1 fill:#ffe6e6
    style A2 fill:#ffe6e6
    style Ah fill:#ffe6e6
    style C fill:#ffffcc
    style O fill:#ccffcc

graph TD
    X[Input X] --> L1[Linear Q1]
    X --> L2[Linear K1]
    X --> L3[Linear V1]
    X --> L4[Linear Q2]
    X --> L5[Linear K2]
    X --> L6[Linear V2]
    X --> L7[Linear Qh]
    X --> L8[Linear Kh]
    X --> L9[Linear Vh]

    L1 --> A1[Attention<br/>Head 1]
    L2 --> A1
    L3 --> A1

    L4 --> A2[Attention<br/>Head 2]
    L5 --> A2
    L6 --> A2

    L7 --> Ah[Attention<br/>Head h]
    L8 --> Ah
    L9 --> Ah

    A1 --> C[Concat]
    A2 --> C
    Ah --> C

    C --> LO[Linear<br/>Output]
    LO --> O[Output]

    style X fill:#e6f3ff
    style A1 fill:#ffe6e6
    style A2 fill:#ffe6e6
    style Ah fill:#ffe6e6
    style C fill:#ffffcc
    style O fill:#ccffcc

Mathematical Formulation:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

Where each head:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Implementation:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class MultiHeadAttention(layers.Layer):
    """
    Multi-Head Attention Layer (Keras)
    """
    def __init__(self, d_model, num_heads, name="multi_head_attention"):
        super(MultiHeadAttention, self).__init__(name=name)

        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        # Linear layers untuk Q, K, V
        self.wq = layers.Dense(d_model, name='query')
        self.wk = layers.Dense(d_model, name='key')
        self.wv = layers.Dense(d_model, name='value')

        # Output linear layer
        self.dense = layers.Dense(d_model, name='output')

    def split_heads(self, x, batch_size):
        """Split last dimension into (num_heads, depth)"""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        """Calculate attention weights"""
        # Q・K^T
        matmul_qk = tf.matmul(q, k, transpose_b=True)

        # Scale
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        # Mask (optional)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Softmax
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

        # Output
        output = tf.matmul(attention_weights, v)

        return output, attention_weights

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]

        # Linear projections
        q = self.wq(q)  # (batch, seq_len_q, d_model)
        k = self.wk(k)  # (batch, seq_len_k, d_model)
        v = self.wv(v)  # (batch, seq_len_v, d_model)

        # Split into heads
        q = self.split_heads(q, batch_size)  # (batch, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # Scaled dot-product attention
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
        # scaled_attention: (batch, num_heads, seq_len_q, depth)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        # Final linear
        output = self.dense(concat_attention)

        return output, attention_weights

# Test multi-head attention
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2

mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)

# Dummy input
query = tf.random.normal((batch_size, seq_len, d_model))
key = tf.random.normal((batch_size, seq_len, d_model))
value = tf.random.normal((batch_size, seq_len, d_model))

output, weights = mha(query, key, value)

print(f"Input shape: {query.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"Number of parameters: {mha.count_params():,}")

8.3 Transformer Architecture

8.3.1 Complete Transformer Model

“Attention is All You Need” (Vaswani et al., 2017) memperkenalkan Transformer architecture yang sepenuhnya based on attention, tanpa recurrence atau convolution.

High-Level Architecture:

Code
graph TB
    subgraph "ENCODER (Left)"
        IE[Input<br/>Embedding] --> IPE[+ Positional<br/>Encoding]
        IPE --> EN1[Encoder<br/>Layer 1]
        EN1 --> EN2[Encoder<br/>Layer 2]
        EN2 --> ENn[Encoder<br/>Layer N]
    end

    subgraph "DECODER (Right)"
        OE[Output<br/>Embedding] --> OPE[+ Positional<br/>Encoding]
        OPE --> DN1[Decoder<br/>Layer 1]
        DN1 --> DN2[Decoder<br/>Layer 2]
        DN2 --> DNn[Decoder<br/>Layer N]
    end

    ENn -.->|Context| DN1
    ENn -.->|Context| DN2
    ENn -.->|Context| DNn

    DNn --> L[Linear]
    L --> S[Softmax]
    S --> OUT[Output<br/>Probabilities]

    style IE fill:#e6f3ff
    style OE fill:#e6f3ff
    style EN1 fill:#ffe6e6
    style EN2 fill:#ffe6e6
    style ENn fill:#ffe6e6
    style DN1 fill:#ffffcc
    style DN2 fill:#ffffcc
    style DNn fill:#ffffcc
    style OUT fill:#ccffcc

graph TB
    subgraph "ENCODER (Left)"
        IE[Input<br/>Embedding] --> IPE[+ Positional<br/>Encoding]
        IPE --> EN1[Encoder<br/>Layer 1]
        EN1 --> EN2[Encoder<br/>Layer 2]
        EN2 --> ENn[Encoder<br/>Layer N]
    end

    subgraph "DECODER (Right)"
        OE[Output<br/>Embedding] --> OPE[+ Positional<br/>Encoding]
        OPE --> DN1[Decoder<br/>Layer 1]
        DN1 --> DN2[Decoder<br/>Layer 2]
        DN2 --> DNn[Decoder<br/>Layer N]
    end

    ENn -.->|Context| DN1
    ENn -.->|Context| DN2
    ENn -.->|Context| DNn

    DNn --> L[Linear]
    L --> S[Softmax]
    S --> OUT[Output<br/>Probabilities]

    style IE fill:#e6f3ff
    style OE fill:#e6f3ff
    style EN1 fill:#ffe6e6
    style EN2 fill:#ffe6e6
    style ENn fill:#ffe6e6
    style DN1 fill:#ffffcc
    style DN2 fill:#ffffcc
    style DNn fill:#ffffcc
    style OUT fill:#ccffcc

8.3.2 Encoder Layer Details

Setiap Encoder layer terdiri dari 2 sub-layers:

  1. Multi-Head Self-Attention
  2. Position-wise Feed-Forward Network

Keduanya menggunakan residual connections dan layer normalization.

Encoder Layer Architecture:

class EncoderLayer(layers.Layer):
    """
    Single Transformer Encoder Layer
    """
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1, name="encoder_layer"):
        """
        Parameters:
            d_model: Model dimension
            num_heads: Number of attention heads
            dff: Dimension of feed-forward network
            dropout_rate: Dropout rate
        """
        super(EncoderLayer, self).__init__(name=name)

        # Multi-head attention
        self.mha = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network
        self.ffn = keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])

        # Layer normalization
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)

        # Dropout
        self.dropout1 = layers.Dropout(dropout_rate)
        self.dropout2 = layers.Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        """Forward pass"""

        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Residual + LayerNorm

        # Feed-forward
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Residual + LayerNorm

        return out2

# Test encoder layer
encoder_layer = EncoderLayer(d_model=512, num_heads=8, dff=2048)

x = tf.random.normal((batch_size, seq_len, 512))
output = encoder_layer(x, training=False)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Encoder layer parameters: {encoder_layer.count_params():,}")

8.3.3 Positional Encoding

Problem: Attention mechanism tidak ada notion of order/position! - “Cat chases dog” vs “Dog chases cat” → Same attention weights!

Solution: Positional Encoding - Add position information ke embeddings.

Sinusoidal Positional Encoding:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

Where:

  • \(pos\): Position dalam sequence
  • \(i\): Dimension index
  • \(d_{model}\): Model dimension

Implementation:

def get_positional_encoding(seq_len, d_model):
    """
    Generate sinusoidal positional encoding

    Parameters:
        seq_len: Sequence length
        d_model: Model dimension

    Returns:
        pos_encoding: (1, seq_len, d_model)
    """
    # Create position indices
    positions = np.arange(seq_len)[:, np.newaxis]  # (seq_len, 1)

    # Create dimension indices
    dimensions = np.arange(d_model)[np.newaxis, :]  # (1, d_model)

    # Calculate angle rates
    angle_rates = 1 / np.power(10000, (2 * (dimensions // 2)) / d_model)

    # Calculate angles
    angles = positions * angle_rates  # (seq_len, d_model)

    # Apply sin to even indices, cos to odd indices
    angles[:, 0::2] = np.sin(angles[:, 0::2])
    angles[:, 1::2] = np.cos(angles[:, 1::2])

    # Add batch dimension
    pos_encoding = angles[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

# Generate and visualize
seq_len = 100
d_model = 128

pos_encoding = get_positional_encoding(seq_len, d_model)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
im = axes[0].imshow(pos_encoding[0], cmap='RdBu', aspect='auto')
axes[0].set_xlabel('Dimension', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Position', fontsize=12, fontweight='bold')
axes[0].set_title('Positional Encoding Heatmap', fontsize=14, fontweight='bold')
plt.colorbar(im, ax=axes[0])

# Selected dimensions
axes[1].plot(pos_encoding[0, :, 4], label='dim 4 (sin)')
axes[1].plot(pos_encoding[0, :, 5], label='dim 5 (cos)')
axes[1].plot(pos_encoding[0, :, 32], label='dim 32 (sin)', linestyle='--')
axes[1].plot(pos_encoding[0, :, 33], label='dim 33 (cos)', linestyle='--')
axes[1].set_xlabel('Position', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Encoding Value', fontsize=12, fontweight='bold')
axes[1].set_title('Positional Encoding Patterns', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Positional encoding shape: {pos_encoding.shape}")

8.3.4 Complete Transformer Encoder

class TransformerEncoder(layers.Layer):
    """
    Transformer Encoder (Stack of N encoder layers)
    """
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, maximum_position_encoding, dropout_rate=0.1,
                 name="transformer_encoder"):
        super(TransformerEncoder, self).__init__(name=name)

        self.d_model = d_model
        self.num_layers = num_layers

        # Embedding layer
        self.embedding = layers.Embedding(input_vocab_size, d_model)

        # Positional encoding
        self.pos_encoding = get_positional_encoding(maximum_position_encoding, d_model)

        # Stack of encoder layers
        self.enc_layers = [
            EncoderLayer(d_model, num_heads, dff, dropout_rate, name=f'encoder_layer_{i}')
            for i in range(num_layers)
        ]

        self.dropout = layers.Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        seq_len = tf.shape(x)[1]

        # Embedding + positional encoding
        x = self.embedding(x)  # (batch, seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))  # Scaling
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        # Pass through encoder layers
        for enc_layer in self.enc_layers:
            x = enc_layer(x, training, mask)

        return x

# Build complete encoder
encoder = TransformerEncoder(
    num_layers=6,
    d_model=512,
    num_heads=8,
    dff=2048,
    input_vocab_size=10000,
    maximum_position_encoding=1000,
    dropout_rate=0.1
)

# Test
sample_input = tf.random.uniform((2, 10), dtype=tf.int32, maxval=10000)
encoder_output = encoder(sample_input, training=False)

print(f"Input shape: {sample_input.shape}")
print(f"Encoder output shape: {encoder_output.shape}")
print(f"Total encoder parameters: {encoder.count_params():,}")

8.4 Pre-trained Transformers: BERT & GPT

8.4.1 Transfer Learning Paradigm

Traditional ML: Train from scratch untuk setiap task Modern NLP: Pre-train on massive corpus, then fine-tune!

Two-Stage Process:

  1. Pre-training: Learn general language understanding
    • Corpus: Wikipedia, books, web text (billions of words)
    • Task: Self-supervised (mask prediction, next word prediction)
    • Duration: Days/weeks pada hundreds of GPUs
  2. Fine-tuning: Adapt ke specific task
    • Corpus: Task-specific data (thousands of examples)
    • Task: Supervised (classification, QA, NER, etc.)
    • Duration: Minutes/hours pada single GPU

Benefits:

  • Tidak perlu millions of labeled examples
  • Leverage knowledge from massive pre-training
  • State-of-the-art results dengan less data

8.4.2 BERT: Bidirectional Encoder Representations

BERT (Devlin et al., 2018) adalah breakthrough pre-trained model.

Key Ideas:

  1. Bidirectional: Process text kiri dan kanan simultaneously
  2. Masked Language Model (MLM): Predict masked tokens
  3. Next Sentence Prediction (NSP): Understand sentence relationships

Architecture:

Code
graph TB
    T[Text Input] --> TOK[Tokenization]
    TOK --> EMB[Token + Segment + Position<br/>Embeddings]
    EMB --> E1[Encoder 1]
    E1 --> E2[Encoder 2]
    E2 --> E3[...]
    E3 --> EN[Encoder N<br/>12 or 24 layers]

    EN --> CLS[CLS Token<br/>Sentence representation]
    EN --> TOK_OUT[Token Outputs<br/>Contextualized embeddings]

    CLS --> TASK1[Classification<br/>Sentiment, NER, etc.]
    TOK_OUT --> TASK2[Token tasks<br/>QA, NER]

    style T fill:#e6f3ff
    style EMB fill:#ffe6e6
    style EN fill:#ffffcc
    style CLS fill:#ccffcc
    style TOK_OUT fill:#ccffcc

graph TB
    T[Text Input] --> TOK[Tokenization]
    TOK --> EMB[Token + Segment + Position<br/>Embeddings]
    EMB --> E1[Encoder 1]
    E1 --> E2[Encoder 2]
    E2 --> E3[...]
    E3 --> EN[Encoder N<br/>12 or 24 layers]

    EN --> CLS[CLS Token<br/>Sentence representation]
    EN --> TOK_OUT[Token Outputs<br/>Contextualized embeddings]

    CLS --> TASK1[Classification<br/>Sentiment, NER, etc.]
    TOK_OUT --> TASK2[Token tasks<br/>QA, NER]

    style T fill:#e6f3ff
    style EMB fill:#ffe6e6
    style EN fill:#ffffcc
    style CLS fill:#ccffcc
    style TOK_OUT fill:#ccffcc

BERT Variants:

Model Layers Hidden Size Heads Parameters
BERT-Base 12 768 12 110M
BERT-Large 24 1024 16 340M
RoBERTa 24 1024 16 355M
ALBERT 12 768 12 12M
DistilBERT 6 768 12 66M

8.4.3 GPT: Generative Pre-trained Transformer

GPT (Radford et al., 2018) menggunakan autoregressive approach.

Key Differences dari BERT:

  1. Unidirectional: Left-to-right only (decoder architecture)
  2. Causal Language Modeling: Predict next token
  3. Generation focus: Designed untuk text generation

GPT Evolution:

  • GPT (2018): 117M parameters
  • GPT-2 (2019): 1.5B parameters
  • GPT-3 (2020): 175B parameters
  • ChatGPT (2022): GPT-3.5 + RLHF
  • GPT-4 (2023): Multimodal, >1T parameters (estimated)

8.4.4 BERT vs GPT: When to Use What?

Comparison:

Aspect BERT GPT
Architecture Encoder only Decoder only
Direction Bidirectional Left-to-right
Pre-training Masked LM Next token prediction
Best for Understanding tasks Generation tasks
Tasks Classification, NER, QA Text generation, completion
Context Full sentence Left context only

Use Cases:

# Visualize use cases
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# BERT use cases
bert_tasks = ['Sentiment\nAnalysis', 'Named Entity\nRecognition', 'Question\nAnswering',
              'Text\nClassification', 'Similarity\nMatching']
bert_scores = [95, 93, 88, 94, 90]

ax1.barh(bert_tasks, bert_scores, color='#4285F4', alpha=0.8)
ax1.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold')
ax1.set_title('BERT: Understanding Tasks', fontsize=14, fontweight='bold')
ax1.set_xlim([0, 100])
ax1.grid(axis='x', alpha=0.3)

# GPT use cases
gpt_tasks = ['Story\nGeneration', 'Code\nCompletion', 'Summarization',
             'Translation', 'Conversation']
gpt_scores = [92, 89, 85, 83, 95]

ax2.barh(gpt_tasks, gpt_scores, color='#34A853', alpha=0.8)
ax2.set_xlabel('Typical Performance (%)', fontsize=12, fontweight='bold')
ax2.set_title('GPT: Generation Tasks', fontsize=14, fontweight='bold')
ax2.set_xlim([0, 100])
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

8.5 Hugging Face Transformers Library

8.5.1 Introduction to Hugging Face

Hugging Face adalah library paling populer untuk working dengan transformers.

Features:

  • 100,000+ pre-trained models
  • Support PyTorch, TensorFlow, JAX
  • Easy-to-use APIs
  • Active community
  • Hub untuk sharing models

Installation:

pip install transformers
pip install datasets
pip install evaluate

8.5.2 Loading Pre-trained Models

Basic Usage:

from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

# Load BERT tokenizer and model
model_name = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Tokenizer vocab size: {tokenizer.vocab_size:,}")
print(f"Model parameters: {model.num_parameters():,}")

# Tokenize text
text = "Transformers are amazing for NLP tasks!"
tokens = tokenizer(text, return_tensors='pt')

print(f"\nOriginal text: {text}")
print(f"Tokens: {tokens['input_ids']}")
print(f"Decoded: {tokenizer.decode(tokens['input_ids'][0])}")

# Forward pass
outputs = model(**tokens)
last_hidden_state = outputs.last_hidden_state

print(f"\nOutput shape: {last_hidden_state.shape}")
print(f"  [batch_size, sequence_length, hidden_size]")

8.5.3 Pipelines untuk Common Tasks

Hugging Face Pipelines = High-level API untuk inference.

from transformers import pipeline

# 1. Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis")
result = sentiment_analyzer("I love using transformers for NLP!")
print("Sentiment:", result)

# 2. Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Elon Musk founded SpaceX in 2002 in California")
print("\nNER:", result)

# 3. Question Answering
qa = pipeline("question-answering")
context = "Transformers were introduced in 2017 by Google researchers"
question = "When were transformers introduced?"
result = qa(question=question, context=context)
print("\nQA:", result)

# 4. Text Generation
generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time", max_length=50, num_return_sequences=1)
print("\nGeneration:", result[0]['generated_text'])

# 5. Summarization
summarizer = pipeline("summarization")
long_text = """
Transformers have revolutionized natural language processing.
They use attention mechanisms to process entire sequences in parallel,
making them much faster than RNNs. Pre-trained models like BERT and GPT
have achieved state-of-the-art results on numerous benchmarks.
"""
result = summarizer(long_text, max_length=50, min_length=20)
print("\nSummary:", result[0]['summary_text'])

8.5.4 Tokenization Deep Dive

Tokenizer Types:

  1. Word-based: Split pada spaces (simple, large vocab)
  2. Character-based: Individual characters (small vocab, long sequences)
  3. Subword: Best of both worlds (BPE, WordPiece, SentencePiece)

WordPiece Tokenization (used by BERT):

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example sentences
texts = [
    "Machine learning is awesome",
    "Transformers use self-attention",
    "Supercalifragilisticexpialidocious"
]

for text in texts:
    # Tokenize
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    print(f"\nText: {text}")
    print(f"Tokens: {tokens}")
    print(f"IDs: {token_ids}")

# Encoding with special tokens
encoding = tokenizer(
    "Hello world",
    add_special_tokens=True,
    max_length=10,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)

print("\nEncoding keys:", encoding.keys())
print("Input IDs:", encoding['input_ids'])
print("Attention mask:", encoding['attention_mask'])

8.6 Fine-tuning BERT untuk Sentiment Analysis

8.6.1 Dataset Preparation

from datasets import load_dataset
from transformers import AutoTokenizer

# Load IMDB dataset
dataset = load_dataset("imdb")

print(f"Train examples: {len(dataset['train']):,}")
print(f"Test examples: {len(dataset['test']):,}")

# Inspect sample
sample = dataset['train'][0]
print(f"\nSample review:")
print(f"Text: {sample['text'][:200]}...")
print(f"Label: {sample['label']}")  # 0=negative, 1=positive

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    """Tokenize texts"""
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=512
    )

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

print(f"\nTokenized dataset:")
print(tokenized_datasets)

8.6.2 Model Configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from datasets import load_metric

# Load pre-trained BERT for classification
model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification
)

print(f"Model parameters: {model.num_parameters():,}")

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Metrics
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    """Compute accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

8.6.3 Training

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'].shuffle(seed=42).select(range(1000)),  # Small subset for demo
    eval_dataset=tokenized_datasets['test'].select(range(200)),
    compute_metrics=compute_metrics,
)

# Train
print("Starting training...")
trainer.train()

# Evaluate
print("\nEvaluating on test set...")
eval_results = trainer.evaluate()
print(f"Test accuracy: {eval_results['eval_accuracy']:.4f}")

8.6.4 Inference

from transformers import pipeline

# Create classifier pipeline
classifier = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer
)

# Test predictions
test_texts = [
    "This movie was absolutely fantastic! Best film of the year!",
    "Terrible waste of time. Worst movie ever.",
    "It was okay, nothing special.",
    "Amazing acting and great story. Highly recommended!"
]

print("\nPredictions:")
for text in test_texts:
    result = classifier(text)[0]
    label = "Positive" if result['label'] == 'LABEL_1' else "Negative"
    score = result['score']
    print(f"\nText: {text}")
    print(f"Prediction: {label} (confidence: {score:.4f})")

8.7 Advanced Topics

8.7.1 Model Optimization

Techniques untuk production:

  1. Distillation: Compress large models

    • DistilBERT: 40% smaller, 60% faster, 97% performance
  2. Quantization: Reduce precision

    • FP32 → INT8: 4x smaller, faster inference
  3. Pruning: Remove unnecessary connections

  4. ONNX Runtime: Optimized inference engine

# Quantization example
from transformers import AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)

print(f"Original model size: {model.num_parameters():,} parameters")
print(f"Quantized model: ~4x smaller in memory")

8.7.2 Custom Tasks

Fine-tuning untuk custom domain:

# Example: Cybersecurity threat detection
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Custom dataset
texts = [
    "Detected SQL injection attempt in login form",
    "Normal user login activity",
    "Suspicious port scanning detected",
    "Regular API request"
]
labels = [1, 0, 1, 0]  # 1=threat, 0=normal

# Tokenize
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
encodings = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

# Create dataset
class ThreatDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

dataset = ThreatDataset(encodings, labels)

# Fine-tune
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# ... training code ...

8.8 Rangkuman & Best Practices

8.8.1 Key Takeaways

📚 Chapter Summary

1. Attention Mechanism:

  • Selective focus pada relevant parts
  • Solusi untuk long-range dependencies
  • Self-attention dan multi-head attention

2. Transformer Architecture:

  • Encoder-decoder structure
  • Parallel processing (no sequential bottleneck)
  • Positional encoding untuk order information

3. Pre-trained Models:

  • BERT: Bidirectional, understanding tasks
  • GPT: Autoregressive, generation tasks
  • Transfer learning paradigm

4. Hugging Face:

  • Comprehensive transformers library
  • Easy access ke pre-trained models
  • Pipelines untuk quick inference

5. Fine-tuning:

  • Adapt pre-trained models ke specific tasks
  • Requires less data than training from scratch
  • State-of-the-art results

8.8.2 Best Practices

Model Selection:

  • Use BERT untuk classification, NER, QA
  • Use GPT untuk generation, completion
  • Consider model size vs performance trade-off

Fine-tuning:

  • Start dengan small learning rate (2e-5 to 5e-5)
  • Use gradient accumulation untuk large batches
  • Monitor validation loss untuk early stopping

Production:

  • Optimize dengan quantization/distillation
  • Cache tokenizer outputs
  • Batch predictions untuk throughput

8.9 Soal Latihan

Review Questions

  1. Jelaskan perbedaan fundamental antara RNN dan Transformer dalam memproses sequences.

  2. Apa itu self-attention? Bagaimana cara kerjanya secara matematis?

  3. Mengapa positional encoding diperlukan dalam Transformer?

  4. Bandingkan BERT dan GPT:

    • Arsitektur (encoder vs decoder)
    • Pre-training objective
    • Use cases
  5. Jelaskan konsep multi-head attention. Apa keuntungannya dibanding single-head?

  6. Apa itu transfer learning dalam context NLP? Jelaskan pre-training dan fine-tuning.

  7. Sebutkan 5 tasks yang cocok untuk BERT dan 5 untuk GPT.

  8. Bagaimana WordPiece tokenization bekerja? Apa advantagenya?

  9. Jelaskan trade-off antara model size dan performance.

  10. Apa itu attention weights dan bagaimana interpretasinya?

Coding Exercises

Exercise 1: Implement Scaled Dot-Product Attention dari scratch (NumPy)

Exercise 2: Fine-tune BERT untuk custom text classification

Exercise 3: Compare different transformer models (BERT, RoBERTa, DistilBERT) pada same task

Exercise 4: Implement text generation dengan GPT-2

Exercise 5: Visualize attention weights untuk interpreting model


🎓 Selamat! Anda telah menyelesaikan Chapter 8 - Transformers & Attention!

Di Lab 8, kita akan fine-tune BERT untuk sentiment analysis dengan Hugging Face Transformers! 🚀