---
title: "Bab 8: Transformers & Attention Mechanism"
subtitle: "Modern NLP dengan Self-Attention, BERT, GPT & Transfer Learning"
number-sections: false
---
# Bab 8: Transformers & Attention Mechanism {#sec-chapter-08}
::: {.callout-note}
## 🎯 Hasil Pembelajaran (Learning Outcomes)
Setelah mempelajari bab ini, Anda akan mampu:
1. **Memahami** konsep attention mechanism dan self-attention
2. **Mengidentifikasi** arsitektur Transformer dan komponennya
3. **Mengimplementasikan** pre-trained transformers (BERT, GPT) dengan Hugging Face
4. **Menerapkan** transfer learning untuk NLP tasks
5. **Melakukan fine-tuning** model transformer untuk domain spesifik
6. **Menggunakan** tokenizers dan pipeline untuk inference
7. **Mengevaluasi** performa model transformer pada berbagai tasks
:::
## 8.1 Dari RNN ke Transformers: Revolusi NLP
### 8.1.1 Limitasi RNN dan LSTM
**Masalah Sequential Processing:**
Di Chapter 7, kita belajar RNN dan LSTM yang powerful untuk sequential data. Namun, mereka memiliki **fundamental limitations**:
**1. Sequential Computation** (tidak bisa paralel):
- Must process timestep-by-timestep
- Cannot leverage GPU parallelism fully
- Training sangat lambat untuk long sequences
**2. Long-Range Dependencies** (masih sulit):
- Meskipun LSTM lebih baik dari Simple RNN
- Information bottleneck pada hidden state
- Gradient masih bisa vanish untuk very long sequences (>1000 tokens)
**3. Fixed Context Window**:
- Encoder mengkompresi semua info ke single vector
- Information loss untuk long documents
- Early tokens might be "forgotten"
::: {.callout-tip}
## 💡 Contoh Problem
Kalimat: "The cat, which was sleeping peacefully on the soft cushion all afternoon, **was** hungry."
- RNN harus remember "cat" dari awal kalimat sampai "was"
- Distance = 12 words
- LSTM dapat handle ini, tapi untuk 100+ words? Sulit!
- **Solution**: Attention mechanism
:::
### 8.1.2 The Attention Revolution
**Attention Mechanism** diperkenalkan dalam paper "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2015).
**Key Insight**: Tidak perlu compress semua information ke single vector. Model dapat **selectively focus** pada relevant parts!
**Evolution Timeline:**
```{mermaid}
timeline
title Evolution of NLP Architectures
2014 : Seq2Seq with RNN
: Encoder-Decoder architecture
2015 : Attention Mechanism
: Bahdanau & Luong Attention
2017 : Transformer Architecture
: "Attention is All You Need"
: Self-attention & Multi-head attention
2018 : Pre-trained Language Models
: BERT (Bidirectional)
: GPT (Autoregressive)
2019-2020 : Large Language Models
: GPT-2, GPT-3
: RoBERTa, ALBERT, T5
2022-2023 : Foundation Models Era
: ChatGPT, GPT-4
: LLaMA, PaLM, Claude
2024 : Multimodal Transformers
: Vision-Language models
: Code generation models
```
### 8.1.3 Mengapa Transformers Menang?
**Kelebihan Transformers:**
1. **Parallelization**: Semua tokens diproses simultaneously
2. **Long-range dependencies**: Direct connections via attention
3. **Scalability**: Bisa scale ke billions of parameters
4. **Transfer learning**: Pre-training + fine-tuning paradigm
5. **Versatility**: Works untuk NLP, vision, audio, multimodal
**Dampak di Industry:**
```{python}
#| echo: true
#| code-fold: false
import matplotlib.pyplot as plt
import numpy as np
# Visualize the impact
fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize= (16 , 6 ))
# Model size growth
years = [2017 , 2018 , 2019 , 2020 , 2021 , 2022 , 2023 ]
models = ['Transformer \n Base' , 'BERT \n Base' , 'GPT-2' , 'GPT-3' , 'PaLM' , 'GPT-3.5' , 'GPT-4' ]
params = [65e6 , 110e6 , 1.5e9 , 175e9 , 540e9 , 175e9 , 1000e9 ] # Estimated parameters
ax1.bar(range (len (models)), np.array(params)/ 1e9 , color= 'steelblue' , alpha= 0.7 )
ax1.set_xticks(range (len (models)))
ax1.set_xticklabels(models, rotation= 45 , ha= 'right' )
ax1.set_ylabel('Parameters (Billions)' , fontsize= 12 , fontweight= 'bold' )
ax1.set_title('Growth of Transformer Models' , fontsize= 14 , fontweight= 'bold' )
ax1.set_yscale('log' )
ax1.grid(axis= 'y' , alpha= 0.3 )
# Performance on benchmarks
benchmarks = ['SQuAD \n (QA)' , 'GLUE \n (General)' , 'SuperGLUE \n (Hard)' , 'MMLU \n (Knowledge)' ]
rnn_scores = [65 , 70 , 45 , 30 ]
transformer_scores = [93 , 90 , 89 , 85 ]
x = np.arange(len (benchmarks))
width = 0.35
ax2.bar(x - width/ 2 , rnn_scores, width, label= 'RNN/LSTM (2016)' , color= 'coral' , alpha= 0.7 )
ax2.bar(x + width/ 2 , transformer_scores, width, label= 'Transformers (2023)' , color= 'green' , alpha= 0.7 )
ax2.set_ylabel('Score (%)' , fontsize= 12 , fontweight= 'bold' )
ax2.set_title('Performance Comparison on Benchmarks' , fontsize= 14 , fontweight= 'bold' )
ax2.set_xticks(x)
ax2.set_xticklabels(benchmarks)
ax2.legend(fontsize= 11 )
ax2.grid(axis= 'y' , alpha= 0.3 )
ax2.set_ylim([0 , 100 ])
plt.tight_layout()
plt.show()
```
## 8.2 Attention Mechanism: Konsep Fundamental
### 8.2.1 Intuisi Attention
**Analogi Manusia:**
Saat membaca kalimat: "The quick brown fox jumps over the lazy dog"
- Untuk understand "jumps", kita fokus pada "fox" (subject) dan "over" (direction)
- Tidak semua kata equally important
- **Attention weights** menentukan relevance setiap kata
**Mathematical Formulation:**
Attention adalah weighted sum dari values, where weights ditentukan oleh compatibility function.
**Basic Attention:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- **Q** (Query): "What I'm looking for"
- **K** (Key): "What I have"
- **V** (Value): "Actual content"
- $d_k$: Dimension of keys (untuk scaling)
### 8.2.2 Attention Step-by-Step
**Example**: Simple attention calculation
```{python}
#| echo: true
#| code-fold: false
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def attention_mechanism(Q, K, V):
"""
Compute attention mechanism
Parameters:
Q: Query matrix (seq_len_q, d_k)
K: Key matrix (seq_len_k, d_k)
V: Value matrix (seq_len_k, d_v)
Returns:
output: Attention output (seq_len_q, d_v)
attention_weights: Attention weights (seq_len_q, seq_len_k)
"""
# Step 1: Compute scores (dot product)
d_k = K.shape[1 ]
scores = np.matmul(Q, K.T) # (seq_len_q, seq_len_k)
# Step 2: Scale
scaled_scores = scores / np.sqrt(d_k)
# Step 3: Softmax
attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum (axis=- 1 , keepdims= True )
# Step 4: Weighted sum of values
output = np.matmul(attention_weights, V)
return output, attention_weights
# Example: Simple sentence
# "The cat sat"
# Embedding dimension = 4 (simplified)
np.random.seed(42 )
seq_len = 3
d_model = 4
# Simulate word embeddings
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)
# Compute attention
output, weights = attention_mechanism(Q, K, V)
# Visualize attention weights
words = ['The' , 'cat' , 'sat' ]
fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize= (14 , 5 ))
# Attention heatmap
sns.heatmap(weights, annot= True , fmt= '.3f' , cmap= 'YlOrRd' ,
xticklabels= words, yticklabels= words,
cbar_kws= {'label' : 'Attention Weight' }, ax= ax1)
ax1.set_xlabel('Key (Input)' , fontsize= 12 , fontweight= 'bold' )
ax1.set_ylabel('Query (Output)' , fontsize= 12 , fontweight= 'bold' )
ax1.set_title('Attention Weights Heatmap' , fontsize= 14 , fontweight= 'bold' )
# Attention flow
for i, query_word in enumerate (words):
ax2.barh(range (len (words)), weights[i], alpha= 0.7 ,
label= f'Query: { query_word} ' )
ax2.set_yticks(range (len (words)))
ax2.set_yticklabels(words)
ax2.set_xlabel('Attention Weight' , fontsize= 12 , fontweight= 'bold' )
ax2.set_ylabel('Key Words' , fontsize= 12 , fontweight= 'bold' )
ax2.set_title('Attention Distribution per Query' , fontsize= 14 , fontweight= 'bold' )
ax2.legend(fontsize= 10 )
ax2.grid(axis= 'x' , alpha= 0.3 )
plt.tight_layout()
plt.show()
print ("Attention Output Shape:" , output.shape)
print (" \n Attention Weights:" )
print (weights)
```
### 8.2.3 Self-Attention
**Self-Attention** adalah kasus khusus di mana Q, K, V berasal dari **input yang sama**.
**Purpose**: Sequence dapat "attend to itself" untuk capture relationships antar words.
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
class SelfAttention:
"""
Self-Attention mechanism (NumPy implementation)
"""
def __init__ (self , d_model, d_k= None ):
"""
Parameters:
d_model: Embedding dimension
d_k: Key/Query dimension (default: d_model)
"""
self .d_model = d_model
self .d_k = d_k if d_k is not None else d_model
# Weight matrices (randomly initialized)
self .W_q = np.random.randn(d_model, self .d_k) * 0.01
self .W_k = np.random.randn(d_model, self .d_k) * 0.01
self .W_v = np.random.randn(d_model, d_model) * 0.01
def forward(self , X):
"""
Forward pass
Parameters:
X: Input (batch, seq_len, d_model)
Returns:
output: Self-attention output
attention_weights: Attention weights
"""
# Linear projections
Q = np.matmul(X, self .W_q) # (batch, seq_len, d_k)
K = np.matmul(X, self .W_k) # (batch, seq_len, d_k)
V = np.matmul(X, self .W_v) # (batch, seq_len, d_model)
# Scaled dot-product attention
scores = np.matmul(Q, K.transpose(0 , 2 , 1 )) # (batch, seq_len, seq_len)
scaled_scores = scores / np.sqrt(self .d_k)
# Softmax
attention_weights = np.exp(scaled_scores) / np.exp(scaled_scores).sum (axis=- 1 , keepdims= True )
# Weighted sum
output = np.matmul(attention_weights, V) # (batch, seq_len, d_model)
return output, attention_weights
# Test self-attention
batch_size = 2
seq_len = 5
d_model = 8
X = np.random.randn(batch_size, seq_len, d_model)
self_attn = SelfAttention(d_model= d_model)
output, weights = self_attn.forward(X)
print (f"Input shape: { X. shape} " )
print (f"Output shape: { output. shape} " )
print (f"Attention weights shape: { weights. shape} " )
# Visualize attention for first sample
plt.figure(figsize= (8 , 6 ))
sns.heatmap(weights[0 ], annot= True , fmt= '.2f' , cmap= 'Blues' ,
cbar_kws= {'label' : 'Attention Weight' })
plt.xlabel('Key Position' , fontsize= 12 , fontweight= 'bold' )
plt.ylabel('Query Position' , fontsize= 12 , fontweight= 'bold' )
plt.title('Self-Attention Weights (Sample 1)' , fontsize= 14 , fontweight= 'bold' )
plt.tight_layout()
plt.show()
```
### 8.2.4 Multi-Head Attention
**Problem dengan Single Attention**: Model hanya bisa fokus pada satu "aspect" atau "relationship" at a time.
**Solution**: **Multi-Head Attention** - Multiple attention mechanisms in parallel!
**Intuisi**:
- Head 1: Fokus pada syntactic relationships
- Head 2: Fokus pada semantic relationships
- Head 3: Fokus pada positional dependencies
- dll.
**Architecture:**
```{mermaid}
graph TD
X[Input X] --> L1[Linear Q1]
X --> L2[Linear K1]
X --> L3[Linear V1]
X --> L4[Linear Q2]
X --> L5[Linear K2]
X --> L6[Linear V2]
X --> L7[Linear Qh]
X --> L8[Linear Kh]
X --> L9[Linear Vh]
L1 --> A1[Attention<br/>Head 1]
L2 --> A1
L3 --> A1
L4 --> A2[Attention<br/>Head 2]
L5 --> A2
L6 --> A2
L7 --> Ah[Attention<br/>Head h]
L8 --> Ah
L9 --> Ah
A1 --> C[Concat]
A2 --> C
Ah --> C
C --> LO[Linear<br/>Output]
LO --> O[Output]
style X fill:#e6f3ff
style A1 fill:#ffe6e6
style A2 fill:#ffe6e6
style Ah fill:#ffe6e6
style C fill:#ffffcc
style O fill:#ccffcc
```
**Mathematical Formulation:**
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
Where each head:
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
class MultiHeadAttention(layers.Layer):
"""
Multi-Head Attention Layer (Keras)
"""
def __init__ (self , d_model, num_heads, name= "multi_head_attention" ):
super (MultiHeadAttention, self ).__init__ (name= name)
assert d_model % num_heads == 0 , "d_model must be divisible by num_heads"
self .d_model = d_model
self .num_heads = num_heads
self .depth = d_model // num_heads
# Linear layers untuk Q, K, V
self .wq = layers.Dense(d_model, name= 'query' )
self .wk = layers.Dense(d_model, name= 'key' )
self .wv = layers.Dense(d_model, name= 'value' )
# Output linear layer
self .dense = layers.Dense(d_model, name= 'output' )
def split_heads(self , x, batch_size):
"""Split last dimension into (num_heads, depth)"""
x = tf.reshape(x, (batch_size, - 1 , self .num_heads, self .depth))
return tf.transpose(x, perm= [0 , 2 , 1 , 3 ])
def scaled_dot_product_attention(self , q, k, v, mask= None ):
"""Calculate attention weights"""
# Q・K^T
matmul_qk = tf.matmul(q, k, transpose_b= True )
# Scale
dk = tf.cast(tf.shape(k)[- 1 ], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# Mask (optional)
if mask is not None :
scaled_attention_logits += (mask * - 1e9 )
# Softmax
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=- 1 )
# Output
output = tf.matmul(attention_weights, v)
return output, attention_weights
def call(self , q, k, v, mask= None ):
batch_size = tf.shape(q)[0 ]
# Linear projections
q = self .wq(q) # (batch, seq_len_q, d_model)
k = self .wk(k) # (batch, seq_len_k, d_model)
v = self .wv(v) # (batch, seq_len_v, d_model)
# Split into heads
q = self .split_heads(q, batch_size) # (batch, num_heads, seq_len_q, depth)
k = self .split_heads(k, batch_size)
v = self .split_heads(v, batch_size)
# Scaled dot-product attention
scaled_attention, attention_weights = self .scaled_dot_product_attention(q, k, v, mask)
# scaled_attention: (batch, num_heads, seq_len_q, depth)
# Concatenate heads
scaled_attention = tf.transpose(scaled_attention, perm= [0 , 2 , 1 , 3 ])
concat_attention = tf.reshape(scaled_attention, (batch_size, - 1 , self .d_model))
# Final linear
output = self .dense(concat_attention)
return output, attention_weights
# Test multi-head attention
d_model = 512
num_heads = 8
seq_len = 10
batch_size = 2
mha = MultiHeadAttention(d_model= d_model, num_heads= num_heads)
# Dummy input
query = tf.random.normal((batch_size, seq_len, d_model))
key = tf.random.normal((batch_size, seq_len, d_model))
value = tf.random.normal((batch_size, seq_len, d_model))
output, weights = mha(query, key, value)
print (f"Input shape: { query. shape} " )
print (f"Output shape: { output. shape} " )
print (f"Attention weights shape: { weights. shape} " )
print (f"Number of parameters: { mha. count_params():,} " )
```
## 8.3 Transformer Architecture
### 8.3.1 Complete Transformer Model
**"Attention is All You Need"** (Vaswani et al., 2017) memperkenalkan Transformer architecture yang sepenuhnya based on attention, tanpa recurrence atau convolution.
**High-Level Architecture:**
```{mermaid}
graph TB
subgraph "ENCODER (Left)"
IE[Input<br/>Embedding] --> IPE[+ Positional<br/>Encoding]
IPE --> EN1[Encoder<br/>Layer 1]
EN1 --> EN2[Encoder<br/>Layer 2]
EN2 --> ENn[Encoder<br/>Layer N]
end
subgraph "DECODER (Right)"
OE[Output<br/>Embedding] --> OPE[+ Positional<br/>Encoding]
OPE --> DN1[Decoder<br/>Layer 1]
DN1 --> DN2[Decoder<br/>Layer 2]
DN2 --> DNn[Decoder<br/>Layer N]
end
ENn -.->|Context| DN1
ENn -.->|Context| DN2
ENn -.->|Context| DNn
DNn --> L[Linear]
L --> S[Softmax]
S --> OUT[Output<br/>Probabilities]
style IE fill:#e6f3ff
style OE fill:#e6f3ff
style EN1 fill:#ffe6e6
style EN2 fill:#ffe6e6
style ENn fill:#ffe6e6
style DN1 fill:#ffffcc
style DN2 fill:#ffffcc
style DNn fill:#ffffcc
style OUT fill:#ccffcc
```
### 8.3.2 Encoder Layer Details
Setiap Encoder layer terdiri dari 2 sub-layers:
1. **Multi-Head Self-Attention**
2. **Position-wise Feed-Forward Network**
Keduanya menggunakan **residual connections** dan **layer normalization**.
**Encoder Layer Architecture:**
```{python}
#| echo: true
#| code-fold: false
class EncoderLayer(layers.Layer):
"""
Single Transformer Encoder Layer
"""
def __init__ (self , d_model, num_heads, dff, dropout_rate= 0.1 , name= "encoder_layer" ):
"""
Parameters:
d_model: Model dimension
num_heads: Number of attention heads
dff: Dimension of feed-forward network
dropout_rate: Dropout rate
"""
super (EncoderLayer, self ).__init__ (name= name)
# Multi-head attention
self .mha = MultiHeadAttention(d_model, num_heads)
# Feed-forward network
self .ffn = keras.Sequential([
layers.Dense(dff, activation= 'relu' ),
layers.Dense(d_model)
])
# Layer normalization
self .layernorm1 = layers.LayerNormalization(epsilon= 1e-6 )
self .layernorm2 = layers.LayerNormalization(epsilon= 1e-6 )
# Dropout
self .dropout1 = layers.Dropout(dropout_rate)
self .dropout2 = layers.Dropout(dropout_rate)
def call(self , x, training, mask= None ):
"""Forward pass"""
# Multi-head attention
attn_output, _ = self .mha(x, x, x, mask)
attn_output = self .dropout1(attn_output, training= training)
out1 = self .layernorm1(x + attn_output) # Residual + LayerNorm
# Feed-forward
ffn_output = self .ffn(out1)
ffn_output = self .dropout2(ffn_output, training= training)
out2 = self .layernorm2(out1 + ffn_output) # Residual + LayerNorm
return out2
# Test encoder layer
encoder_layer = EncoderLayer(d_model= 512 , num_heads= 8 , dff= 2048 )
x = tf.random.normal((batch_size, seq_len, 512 ))
output = encoder_layer(x, training= False )
print (f"Input shape: { x. shape} " )
print (f"Output shape: { output. shape} " )
print (f"Encoder layer parameters: { encoder_layer. count_params():,} " )
```
### 8.3.3 Positional Encoding
**Problem**: Attention mechanism tidak ada notion of order/position!
- "Cat chases dog" vs "Dog chases cat" → Same attention weights!
**Solution**: **Positional Encoding** - Add position information ke embeddings.
**Sinusoidal Positional Encoding:**
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$
Where:
- $pos$: Position dalam sequence
- $i$: Dimension index
- $d_{model}$: Model dimension
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
def get_positional_encoding(seq_len, d_model):
"""
Generate sinusoidal positional encoding
Parameters:
seq_len: Sequence length
d_model: Model dimension
Returns:
pos_encoding: (1, seq_len, d_model)
"""
# Create position indices
positions = np.arange(seq_len)[:, np.newaxis] # (seq_len, 1)
# Create dimension indices
dimensions = np.arange(d_model)[np.newaxis, :] # (1, d_model)
# Calculate angle rates
angle_rates = 1 / np.power(10000 , (2 * (dimensions // 2 )) / d_model)
# Calculate angles
angles = positions * angle_rates # (seq_len, d_model)
# Apply sin to even indices, cos to odd indices
angles[:, 0 ::2 ] = np.sin(angles[:, 0 ::2 ])
angles[:, 1 ::2 ] = np.cos(angles[:, 1 ::2 ])
# Add batch dimension
pos_encoding = angles[np.newaxis, ...]
return tf.cast(pos_encoding, dtype= tf.float32)
# Generate and visualize
seq_len = 100
d_model = 128
pos_encoding = get_positional_encoding(seq_len, d_model)
fig, axes = plt.subplots(1 , 2 , figsize= (16 , 6 ))
# Heatmap
im = axes[0 ].imshow(pos_encoding[0 ], cmap= 'RdBu' , aspect= 'auto' )
axes[0 ].set_xlabel('Dimension' , fontsize= 12 , fontweight= 'bold' )
axes[0 ].set_ylabel('Position' , fontsize= 12 , fontweight= 'bold' )
axes[0 ].set_title('Positional Encoding Heatmap' , fontsize= 14 , fontweight= 'bold' )
plt.colorbar(im, ax= axes[0 ])
# Selected dimensions
axes[1 ].plot(pos_encoding[0 , :, 4 ], label= 'dim 4 (sin)' )
axes[1 ].plot(pos_encoding[0 , :, 5 ], label= 'dim 5 (cos)' )
axes[1 ].plot(pos_encoding[0 , :, 32 ], label= 'dim 32 (sin)' , linestyle= '--' )
axes[1 ].plot(pos_encoding[0 , :, 33 ], label= 'dim 33 (cos)' , linestyle= '--' )
axes[1 ].set_xlabel('Position' , fontsize= 12 , fontweight= 'bold' )
axes[1 ].set_ylabel('Encoding Value' , fontsize= 12 , fontweight= 'bold' )
axes[1 ].set_title('Positional Encoding Patterns' , fontsize= 14 , fontweight= 'bold' )
axes[1 ].legend(fontsize= 10 )
axes[1 ].grid(alpha= 0.3 )
plt.tight_layout()
plt.show()
print (f"Positional encoding shape: { pos_encoding. shape} " )
```
### 8.3.4 Complete Transformer Encoder
```{python}
#| echo: true
#| code-fold: false
class TransformerEncoder(layers.Layer):
"""
Transformer Encoder (Stack of N encoder layers)
"""
def __init__ (self , num_layers, d_model, num_heads, dff,
input_vocab_size, maximum_position_encoding, dropout_rate= 0.1 ,
name= "transformer_encoder" ):
super (TransformerEncoder, self ).__init__ (name= name)
self .d_model = d_model
self .num_layers = num_layers
# Embedding layer
self .embedding = layers.Embedding(input_vocab_size, d_model)
# Positional encoding
self .pos_encoding = get_positional_encoding(maximum_position_encoding, d_model)
# Stack of encoder layers
self .enc_layers = [
EncoderLayer(d_model, num_heads, dff, dropout_rate, name= f'encoder_layer_ { i} ' )
for i in range (num_layers)
]
self .dropout = layers.Dropout(dropout_rate)
def call(self , x, training, mask= None ):
seq_len = tf.shape(x)[1 ]
# Embedding + positional encoding
x = self .embedding(x) # (batch, seq_len, d_model)
x *= tf.math.sqrt(tf.cast(self .d_model, tf.float32)) # Scaling
x += self .pos_encoding[:, :seq_len, :]
x = self .dropout(x, training= training)
# Pass through encoder layers
for enc_layer in self .enc_layers:
x = enc_layer(x, training, mask)
return x
# Build complete encoder
encoder = TransformerEncoder(
num_layers= 6 ,
d_model= 512 ,
num_heads= 8 ,
dff= 2048 ,
input_vocab_size= 10000 ,
maximum_position_encoding= 1000 ,
dropout_rate= 0.1
)
# Test
sample_input = tf.random.uniform((2 , 10 ), dtype= tf.int32, maxval= 10000 )
encoder_output = encoder(sample_input, training= False )
print (f"Input shape: { sample_input. shape} " )
print (f"Encoder output shape: { encoder_output. shape} " )
print (f"Total encoder parameters: { encoder. count_params():,} " )
```
## 8.4 Pre-trained Transformers: BERT & GPT
### 8.4.1 Transfer Learning Paradigm
**Traditional ML**: Train from scratch untuk setiap task
**Modern NLP**: Pre-train on massive corpus, then fine-tune!
**Two-Stage Process:**
1. **Pre-training**: Learn general language understanding
- Corpus: Wikipedia, books, web text (billions of words)
- Task: Self-supervised (mask prediction, next word prediction)
- Duration: Days/weeks pada hundreds of GPUs
2. **Fine-tuning**: Adapt ke specific task
- Corpus: Task-specific data (thousands of examples)
- Task: Supervised (classification, QA, NER, etc.)
- Duration: Minutes/hours pada single GPU
**Benefits:**
- Tidak perlu millions of labeled examples
- Leverage knowledge from massive pre-training
- State-of-the-art results dengan less data
### 8.4.2 BERT: Bidirectional Encoder Representations
**BERT** (Devlin et al., 2018) adalah breakthrough pre-trained model.
**Key Ideas:**
1. **Bidirectional**: Process text kiri dan kanan simultaneously
2. **Masked Language Model (MLM)**: Predict masked tokens
3. **Next Sentence Prediction (NSP)**: Understand sentence relationships
**Architecture:**
```{mermaid}
graph TB
T[Text Input] --> TOK[Tokenization]
TOK --> EMB[Token + Segment + Position<br/>Embeddings]
EMB --> E1[Encoder 1]
E1 --> E2[Encoder 2]
E2 --> E3[...]
E3 --> EN[Encoder N<br/>12 or 24 layers]
EN --> CLS[CLS Token<br/>Sentence representation]
EN --> TOK_OUT[Token Outputs<br/>Contextualized embeddings]
CLS --> TASK1[Classification<br/>Sentiment, NER, etc.]
TOK_OUT --> TASK2[Token tasks<br/>QA, NER]
style T fill:#e6f3ff
style EMB fill:#ffe6e6
style EN fill:#ffffcc
style CLS fill:#ccffcc
style TOK_OUT fill:#ccffcc
```
**BERT Variants:**
| Model | Layers | Hidden Size | Heads | Parameters |
|-------|--------|-------------|-------|------------|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
| RoBERTa | 24 | 1024 | 16 | 355M |
| ALBERT | 12 | 768 | 12 | 12M |
| DistilBERT | 6 | 768 | 12 | 66M |
### 8.4.3 GPT: Generative Pre-trained Transformer
**GPT** (Radford et al., 2018) menggunakan **autoregressive** approach.
**Key Differences dari BERT:**
1. **Unidirectional**: Left-to-right only (decoder architecture)
2. **Causal Language Modeling**: Predict next token
3. **Generation focus**: Designed untuk text generation
**GPT Evolution:**
- **GPT** (2018): 117M parameters
- **GPT-2** (2019): 1.5B parameters
- **GPT-3** (2020): 175B parameters
- **ChatGPT** (2022): GPT-3.5 + RLHF
- **GPT-4** (2023): Multimodal, >1T parameters (estimated)
### 8.4.4 BERT vs GPT: When to Use What?
**Comparison:**
| Aspect | BERT | GPT |
|--------|------|-----|
| **Architecture** | Encoder only | Decoder only |
| **Direction** | Bidirectional | Left-to-right |
| **Pre-training** | Masked LM | Next token prediction |
| **Best for** | Understanding tasks | Generation tasks |
| **Tasks** | Classification, NER, QA | Text generation, completion |
| **Context** | Full sentence | Left context only |
**Use Cases:**
```{python}
#| echo: true
#| code-fold: false
# Visualize use cases
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize= (16 , 6 ))
# BERT use cases
bert_tasks = ['Sentiment \n Analysis' , 'Named Entity \n Recognition' , 'Question \n Answering' ,
'Text \n Classification' , 'Similarity \n Matching' ]
bert_scores = [95 , 93 , 88 , 94 , 90 ]
ax1.barh(bert_tasks, bert_scores, color= '#4285F4' , alpha= 0.8 )
ax1.set_xlabel('Typical Performance (%)' , fontsize= 12 , fontweight= 'bold' )
ax1.set_title('BERT: Understanding Tasks' , fontsize= 14 , fontweight= 'bold' )
ax1.set_xlim([0 , 100 ])
ax1.grid(axis= 'x' , alpha= 0.3 )
# GPT use cases
gpt_tasks = ['Story \n Generation' , 'Code \n Completion' , 'Summarization' ,
'Translation' , 'Conversation' ]
gpt_scores = [92 , 89 , 85 , 83 , 95 ]
ax2.barh(gpt_tasks, gpt_scores, color= '#34A853' , alpha= 0.8 )
ax2.set_xlabel('Typical Performance (%)' , fontsize= 12 , fontweight= 'bold' )
ax2.set_title('GPT: Generation Tasks' , fontsize= 14 , fontweight= 'bold' )
ax2.set_xlim([0 , 100 ])
ax2.grid(axis= 'x' , alpha= 0.3 )
plt.tight_layout()
plt.show()
```
## 8.5 Hugging Face Transformers Library
### 8.5.1 Introduction to Hugging Face
**Hugging Face** adalah library paling populer untuk working dengan transformers.
**Features:**
- 100,000+ pre-trained models
- Support PyTorch, TensorFlow, JAX
- Easy-to-use APIs
- Active community
- Hub untuk sharing models
**Installation:**
```python
pip install transformers
pip install datasets
pip install evaluate
```
### 8.5.2 Loading Pre-trained Models
**Basic Usage:**
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
# Load BERT tokenizer and model
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
print (f"Model: { model_name} " )
print (f"Tokenizer vocab size: { tokenizer. vocab_size:,} " )
print (f"Model parameters: { model. num_parameters():,} " )
# Tokenize text
text = "Transformers are amazing for NLP tasks!"
tokens = tokenizer(text, return_tensors= 'pt' )
print (f" \n Original text: { text} " )
print (f"Tokens: { tokens['input_ids' ]} " )
print (f"Decoded: { tokenizer. decode(tokens['input_ids' ][0 ])} " )
# Forward pass
outputs = model(** tokens)
last_hidden_state = outputs.last_hidden_state
print (f" \n Output shape: { last_hidden_state. shape} " )
print (f" [batch_size, sequence_length, hidden_size]" )
```
### 8.5.3 Pipelines untuk Common Tasks
**Hugging Face Pipelines** = High-level API untuk inference.
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from transformers import pipeline
# 1. Sentiment Analysis
sentiment_analyzer = pipeline("sentiment-analysis" )
result = sentiment_analyzer("I love using transformers for NLP!" )
print ("Sentiment:" , result)
# 2. Named Entity Recognition
ner = pipeline("ner" , grouped_entities= True )
result = ner("Elon Musk founded SpaceX in 2002 in California" )
print (" \n NER:" , result)
# 3. Question Answering
qa = pipeline("question-answering" )
context = "Transformers were introduced in 2017 by Google researchers"
question = "When were transformers introduced?"
result = qa(question= question, context= context)
print (" \n QA:" , result)
# 4. Text Generation
generator = pipeline("text-generation" , model= "gpt2" )
result = generator("Once upon a time" , max_length= 50 , num_return_sequences= 1 )
print (" \n Generation:" , result[0 ]['generated_text' ])
# 5. Summarization
summarizer = pipeline("summarization" )
long_text = """
Transformers have revolutionized natural language processing.
They use attention mechanisms to process entire sequences in parallel,
making them much faster than RNNs. Pre-trained models like BERT and GPT
have achieved state-of-the-art results on numerous benchmarks.
"""
result = summarizer(long_text, max_length= 50 , min_length= 20 )
print (" \n Summary:" , result[0 ]['summary_text' ])
```
### 8.5.4 Tokenization Deep Dive
**Tokenizer Types:**
1. **Word-based**: Split pada spaces (simple, large vocab)
2. **Character-based**: Individual characters (small vocab, long sequences)
3. **Subword**: Best of both worlds (BPE, WordPiece, SentencePiece)
**WordPiece Tokenization** (used by BERT):
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased' )
# Example sentences
texts = [
"Machine learning is awesome" ,
"Transformers use self-attention" ,
"Supercalifragilisticexpialidocious"
]
for text in texts:
# Tokenize
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print (f" \n Text: { text} " )
print (f"Tokens: { tokens} " )
print (f"IDs: { token_ids} " )
# Encoding with special tokens
encoding = tokenizer(
"Hello world" ,
add_special_tokens= True ,
max_length= 10 ,
padding= 'max_length' ,
truncation= True ,
return_tensors= 'pt'
)
print (" \n Encoding keys:" , encoding.keys())
print ("Input IDs:" , encoding['input_ids' ])
print ("Attention mask:" , encoding['attention_mask' ])
```
## 8.6 Fine-tuning BERT untuk Sentiment Analysis
### 8.6.1 Dataset Preparation
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from datasets import load_dataset
from transformers import AutoTokenizer
# Load IMDB dataset
dataset = load_dataset("imdb" )
print (f"Train examples: { len (dataset['train' ]):,} " )
print (f"Test examples: { len (dataset['test' ]):,} " )
# Inspect sample
sample = dataset['train' ][0 ]
print (f" \n Sample review:" )
print (f"Text: { sample['text' ][:200 ]} ..." )
print (f"Label: { sample['label' ]} " ) # 0=negative, 1=positive
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased' )
def tokenize_function(examples):
"""Tokenize texts"""
return tokenizer(
examples['text' ],
padding= 'max_length' ,
truncation= True ,
max_length= 512
)
# Tokenize dataset
tokenized_datasets = dataset.map (tokenize_function, batched= True )
print (f" \n Tokenized dataset:" )
print (tokenized_datasets)
```
### 8.6.2 Model Configuration
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from datasets import load_metric
# Load pre-trained BERT for classification
model = AutoModelForSequenceClassification.from_pretrained(
'bert-base-uncased' ,
num_labels= 2 # Binary classification
)
print (f"Model parameters: { model. num_parameters():,} " )
# Training arguments
training_args = TrainingArguments(
output_dir= './results' ,
num_train_epochs= 3 ,
per_device_train_batch_size= 16 ,
per_device_eval_batch_size= 64 ,
warmup_steps= 500 ,
weight_decay= 0.01 ,
logging_dir= './logs' ,
logging_steps= 100 ,
evaluation_strategy= "epoch" ,
save_strategy= "epoch" ,
load_best_model_at_end= True ,
)
# Metrics
metric = load_metric("accuracy" )
def compute_metrics(eval_pred):
"""Compute accuracy"""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=- 1 )
return metric.compute(predictions= predictions, references= labels)
```
### 8.6.3 Training
```{python}
#| echo: true
#| code-fold: false
#| eval: false
# Create Trainer
trainer = Trainer(
model= model,
args= training_args,
train_dataset= tokenized_datasets['train' ].shuffle(seed= 42 ).select(range (1000 )), # Small subset for demo
eval_dataset= tokenized_datasets['test' ].select(range (200 )),
compute_metrics= compute_metrics,
)
# Train
print ("Starting training..." )
trainer.train()
# Evaluate
print (" \n Evaluating on test set..." )
eval_results = trainer.evaluate()
print (f"Test accuracy: { eval_results['eval_accuracy' ]:.4f} " )
```
### 8.6.4 Inference
```{python}
#| echo: true
#| code-fold: false
#| eval: false
from transformers import pipeline
# Create classifier pipeline
classifier = pipeline(
"sentiment-analysis" ,
model= model,
tokenizer= tokenizer
)
# Test predictions
test_texts = [
"This movie was absolutely fantastic! Best film of the year!" ,
"Terrible waste of time. Worst movie ever." ,
"It was okay, nothing special." ,
"Amazing acting and great story. Highly recommended!"
]
print (" \n Predictions:" )
for text in test_texts:
result = classifier(text)[0 ]
label = "Positive" if result['label' ] == 'LABEL_1' else "Negative"
score = result['score' ]
print (f" \n Text: { text} " )
print (f"Prediction: { label} (confidence: { score:.4f} )" )
```
## 8.7 Advanced Topics
### 8.7.1 Model Optimization
**Techniques untuk production:**
1. **Distillation**: Compress large models
- DistilBERT: 40% smaller, 60% faster, 97% performance
2. **Quantization**: Reduce precision
- FP32 → INT8: 4x smaller, faster inference
3. **Pruning**: Remove unnecessary connections
4. **ONNX Runtime**: Optimized inference engine
```{python}
#| echo: true
#| code-fold: false
#| eval: false
# Quantization example
from transformers import AutoModelForSequenceClassification
import torch
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased' )
# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype= torch.qint8
)
print (f"Original model size: { model. num_parameters():,} parameters" )
print (f"Quantized model: ~4x smaller in memory" )
```
### 8.7.2 Custom Tasks
**Fine-tuning untuk custom domain:**
```{python}
#| echo: true
#| code-fold: false
#| eval: false
# Example: Cybersecurity threat detection
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
# Custom dataset
texts = [
"Detected SQL injection attempt in login form" ,
"Normal user login activity" ,
"Suspicious port scanning detected" ,
"Regular API request"
]
labels = [1 , 0 , 1 , 0 ] # 1=threat, 0=normal
# Tokenize
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased' )
encodings = tokenizer(texts, padding= True , truncation= True , return_tensors= 'pt' )
# Create dataset
class ThreatDataset(torch.utils.data.Dataset):
def __init__ (self , encodings, labels):
self .encodings = encodings
self .labels = labels
def __len__ (self ):
return len (self .labels)
def __getitem__ (self , idx):
item = {key: val[idx] for key, val in self .encodings.items()}
item['labels' ] = torch.tensor(self .labels[idx])
return item
dataset = ThreatDataset(encodings, labels)
# Fine-tune
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased' , num_labels= 2 )
# ... training code ...
```
## 8.8 Rangkuman & Best Practices
### 8.8.1 Key Takeaways
::: {.callout-note}
## 📚 Chapter Summary
**1. Attention Mechanism:**
- Selective focus pada relevant parts
- Solusi untuk long-range dependencies
- Self-attention dan multi-head attention
**2. Transformer Architecture:**
- Encoder-decoder structure
- Parallel processing (no sequential bottleneck)
- Positional encoding untuk order information
**3. Pre-trained Models:**
- **BERT**: Bidirectional, understanding tasks
- **GPT**: Autoregressive, generation tasks
- Transfer learning paradigm
**4. Hugging Face:**
- Comprehensive transformers library
- Easy access ke pre-trained models
- Pipelines untuk quick inference
**5. Fine-tuning:**
- Adapt pre-trained models ke specific tasks
- Requires less data than training from scratch
- State-of-the-art results
:::
### 8.8.2 Best Practices
**Model Selection:**
- Use BERT untuk classification, NER, QA
- Use GPT untuk generation, completion
- Consider model size vs performance trade-off
**Fine-tuning:**
- Start dengan small learning rate (2e-5 to 5e-5)
- Use gradient accumulation untuk large batches
- Monitor validation loss untuk early stopping
**Production:**
- Optimize dengan quantization/distillation
- Cache tokenizer outputs
- Batch predictions untuk throughput
## 8.9 Soal Latihan
### Review Questions
1. Jelaskan perbedaan fundamental antara RNN dan Transformer dalam memproses sequences.
2. Apa itu self-attention? Bagaimana cara kerjanya secara matematis?
3. Mengapa positional encoding diperlukan dalam Transformer?
4. Bandingkan BERT dan GPT:
- Arsitektur (encoder vs decoder)
- Pre-training objective
- Use cases
5. Jelaskan konsep multi-head attention. Apa keuntungannya dibanding single-head?
6. Apa itu transfer learning dalam context NLP? Jelaskan pre-training dan fine-tuning.
7. Sebutkan 5 tasks yang cocok untuk BERT dan 5 untuk GPT.
8. Bagaimana WordPiece tokenization bekerja? Apa advantagenya?
9. Jelaskan trade-off antara model size dan performance.
10. Apa itu attention weights dan bagaimana interpretasinya?
### Coding Exercises
**Exercise 1**: Implement Scaled Dot-Product Attention dari scratch (NumPy)
**Exercise 2**: Fine-tune BERT untuk custom text classification
**Exercise 3**: Compare different transformer models (BERT, RoBERTa, DistilBERT) pada same task
**Exercise 4**: Implement text generation dengan GPT-2
**Exercise 5**: Visualize attention weights untuk interpreting model
---
**🎓 Selamat! Anda telah menyelesaikan Chapter 8 - Transformers & Attention!**
**Di Lab 8, kita akan fine-tune BERT untuk sentiment analysis dengan Hugging Face Transformers! 🚀**