---
title: "Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling"
subtitle: "Deep Learning untuk Data Sequential: Time Series, NLP & Sequential Prediction"
number-sections: false
---
# Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling {#sec-chapter-07}
::: {.callout-note}
## 🎯 Hasil Pembelajaran (Learning Outcomes)
Setelah mempelajari bab ini, Anda akan mampu:
1. **Memahami** arsitektur RNN dan bagaimana ia memproses data sequential
2. **Mengidentifikasi** masalah vanishing gradient dan solusinya (LSTM, GRU)
3. **Mengimplementasikan** RNN, LSTM, dan GRU untuk time series forecasting
4. **Menerapkan** bidirectional dan stacked RNNs untuk performa lebih baik
5. **Menggunakan** sequence-to-sequence models untuk aplikasi NLP
6. **Mengevaluasi** performa model sequential dengan metrik yang tepat
:::
## 7.1 Pengantar Sequential Data dan RNN
### 7.1.1 Mengapa Data Sequential Memerlukan Arsitektur Khusus?
**Problem dengan Feedforward Networks untuk Sequential Data:**
Di Chapter 5 dan 6, kita belajar MLP dan CNN yang bekerja dengan **fixed-size inputs**. Namun, banyak data di dunia nyata bersifat **sequential** dengan temporal dependencies:
**Contoh Sequential Data:**
- **Time series**: Harga saham, suhu, konsumsi energi
- **Text**: Kalimat, dokumen, conversation
- **Audio**: Speech, music, sound
- **Video**: Frame sequences
- **DNA sequences**: Genomic data
**Masalah Feedforward NN:**
1. **No memory**: Setiap input diproses independently
2. **Fixed input size**: Tidak bisa handle variable-length sequences
3. **No temporal relationships**: Kehilangan informasi order/sequence
4. **Parameter explosion**: Beda timestep = beda parameters
::: {.callout-tip}
## 💡 Intuisi RNN
RNN mengatasi masalah dengan:
- **Hidden state (memory)**: Menyimpan informasi dari timesteps sebelumnya
- **Parameter sharing**: Weight yang sama digunakan untuk semua timesteps
- **Variable-length sequences**: Bisa proses sequence dengan panjang berbeda
- **Temporal modeling**: Capture dependencies antar timesteps
**Analogi**: Seperti membaca buku - Anda memahami kalimat berdasarkan kata-kata sebelumnya!
:::
### 7.1.2 Evolution of Sequential Models
**Era Pre-RNN:**
- Hidden Markov Models (HMM)
- Autoregressive models (AR, ARMA, ARIMA)
- Manual feature engineering
- **Limitation**: Cannot learn long-term dependencies
**RNN Era (1990s-2010s):**
- **Simple RNN (1986)**: First recurrent architecture
- **LSTM (1997)**: Long Short-Term Memory - solved vanishing gradient
- **GRU (2014)**: Gated Recurrent Unit - simplified LSTM
- **Bidirectional RNN (1997)**: Process sequence forward & backward
**Modern Era (2017-now):**
- **Attention Mechanisms (2017)**: Self-attention, multi-head attention
- **Transformers (2017)**: Attention is all you need - replaced RNN for NLP
- **BERT, GPT (2018-now)**: Large language models
- **But**: RNNs still relevant for time series, small data, interpretability
::: {.callout-note}
## 📊 RNN Applications Today
**Industry Applications:**
- **Finance**: Stock prediction, algorithmic trading
- **Energy**: Load forecasting, demand prediction
- **Healthcare**: Patient monitoring, disease progression
- **Manufacturing**: Predictive maintenance, quality control
- **NLP**: Machine translation, text generation, sentiment analysis
- **Speech**: Speech recognition, text-to-speech
:::
### 7.1.3 Types of Sequential Problems
RNN dapat menangani berbagai tipe sequential problems:
```{mermaid}
graph LR
A[One-to-One<br/>Traditional NN<br/>Image Classification] --> B[One-to-Many<br/>Image Captioning<br/>Music Generation]
B --> C[Many-to-One<br/>Sentiment Analysis<br/>Video Classification]
C --> D[Many-to-Many<br/>same length<br/>Video Frame Labeling]
D --> E[Many-to-Many<br/>diff length<br/>Machine Translation]
style A fill:#ffcccc
style B fill:#ffe6cc
style C fill:#ffffcc
style D fill:#ccffcc
style E fill:#ccccff
```
**Detailed Explanation:**
1. **One-to-One**: Standard feedforward NN
- Input: Single vector
- Output: Single vector
- Example: Image classification
2. **One-to-Many**: Sequence generation
- Input: Single vector (atau kondisi awal)
- Output: Sequence
- Example: Image captioning, music generation
3. **Many-to-One**: Sequence classification
- Input: Sequence
- Output: Single vector
- Example: Sentiment analysis, time series classification
4. **Many-to-Many (same length)**: Synchronized sequence
- Input: Sequence
- Output: Sequence (same length)
- Example: Video frame labeling, POS tagging
5. **Many-to-Many (different length)**: Sequence-to-sequence
- Input: Sequence
- Output: Sequence (different length)
- Example: Machine translation, text summarization
## 7.2 Simple RNN: Fundamentals
### 7.2.1 RNN Architecture Basics
**Recurrent Neural Network** memproses sequences dengan **hidden state** yang di-update setiap timestep.
**Mathematical Formulation:**
Untuk setiap timestep $t$:
$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$y_t = W_{hy} h_t + b_y$$
Where:
- $x_t$: Input pada timestep $t$
- $h_t$: Hidden state pada timestep $t$
- $y_t$: Output pada timestep $t$
- $W_{hh}$: Hidden-to-hidden weight matrix
- $W_{xh}$: Input-to-hidden weight matrix
- $W_{hy}$: Hidden-to-output weight matrix
- $b_h$, $b_y$: Bias terms
**Visualisasi RNN:**
```{python}
#| echo: true
#| code-fold: false
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
fig, axes = plt.subplots(1 , 2 , figsize= (16 , 6 ))
# Folded representation
ax = axes[0 ]
ax.set_xlim(0 , 10 )
ax.set_ylim(0 , 10 )
ax.axis('off' )
ax.set_title('RNN: Folded Representation' , fontsize= 14 , fontweight= 'bold' )
# Input
input_rect = patches.Rectangle((1 , 1 ), 2 , 1 , linewidth= 2 , edgecolor= 'blue' , facecolor= 'lightblue' )
ax.add_patch(input_rect)
ax.text(2 , 1.5 , r'$x_t$' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
# Hidden state (RNN cell)
rnn_rect = patches.Rectangle((1 , 4 ), 2 , 2 , linewidth= 3 , edgecolor= 'red' , facecolor= 'lightyellow' )
ax.add_patch(rnn_rect)
ax.text(2 , 5 , 'RNN' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
# Output
output_rect = patches.Rectangle((1 , 8 ), 2 , 1 , linewidth= 2 , edgecolor= 'green' , facecolor= 'lightgreen' )
ax.add_patch(output_rect)
ax.text(2 , 8.5 , r'$y_t$' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
# Arrows
ax.arrow(2 , 2.2 , 0 , 1.5 , head_width= 0.2 , head_length= 0.2 , fc= 'blue' , ec= 'blue' , linewidth= 2 )
ax.arrow(2 , 6.2 , 0 , 1.5 , head_width= 0.2 , head_length= 0.2 , fc= 'green' , ec= 'green' , linewidth= 2 )
# Recurrent connection
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
arrow_loop = FancyArrowPatch((3.2 , 5.5 ), (3.2 , 4.5 ),
arrowstyle= '->' , mutation_scale= 20 , linewidth= 2.5 ,
color= 'red' , connectionstyle= "arc3,rad=1.5" )
ax.add_patch(arrow_loop)
ax.text(5 , 5 , r'$h_{t-1}$' , fontsize= 11 , color= 'red' , fontweight= 'bold' )
# Unfolded representation
ax = axes[1 ]
ax.set_xlim(0 , 16 )
ax.set_ylim(0 , 10 )
ax.axis('off' )
ax.set_title('RNN: Unfolded Representation (Through Time)' , fontsize= 14 , fontweight= 'bold' )
timesteps = [2 , 6 , 10 , 14 ]
for i, t in enumerate (timesteps):
# Input
input_rect = patches.Rectangle((t- 0.5 , 1 ), 1 , 0.8 , linewidth= 2 , edgecolor= 'blue' , facecolor= 'lightblue' )
ax.add_patch(input_rect)
ax.text(t, 1.4 , f'$x_ { i} $' , ha= 'center' , va= 'center' , fontsize= 10 , fontweight= 'bold' )
# Hidden state
rnn_rect = patches.Rectangle((t- 0.5 , 4 ), 1 , 1.5 , linewidth= 2.5 , edgecolor= 'red' , facecolor= 'lightyellow' )
ax.add_patch(rnn_rect)
ax.text(t, 4.75 , f'$h_ { i} $' , ha= 'center' , va= 'center' , fontsize= 10 , fontweight= 'bold' )
# Output
output_rect = patches.Rectangle((t- 0.5 , 8 ), 1 , 0.8 , linewidth= 2 , edgecolor= 'green' , facecolor= 'lightgreen' )
ax.add_patch(output_rect)
ax.text(t, 8.4 , f'$y_ { i} $' , ha= 'center' , va= 'center' , fontsize= 10 , fontweight= 'bold' )
# Vertical arrows
ax.arrow(t, 2 , 0 , 1.8 , head_width= 0.15 , head_length= 0.15 , fc= 'blue' , ec= 'blue' , linewidth= 1.5 )
ax.arrow(t, 5.7 , 0 , 2 , head_width= 0.15 , head_length= 0.15 , fc= 'green' , ec= 'green' , linewidth= 1.5 )
# Horizontal arrows (recurrent connections)
if i < len (timesteps) - 1 :
ax.arrow(t+ 0.6 , 4.75 , 3 , 0 , head_width= 0.2 , head_length= 0.2 , fc= 'red' , ec= 'red' , linewidth= 2 )
# Time axis
ax.text(8 , 0.3 , 'Time →' , ha= 'center' , fontsize= 12 , fontweight= 'bold' , style= 'italic' )
plt.tight_layout()
plt.show()
```
### 7.2.2 Simple RNN Implementation
**Keras Implementation:**
```{python}
#| echo: true
#| code-fold: false
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Simple RNN untuk many-to-one classification
def build_simple_rnn_classifier(sequence_length= 10 , input_dim= 1 ,
hidden_units= 32 , num_classes= 2 ):
"""
Simple RNN untuk sequence classification
Parameters:
sequence_length: Panjang sequence input
input_dim: Dimensi feature pada setiap timestep
hidden_units: Jumlah unit dalam RNN layer
num_classes: Jumlah kelas untuk klasifikasi
Returns:
model: Compiled Keras model
"""
model = keras.Sequential([
# SimpleRNN layer
layers.SimpleRNN(
units= hidden_units,
activation= 'tanh' ,
return_sequences= False , # Many-to-one: hanya output terakhir
input_shape= (sequence_length, input_dim),
name= 'simple_rnn'
),
# Dense layer untuk classification
layers.Dense(num_classes, activation= 'softmax' , name= 'output' )
])
model.compile (
optimizer= 'adam' ,
loss= 'sparse_categorical_crossentropy' ,
metrics= ['accuracy' ]
)
return model
# Build model
simple_rnn = build_simple_rnn_classifier()
print (simple_rnn.summary())
```
**PyTorch Implementation:**
```{python}
#| echo: true
#| code-fold: false
import torch
import torch.nn as nn
class SimpleRNNClassifier(nn.Module):
"""
Simple RNN untuk sequence classification (PyTorch)
"""
def __init__ (self , input_dim= 1 , hidden_dim= 32 , num_layers= 1 , num_classes= 2 ):
super (SimpleRNNClassifier, self ).__init__ ()
self .hidden_dim = hidden_dim
self .num_layers = num_layers
# RNN layer
self .rnn = nn.RNN(
input_size= input_dim,
hidden_size= hidden_dim,
num_layers= num_layers,
batch_first= True ,
nonlinearity= 'tanh'
)
# Fully connected layer
self .fc = nn.Linear(hidden_dim, num_classes)
def forward(self , x):
# x shape: (batch, seq_len, input_dim)
# Initialize hidden state
h0 = torch.zeros(self .num_layers, x.size(0 ), self .hidden_dim).to(x.device)
# RNN forward pass
out, hn = self .rnn(x, h0)
# out shape: (batch, seq_len, hidden_dim)
# hn shape: (num_layers, batch, hidden_dim)
# Take output from last timestep
out = out[:, - 1 , :] # (batch, hidden_dim)
# Fully connected layer
out = self .fc(out) # (batch, num_classes)
return out
# Instantiate model
pytorch_rnn = SimpleRNNClassifier(input_dim= 1 , hidden_dim= 32 , num_classes= 2 )
print (pytorch_rnn)
# Test forward pass
dummy_input = torch.randn(4 , 10 , 1 ) # (batch=4, seq_len=10, features=1)
output = pytorch_rnn(dummy_input)
print (f" \n Input shape: { dummy_input. shape} " )
print (f"Output shape: { output. shape} " )
```
### 7.2.3 The Vanishing Gradient Problem
**Masalah utama Simple RNN**: **Vanishing Gradient**
Saat melakukan backpropagation through time (BPTT), gradients harus propagate melalui banyak timesteps. Karena repeated matrix multiplication dengan weights < 1, gradients menjadi semakin kecil (vanish).
**Mathematical Explanation:**
Gradient untuk timestep awal:
$$\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$
Jika $\left|\frac{\partial h_t}{\partial h_{t-1}}\right| < 1$, maka gradient exponentially decay!
**Consequences:**
- Cannot learn long-term dependencies
- Early timesteps tidak mendapat gradient signal yang cukup
- Model hanya belajar short-term patterns
**Visualisasi Vanishing Gradient:**
```{python}
#| echo: true
#| code-fold: false
# Demonstrasi vanishing gradient
def demonstrate_vanishing_gradient():
"""
Simulasi how gradients vanish over timesteps
"""
timesteps = 50
# Simulate gradient flow dengan different weight values
gradients_small_w = []
gradients_good_w = []
gradients_large_w = []
initial_gradient = 1.0
for w in [0.9 , 1.0 , 1.1 ]:
gradient = initial_gradient
gradient_history = [gradient]
for t in range (timesteps):
gradient = gradient * w # Simplified gradient flow
gradient_history.append(gradient)
if w == 0.9 :
gradients_small_w = gradient_history
elif w == 1.0 :
gradients_good_w = gradient_history
else :
gradients_large_w = gradient_history
# Plot
fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize= (16 , 5 ))
# Linear scale
ax1.plot(gradients_small_w, label= 'W = 0.9 (Vanishing)' , linewidth= 2.5 , color= 'red' )
ax1.plot(gradients_good_w, label= 'W = 1.0 (Stable)' , linewidth= 2.5 , color= 'green' , linestyle= '--' )
ax1.plot(gradients_large_w, label= 'W = 1.1 (Exploding)' , linewidth= 2.5 , color= 'blue' )
ax1.set_xlabel('Timesteps (backward)' , fontsize= 12 , fontweight= 'bold' )
ax1.set_ylabel('Gradient Magnitude' , fontsize= 12 , fontweight= 'bold' )
ax1.set_title('Gradient Flow (Linear Scale)' , fontsize= 14 , fontweight= 'bold' )
ax1.legend(fontsize= 11 )
ax1.grid(alpha= 0.3 )
# Log scale
ax2.semilogy(np.abs (gradients_small_w), label= 'W = 0.9 (Vanishing)' , linewidth= 2.5 , color= 'red' )
ax2.semilogy(np.abs (gradients_good_w), label= 'W = 1.0 (Stable)' , linewidth= 2.5 , color= 'green' , linestyle= '--' )
ax2.semilogy(np.abs (gradients_large_w), label= 'W = 1.1 (Exploding)' , linewidth= 2.5 , color= 'blue' )
ax2.set_xlabel('Timesteps (backward)' , fontsize= 12 , fontweight= 'bold' )
ax2.set_ylabel('Gradient Magnitude (log scale)' , fontsize= 12 , fontweight= 'bold' )
ax2.set_title('Gradient Flow (Log Scale)' , fontsize= 14 , fontweight= 'bold' )
ax2.legend(fontsize= 11 )
ax2.grid(alpha= 0.3 , which= 'both' )
plt.tight_layout()
plt.show()
print ("Gradient after 50 timesteps:" )
print (f" W=0.9 (vanishing): { gradients_small_w[- 1 ]:.2e} " )
print (f" W=1.0 (stable): { gradients_good_w[- 1 ]:.2e} " )
print (f" W=1.1 (exploding): { gradients_large_w[- 1 ]:.2e} " )
demonstrate_vanishing_gradient()
```
::: {.callout-warning}
## ⚠️ Vanishing vs Exploding Gradients
**Vanishing Gradient** (lebih common):
- Gradients → 0
- Cannot learn long-term dependencies
- **Solution**: LSTM, GRU, skip connections
**Exploding Gradient** (less common):
- Gradients → ∞
- Training unstable, NaN values
- **Solution**: Gradient clipping, careful initialization
:::
## 7.3 LSTM: Long Short-Term Memory
### 7.3.1 LSTM Architecture
**LSTM** dirancang khusus untuk mengatasi vanishing gradient problem dengan **gating mechanisms**.
**Key Components:**
1. **Cell State** ($C_t$): Long-term memory highway
2. **Hidden State** ($h_t$): Short-term memory (output)
3. **Forget Gate** ($f_t$): Decide apa yang dibuang dari cell state
4. **Input Gate** ($i_t$): Decide info baru apa yang disimpan
5. **Output Gate** ($o_t$): Decide apa yang di-output
**LSTM Equations:**
$$f_t = \sigma(W_f \cdot [ h_{t-1}, x_t ] + b_f)$$ Forget gate
$$i_t = \sigma(W_i \cdot [ h_{t-1}, x_t ] + b_i)$$ Input gate
$$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$ Candidate values
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$ Update cell state
$$o_t = \sigma(W_o \cdot [ h_{t-1}, x_t ] + b_o)$$ Output gate
$$h_t = o_t \odot \tanh(C_t)$$ Hidden state
Where:
- $\sigma$: Sigmoid function (output 0-1, acts as gate)
- $\odot$: Element-wise multiplication
- $\tanh$: Hyperbolic tangent (output -1 to 1)
**LSTM Cell Visualization:**
```{python}
#| echo: true
#| code-fold: false
# Complex LSTM visualization
fig, ax = plt.subplots(figsize= (16 , 10 ))
ax.set_xlim(0 , 16 )
ax.set_ylim(0 , 12 )
ax.axis('off' )
ax.set_title('LSTM Cell Architecture' , fontsize= 16 , fontweight= 'bold' , pad= 20 )
# Cell state (top highway)
ax.plot([1 , 15 ], [10 , 10 ], 'k-' , linewidth= 4 , label= 'Cell State ($C_t$)' )
ax.text(0.3 , 10 , r'$C_{t-1}$' , fontsize= 12 , fontweight= 'bold' , va= 'center' )
ax.text(15.3 , 10 , r'$C_t$' , fontsize= 12 , fontweight= 'bold' , va= 'center' )
# Forget gate
forget_rect = patches.Rectangle((3 , 8 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'red' ,
facecolor= 'lightcoral' , alpha= 0.7 )
ax.add_patch(forget_rect)
ax.text(3.75 , 8.75 , r'$f_t$' , ha= 'center' , va= 'center' , fontsize= 11 , fontweight= 'bold' )
ax.text(3.75 , 7.3 , r'$\sigma$' , ha= 'center' , fontsize= 10 )
# Forget gate operation
ax.plot([3.75 , 3.75 ], [9.5 , 10 ], 'r-' , linewidth= 2 )
ax.plot([5 , 5 ], [10 , 10 ], 'ro' , markersize= 12 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(5 , 10.6 , '×' , fontsize= 14 , fontweight= 'bold' , ha= 'center' )
# Input gate
input_rect = patches.Rectangle((7 , 8 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'blue' ,
facecolor= 'lightblue' , alpha= 0.7 )
ax.add_patch(input_rect)
ax.text(7.75 , 8.75 , r'$i_t$' , ha= 'center' , va= 'center' , fontsize= 11 , fontweight= 'bold' )
ax.text(7.75 , 7.3 , r'$\sigma$' , ha= 'center' , fontsize= 10 )
# Candidate values
candidate_rect = patches.Rectangle((9.5 , 8 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'purple' ,
facecolor= 'plum' , alpha= 0.7 )
ax.add_patch(candidate_rect)
ax.text(10.25 , 8.75 , r'$\tilde {C} _t$' , ha= 'center' , va= 'center' , fontsize= 11 , fontweight= 'bold' )
ax.text(10.25 , 7.3 , r'$\tanh$' , ha= 'center' , fontsize= 10 )
# Combine input and candidate
ax.plot([7.75 , 7.75 ], [9.5 , 10.5 ], 'b-' , linewidth= 2 )
ax.plot([10.25 , 10.25 ], [9.5 , 10.5 ], 'purple' , linewidth= 2 )
ax.plot([9 , 9 ], [10.5 , 10.5 ], 'g-' , linewidth= 2 )
ax.plot([9 , 9 ], [10 , 10 ], 'go' , markersize= 12 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(9 , 10.6 , '×' , fontsize= 14 , fontweight= 'bold' , ha= 'center' )
# Add to cell state
ax.plot([11 , 11 ], [10 , 10 ], 'ko' , markersize= 12 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(11 , 10.6 , '+' , fontsize= 14 , fontweight= 'bold' , ha= 'center' )
# Output gate
output_rect = patches.Rectangle((13 , 8 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'green' ,
facecolor= 'lightgreen' , alpha= 0.7 )
ax.add_patch(output_rect)
ax.text(13.75 , 8.75 , r'$o_t$' , ha= 'center' , va= 'center' , fontsize= 11 , fontweight= 'bold' )
ax.text(13.75 , 7.3 , r'$\sigma$' , ha= 'center' , fontsize= 10 )
# tanh for output
tanh_rect = patches.Rectangle((12.5 , 5.5 ), 1 , 1 , linewidth= 2 , edgecolor= 'orange' ,
facecolor= 'wheat' , alpha= 0.7 )
ax.add_patch(tanh_rect)
ax.text(13 , 6 , r'$\tanh$' , ha= 'center' , va= 'center' , fontsize= 10 , fontweight= 'bold' )
# Output combination
ax.plot([13 , 13 ], [6.5 , 7 ], 'orange' , linewidth= 2 )
ax.plot([13.75 , 13.75 ], [9.5 , 7 ], 'g-' , linewidth= 2 )
ax.plot([13.4 , 13.4 ], [7 , 7 ], 'ko' , markersize= 12 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(13.4 , 7.6 , '×' , fontsize= 14 , fontweight= 'bold' , ha= 'center' )
# Hidden state output
ax.arrow(13.4 , 6.5 , 0 , - 2.5 , head_width= 0.2 , head_length= 0.2 , fc= 'green' , ec= 'green' , linewidth= 2.5 )
ax.text(13.4 , 3.5 , r'$h_t$' , ha= 'center' , fontsize= 12 , fontweight= 'bold' )
# Inputs
ax.arrow(3.75 , 2 , 0 , 5.5 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.arrow(7.75 , 2 , 0 , 5.5 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.arrow(10.25 , 2 , 0 , 5.5 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.arrow(13.75 , 2 , 0 , 5.5 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.text(8 , 1 , r'$[h_{t-1}, x_t]$' , ha= 'center' , fontsize= 12 , fontweight= 'bold' ,
bbox= dict (boxstyle= 'round' , facecolor= 'lightyellow' , edgecolor= 'black' , linewidth= 2 ))
# Legend
legend_elements = [
patches.Patch(facecolor= 'lightcoral' , edgecolor= 'red' , label= 'Forget Gate' ),
patches.Patch(facecolor= 'lightblue' , edgecolor= 'blue' , label= 'Input Gate' ),
patches.Patch(facecolor= 'plum' , edgecolor= 'purple' , label= 'Candidate' ),
patches.Patch(facecolor= 'lightgreen' , edgecolor= 'green' , label= 'Output Gate' ),
]
ax.legend(handles= legend_elements, loc= 'upper left' , fontsize= 11 )
plt.tight_layout()
plt.show()
```
### 7.3.2 LSTM Implementation
**Keras LSTM:**
```{python}
#| echo: true
#| code-fold: false
def build_lstm_model(sequence_length= 50 , input_dim= 1 ,
lstm_units= 64 , dense_units= 32 , output_dim= 1 ):
"""
LSTM model untuk time series forecasting
Parameters:
sequence_length: Lookback window size
input_dim: Number of features per timestep
lstm_units: LSTM hidden units
dense_units: Dense layer units
output_dim: Prediction horizon
Returns:
model: Compiled Keras model
"""
model = keras.Sequential([
# LSTM layer
layers.LSTM(
units= lstm_units,
activation= 'tanh' ,
recurrent_activation= 'sigmoid' ,
return_sequences= False , # Return last output only
input_shape= (sequence_length, input_dim),
name= 'lstm_layer'
),
# Dropout untuk regularization
layers.Dropout(0.2 , name= 'dropout' ),
# Dense layers
layers.Dense(dense_units, activation= 'relu' , name= 'dense_1' ),
layers.Dense(output_dim, activation= 'linear' , name= 'output' )
])
model.compile (
optimizer= keras.optimizers.Adam(learning_rate= 0.001 ),
loss= 'mse' ,
metrics= ['mae' ]
)
return model
# Build LSTM model
lstm_model = build_lstm_model(sequence_length= 50 , lstm_units= 64 )
print (lstm_model.summary())
```
**PyTorch LSTM:**
```{python}
#| echo: true
#| code-fold: false
class LSTMForecaster(nn.Module):
"""
LSTM model untuk time series forecasting (PyTorch)
"""
def __init__ (self , input_dim= 1 , hidden_dim= 64 , num_layers= 1 ,
dense_dim= 32 , output_dim= 1 , dropout= 0.2 ):
super (LSTMForecaster, self ).__init__ ()
self .hidden_dim = hidden_dim
self .num_layers = num_layers
# LSTM layer
self .lstm = nn.LSTM(
input_size= input_dim,
hidden_size= hidden_dim,
num_layers= num_layers,
batch_first= True ,
dropout= dropout if num_layers > 1 else 0
)
# Dropout
self .dropout = nn.Dropout(dropout)
# Fully connected layers
self .fc1 = nn.Linear(hidden_dim, dense_dim)
self .fc2 = nn.Linear(dense_dim, output_dim)
# Activation
self .relu = nn.ReLU()
def forward(self , x):
# x shape: (batch, seq_len, input_dim)
# Initialize hidden and cell states
h0 = torch.zeros(self .num_layers, x.size(0 ), self .hidden_dim).to(x.device)
c0 = torch.zeros(self .num_layers, x.size(0 ), self .hidden_dim).to(x.device)
# LSTM forward pass
out, (hn, cn) = self .lstm(x, (h0, c0))
# out: (batch, seq_len, hidden_dim)
# Take last timestep output
out = out[:, - 1 , :] # (batch, hidden_dim)
# Dropout
out = self .dropout(out)
# Fully connected layers
out = self .fc1(out)
out = self .relu(out)
out = self .fc2(out)
return out
# Instantiate PyTorch LSTM
pytorch_lstm = LSTMForecaster(input_dim= 1 , hidden_dim= 64 , num_layers= 2 )
print (pytorch_lstm)
# Test
test_input = torch.randn(8 , 50 , 1 ) # (batch=8, seq=50, features=1)
test_output = pytorch_lstm(test_input)
print (f" \n Input shape: { test_input. shape} " )
print (f"Output shape: { test_output. shape} " )
```
### 7.3.3 How LSTM Solves Vanishing Gradient
**LSTM's Solution: Additive Cell State Update**
Key insight: Cell state $C_t$ di-update secara **additive**, bukan multiplicative!
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$
**Gradient Flow:**
$$\frac{\partial C_t}{\partial C_{t-1}} = f_t$$
Forget gate $f_t$ dapat mendekati 1, memungkinkan gradient flow tanpa decay!
**Comparison:**
```{python}
#| echo: true
#| code-fold: false
# Comparison: Simple RNN vs LSTM gradient flow
fig, (ax1, ax2) = plt.subplots(1 , 2 , figsize= (16 , 6 ))
timesteps = np.arange(0 , 51 )
# Simple RNN: multiplicative gradient
rnn_gradient = 0.95 ** timesteps
# LSTM: controlled by forget gate (closer to 1)
lstm_gradient_forget_high = 0.99 ** timesteps
lstm_gradient_forget_medium = 0.95 ** timesteps
lstm_gradient_forget_low = 0.90 ** timesteps
# Linear scale
ax1.plot(timesteps, rnn_gradient, 'r-' , linewidth= 3 , label= 'Simple RNN (W=0.95)' , alpha= 0.8 )
ax1.plot(timesteps, lstm_gradient_forget_high, 'g-' , linewidth= 3 , label= 'LSTM (forget=0.99)' , alpha= 0.8 )
ax1.plot(timesteps, lstm_gradient_forget_medium, 'b--' , linewidth= 2.5 , label= 'LSTM (forget=0.95)' , alpha= 0.8 )
ax1.plot(timesteps, lstm_gradient_forget_low, 'purple' , linewidth= 2 , label= 'LSTM (forget=0.90)' , linestyle= ':' , alpha= 0.8 )
ax1.set_xlabel('Timesteps' , fontsize= 12 , fontweight= 'bold' )
ax1.set_ylabel('Gradient Magnitude' , fontsize= 12 , fontweight= 'bold' )
ax1.set_title('Gradient Flow Comparison (Linear)' , fontsize= 14 , fontweight= 'bold' )
ax1.legend(fontsize= 11 )
ax1.grid(alpha= 0.3 )
# Log scale
ax2.semilogy(timesteps, rnn_gradient, 'r-' , linewidth= 3 , label= 'Simple RNN (W=0.95)' , alpha= 0.8 )
ax2.semilogy(timesteps, lstm_gradient_forget_high, 'g-' , linewidth= 3 , label= 'LSTM (forget=0.99)' , alpha= 0.8 )
ax2.semilogy(timesteps, lstm_gradient_forget_medium, 'b--' , linewidth= 2.5 , label= 'LSTM (forget=0.95)' , alpha= 0.8 )
ax2.semilogy(timesteps, lstm_gradient_forget_low, 'purple' , linewidth= 2 , label= 'LSTM (forget=0.90)' , linestyle= ':' , alpha= 0.8 )
ax2.set_xlabel('Timesteps' , fontsize= 12 , fontweight= 'bold' )
ax2.set_ylabel('Gradient Magnitude (log)' , fontsize= 12 , fontweight= 'bold' )
ax2.set_title('Gradient Flow Comparison (Log Scale)' , fontsize= 14 , fontweight= 'bold' )
ax2.legend(fontsize= 11 )
ax2.grid(alpha= 0.3 , which= 'both' )
plt.tight_layout()
plt.show()
print ("Gradient after 50 timesteps:" )
print (f" Simple RNN: { rnn_gradient[- 1 ]:.6f} " )
print (f" LSTM (f=0.99): { lstm_gradient_forget_high[- 1 ]:.6f} " )
print (f" LSTM (f=0.95): { lstm_gradient_forget_medium[- 1 ]:.6f} " )
print (f" LSTM (f=0.90): { lstm_gradient_forget_low[- 1 ]:.6f} " )
```
::: {.callout-tip}
## 💡 Why LSTM Works
1. **Cell State Highway**: Direct path untuk info flow tanpa transformations
2. **Gating Mechanisms**: Learned control kapan remember/forget
3. **Additive Updates**: Gradients tidak multiply repeatedly
4. **Flexible Memory**: Bisa learn long-term dependencies (100+ timesteps)
**Result**: LSTM bisa learn dependencies ratusan timesteps, sedangkan Simple RNN hanya ~10 timesteps!
:::
## 7.4 GRU: Gated Recurrent Unit
### 7.4.1 GRU Architecture
**GRU** adalah simplified version dari LSTM dengan **fewer parameters** tapi **comparable performance**.
**Key Differences dari LSTM:**
- **2 gates** instead of 3 (reset gate, update gate)
- **No separate cell state** - hidden state saja
- **Fewer parameters** - faster training
- **Simpler architecture** - easier to understand
**GRU Equations:**
$$z_t = \sigma(W_z \cdot [ h_{t-1}, x_t ] )$$ Update gate
$$r_t = \sigma(W_r \cdot [ h_{t-1}, x_t ] )$$ Reset gate
$$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t])$$ Candidate hidden state
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$ Final hidden state
**Component Functions:**
- **Update gate** ($z_t$): Decides how much past info to keep
- **Reset gate** ($r_t$): Decides how much past info to forget
- **Candidate state** ($\tilde{h}_t$): New memory content
- **Final state** ($h_t$): Combination of old and new
**GRU Visualization:**
```{python}
#| echo: true
#| code-fold: false
fig, ax = plt.subplots(figsize= (14 , 8 ))
ax.set_xlim(0 , 14 )
ax.set_ylim(0 , 10 )
ax.axis('off' )
ax.set_title('GRU Cell Architecture' , fontsize= 16 , fontweight= 'bold' , pad= 20 )
# Reset gate
reset_rect = patches.Rectangle((2 , 7 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'red' ,
facecolor= 'lightcoral' , alpha= 0.7 )
ax.add_patch(reset_rect)
ax.text(2.75 , 7.75 , r'$r_t$' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
ax.text(2.75 , 6.3 , r'$\sigma$' , ha= 'center' , fontsize= 10 )
# Update gate
update_rect = patches.Rectangle((5.5 , 7 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'blue' ,
facecolor= 'lightblue' , alpha= 0.7 )
ax.add_patch(update_rect)
ax.text(6.25 , 7.75 , r'$z_t$' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
ax.text(6.25 , 6.3 , r'$\sigma$' , ha= 'center' , fontsize= 10 )
# Candidate hidden state
candidate_rect = patches.Rectangle((9 , 7 ), 1.5 , 1.5 , linewidth= 2 , edgecolor= 'purple' ,
facecolor= 'plum' , alpha= 0.7 )
ax.add_patch(candidate_rect)
ax.text(9.75 , 7.75 , r'$\tilde {h} _t$' , ha= 'center' , va= 'center' , fontsize= 12 , fontweight= 'bold' )
ax.text(9.75 , 6.3 , r'$\tanh$' , ha= 'center' , fontsize= 10 )
# Reset operation
ax.plot([2.75 , 2.75 ], [8.5 , 9.5 ], 'r-' , linewidth= 2 )
ax.plot([2.75 , 8 ], [9.5 , 9.5 ], 'r-' , linewidth= 2 )
ax.plot([8 , 8 ], [9.5 , 9 ], 'r-' , linewidth= 2 )
ax.plot([8 , 8 ], [9 , 9 ], 'ro' , markersize= 10 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(8 , 9.5 , '×' , fontsize= 13 , fontweight= 'bold' , ha= 'center' , va= 'bottom' )
# Previous hidden state path
ax.plot([0.5 , 11.5 ], [9 , 9 ], 'k-' , linewidth= 3 , alpha= 0.5 )
ax.text(0 , 9 , r'$h_{t-1}$' , fontsize= 11 , fontweight= 'bold' , va= 'center' )
# Update gate paths
ax.plot([6.25 , 6.25 ], [8.5 , 5 ], 'b-' , linewidth= 2 )
ax.plot([6.25 , 11.5 ], [5 , 5 ], 'b-' , linewidth= 2 )
# 1 - z_t path
ax.plot([4 , 4 ], [5 , 5 ], 'b--' , linewidth= 2 )
ax.plot([4 , 11.5 ], [3 , 3 ], 'b--' , linewidth= 2 )
ax.text(3.5 , 5 , r'$1-z_t$' , fontsize= 10 , ha= 'right' , color= 'blue' , fontweight= 'bold' )
# Candidate combination
ax.plot([9.75 , 9.75 ], [8.5 , 5 ], 'purple' , linewidth= 2 )
ax.plot([11.5 , 11.5 ], [5 , 5 ], 'go' , markersize= 10 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(11.5 , 5.5 , '×' , fontsize= 13 , fontweight= 'bold' , ha= 'center' , color= 'purple' )
# Old hidden state path
ax.plot([11.5 , 11.5 ], [9 , 3 ], 'k--' , linewidth= 2 , alpha= 0.5 )
ax.plot([11.5 , 11.5 ], [3 , 3 ], 'go' , markersize= 10 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(11.5 , 2.5 , '×' , fontsize= 13 , fontweight= 'bold' , ha= 'center' )
# Final combination
ax.plot([11.5 , 11.5 ], [4.5 , 1.5 ], 'g-' , linewidth= 2.5 )
ax.plot([11.5 , 11.5 ], [2.5 , 1.5 ], 'g-' , linewidth= 2.5 )
ax.plot([11.5 , 11.5 ], [1.5 , 1.5 ], 'ko' , markersize= 12 , markerfacecolor= 'white' , markeredgewidth= 2 )
ax.text(11.5 , 1 , '+' , fontsize= 14 , fontweight= 'bold' , ha= 'center' )
# Output
ax.arrow(11.5 , 0.8 , 0 , - 0.3 , head_width= 0.2 , head_length= 0.1 , fc= 'green' , ec= 'green' , linewidth= 2.5 )
ax.text(11.5 , 0 , r'$h_t$' , ha= 'center' , fontsize= 12 , fontweight= 'bold' )
# Inputs
ax.arrow(2.75 , 2.5 , 0 , 4 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.arrow(6.25 , 2.5 , 0 , 4 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.arrow(9.75 , 2.5 , 0 , 4 , head_width= 0.2 , head_length= 0.2 , fc= 'black' , ec= 'black' , linewidth= 2 )
ax.text(6.25 , 1.5 , r'$[h_{t-1}, x_t]$' , ha= 'center' , fontsize= 11 , fontweight= 'bold' ,
bbox= dict (boxstyle= 'round' , facecolor= 'lightyellow' , edgecolor= 'black' , linewidth= 2 ))
plt.tight_layout()
plt.show()
```
### 7.4.2 GRU Implementation
**Keras GRU:**
```{python}
#| echo: true
#| code-fold: false
def build_gru_model(sequence_length= 50 , input_dim= 1 ,
gru_units= 64 , dense_units= 32 , output_dim= 1 ):
"""
GRU model untuk time series forecasting
"""
model = keras.Sequential([
# GRU layer
layers.GRU(
units= gru_units,
activation= 'tanh' ,
recurrent_activation= 'sigmoid' ,
return_sequences= False ,
input_shape= (sequence_length, input_dim),
name= 'gru_layer'
),
# Dropout
layers.Dropout(0.2 , name= 'dropout' ),
# Dense layers
layers.Dense(dense_units, activation= 'relu' , name= 'dense_1' ),
layers.Dense(output_dim, activation= 'linear' , name= 'output' )
])
model.compile (
optimizer= keras.optimizers.Adam(learning_rate= 0.001 ),
loss= 'mse' ,
metrics= ['mae' ]
)
return model
# Build GRU model
gru_model = build_gru_model()
print (gru_model.summary())
```
**PyTorch GRU:**
```{python}
#| echo: true
#| code-fold: false
class GRUForecaster(nn.Module):
"""
GRU model untuk time series forecasting (PyTorch)
"""
def __init__ (self , input_dim= 1 , hidden_dim= 64 , num_layers= 1 ,
dense_dim= 32 , output_dim= 1 , dropout= 0.2 ):
super (GRUForecaster, self ).__init__ ()
self .hidden_dim = hidden_dim
self .num_layers = num_layers
# GRU layer
self .gru = nn.GRU(
input_size= input_dim,
hidden_size= hidden_dim,
num_layers= num_layers,
batch_first= True ,
dropout= dropout if num_layers > 1 else 0
)
# Dropout
self .dropout = nn.Dropout(dropout)
# Fully connected layers
self .fc1 = nn.Linear(hidden_dim, dense_dim)
self .fc2 = nn.Linear(dense_dim, output_dim)
# Activation
self .relu = nn.ReLU()
def forward(self , x):
# Initialize hidden state
h0 = torch.zeros(self .num_layers, x.size(0 ), self .hidden_dim).to(x.device)
# GRU forward pass
out, hn = self .gru(x, h0)
# Take last timestep
out = out[:, - 1 , :]
# Dropout
out = self .dropout(out)
# FC layers
out = self .fc1(out)
out = self .relu(out)
out = self .fc2(out)
return out
# Instantiate
pytorch_gru = GRUForecaster(hidden_dim= 64 , num_layers= 2 )
print (pytorch_gru)
```
### 7.4.3 LSTM vs GRU: When to Use What?
**Comparison Table:**
| Aspect | LSTM | GRU |
|--------|------|-----|
| **Parameters** | More (4 gates) | Less (2 gates) |
| **Training Speed** | Slower | Faster |
| **Memory** | Higher | Lower |
| **Performance** | Slightly better on complex tasks | Comparable on most tasks |
| **Long-term Dependencies** | Excellent | Very good |
| **Overfitting Risk** | Higher (more params) | Lower |
| **When to Use** | Large datasets, complex patterns | Smaller datasets, faster training needed |
**Practical Guidelines:**
```{python}
#| echo: true
#| code-fold: false
# Parameter comparison
def compare_parameters():
"""
Compare parameter counts: LSTM vs GRU
"""
seq_len, input_dim, hidden_dim = 50 , 1 , 64
# Build models
lstm_model = keras.Sequential([
layers.LSTM(hidden_dim, input_shape= (seq_len, input_dim)),
layers.Dense(1 )
])
gru_model = keras.Sequential([
layers.GRU(hidden_dim, input_shape= (seq_len, input_dim)),
layers.Dense(1 )
])
lstm_params = lstm_model.count_params()
gru_params = gru_model.count_params()
print ("Parameter Comparison:" )
print (f" LSTM parameters: { lstm_params:,} " )
print (f" GRU parameters: { gru_params:,} " )
print (f" Difference: { lstm_params - gru_params:,} ( { (lstm_params- gru_params)/ gru_params* 100 :.1f} % more)" )
print (f" \n GRU is { lstm_params/ gru_params:.2f} x smaller than LSTM" )
compare_parameters()
```
::: {.callout-tip}
## 🎯 Rule of Thumb
**Use LSTM when:**
- You have **large datasets** (millions of samples)
- Task requires **very long-term dependencies** (100+ timesteps)
- Model interpretability less important
- Computational resources abundant
**Use GRU when:**
- **Smaller datasets** or limited computational resources
- Need **faster training/inference**
- Medium-term dependencies (10-100 timesteps)
- Want **simpler model** with fewer hyperparameters
**In practice**: Try both! GRU often performs similarly with less complexity.
:::
## 7.5 Advanced RNN Architectures
### 7.5.1 Bidirectional RNNs
**Konsep**: Process sequence **forward AND backward** untuk mendapatkan context dari kedua arah.
**Use Cases:**
- Sentiment analysis (membutuhkan full sentence context)
- Named Entity Recognition
- Speech recognition
- Tidak cocok untuk real-time forecasting (butuh future data)
**Architecture:**
```{mermaid}
%%| fig-cap: "Arsitektur Bidirectional LSTM - menggabungkan informasi dari forward dan backward pass"
%%| label: fig-bidirectional-lstm
flowchart LR
subgraph Forward["Forward Pass"]
direction LR
X1[x1] --> F1[h1_fwd]
X2[x2] --> F2[h2_fwd]
X3[x3] --> F3[h3_fwd]
F1 --> F2
F2 --> F3
end
subgraph Backward["Backward Pass"]
direction RL
X1B[x1] --> B1[h1_bwd]
X2B[x2] --> B2[h2_bwd]
X3B[x3] --> B3[h3_bwd]
B3 --> B2
B2 --> B1
end
F1 --> C1[Concat]
B1 --> C1
F2 --> C2[Concat]
B2 --> C2
F3 --> C3[Concat]
B3 --> C3
C1 --> Y1[y1]
C2 --> Y2[y2]
C3 --> Y3[y3]
style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px
style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px
style C1 fill:#fff4e6,stroke:#333
style C2 fill:#fff4e6,stroke:#333
style C3 fill:#fff4e6,stroke:#333
```
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
def build_bidirectional_lstm(sequence_length= 50 , input_dim= 1 ,
lstm_units= 64 , num_classes= 3 ):
"""
Bidirectional LSTM untuk sequence classification
"""
model = keras.Sequential([
# Bidirectional LSTM
layers.Bidirectional(
layers.LSTM(lstm_units, return_sequences= True ),
input_shape= (sequence_length, input_dim),
name= 'bidirectional_lstm_1'
),
# Second bidirectional layer
layers.Bidirectional(
layers.LSTM(lstm_units // 2 ),
name= 'bidirectional_lstm_2'
),
# Dropout
layers.Dropout(0.3 ),
# Output
layers.Dense(num_classes, activation= 'softmax' )
])
model.compile (
optimizer= 'adam' ,
loss= 'sparse_categorical_crossentropy' ,
metrics= ['accuracy' ]
)
return model
# Build model
bi_lstm = build_bidirectional_lstm()
print (bi_lstm.summary())
# Note: Output dari bidirectional layer adalah concatenation
# Jika forward LSTM has 64 units, backward juga 64 units
# Output shape: (batch, 64 + 64) = (batch, 128)
```
### 7.5.2 Stacked/Deep RNNs
**Konsep**: Stack multiple RNN layers untuk learn **hierarchical representations**.
**Benefits:**
- Learn more complex patterns
- Better feature extraction
- Hierarchical temporal abstractions
**Caution:**
- More parameters = more data needed
- Risk of overfitting
- Harder to train
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
def build_stacked_lstm(sequence_length= 50 , input_dim= 1 ,
lstm_layers= [128 , 64 , 32 ], output_dim= 1 ):
"""
Stacked LSTM dengan multiple layers
Parameters:
lstm_layers: List of units per layer [layer1_units, layer2_units, ...]
"""
model = keras.Sequential(name= 'Stacked_LSTM' )
# First LSTM layer (must return sequences)
model.add(layers.LSTM(
lstm_layers[0 ],
return_sequences= True ,
input_shape= (sequence_length, input_dim),
name= f'lstm_1'
))
model.add(layers.Dropout(0.2 , name= 'dropout_1' ))
# Middle layers (return sequences for all except last)
for i, units in enumerate (lstm_layers[1 :- 1 ], start= 2 ):
model.add(layers.LSTM(
units,
return_sequences= True ,
name= f'lstm_ { i} '
))
model.add(layers.Dropout(0.2 , name= f'dropout_ { i} ' ))
# Last LSTM layer (return_sequences=False)
model.add(layers.LSTM(
lstm_layers[- 1 ],
return_sequences= False ,
name= f'lstm_ { len (lstm_layers)} '
))
model.add(layers.Dropout(0.2 , name= f'dropout_ { len (lstm_layers)} ' ))
# Output layer
model.add(layers.Dense(output_dim, activation= 'linear' , name= 'output' ))
model.compile (
optimizer= 'adam' ,
loss= 'mse' ,
metrics= ['mae' ]
)
return model
# Build 3-layer stacked LSTM
stacked_lstm = build_stacked_lstm(lstm_layers= [128 , 64 , 32 ])
print (stacked_lstm.summary())
```
### 7.5.3 Encoder-Decoder (Seq2Seq)
**Konsep**: Architecture untuk **sequence-to-sequence** tasks dengan variable input/output lengths.
**Components:**
1. **Encoder**: Process input sequence → context vector
2. **Decoder**: Generate output sequence dari context vector
**Use Cases:**
- Machine translation
- Text summarization
- Question answering
- Image captioning
**Architecture:**
```{mermaid}
%%| fig-cap: "Arsitektur Encoder-Decoder (Seq2Seq) - Encoder mengompres input menjadi context vector, Decoder menghasilkan output sequence"
%%| label: fig-seq2seq-architecture
flowchart LR
subgraph Encoder["Encoder"]
direction LR
X1[x1] --> E1[LSTM 1]
X2[x2] --> E2[LSTM 2]
X3[x3] --> E3[LSTM 3]
E1 --> E2
E2 --> E3
end
E3 ==> C[Context Vector]
subgraph Decoder["Decoder"]
direction LR
C ==> D1[LSTM 1]
D1 --> D2[LSTM 2]
D2 --> D3[LSTM 3]
D1 --> Y1[y1]
D2 --> Y2[y2]
D3 --> Y3[y3]
end
style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px
style C fill:#ffffcc,stroke:#f90,stroke-width:3px
style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px
style Y1 fill:#d4edda,stroke:#333
style Y2 fill:#d4edda,stroke:#333
style Y3 fill:#d4edda,stroke:#333
```
**Implementation:**
```{python}
#| echo: true
#| code-fold: false
def build_seq2seq_model(encoder_seq_len= 10 , decoder_seq_len= 10 ,
input_dim= 1 , output_dim= 1 , latent_dim= 64 ):
"""
Simple Seq2Seq model
Parameters:
encoder_seq_len: Input sequence length
decoder_seq_len: Output sequence length
latent_dim: Hidden dimension
"""
# Encoder
encoder_inputs = layers.Input(shape= (encoder_seq_len, input_dim), name= 'encoder_input' )
encoder_lstm = layers.LSTM(latent_dim, return_state= True , name= 'encoder_lstm' )
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c] # Context vector
# Decoder
decoder_inputs = layers.Input(shape= (decoder_seq_len, output_dim), name= 'decoder_input' )
decoder_lstm = layers.LSTM(latent_dim, return_sequences= True , return_state= True , name= 'decoder_lstm' )
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state= encoder_states)
decoder_dense = layers.Dense(output_dim, activation= 'linear' , name= 'decoder_dense' )
decoder_outputs = decoder_dense(decoder_outputs)
# Model
model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name= 'Seq2Seq' )
model.compile (
optimizer= 'adam' ,
loss= 'mse' ,
metrics= ['mae' ]
)
return model
# Build seq2seq
seq2seq = build_seq2seq_model()
print (seq2seq.summary())
```
## 7.6 Time Series Forecasting dengan RNN
### 7.6.1 Problem Formulation
**Time Series Forecasting Task:**
Given historical data $x_1, x_2, ..., x_t$, predict future values $x_{t+1}, x_{t+2}, ..., x_{t+h}$
**Approaches:**
1. **One-step ahead**: Predict $x_{t+1}$ saja
2. **Multi-step ahead**: Predict sequence $[ x_{t+1}, ..., x_{t+h} ] $
3. **Recursive**: Use predictions as input untuk next prediction
4. **Direct**: Separate model untuk setiap horizon
**Data Preparation:**
```{python}
#| echo: true
#| code-fold: false
def create_sequences(data, lookback= 50 , horizon= 1 ):
"""
Create input-output sequences untuk time series forecasting
Parameters:
data: 1D array time series data
lookback: Number of past timesteps to use as input
horizon: Number of future timesteps to predict
Returns:
X: Input sequences (samples, lookback, features)
y: Target values (samples, horizon)
"""
X, y = [], []
for i in range (len (data) - lookback - horizon + 1 ):
# Input: [i : i+lookback]
X.append(data[i : i + lookback])
# Target: [i+lookback : i+lookback+horizon]
if horizon == 1 :
y.append(data[i + lookback])
else :
y.append(data[i + lookback : i + lookback + horizon])
X = np.array(X)
y = np.array(y)
# Reshape X to (samples, lookback, 1) untuk univariate
if len (X.shape) == 2 :
X = X.reshape((X.shape[0 ], X.shape[1 ], 1 ))
return X, y
# Example
np.random.seed(42 )
sample_data = np.sin(np.linspace(0 , 100 , 1000 )) + np.random.normal(0 , 0.1 , 1000 )
X, y = create_sequences(sample_data, lookback= 50 , horizon= 1 )
print (f"Input shape: { X. shape} " ) # (samples, 50, 1)
print (f"Output shape: { y. shape} " ) # (samples, 1)
# Visualize sequences
fig, axes = plt.subplots(2 , 1 , figsize= (15 , 8 ))
# Plot full time series
axes[0 ].plot(sample_data, linewidth= 1.5 , alpha= 0.7 )
axes[0 ].set_title('Full Time Series' , fontsize= 13 , fontweight= 'bold' )
axes[0 ].set_xlabel('Time' , fontsize= 11 )
axes[0 ].set_ylabel('Value' , fontsize= 11 )
axes[0 ].grid(alpha= 0.3 )
# Plot one sequence example
example_idx = 100
input_seq = X[example_idx].flatten()
target_val = y[example_idx]
axes[1 ].plot(range (len (input_seq)), input_seq, 'b-' , linewidth= 2 , label= 'Input Sequence (lookback=50)' )
axes[1 ].plot(len (input_seq), target_val, 'ro' , markersize= 10 , label= f'Target (t+1)' , zorder= 3 )
axes[1 ].axvline(len (input_seq)- 1 , color= 'gray' , linestyle= '--' , alpha= 0.5 )
axes[1 ].set_title(f'Example Sequence # { example_idx} ' , fontsize= 13 , fontweight= 'bold' )
axes[1 ].set_xlabel('Timestep' , fontsize= 11 )
axes[1 ].set_ylabel('Value' , fontsize= 11 )
axes[1 ].legend(fontsize= 10 )
axes[1 ].grid(alpha= 0.3 )
plt.tight_layout()
plt.show()
```
### 7.6.2 Feature Engineering for Time Series
**Important Features:**
1. **Lag features**: Past values
2. **Rolling statistics**: Moving average, std
3. **Time-based features**: Hour, day, month, seasonality
4. **Difference features**: First/second differences
```{python}
#| echo: true
#| code-fold: false
import pandas as pd
def engineer_time_series_features(data, datetime_index= None ):
"""
Create time series features
Parameters:
data: 1D array or pandas Series
datetime_index: DatetimeIndex (optional)
Returns:
DataFrame with engineered features
"""
if isinstance (data, np.ndarray):
data = pd.Series(data)
df = pd.DataFrame({'value' : data})
# Lag features
for lag in [1 , 2 , 3 , 7 , 14 ]:
df[f'lag_ { lag} ' ] = df['value' ].shift(lag)
# Rolling statistics
for window in [7 , 14 , 30 ]:
df[f'rolling_mean_ { window} ' ] = df['value' ].rolling(window= window).mean()
df[f'rolling_std_ { window} ' ] = df['value' ].rolling(window= window).std()
df[f'rolling_min_ { window} ' ] = df['value' ].rolling(window= window).min ()
df[f'rolling_max_ { window} ' ] = df['value' ].rolling(window= window).max ()
# Difference features
df['diff_1' ] = df['value' ].diff(1 )
df['diff_2' ] = df['value' ].diff(2 )
# Time-based features (if datetime index provided)
if datetime_index is not None :
df.index = datetime_index
df['hour' ] = df.index.hour
df['day_of_week' ] = df.index.dayofweek
df['day_of_month' ] = df.index.day
df['month' ] = df.index.month
df['quarter' ] = df.index.quarter
# Cyclical encoding untuk periodic features
df['hour_sin' ] = np.sin(2 * np.pi * df['hour' ] / 24 )
df['hour_cos' ] = np.cos(2 * np.pi * df['hour' ] / 24 )
df['day_sin' ] = np.sin(2 * np.pi * df['day_of_week' ] / 7 )
df['day_cos' ] = np.cos(2 * np.pi * df['day_of_week' ] / 7 )
# Drop NaN rows
df = df.dropna()
return df
# Example
dates = pd.date_range('2023-01-01' , periods= len (sample_data), freq= 'H' )
features_df = engineer_time_series_features(sample_data, dates)
print ("Engineered Features:" )
print (features_df.head(10 ))
print (f" \n Total features created: { len (features_df.columns)} " )
```
### 7.6.3 Evaluation Metrics untuk Time Series
**Common Metrics:**
```{python}
#| echo: true
#| code-fold: false
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def evaluate_forecast(y_true, y_pred):
"""
Comprehensive evaluation metrics untuk forecasting
Returns:
Dictionary of metrics
"""
# Regression metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
# Percentage errors
mape = np.mean(np.abs ((y_true - y_pred) / (y_true + 1e-8 ))) * 100
# Direction accuracy (berapa % trend direction benar)
y_true_diff = np.diff(y_true.flatten())
y_pred_diff = np.diff(y_pred.flatten())
direction_accuracy = np.mean((y_true_diff * y_pred_diff) > 0 ) * 100
metrics = {
'MSE' : mse,
'RMSE' : rmse,
'MAE' : mae,
'R²' : r2,
'MAPE (%)' : mape,
'Direction Accuracy (%)' : direction_accuracy
}
return metrics
# Example evaluation
y_true = sample_data[100 :200 ]
y_pred = sample_data[100 :200 ] + np.random.normal(0 , 0.1 , 100 ) # Simulated predictions
metrics = evaluate_forecast(y_true, y_pred)
print ("Forecast Evaluation Metrics:" )
print ("=" * 50 )
for metric, value in metrics.items():
print (f" { metric:25s} : { value:10.4f} " )
```
::: {.callout-note}
## 📊 Which Metric to Use?
**MSE/RMSE**:
- Penalize large errors heavily
- Good when outliers are critical
- Same unit as target
**MAE**:
- More robust to outliers
- Interpretable (average error)
- Less sensitive to extreme values
**MAPE**:
- Percentage-based, scale-independent
- Good untuk comparing different datasets
- Problem jika y_true close to zero
**Direction Accuracy**:
- Important for trading strategies
- Measures trend prediction
- Binary metric (up/down)
**R²**:
- Goodness of fit
- 1.0 = perfect, 0.0 = baseline
- Can be negative for bad models
:::
## 7.7 RNN untuk NLP Applications
### 7.7.1 Text Preprocessing untuk RNN
**Steps:**
1. **Tokenization**: Split text into tokens
2. **Vocabulary building**: Create word-to-index mapping
3. **Sequence padding**: Make all sequences same length
4. **Embedding**: Convert words to dense vectors
```{python}
#| echo: true
#| code-fold: false
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Sample text data
texts = [
"I love machine learning" ,
"Deep learning is amazing" ,
"RNN are great for sequences" ,
"LSTM solves vanishing gradient problem" ,
"NLP with deep learning is powerful"
]
# Tokenization
tokenizer = Tokenizer(num_words= 100 , oov_token= '<OOV>' )
tokenizer.fit_on_texts(texts)
# Convert to sequences
sequences = tokenizer.texts_to_sequences(texts)
print ("Original texts:" )
for i, text in enumerate (texts):
print (f" { i+ 1 } . { text} " )
print (" \n Tokenized sequences:" )
for i, seq in enumerate (sequences):
print (f" { i+ 1 } . { seq} " )
# Vocabulary
word_index = tokenizer.word_index
print (f" \n Vocabulary size: { len (word_index)} " )
print (" \n Word to index mapping (first 10):" )
for word, idx in list (word_index.items())[:10 ]:
print (f" ' { word} ': { idx} " )
# Padding
max_length = 10
padded_sequences = pad_sequences(sequences, maxlen= max_length, padding= 'post' , truncating= 'post' )
print (f" \n Padded sequences (max_length= { max_length} ):" )
for i, seq in enumerate (padded_sequences):
print (f" { i+ 1 } . { seq} " )
```
### 7.7.2 Sentiment Analysis dengan LSTM
```{python}
#| echo: true
#| code-fold: false
def build_sentiment_classifier(vocab_size= 10000 , embedding_dim= 64 ,
max_length= 100 , lstm_units= 64 ):
"""
LSTM untuk sentiment classification
"""
model = keras.Sequential([
# Embedding layer
layers.Embedding(
input_dim= vocab_size,
output_dim= embedding_dim,
input_length= max_length,
name= 'embedding'
),
# Spatial dropout untuk embedding
layers.SpatialDropout1D(0.2 ),
# Bidirectional LSTM
layers.Bidirectional(
layers.LSTM(lstm_units, dropout= 0.2 , recurrent_dropout= 0.2 ),
name= 'bidirectional_lstm'
),
# Dense layers
layers.Dense(32 , activation= 'relu' ),
layers.Dropout(0.3 ),
# Output layer (binary classification)
layers.Dense(1 , activation= 'sigmoid' )
])
model.compile (
optimizer= 'adam' ,
loss= 'binary_crossentropy' ,
metrics= ['accuracy' ]
)
return model
# Build model
sentiment_model = build_sentiment_classifier()
print (sentiment_model.summary())
```
### 7.7.3 Text Generation dengan RNN
**Character-level Language Model:**
```{python}
#| echo: true
#| code-fold: false
def build_text_generator(vocab_size= 100 , embedding_dim= 256 ,
rnn_units= 512 , sequence_length= 100 ):
"""
Character-level text generation model
"""
model = keras.Sequential([
# Embedding
layers.Embedding(vocab_size, embedding_dim, input_length= sequence_length),
# Stacked LSTM
layers.LSTM(rnn_units, return_sequences= True ),
layers.Dropout(0.2 ),
layers.LSTM(rnn_units),
layers.Dropout(0.2 ),
# Output layer (predict next character)
layers.Dense(vocab_size, activation= 'softmax' )
])
model.compile (
optimizer= 'adam' ,
loss= 'sparse_categorical_crossentropy' ,
metrics= ['accuracy' ]
)
return model
text_gen = build_text_generator()
print (text_gen.summary())
```
## 7.8 Best Practices dan Tips
### 7.8.1 Hyperparameter Tuning
**Key Hyperparameters:**
1. **Hidden units**: Start dengan 32-128
2. **Number of layers**: 1-3 layers biasanya cukup
3. **Dropout rate**: 0.2-0.5
4. **Learning rate**: 0.001 (Adam) atau 0.01 (SGD)
5. **Batch size**: 32-128 untuk time series
6. **Sequence length**: Tergantung pada temporal dependency
```{python}
#| echo: true
#| code-fold: false
# Hyperparameter search space
hyperparams = {
'lstm_units' : [32 , 64 , 128 , 256 ],
'num_layers' : [1 , 2 , 3 ],
'dropout_rate' : [0.0 , 0.2 , 0.3 , 0.5 ],
'learning_rate' : [0.0001 , 0.001 , 0.01 ],
'batch_size' : [32 , 64 , 128 ],
'sequence_length' : [24 , 48 , 72 , 96 ]
}
print ("Hyperparameter Search Space:" )
for param, values in hyperparams.items():
print (f" { param:20s} : { values} " )
```
### 7.8.2 Training Tips
::: {.callout-tip}
## 🎯 Training Best Practices
**1. Gradient Clipping** (prevent exploding gradients):
```python
optimizer = keras.optimizers.Adam(clipnorm= 1.0 )
```
**2. Early Stopping** (prevent overfitting):
```python
early_stop = keras.callbacks.EarlyStopping(
monitor= 'val_loss' ,
patience= 10 ,
restore_best_weights= True
)
```
**3. Learning Rate Scheduling**:
```python
reduce_lr = keras.callbacks.ReduceLROnPlateau(
monitor= 'val_loss' ,
factor= 0.5 ,
patience= 5 ,
min_lr= 1e-7
)
```
**4. Batch Normalization** (untuk deeper networks):
```python
layers.BatchNormalization()
```
**5. Teacher Forcing** (untuk seq2seq):
- Use ground truth sebagai decoder input during training
- Use model predictions during inference
:::
### 7.8.3 Common Pitfalls
**❌ Common Mistakes:**
1. **Data leakage**: Using future data untuk train
2. **Not normalizing**: RNN sensitive terhadap scale
3. **Too many parameters**: Overfitting pada small datasets
4. **Ignoring stationarity**: Time series should be stationary
5. **Wrong sequence direction**: Pastikan temporal order benar
6. **Batch size too small**: Unstable training
7. **No validation set**: Cannot detect overfitting
**✅ Solutions:**
```{python}
#| echo: true
#| code-fold: false
# 1. Proper train/val/test split untuk time series
def time_series_split(data, train_ratio= 0.7 , val_ratio= 0.15 ):
"""
Time series split (no shuffling!)
"""
n = len (data)
train_end = int (n * train_ratio)
val_end = int (n * (train_ratio + val_ratio))
train = data[:train_end]
val = data[train_end:val_end]
test = data[val_end:]
return train, val, test
# 2. Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler
def normalize_data(train, val, test, method= 'standard' ):
"""
Normalize data using train statistics
"""
if method == 'standard' :
scaler = StandardScaler()
else :
scaler = MinMaxScaler()
# Fit ONLY on training data
train_scaled = scaler.fit_transform(train.reshape(- 1 , 1 ))
val_scaled = scaler.transform(val.reshape(- 1 , 1 ))
test_scaled = scaler.transform(test.reshape(- 1 , 1 ))
return train_scaled, val_scaled, test_scaled, scaler
# Example
train, val, test = time_series_split(sample_data)
train_norm, val_norm, test_norm, scaler = normalize_data(train, val, test)
print (f"Train size: { len (train)} ( { len (train)/ len (sample_data)* 100 :.1f} %)" )
print (f"Val size: { len (val)} ( { len (val)/ len (sample_data)* 100 :.1f} %)" )
print (f"Test size: { len (test)} ( { len (test)/ len (sample_data)* 100 :.1f} %)" )
```
## 7.9 Rangkuman & Kesimpulan
### 7.9.1 Key Takeaways
::: {.callout-note}
## 📚 Chapter Summary
**1. RNN Fundamentals:**
- RNN memproses sequential data dengan hidden state (memory)
- Parameter sharing across timesteps
- Dapat handle variable-length sequences
**2. Vanishing Gradient Problem:**
- Simple RNN sulit learn long-term dependencies
- Gradients exponentially decay saat backpropagation
- Solution: LSTM, GRU dengan gating mechanisms
**3. LSTM:**
- 3 gates (forget, input, output) + cell state
- Cell state acts as information highway
- Additive updates prevent gradient vanishing
- Best untuk very long-term dependencies
**4. GRU:**
- Simplified LSTM dengan 2 gates (update, reset)
- Fewer parameters, faster training
- Comparable performance untuk most tasks
- Good default choice untuk many applications
**5. Advanced Architectures:**
- **Bidirectional**: Process forward + backward
- **Stacked**: Multiple layers untuk hierarchical learning
- **Seq2Seq**: Encoder-decoder untuk variable length I/O
**6. Applications:**
- **Time Series**: Energy forecasting, stock prediction
- **NLP**: Sentiment analysis, text generation, translation
- **Speech**: Recognition, synthesis
- **Video**: Frame prediction, action recognition
**7. Best Practices:**
- Normalize input data
- Use gradient clipping
- Proper train/val/test split (no shuffling untuk time series!)
- Start simple, add complexity gradually
- Monitor for overfitting
:::
### 7.9.2 When to Use What?
```{mermaid}
%%| fig-cap: "Decision tree untuk memilih arsitektur RNN yang tepat berdasarkan karakteristik masalah dan data"
%%| label: fig-rnn-decision-tree
flowchart TD
A["Sequential Problem?"] -->|Yes| B{"Long-term<br/>Dependencies?"}
A -->|No| Z["Use Feedforward NN<br/>or CNN"]
B -->|"Yes, >100 steps"| C["Use LSTM"]
B -->|"Medium, 10-100 steps"| D["Use GRU"]
B -->|"Short, <10 steps"| E["Simple RNN OK"]
C --> F{"Large Dataset?"}
D --> F
E --> F
F -->|"Yes, millions"| G["Deep/Stacked RNN"]
F -->|"No, thousands"| H["Shallow RNN<br/>1-2 layers"]
G --> I{"Need both<br/>directions?"}
H --> I
I -->|Yes| J["Bidirectional"]
I -->|No| K["Unidirectional"]
style A fill:#ffcccc,stroke:#333,stroke-width:2px
style B fill:#fff3cd,stroke:#333,stroke-width:2px
style C fill:#ccffcc,stroke:#333,stroke-width:2px
style D fill:#ccffcc,stroke:#333,stroke-width:2px
style E fill:#ffffcc,stroke:#333,stroke-width:2px
style F fill:#fff3cd,stroke:#333,stroke-width:2px
style I fill:#fff3cd,stroke:#333,stroke-width:2px
style J fill:#ccccff,stroke:#333,stroke-width:2px
style K fill:#ccccff,stroke:#333,stroke-width:2px
style Z fill:#e2e3e5,stroke:#333,stroke-width:2px
```
### 7.9.3 Looking Forward
**Limitations of RNNs:**
- Sequential processing (cannot parallelize)
- Still struggle dengan very long sequences (>1000 steps)
- Computationally expensive
- Hard to capture global dependencies
**Modern Alternatives:**
- **Transformers**: Attention mechanisms, fully parallelizable
- **Temporal Convolutional Networks (TCN)**: CNN untuk sequences
- **State Space Models (SSM)**: Linear-time alternatives
**When RNN Still Relevant:**
- Small datasets (transformers need more data)
- Online/streaming prediction
- Resource-constrained environments
- Interpretability requirements
- Classic time series problems
## 7.10 Soal Latihan
### Review Questions
1. Jelaskan mengapa feedforward neural networks tidak cocok untuk sequential data. Sebutkan 3 alasan utama.
2. Apa itu vanishing gradient problem? Mengapa ini terjadi pada Simple RNN? Berikan penjelasan matematis.
3. Bandingkan LSTM dan GRU:
- Perbedaan arsitektur
- Jumlah parameters
- Kapan menggunakan masing-masing
- Trade-offs
4. Apa fungsi dari setiap gate dalam LSTM:
- Forget gate
- Input gate
- Output gate
Berikan contoh konkret kapan setiap gate akan "terbuka" atau "tertutup".
5. Jelaskan perbedaan antara:
- `return_sequences=True` vs `return_sequences=False`
- Unidirectional vs Bidirectional RNN
- Stateful vs Stateless RNN
6. Untuk time series forecasting, jelaskan:
- One-step vs multi-step forecasting
- Recursive vs direct multi-step prediction
- Kelebihan dan kekurangan masing-masing
7. Mengapa normalization penting untuk RNN? Apa yang terjadi jika tidak melakukan normalization?
8. Jelaskan sequence-to-sequence (Seq2Seq) architecture. Berikan 3 aplikasi nyata.
9. Apa peran dropout dalam RNN? Di mana sebaiknya dropout diterapkan?
10. Bandingkan evaluation metrics untuk time series:
- MSE vs MAE
- MAPE vs RMSE
- Kapan menggunakan masing-masing
### Coding Exercises
**Exercise 1**: Implementasi Simple RNN dari scratch (NumPy)
```python
# Tugas: Implementasi forward pass Simple RNN tanpa library
# Input: sequence dengan shape (batch, timesteps, features)
# Output: hidden states untuk setiap timestep
```
**Exercise 2**: LSTM untuk Stock Price Prediction
```python
# Dataset: Historical stock prices (Yahoo Finance)
# Task: Predict next day close price
# Requirements:
# - Feature engineering (moving averages, RSI, etc.)
# - LSTM model dengan minimal 2 layers
# - Proper train/val/test split
# - Evaluate dengan multiple metrics
```
**Exercise 3**: Sentiment Analysis dengan Bidirectional LSTM
```python
# Dataset: IMDB reviews
# Task: Binary sentiment classification
# Requirements:
# - Text preprocessing (tokenization, padding)
# - Embedding layer
# - Bidirectional LSTM
# - Compare dengan unidirectional
```
**Exercise 4**: Text Generation (Character-level)
```python
# Dataset: Shakespeare texts atau any corpus
# Task: Generate new text character-by-character
# Requirements:
# - Character-level tokenization
# - Stacked LSTM
# - Temperature sampling
# - Generate coherent sentences
```
**Exercise 5**: Energy Consumption Forecasting
```python
# Dataset: Household energy consumption (hourly)
# Task: Predict next 24 hours consumption
# Requirements:
# - Multi-step forecasting
# - Compare Simple RNN, LSTM, GRU
# - Feature engineering (time-based features)
# - Visualization of predictions
```
---
**Selamat belajar! Di Lab 7, kita akan mengimplementasikan LSTM untuk Energy Forecasting secara hands-on! 🚀**