Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling

Deep Learning untuk Data Sequential: Time Series, NLP & Sequential Prediction

Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

Memahami arsitektur RNN dan bagaimana ia memproses data sequential
Mengidentifikasi masalah vanishing gradient dan solusinya (LSTM, GRU)
Mengimplementasikan RNN, LSTM, dan GRU untuk time series forecasting
Menerapkan bidirectional dan stacked RNNs untuk performa lebih baik
Menggunakan sequence-to-sequence models untuk aplikasi NLP
Mengevaluasi performa model sequential dengan metrik yang tepat

7.1 Pengantar Sequential Data dan RNN

7.1.1 Mengapa Data Sequential Memerlukan Arsitektur Khusus?

Problem dengan Feedforward Networks untuk Sequential Data:

Di Chapter 5 dan 6, kita belajar MLP dan CNN yang bekerja dengan fixed-size inputs. Namun, banyak data di dunia nyata bersifat sequential dengan temporal dependencies:

Contoh Sequential Data:

Time series: Harga saham, suhu, konsumsi energi
Text: Kalimat, dokumen, conversation
Audio: Speech, music, sound
Video: Frame sequences
DNA sequences: Genomic data

Masalah Feedforward NN:

No memory: Setiap input diproses independently
Fixed input size: Tidak bisa handle variable-length sequences
No temporal relationships: Kehilangan informasi order/sequence
Parameter explosion: Beda timestep = beda parameters

💡 Intuisi RNN

RNN mengatasi masalah dengan:

Hidden state (memory): Menyimpan informasi dari timesteps sebelumnya
Parameter sharing: Weight yang sama digunakan untuk semua timesteps
Variable-length sequences: Bisa proses sequence dengan panjang berbeda
Temporal modeling: Capture dependencies antar timesteps

Analogi: Seperti membaca buku - Anda memahami kalimat berdasarkan kata-kata sebelumnya!

7.1.2 Evolution of Sequential Models

Era Pre-RNN:

Hidden Markov Models (HMM)
Autoregressive models (AR, ARMA, ARIMA)
Manual feature engineering
Limitation: Cannot learn long-term dependencies

RNN Era (1990s-2010s):

Simple RNN (1986): First recurrent architecture
LSTM (1997): Long Short-Term Memory - solved vanishing gradient
GRU (2014): Gated Recurrent Unit - simplified LSTM
Bidirectional RNN (1997): Process sequence forward & backward

Modern Era (2017-now):

Attention Mechanisms (2017): Self-attention, multi-head attention
Transformers (2017): Attention is all you need - replaced RNN for NLP
BERT, GPT (2018-now): Large language models
But: RNNs still relevant for time series, small data, interpretability

📊 RNN Applications Today

Industry Applications:

Finance: Stock prediction, algorithmic trading
Energy: Load forecasting, demand prediction
Healthcare: Patient monitoring, disease progression
Manufacturing: Predictive maintenance, quality control
NLP: Machine translation, text generation, sentiment analysis
Speech: Speech recognition, text-to-speech

7.1.3 Types of Sequential Problems

RNN dapat menangani berbagai tipe sequential problems:

Code

graph LR
    A[One-to-One<br/>Traditional NN<br/>Image Classification] --> B[One-to-Many<br/>Image Captioning<br/>Music Generation]
    B --> C[Many-to-One<br/>Sentiment Analysis<br/>Video Classification]
    C --> D[Many-to-Many<br/>same length<br/>Video Frame Labeling]
    D --> E[Many-to-Many<br/>diff length<br/>Machine Translation]

    style A fill:#ffcccc
    style B fill:#ffe6cc
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#ccccff

graph LR
    A[One-to-One<br/>Traditional NN<br/>Image Classification] --> B[One-to-Many<br/>Image Captioning<br/>Music Generation]
    B --> C[Many-to-One<br/>Sentiment Analysis<br/>Video Classification]
    C --> D[Many-to-Many<br/>same length<br/>Video Frame Labeling]
    D --> E[Many-to-Many<br/>diff length<br/>Machine Translation]

    style A fill:#ffcccc
    style B fill:#ffe6cc
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#ccccff

Detailed Explanation:

One-to-One: Standard feedforward NN
- Input: Single vector
- Output: Single vector
- Example: Image classification
One-to-Many: Sequence generation
- Input: Single vector (atau kondisi awal)
- Output: Sequence
- Example: Image captioning, music generation
Many-to-One: Sequence classification
- Input: Sequence
- Output: Single vector
- Example: Sentiment analysis, time series classification
Many-to-Many (same length): Synchronized sequence
- Input: Sequence
- Output: Sequence (same length)
- Example: Video frame labeling, POS tagging
Many-to-Many (different length): Sequence-to-sequence
- Input: Sequence
- Output: Sequence (different length)
- Example: Machine translation, text summarization

7.2 Simple RNN: Fundamentals

7.2.1 RNN Architecture Basics

Recurrent Neural Network memproses sequences dengan hidden state yang di-update setiap timestep.

Mathematical Formulation:

Untuk setiap timestep $t$:

\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]

Where:

$x_t$: Input pada timestep $t$
$h_t$: Hidden state pada timestep $t$
$y_t$: Output pada timestep $t$
$W_{hh}$: Hidden-to-hidden weight matrix
$W_{xh}$: Input-to-hidden weight matrix
$W_{hy}$: Hidden-to-output weight matrix
$b_h$, $b_y$: Bias terms

Visualisasi RNN:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Folded representation
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('RNN: Folded Representation', fontsize=14, fontweight='bold')

# Input
input_rect = patches.Rectangle((1, 1), 2, 1, linewidth=2, edgecolor='blue', facecolor='lightblue')
ax.add_patch(input_rect)
ax.text(2, 1.5, r'$x_t$', ha='center', va='center', fontsize=12, fontweight='bold')

# Hidden state (RNN cell)
rnn_rect = patches.Rectangle((1, 4), 2, 2, linewidth=3, edgecolor='red', facecolor='lightyellow')
ax.add_patch(rnn_rect)
ax.text(2, 5, 'RNN', ha='center', va='center', fontsize=12, fontweight='bold')

# Output
output_rect = patches.Rectangle((1, 8), 2, 1, linewidth=2, edgecolor='green', facecolor='lightgreen')
ax.add_patch(output_rect)
ax.text(2, 8.5, r'$y_t$', ha='center', va='center', fontsize=12, fontweight='bold')

# Arrows
ax.arrow(2, 2.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='blue', ec='blue', linewidth=2)
ax.arrow(2, 6.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2)

# Recurrent connection
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
arrow_loop = FancyArrowPatch((3.2, 5.5), (3.2, 4.5),
                             arrowstyle='->', mutation_scale=20, linewidth=2.5,
                             color='red', connectionstyle="arc3,rad=1.5")
ax.add_patch(arrow_loop)
ax.text(5, 5, r'$h_{t-1}$', fontsize=11, color='red', fontweight='bold')

# Unfolded representation
ax = axes[1]
ax.set_xlim(0, 16)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('RNN: Unfolded Representation (Through Time)', fontsize=14, fontweight='bold')

timesteps = [2, 6, 10, 14]
for i, t in enumerate(timesteps):
    # Input
    input_rect = patches.Rectangle((t-0.5, 1), 1, 0.8, linewidth=2, edgecolor='blue', facecolor='lightblue')
    ax.add_patch(input_rect)
    ax.text(t, 1.4, f'$x_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Hidden state
    rnn_rect = patches.Rectangle((t-0.5, 4), 1, 1.5, linewidth=2.5, edgecolor='red', facecolor='lightyellow')
    ax.add_patch(rnn_rect)
    ax.text(t, 4.75, f'$h_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Output
    output_rect = patches.Rectangle((t-0.5, 8), 1, 0.8, linewidth=2, edgecolor='green', facecolor='lightgreen')
    ax.add_patch(output_rect)
    ax.text(t, 8.4, f'$y_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Vertical arrows
    ax.arrow(t, 2, 0, 1.8, head_width=0.15, head_length=0.15, fc='blue', ec='blue', linewidth=1.5)
    ax.arrow(t, 5.7, 0, 2, head_width=0.15, head_length=0.15, fc='green', ec='green', linewidth=1.5)

    # Horizontal arrows (recurrent connections)
    if i < len(timesteps) - 1:
        ax.arrow(t+0.6, 4.75, 3, 0, head_width=0.2, head_length=0.2, fc='red', ec='red', linewidth=2)

# Time axis
ax.text(8, 0.3, 'Time →', ha='center', fontsize=12, fontweight='bold', style='italic')

plt.tight_layout()
plt.show()

7.2.2 Simple RNN Implementation

Keras Implementation:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Simple RNN untuk many-to-one classification
def build_simple_rnn_classifier(sequence_length=10, input_dim=1,
                                hidden_units=32, num_classes=2):
    """
    Simple RNN untuk sequence classification

    Parameters:
        sequence_length: Panjang sequence input
        input_dim: Dimensi feature pada setiap timestep
        hidden_units: Jumlah unit dalam RNN layer
        num_classes: Jumlah kelas untuk klasifikasi

    Returns:
        model: Compiled Keras model
    """
    model = keras.Sequential([
        # SimpleRNN layer
        layers.SimpleRNN(
            units=hidden_units,
            activation='tanh',
            return_sequences=False,  # Many-to-one: hanya output terakhir
            input_shape=(sequence_length, input_dim),
            name='simple_rnn'
        ),

        # Dense layer untuk classification
        layers.Dense(num_classes, activation='softmax', name='output')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
simple_rnn = build_simple_rnn_classifier()
print(simple_rnn.summary())

PyTorch Implementation:

import torch
import torch.nn as nn

class SimpleRNNClassifier(nn.Module):
    """
    Simple RNN untuk sequence classification (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=32, num_layers=1, num_classes=2):
        super(SimpleRNNClassifier, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # RNN layer
        self.rnn = nn.RNN(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            nonlinearity='tanh'
        )

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)

        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # RNN forward pass
        out, hn = self.rnn(x, h0)
        # out shape: (batch, seq_len, hidden_dim)
        # hn shape: (num_layers, batch, hidden_dim)

        # Take output from last timestep
        out = out[:, -1, :]  # (batch, hidden_dim)

        # Fully connected layer
        out = self.fc(out)  # (batch, num_classes)

        return out

# Instantiate model
pytorch_rnn = SimpleRNNClassifier(input_dim=1, hidden_dim=32, num_classes=2)
print(pytorch_rnn)

# Test forward pass
dummy_input = torch.randn(4, 10, 1)  # (batch=4, seq_len=10, features=1)
output = pytorch_rnn(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")

7.2.3 The Vanishing Gradient Problem

Masalah utama Simple RNN: Vanishing Gradient

Saat melakukan backpropagation through time (BPTT), gradients harus propagate melalui banyak timesteps. Karena repeated matrix multiplication dengan weights < 1, gradients menjadi semakin kecil (vanish).

Mathematical Explanation:

Gradient untuk timestep awal:

\[\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}\]

Jika $\left|\frac{\partial h_t}{\partial h_{t-1}}\right| < 1$, maka gradient exponentially decay!

Consequences:

Cannot learn long-term dependencies
Early timesteps tidak mendapat gradient signal yang cukup
Model hanya belajar short-term patterns

Visualisasi Vanishing Gradient:

# Demonstrasi vanishing gradient
def demonstrate_vanishing_gradient():
    """
    Simulasi how gradients vanish over timesteps
    """
    timesteps = 50

    # Simulate gradient flow dengan different weight values
    gradients_small_w = []
    gradients_good_w = []
    gradients_large_w = []

    initial_gradient = 1.0

    for w in [0.9, 1.0, 1.1]:
        gradient = initial_gradient
        gradient_history = [gradient]

        for t in range(timesteps):
            gradient = gradient * w  # Simplified gradient flow
            gradient_history.append(gradient)

        if w == 0.9:
            gradients_small_w = gradient_history
        elif w == 1.0:
            gradients_good_w = gradient_history
        else:
            gradients_large_w = gradient_history

    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

    # Linear scale
    ax1.plot(gradients_small_w, label='W = 0.9 (Vanishing)', linewidth=2.5, color='red')
    ax1.plot(gradients_good_w, label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--')
    ax1.plot(gradients_large_w, label='W = 1.1 (Exploding)', linewidth=2.5, color='blue')
    ax1.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
    ax1.set_title('Gradient Flow (Linear Scale)', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(alpha=0.3)

    # Log scale
    ax2.semilogy(np.abs(gradients_small_w), label='W = 0.9 (Vanishing)', linewidth=2.5, color='red')
    ax2.semilogy(np.abs(gradients_good_w), label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--')
    ax2.semilogy(np.abs(gradients_large_w), label='W = 1.1 (Exploding)', linewidth=2.5, color='blue')
    ax2.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Gradient Magnitude (log scale)', fontsize=12, fontweight='bold')
    ax2.set_title('Gradient Flow (Log Scale)', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=11)
    ax2.grid(alpha=0.3, which='both')

    plt.tight_layout()
    plt.show()

    print("Gradient after 50 timesteps:")
    print(f"  W=0.9 (vanishing): {gradients_small_w[-1]:.2e}")
    print(f"  W=1.0 (stable):    {gradients_good_w[-1]:.2e}")
    print(f"  W=1.1 (exploding): {gradients_large_w[-1]:.2e}")

demonstrate_vanishing_gradient()

⚠️ Vanishing vs Exploding Gradients

Vanishing Gradient (lebih common):

Gradients → 0
Cannot learn long-term dependencies
Solution: LSTM, GRU, skip connections

Exploding Gradient (less common):

Gradients → ∞
Training unstable, NaN values
Solution: Gradient clipping, careful initialization

7.3 LSTM: Long Short-Term Memory

7.3.1 LSTM Architecture

LSTM dirancang khusus untuk mengatasi vanishing gradient problem dengan gating mechanisms.

Key Components:

Cell State ($C_t$): Long-term memory highway
Hidden State ($h_t$): Short-term memory (output)
Forget Gate ($f_t$): Decide apa yang dibuang dari cell state
Input Gate ($i_t$): Decide info baru apa yang disimpan
Output Gate ($o_t$): Decide apa yang di-output

LSTM Equations:

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\] Forget gate \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] Input gate \[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\] Candidate values \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\] Update cell state \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\] Output gate \[h_t = o_t \odot \tanh(C_t)\] Hidden state

Where:

$\sigma$: Sigmoid function (output 0-1, acts as gate)
$\odot$: Element-wise multiplication
$\tanh$: Hyperbolic tangent (output -1 to 1)

LSTM Cell Visualization:

# Complex LSTM visualization
fig, ax = plt.subplots(figsize=(16, 10))
ax.set_xlim(0, 16)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('LSTM Cell Architecture', fontsize=16, fontweight='bold', pad=20)

# Cell state (top highway)
ax.plot([1, 15], [10, 10], 'k-', linewidth=4, label='Cell State ($C_t$)')
ax.text(0.3, 10, r'$C_{t-1}$', fontsize=12, fontweight='bold', va='center')
ax.text(15.3, 10, r'$C_t$', fontsize=12, fontweight='bold', va='center')

# Forget gate
forget_rect = patches.Rectangle((3, 8), 1.5, 1.5, linewidth=2, edgecolor='red',
                                facecolor='lightcoral', alpha=0.7)
ax.add_patch(forget_rect)
ax.text(3.75, 8.75, r'$f_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(3.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# Forget gate operation
ax.plot([3.75, 3.75], [9.5, 10], 'r-', linewidth=2)
ax.plot([5, 5], [10, 10], 'ro', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(5, 10.6, '×', fontsize=14, fontweight='bold', ha='center')

# Input gate
input_rect = patches.Rectangle((7, 8), 1.5, 1.5, linewidth=2, edgecolor='blue',
                               facecolor='lightblue', alpha=0.7)
ax.add_patch(input_rect)
ax.text(7.75, 8.75, r'$i_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(7.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# Candidate values
candidate_rect = patches.Rectangle((9.5, 8), 1.5, 1.5, linewidth=2, edgecolor='purple',
                                  facecolor='plum', alpha=0.7)
ax.add_patch(candidate_rect)
ax.text(10.25, 8.75, r'$\tilde{C}_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(10.25, 7.3, r'$\tanh$', ha='center', fontsize=10)

# Combine input and candidate
ax.plot([7.75, 7.75], [9.5, 10.5], 'b-', linewidth=2)
ax.plot([10.25, 10.25], [9.5, 10.5], 'purple', linewidth=2)
ax.plot([9, 9], [10.5, 10.5], 'g-', linewidth=2)
ax.plot([9, 9], [10, 10], 'go', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(9, 10.6, '×', fontsize=14, fontweight='bold', ha='center')

# Add to cell state
ax.plot([11, 11], [10, 10], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(11, 10.6, '+', fontsize=14, fontweight='bold', ha='center')

# Output gate
output_rect = patches.Rectangle((13, 8), 1.5, 1.5, linewidth=2, edgecolor='green',
                                facecolor='lightgreen', alpha=0.7)
ax.add_patch(output_rect)
ax.text(13.75, 8.75, r'$o_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(13.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# tanh for output
tanh_rect = patches.Rectangle((12.5, 5.5), 1, 1, linewidth=2, edgecolor='orange',
                              facecolor='wheat', alpha=0.7)
ax.add_patch(tanh_rect)
ax.text(13, 6, r'$\tanh$', ha='center', va='center', fontsize=10, fontweight='bold')

# Output combination
ax.plot([13, 13], [6.5, 7], 'orange', linewidth=2)
ax.plot([13.75, 13.75], [9.5, 7], 'g-', linewidth=2)
ax.plot([13.4, 13.4], [7, 7], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(13.4, 7.6, '×', fontsize=14, fontweight='bold', ha='center')

# Hidden state output
ax.arrow(13.4, 6.5, 0, -2.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2.5)
ax.text(13.4, 3.5, r'$h_t$', ha='center', fontsize=12, fontweight='bold')

# Inputs
ax.arrow(3.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(7.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(10.25, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(13.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)

ax.text(8, 1, r'$[h_{t-1}, x_t]$', ha='center', fontsize=12, fontweight='bold',
       bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2))

# Legend
legend_elements = [
    patches.Patch(facecolor='lightcoral', edgecolor='red', label='Forget Gate'),
    patches.Patch(facecolor='lightblue', edgecolor='blue', label='Input Gate'),
    patches.Patch(facecolor='plum', edgecolor='purple', label='Candidate'),
    patches.Patch(facecolor='lightgreen', edgecolor='green', label='Output Gate'),
]
ax.legend(handles=legend_elements, loc='upper left', fontsize=11)

plt.tight_layout()
plt.show()

7.3.2 LSTM Implementation

Keras LSTM:

def build_lstm_model(sequence_length=50, input_dim=1,
                     lstm_units=64, dense_units=32, output_dim=1):
    """
    LSTM model untuk time series forecasting

    Parameters:
        sequence_length: Lookback window size
        input_dim: Number of features per timestep
        lstm_units: LSTM hidden units
        dense_units: Dense layer units
        output_dim: Prediction horizon

    Returns:
        model: Compiled Keras model
    """
    model = keras.Sequential([
        # LSTM layer
        layers.LSTM(
            units=lstm_units,
            activation='tanh',
            recurrent_activation='sigmoid',
            return_sequences=False,  # Return last output only
            input_shape=(sequence_length, input_dim),
            name='lstm_layer'
        ),

        # Dropout untuk regularization
        layers.Dropout(0.2, name='dropout'),

        # Dense layers
        layers.Dense(dense_units, activation='relu', name='dense_1'),
        layers.Dense(output_dim, activation='linear', name='output')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )

    return model

# Build LSTM model
lstm_model = build_lstm_model(sequence_length=50, lstm_units=64)
print(lstm_model.summary())

PyTorch LSTM:

class LSTMForecaster(nn.Module):
    """
    LSTM model untuk time series forecasting (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=64, num_layers=1,
                 dense_dim=32, output_dim=1, dropout=0.2):
        super(LSTMForecaster, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Fully connected layers
        self.fc1 = nn.Linear(hidden_dim, dense_dim)
        self.fc2 = nn.Linear(dense_dim, output_dim)

        # Activation
        self.relu = nn.ReLU()

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)

        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # LSTM forward pass
        out, (hn, cn) = self.lstm(x, (h0, c0))
        # out: (batch, seq_len, hidden_dim)

        # Take last timestep output
        out = out[:, -1, :]  # (batch, hidden_dim)

        # Dropout
        out = self.dropout(out)

        # Fully connected layers
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)

        return out

# Instantiate PyTorch LSTM
pytorch_lstm = LSTMForecaster(input_dim=1, hidden_dim=64, num_layers=2)
print(pytorch_lstm)

# Test
test_input = torch.randn(8, 50, 1)  # (batch=8, seq=50, features=1)
test_output = pytorch_lstm(test_input)
print(f"\nInput shape: {test_input.shape}")
print(f"Output shape: {test_output.shape}")

7.3.3 How LSTM Solves Vanishing Gradient

LSTM’s Solution: Additive Cell State Update

Key insight: Cell state $C_t$ di-update secara additive, bukan multiplicative!

\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]

Gradient Flow:

\[\frac{\partial C_t}{\partial C_{t-1}} = f_t\]

Forget gate $f_t$ dapat mendekati 1, memungkinkan gradient flow tanpa decay!

Comparison:

# Comparison: Simple RNN vs LSTM gradient flow
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

timesteps = np.arange(0, 51)

# Simple RNN: multiplicative gradient
rnn_gradient = 0.95 ** timesteps

# LSTM: controlled by forget gate (closer to 1)
lstm_gradient_forget_high = 0.99 ** timesteps
lstm_gradient_forget_medium = 0.95 ** timesteps
lstm_gradient_forget_low = 0.90 ** timesteps

# Linear scale
ax1.plot(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8)
ax1.set_xlabel('Timesteps', fontsize=12, fontweight='bold')
ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
ax1.set_title('Gradient Flow Comparison (Linear)', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# Log scale
ax2.semilogy(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8)
ax2.set_xlabel('Timesteps', fontsize=12, fontweight='bold')
ax2.set_ylabel('Gradient Magnitude (log)', fontsize=12, fontweight='bold')
ax2.set_title('Gradient Flow Comparison (Log Scale)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(alpha=0.3, which='both')

plt.tight_layout()
plt.show()

print("Gradient after 50 timesteps:")
print(f"  Simple RNN:      {rnn_gradient[-1]:.6f}")
print(f"  LSTM (f=0.99):   {lstm_gradient_forget_high[-1]:.6f}")
print(f"  LSTM (f=0.95):   {lstm_gradient_forget_medium[-1]:.6f}")
print(f"  LSTM (f=0.90):   {lstm_gradient_forget_low[-1]:.6f}")

💡 Why LSTM Works

Cell State Highway: Direct path untuk info flow tanpa transformations
Gating Mechanisms: Learned control kapan remember/forget
Additive Updates: Gradients tidak multiply repeatedly
Flexible Memory: Bisa learn long-term dependencies (100+ timesteps)

Result: LSTM bisa learn dependencies ratusan timesteps, sedangkan Simple RNN hanya ~10 timesteps!

7.4 GRU: Gated Recurrent Unit

7.4.1 GRU Architecture

GRU adalah simplified version dari LSTM dengan fewer parameters tapi comparable performance.

Key Differences dari LSTM:

2 gates instead of 3 (reset gate, update gate)
No separate cell state - hidden state saja
Fewer parameters - faster training
Simpler architecture - easier to understand

GRU Equations:

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] Update gate \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] Reset gate \[\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t])\] Candidate hidden state \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\] Final hidden state

Component Functions:

Update gate ($z_t$): Decides how much past info to keep
Reset gate ($r_t$): Decides how much past info to forget
Candidate state ($\tilde{h}_t$): New memory content
Final state ($h_t$): Combination of old and new

GRU Visualization:

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('GRU Cell Architecture', fontsize=16, fontweight='bold', pad=20)

# Reset gate
reset_rect = patches.Rectangle((2, 7), 1.5, 1.5, linewidth=2, edgecolor='red',
                               facecolor='lightcoral', alpha=0.7)
ax.add_patch(reset_rect)
ax.text(2.75, 7.75, r'$r_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(2.75, 6.3, r'$\sigma$', ha='center', fontsize=10)

# Update gate
update_rect = patches.Rectangle((5.5, 7), 1.5, 1.5, linewidth=2, edgecolor='blue',
                                facecolor='lightblue', alpha=0.7)
ax.add_patch(update_rect)
ax.text(6.25, 7.75, r'$z_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(6.25, 6.3, r'$\sigma$', ha='center', fontsize=10)

# Candidate hidden state
candidate_rect = patches.Rectangle((9, 7), 1.5, 1.5, linewidth=2, edgecolor='purple',
                                  facecolor='plum', alpha=0.7)
ax.add_patch(candidate_rect)
ax.text(9.75, 7.75, r'$\tilde{h}_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(9.75, 6.3, r'$\tanh$', ha='center', fontsize=10)

# Reset operation
ax.plot([2.75, 2.75], [8.5, 9.5], 'r-', linewidth=2)
ax.plot([2.75, 8], [9.5, 9.5], 'r-', linewidth=2)
ax.plot([8, 8], [9.5, 9], 'r-', linewidth=2)
ax.plot([8, 8], [9, 9], 'ro', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(8, 9.5, '×', fontsize=13, fontweight='bold', ha='center', va='bottom')

# Previous hidden state path
ax.plot([0.5, 11.5], [9, 9], 'k-', linewidth=3, alpha=0.5)
ax.text(0, 9, r'$h_{t-1}$', fontsize=11, fontweight='bold', va='center')

# Update gate paths
ax.plot([6.25, 6.25], [8.5, 5], 'b-', linewidth=2)
ax.plot([6.25, 11.5], [5, 5], 'b-', linewidth=2)

# 1 - z_t path
ax.plot([4, 4], [5, 5], 'b--', linewidth=2)
ax.plot([4, 11.5], [3, 3], 'b--', linewidth=2)
ax.text(3.5, 5, r'$1-z_t$', fontsize=10, ha='right', color='blue', fontweight='bold')

# Candidate combination
ax.plot([9.75, 9.75], [8.5, 5], 'purple', linewidth=2)
ax.plot([11.5, 11.5], [5, 5], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 5.5, '×', fontsize=13, fontweight='bold', ha='center', color='purple')

# Old hidden state path
ax.plot([11.5, 11.5], [9, 3], 'k--', linewidth=2, alpha=0.5)
ax.plot([11.5, 11.5], [3, 3], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 2.5, '×', fontsize=13, fontweight='bold', ha='center')

# Final combination
ax.plot([11.5, 11.5], [4.5, 1.5], 'g-', linewidth=2.5)
ax.plot([11.5, 11.5], [2.5, 1.5], 'g-', linewidth=2.5)
ax.plot([11.5, 11.5], [1.5, 1.5], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 1, '+', fontsize=14, fontweight='bold', ha='center')

# Output
ax.arrow(11.5, 0.8, 0, -0.3, head_width=0.2, head_length=0.1, fc='green', ec='green', linewidth=2.5)
ax.text(11.5, 0, r'$h_t$', ha='center', fontsize=12, fontweight='bold')

# Inputs
ax.arrow(2.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(6.25, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(9.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)

ax.text(6.25, 1.5, r'$[h_{t-1}, x_t]$', ha='center', fontsize=11, fontweight='bold',
       bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2))

plt.tight_layout()
plt.show()

7.4.2 GRU Implementation

Keras GRU:

def build_gru_model(sequence_length=50, input_dim=1,
                    gru_units=64, dense_units=32, output_dim=1):
    """
    GRU model untuk time series forecasting
    """
    model = keras.Sequential([
        # GRU layer
        layers.GRU(
            units=gru_units,
            activation='tanh',
            recurrent_activation='sigmoid',
            return_sequences=False,
            input_shape=(sequence_length, input_dim),
            name='gru_layer'
        ),

        # Dropout
        layers.Dropout(0.2, name='dropout'),

        # Dense layers
        layers.Dense(dense_units, activation='relu', name='dense_1'),
        layers.Dense(output_dim, activation='linear', name='output')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )

    return model

# Build GRU model
gru_model = build_gru_model()
print(gru_model.summary())

PyTorch GRU:

class GRUForecaster(nn.Module):
    """
    GRU model untuk time series forecasting (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=64, num_layers=1,
                 dense_dim=32, output_dim=1, dropout=0.2):
        super(GRUForecaster, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # GRU layer
        self.gru = nn.GRU(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Fully connected layers
        self.fc1 = nn.Linear(hidden_dim, dense_dim)
        self.fc2 = nn.Linear(dense_dim, output_dim)

        # Activation
        self.relu = nn.ReLU()

    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # GRU forward pass
        out, hn = self.gru(x, h0)

        # Take last timestep
        out = out[:, -1, :]

        # Dropout
        out = self.dropout(out)

        # FC layers
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)

        return out

# Instantiate
pytorch_gru = GRUForecaster(hidden_dim=64, num_layers=2)
print(pytorch_gru)

7.4.3 LSTM vs GRU: When to Use What?

Comparison Table:

Aspect	LSTM	GRU
Parameters	More (4 gates)	Less (2 gates)
Training Speed	Slower	Faster
Memory	Higher	Lower
Performance	Slightly better on complex tasks	Comparable on most tasks
Long-term Dependencies	Excellent	Very good
Overfitting Risk	Higher (more params)	Lower
When to Use	Large datasets, complex patterns	Smaller datasets, faster training needed

Practical Guidelines:

# Parameter comparison
def compare_parameters():
    """
    Compare parameter counts: LSTM vs GRU
    """
    seq_len, input_dim, hidden_dim = 50, 1, 64

    # Build models
    lstm_model = keras.Sequential([
        layers.LSTM(hidden_dim, input_shape=(seq_len, input_dim)),
        layers.Dense(1)
    ])

    gru_model = keras.Sequential([
        layers.GRU(hidden_dim, input_shape=(seq_len, input_dim)),
        layers.Dense(1)
    ])

    lstm_params = lstm_model.count_params()
    gru_params = gru_model.count_params()

    print("Parameter Comparison:")
    print(f"  LSTM parameters: {lstm_params:,}")
    print(f"  GRU parameters:  {gru_params:,}")
    print(f"  Difference:      {lstm_params - gru_params:,} ({(lstm_params-gru_params)/gru_params*100:.1f}% more)")
    print(f"\n  GRU is {lstm_params/gru_params:.2f}x smaller than LSTM")

compare_parameters()

🎯 Rule of Thumb

Use LSTM when:

You have large datasets (millions of samples)
Task requires very long-term dependencies (100+ timesteps)
Model interpretability less important
Computational resources abundant

Use GRU when:

Smaller datasets or limited computational resources
Need faster training/inference
Medium-term dependencies (10-100 timesteps)
Want simpler model with fewer hyperparameters

In practice: Try both! GRU often performs similarly with less complexity.

7.5 Advanced RNN Architectures

7.5.1 Bidirectional RNNs

Konsep: Process sequence forward AND backward untuk mendapatkan context dari kedua arah.

Use Cases:

Sentiment analysis (membutuhkan full sentence context)
Named Entity Recognition
Speech recognition
Tidak cocok untuk real-time forecasting (butuh future data)

Architecture:

Code

flowchart LR
    subgraph Forward["Forward Pass"]
        direction LR
        X1[x1] --> F1[h1_fwd]
        X2[x2] --> F2[h2_fwd]
        X3[x3] --> F3[h3_fwd]
        F1 --> F2
        F2 --> F3
    end

    subgraph Backward["Backward Pass"]
        direction RL
        X1B[x1] --> B1[h1_bwd]
        X2B[x2] --> B2[h2_bwd]
        X3B[x3] --> B3[h3_bwd]
        B3 --> B2
        B2 --> B1
    end

    F1 --> C1[Concat]
    B1 --> C1
    F2 --> C2[Concat]
    B2 --> C2
    F3 --> C3[Concat]
    B3 --> C3

    C1 --> Y1[y1]
    C2 --> Y2[y2]
    C3 --> Y3[y3]

    style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px
    style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px
    style C1 fill:#fff4e6,stroke:#333
    style C2 fill:#fff4e6,stroke:#333
    style C3 fill:#fff4e6,stroke:#333

flowchart LR
    subgraph Forward["Forward Pass"]
        direction LR
        X1[x1] --> F1[h1_fwd]
        X2[x2] --> F2[h2_fwd]
        X3[x3] --> F3[h3_fwd]
        F1 --> F2
        F2 --> F3
    end

    subgraph Backward["Backward Pass"]
        direction RL
        X1B[x1] --> B1[h1_bwd]
        X2B[x2] --> B2[h2_bwd]
        X3B[x3] --> B3[h3_bwd]
        B3 --> B2
        B2 --> B1
    end

    F1 --> C1[Concat]
    B1 --> C1
    F2 --> C2[Concat]
    B2 --> C2
    F3 --> C3[Concat]
    B3 --> C3

    C1 --> Y1[y1]
    C2 --> Y2[y2]
    C3 --> Y3[y3]

    style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px
    style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px
    style C1 fill:#fff4e6,stroke:#333
    style C2 fill:#fff4e6,stroke:#333
    style C3 fill:#fff4e6,stroke:#333

Figure 14.1: Arsitektur Bidirectional LSTM - menggabungkan informasi dari forward dan backward pass

Implementation:

def build_bidirectional_lstm(sequence_length=50, input_dim=1,
                             lstm_units=64, num_classes=3):
    """
    Bidirectional LSTM untuk sequence classification
    """
    model = keras.Sequential([
        # Bidirectional LSTM
        layers.Bidirectional(
            layers.LSTM(lstm_units, return_sequences=True),
            input_shape=(sequence_length, input_dim),
            name='bidirectional_lstm_1'
        ),

        # Second bidirectional layer
        layers.Bidirectional(
            layers.LSTM(lstm_units // 2),
            name='bidirectional_lstm_2'
        ),

        # Dropout
        layers.Dropout(0.3),

        # Output
        layers.Dense(num_classes, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
bi_lstm = build_bidirectional_lstm()
print(bi_lstm.summary())

# Note: Output dari bidirectional layer adalah concatenation
# Jika forward LSTM has 64 units, backward juga 64 units
# Output shape: (batch, 64 + 64) = (batch, 128)

7.5.2 Stacked/Deep RNNs

Konsep: Stack multiple RNN layers untuk learn hierarchical representations.

Benefits:

Learn more complex patterns
Better feature extraction
Hierarchical temporal abstractions

Caution:

More parameters = more data needed
Risk of overfitting
Harder to train

Implementation:

def build_stacked_lstm(sequence_length=50, input_dim=1,
                      lstm_layers=[128, 64, 32], output_dim=1):
    """
    Stacked LSTM dengan multiple layers

    Parameters:
        lstm_layers: List of units per layer [layer1_units, layer2_units, ...]
    """
    model = keras.Sequential(name='Stacked_LSTM')

    # First LSTM layer (must return sequences)
    model.add(layers.LSTM(
        lstm_layers[0],
        return_sequences=True,
        input_shape=(sequence_length, input_dim),
        name=f'lstm_1'
    ))
    model.add(layers.Dropout(0.2, name='dropout_1'))

    # Middle layers (return sequences for all except last)
    for i, units in enumerate(lstm_layers[1:-1], start=2):
        model.add(layers.LSTM(
            units,
            return_sequences=True,
            name=f'lstm_{i}'
        ))
        model.add(layers.Dropout(0.2, name=f'dropout_{i}'))

    # Last LSTM layer (return_sequences=False)
    model.add(layers.LSTM(
        lstm_layers[-1],
        return_sequences=False,
        name=f'lstm_{len(lstm_layers)}'
    ))
    model.add(layers.Dropout(0.2, name=f'dropout_{len(lstm_layers)}'))

    # Output layer
    model.add(layers.Dense(output_dim, activation='linear', name='output'))

    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )

    return model

# Build 3-layer stacked LSTM
stacked_lstm = build_stacked_lstm(lstm_layers=[128, 64, 32])
print(stacked_lstm.summary())

7.5.3 Encoder-Decoder (Seq2Seq)

Konsep: Architecture untuk sequence-to-sequence tasks dengan variable input/output lengths.

Components:

Encoder: Process input sequence → context vector
Decoder: Generate output sequence dari context vector

Use Cases:

Machine translation
Text summarization
Question answering
Image captioning

Architecture:

Code

flowchart LR
    subgraph Encoder["Encoder"]
        direction LR
        X1[x1] --> E1[LSTM 1]
        X2[x2] --> E2[LSTM 2]
        X3[x3] --> E3[LSTM 3]
        E1 --> E2
        E2 --> E3
    end

    E3 ==> C[Context Vector]

    subgraph Decoder["Decoder"]
        direction LR
        C ==> D1[LSTM 1]
        D1 --> D2[LSTM 2]
        D2 --> D3[LSTM 3]
        D1 --> Y1[y1]
        D2 --> Y2[y2]
        D3 --> Y3[y3]
    end

    style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px
    style C fill:#ffffcc,stroke:#f90,stroke-width:3px
    style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px
    style Y1 fill:#d4edda,stroke:#333
    style Y2 fill:#d4edda,stroke:#333
    style Y3 fill:#d4edda,stroke:#333

flowchart LR
    subgraph Encoder["Encoder"]
        direction LR
        X1[x1] --> E1[LSTM 1]
        X2[x2] --> E2[LSTM 2]
        X3[x3] --> E3[LSTM 3]
        E1 --> E2
        E2 --> E3
    end

    E3 ==> C[Context Vector]

    subgraph Decoder["Decoder"]
        direction LR
        C ==> D1[LSTM 1]
        D1 --> D2[LSTM 2]
        D2 --> D3[LSTM 3]
        D1 --> Y1[y1]
        D2 --> Y2[y2]
        D3 --> Y3[y3]
    end

    style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px
    style C fill:#ffffcc,stroke:#f90,stroke-width:3px
    style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px
    style Y1 fill:#d4edda,stroke:#333
    style Y2 fill:#d4edda,stroke:#333
    style Y3 fill:#d4edda,stroke:#333

Figure 14.2: Arsitektur Encoder-Decoder (Seq2Seq) - Encoder mengompres input menjadi context vector, Decoder menghasilkan output sequence

Implementation:

def build_seq2seq_model(encoder_seq_len=10, decoder_seq_len=10,
                       input_dim=1, output_dim=1, latent_dim=64):
    """
    Simple Seq2Seq model

    Parameters:
        encoder_seq_len: Input sequence length
        decoder_seq_len: Output sequence length
        latent_dim: Hidden dimension
    """
    # Encoder
    encoder_inputs = layers.Input(shape=(encoder_seq_len, input_dim), name='encoder_input')
    encoder_lstm = layers.LSTM(latent_dim, return_state=True, name='encoder_lstm')
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
    encoder_states = [state_h, state_c]  # Context vector

    # Decoder
    decoder_inputs = layers.Input(shape=(decoder_seq_len, output_dim), name='decoder_input')
    decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True, name='decoder_lstm')
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = layers.Dense(output_dim, activation='linear', name='decoder_dense')
    decoder_outputs = decoder_dense(decoder_outputs)

    # Model
    model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name='Seq2Seq')

    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )

    return model

# Build seq2seq
seq2seq = build_seq2seq_model()
print(seq2seq.summary())

7.6 Time Series Forecasting dengan RNN

7.6.1 Problem Formulation

Time Series Forecasting Task:

Given historical data $x_1, x_2, ..., x_t$, predict future values $x_{t+1}, x_{t+2}, ..., x_{t+h}$

Approaches:

One-step ahead: Predict $x_{t+1}$ saja
Multi-step ahead: Predict sequence $[x_{t+1}, ..., x_{t+h}]$
Recursive: Use predictions as input untuk next prediction
Direct: Separate model untuk setiap horizon

Data Preparation:

def create_sequences(data, lookback=50, horizon=1):
    """
    Create input-output sequences untuk time series forecasting

    Parameters:
        data: 1D array time series data
        lookback: Number of past timesteps to use as input
        horizon: Number of future timesteps to predict

    Returns:
        X: Input sequences (samples, lookback, features)
        y: Target values (samples, horizon)
    """
    X, y = [], []

    for i in range(len(data) - lookback - horizon + 1):
        # Input: [i : i+lookback]
        X.append(data[i : i + lookback])

        # Target: [i+lookback : i+lookback+horizon]
        if horizon == 1:
            y.append(data[i + lookback])
        else:
            y.append(data[i + lookback : i + lookback + horizon])

    X = np.array(X)
    y = np.array(y)

    # Reshape X to (samples, lookback, 1) untuk univariate
    if len(X.shape) == 2:
        X = X.reshape((X.shape[0], X.shape[1], 1))

    return X, y

# Example
np.random.seed(42)
sample_data = np.sin(np.linspace(0, 100, 1000)) + np.random.normal(0, 0.1, 1000)

X, y = create_sequences(sample_data, lookback=50, horizon=1)
print(f"Input shape: {X.shape}")   # (samples, 50, 1)
print(f"Output shape: {y.shape}")  # (samples, 1)

# Visualize sequences
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Plot full time series
axes[0].plot(sample_data, linewidth=1.5, alpha=0.7)
axes[0].set_title('Full Time Series', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Time', fontsize=11)
axes[0].set_ylabel('Value', fontsize=11)
axes[0].grid(alpha=0.3)

# Plot one sequence example
example_idx = 100
input_seq = X[example_idx].flatten()
target_val = y[example_idx]

axes[1].plot(range(len(input_seq)), input_seq, 'b-', linewidth=2, label='Input Sequence (lookback=50)')
axes[1].plot(len(input_seq), target_val, 'ro', markersize=10, label=f'Target (t+1)', zorder=3)
axes[1].axvline(len(input_seq)-1, color='gray', linestyle='--', alpha=0.5)
axes[1].set_title(f'Example Sequence #{example_idx}', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Timestep', fontsize=11)
axes[1].set_ylabel('Value', fontsize=11)
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

7.6.2 Feature Engineering for Time Series

Important Features:

Lag features: Past values
Rolling statistics: Moving average, std
Time-based features: Hour, day, month, seasonality
Difference features: First/second differences

import pandas as pd

def engineer_time_series_features(data, datetime_index=None):
    """
    Create time series features

    Parameters:
        data: 1D array or pandas Series
        datetime_index: DatetimeIndex (optional)

    Returns:
        DataFrame with engineered features
    """
    if isinstance(data, np.ndarray):
        data = pd.Series(data)

    df = pd.DataFrame({'value': data})

    # Lag features
    for lag in [1, 2, 3, 7, 14]:
        df[f'lag_{lag}'] = df['value'].shift(lag)

    # Rolling statistics
    for window in [7, 14, 30]:
        df[f'rolling_mean_{window}'] = df['value'].rolling(window=window).mean()
        df[f'rolling_std_{window}'] = df['value'].rolling(window=window).std()
        df[f'rolling_min_{window}'] = df['value'].rolling(window=window).min()
        df[f'rolling_max_{window}'] = df['value'].rolling(window=window).max()

    # Difference features
    df['diff_1'] = df['value'].diff(1)
    df['diff_2'] = df['value'].diff(2)

    # Time-based features (if datetime index provided)
    if datetime_index is not None:
        df.index = datetime_index
        df['hour'] = df.index.hour
        df['day_of_week'] = df.index.dayofweek
        df['day_of_month'] = df.index.day
        df['month'] = df.index.month
        df['quarter'] = df.index.quarter

        # Cyclical encoding untuk periodic features
        df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
        df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
        df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
        df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

    # Drop NaN rows
    df = df.dropna()

    return df

# Example
dates = pd.date_range('2023-01-01', periods=len(sample_data), freq='H')
features_df = engineer_time_series_features(sample_data, dates)

print("Engineered Features:")
print(features_df.head(10))
print(f"\nTotal features created: {len(features_df.columns)}")

7.6.3 Evaluation Metrics untuk Time Series

Common Metrics:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_forecast(y_true, y_pred):
    """
    Comprehensive evaluation metrics untuk forecasting

    Returns:
        Dictionary of metrics
    """
    # Regression metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    # Percentage errors
    mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100

    # Direction accuracy (berapa % trend direction benar)
    y_true_diff = np.diff(y_true.flatten())
    y_pred_diff = np.diff(y_pred.flatten())
    direction_accuracy = np.mean((y_true_diff * y_pred_diff) > 0) * 100

    metrics = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'MAPE (%)': mape,
        'Direction Accuracy (%)': direction_accuracy
    }

    return metrics

# Example evaluation
y_true = sample_data[100:200]
y_pred = sample_data[100:200] + np.random.normal(0, 0.1, 100)  # Simulated predictions

metrics = evaluate_forecast(y_true, y_pred)

print("Forecast Evaluation Metrics:")
print("=" * 50)
for metric, value in metrics.items():
    print(f"{metric:25s}: {value:10.4f}")

📊 Which Metric to Use?

MSE/RMSE:

Penalize large errors heavily
Good when outliers are critical
Same unit as target

MAE:

More robust to outliers
Interpretable (average error)
Less sensitive to extreme values

MAPE:

Percentage-based, scale-independent
Good untuk comparing different datasets
Problem jika y_true close to zero

Direction Accuracy:

Important for trading strategies
Measures trend prediction
Binary metric (up/down)

R²:

Goodness of fit
1.0 = perfect, 0.0 = baseline
Can be negative for bad models

7.7 RNN untuk NLP Applications

7.7.1 Text Preprocessing untuk RNN

Steps:

Tokenization: Split text into tokens
Vocabulary building: Create word-to-index mapping
Sequence padding: Make all sequences same length
Embedding: Convert words to dense vectors

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = [
    "I love machine learning",
    "Deep learning is amazing",
    "RNN are great for sequences",
    "LSTM solves vanishing gradient problem",
    "NLP with deep learning is powerful"
]

# Tokenization
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)

# Convert to sequences
sequences = tokenizer.texts_to_sequences(texts)

print("Original texts:")
for i, text in enumerate(texts):
    print(f"  {i+1}. {text}")

print("\nTokenized sequences:")
for i, seq in enumerate(sequences):
    print(f"  {i+1}. {seq}")

# Vocabulary
word_index = tokenizer.word_index
print(f"\nVocabulary size: {len(word_index)}")
print("\nWord to index mapping (first 10):")
for word, idx in list(word_index.items())[:10]:
    print(f"  '{word}': {idx}")

# Padding
max_length = 10
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')

print(f"\nPadded sequences (max_length={max_length}):")
for i, seq in enumerate(padded_sequences):
    print(f"  {i+1}. {seq}")

7.7.2 Sentiment Analysis dengan LSTM

def build_sentiment_classifier(vocab_size=10000, embedding_dim=64,
                               max_length=100, lstm_units=64):
    """
    LSTM untuk sentiment classification
    """
    model = keras.Sequential([
        # Embedding layer
        layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=max_length,
            name='embedding'
        ),

        # Spatial dropout untuk embedding
        layers.SpatialDropout1D(0.2),

        # Bidirectional LSTM
        layers.Bidirectional(
            layers.LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2),
            name='bidirectional_lstm'
        ),

        # Dense layers
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.3),

        # Output layer (binary classification)
        layers.Dense(1, activation='sigmoid')
    ])

    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
sentiment_model = build_sentiment_classifier()
print(sentiment_model.summary())

7.7.3 Text Generation dengan RNN

Character-level Language Model:

def build_text_generator(vocab_size=100, embedding_dim=256,
                        rnn_units=512, sequence_length=100):
    """
    Character-level text generation model
    """
    model = keras.Sequential([
        # Embedding
        layers.Embedding(vocab_size, embedding_dim, input_length=sequence_length),

        # Stacked LSTM
        layers.LSTM(rnn_units, return_sequences=True),
        layers.Dropout(0.2),
        layers.LSTM(rnn_units),
        layers.Dropout(0.2),

        # Output layer (predict next character)
        layers.Dense(vocab_size, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

text_gen = build_text_generator()
print(text_gen.summary())

7.8 Best Practices dan Tips

7.8.1 Hyperparameter Tuning

Key Hyperparameters:

Hidden units: Start dengan 32-128
Number of layers: 1-3 layers biasanya cukup
Dropout rate: 0.2-0.5
Learning rate: 0.001 (Adam) atau 0.01 (SGD)
Batch size: 32-128 untuk time series
Sequence length: Tergantung pada temporal dependency

# Hyperparameter search space
hyperparams = {
    'lstm_units': [32, 64, 128, 256],
    'num_layers': [1, 2, 3],
    'dropout_rate': [0.0, 0.2, 0.3, 0.5],
    'learning_rate': [0.0001, 0.001, 0.01],
    'batch_size': [32, 64, 128],
    'sequence_length': [24, 48, 72, 96]
}

print("Hyperparameter Search Space:")
for param, values in hyperparams.items():
    print(f"  {param:20s}: {values}")

7.8.2 Training Tips

🎯 Training Best Practices

1. Gradient Clipping (prevent exploding gradients):

optimizer = keras.optimizers.Adam(clipnorm=1.0)

2. Early Stopping (prevent overfitting):

early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

3. Learning Rate Scheduling:

reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-7
)

4. Batch Normalization (untuk deeper networks):

layers.BatchNormalization()

5. Teacher Forcing (untuk seq2seq):

Use ground truth sebagai decoder input during training
Use model predictions during inference

7.8.3 Common Pitfalls

❌ Common Mistakes:

Data leakage: Using future data untuk train
Not normalizing: RNN sensitive terhadap scale
Too many parameters: Overfitting pada small datasets
Ignoring stationarity: Time series should be stationary
Wrong sequence direction: Pastikan temporal order benar
Batch size too small: Unstable training
No validation set: Cannot detect overfitting

✅ Solutions:

# 1. Proper train/val/test split untuk time series
def time_series_split(data, train_ratio=0.7, val_ratio=0.15):
    """
    Time series split (no shuffling!)
    """
    n = len(data)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))

    train = data[:train_end]
    val = data[train_end:val_end]
    test = data[val_end:]

    return train, val, test

# 2. Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler

def normalize_data(train, val, test, method='standard'):
    """
    Normalize data using train statistics
    """
    if method == 'standard':
        scaler = StandardScaler()
    else:
        scaler = MinMaxScaler()

    # Fit ONLY on training data
    train_scaled = scaler.fit_transform(train.reshape(-1, 1))
    val_scaled = scaler.transform(val.reshape(-1, 1))
    test_scaled = scaler.transform(test.reshape(-1, 1))

    return train_scaled, val_scaled, test_scaled, scaler

# Example
train, val, test = time_series_split(sample_data)
train_norm, val_norm, test_norm, scaler = normalize_data(train, val, test)

print(f"Train size: {len(train)} ({len(train)/len(sample_data)*100:.1f}%)")
print(f"Val size:   {len(val)} ({len(val)/len(sample_data)*100:.1f}%)")
print(f"Test size:  {len(test)} ({len(test)/len(sample_data)*100:.1f}%)")

7.9 Rangkuman & Kesimpulan

7.9.1 Key Takeaways

📚 Chapter Summary

1. RNN Fundamentals:

RNN memproses sequential data dengan hidden state (memory)
Parameter sharing across timesteps
Dapat handle variable-length sequences

2. Vanishing Gradient Problem:

Simple RNN sulit learn long-term dependencies
Gradients exponentially decay saat backpropagation
Solution: LSTM, GRU dengan gating mechanisms

3. LSTM:

3 gates (forget, input, output) + cell state
Cell state acts as information highway
Additive updates prevent gradient vanishing
Best untuk very long-term dependencies

4. GRU:

Simplified LSTM dengan 2 gates (update, reset)
Fewer parameters, faster training
Comparable performance untuk most tasks
Good default choice untuk many applications

5. Advanced Architectures:

Bidirectional: Process forward + backward
Stacked: Multiple layers untuk hierarchical learning
Seq2Seq: Encoder-decoder untuk variable length I/O

6. Applications:

Time Series: Energy forecasting, stock prediction
NLP: Sentiment analysis, text generation, translation
Speech: Recognition, synthesis
Video: Frame prediction, action recognition

7. Best Practices:

Normalize input data
Use gradient clipping
Proper train/val/test split (no shuffling untuk time series!)
Start simple, add complexity gradually
Monitor for overfitting

7.9.2 When to Use What?

Code

flowchart TD
    A["Sequential Problem?"] -->|Yes| B{"Long-term<br/>Dependencies?"}
    A -->|No| Z["Use Feedforward NN<br/>or CNN"]

    B -->|"Yes, >100 steps"| C["Use LSTM"]
    B -->|"Medium, 10-100 steps"| D["Use GRU"]
    B -->|"Short, <10 steps"| E["Simple RNN OK"]

    C --> F{"Large Dataset?"}
    D --> F
    E --> F

    F -->|"Yes, millions"| G["Deep/Stacked RNN"]
    F -->|"No, thousands"| H["Shallow RNN<br/>1-2 layers"]

    G --> I{"Need both<br/>directions?"}
    H --> I

    I -->|Yes| J["Bidirectional"]
    I -->|No| K["Unidirectional"]

    style A fill:#ffcccc,stroke:#333,stroke-width:2px
    style B fill:#fff3cd,stroke:#333,stroke-width:2px
    style C fill:#ccffcc,stroke:#333,stroke-width:2px
    style D fill:#ccffcc,stroke:#333,stroke-width:2px
    style E fill:#ffffcc,stroke:#333,stroke-width:2px
    style F fill:#fff3cd,stroke:#333,stroke-width:2px
    style I fill:#fff3cd,stroke:#333,stroke-width:2px
    style J fill:#ccccff,stroke:#333,stroke-width:2px
    style K fill:#ccccff,stroke:#333,stroke-width:2px
    style Z fill:#e2e3e5,stroke:#333,stroke-width:2px

flowchart TD
    A["Sequential Problem?"] -->|Yes| B{"Long-term<br/>Dependencies?"}
    A -->|No| Z["Use Feedforward NN<br/>or CNN"]

    B -->|"Yes, >100 steps"| C["Use LSTM"]
    B -->|"Medium, 10-100 steps"| D["Use GRU"]
    B -->|"Short, <10 steps"| E["Simple RNN OK"]

    C --> F{"Large Dataset?"}
    D --> F
    E --> F

    F -->|"Yes, millions"| G["Deep/Stacked RNN"]
    F -->|"No, thousands"| H["Shallow RNN<br/>1-2 layers"]

    G --> I{"Need both<br/>directions?"}
    H --> I

    I -->|Yes| J["Bidirectional"]
    I -->|No| K["Unidirectional"]

    style A fill:#ffcccc,stroke:#333,stroke-width:2px
    style B fill:#fff3cd,stroke:#333,stroke-width:2px
    style C fill:#ccffcc,stroke:#333,stroke-width:2px
    style D fill:#ccffcc,stroke:#333,stroke-width:2px
    style E fill:#ffffcc,stroke:#333,stroke-width:2px
    style F fill:#fff3cd,stroke:#333,stroke-width:2px
    style I fill:#fff3cd,stroke:#333,stroke-width:2px
    style J fill:#ccccff,stroke:#333,stroke-width:2px
    style K fill:#ccccff,stroke:#333,stroke-width:2px
    style Z fill:#e2e3e5,stroke:#333,stroke-width:2px

Figure 14.3: Decision tree untuk memilih arsitektur RNN yang tepat berdasarkan karakteristik masalah dan data

7.9.3 Looking Forward

Limitations of RNNs:

Sequential processing (cannot parallelize)
Still struggle dengan very long sequences (>1000 steps)
Computationally expensive
Hard to capture global dependencies

Modern Alternatives:

Transformers: Attention mechanisms, fully parallelizable
Temporal Convolutional Networks (TCN): CNN untuk sequences
State Space Models (SSM): Linear-time alternatives

When RNN Still Relevant:

Small datasets (transformers need more data)
Online/streaming prediction
Resource-constrained environments
Interpretability requirements
Classic time series problems

7.10 Soal Latihan

Review Questions

Jelaskan mengapa feedforward neural networks tidak cocok untuk sequential data. Sebutkan 3 alasan utama.
Apa itu vanishing gradient problem? Mengapa ini terjadi pada Simple RNN? Berikan penjelasan matematis.
Bandingkan LSTM dan GRU:
- Perbedaan arsitektur
- Jumlah parameters
- Kapan menggunakan masing-masing
- Trade-offs
Apa fungsi dari setiap gate dalam LSTM:
- Forget gate
- Input gate
- Output gate Berikan contoh konkret kapan setiap gate akan “terbuka” atau “tertutup”.
Jelaskan perbedaan antara:
- return_sequences=True vs return_sequences=False
- Unidirectional vs Bidirectional RNN
- Stateful vs Stateless RNN
Untuk time series forecasting, jelaskan:
- One-step vs multi-step forecasting
- Recursive vs direct multi-step prediction
- Kelebihan dan kekurangan masing-masing
Mengapa normalization penting untuk RNN? Apa yang terjadi jika tidak melakukan normalization?
Jelaskan sequence-to-sequence (Seq2Seq) architecture. Berikan 3 aplikasi nyata.
Apa peran dropout dalam RNN? Di mana sebaiknya dropout diterapkan?
Bandingkan evaluation metrics untuk time series:
- MSE vs MAE
- MAPE vs RMSE
- Kapan menggunakan masing-masing

Coding Exercises

Exercise 1: Implementasi Simple RNN dari scratch (NumPy)

# Tugas: Implementasi forward pass Simple RNN tanpa library
# Input: sequence dengan shape (batch, timesteps, features)
# Output: hidden states untuk setiap timestep

Exercise 2: LSTM untuk Stock Price Prediction

# Dataset: Historical stock prices (Yahoo Finance)
# Task: Predict next day close price
# Requirements:
#   - Feature engineering (moving averages, RSI, etc.)
#   - LSTM model dengan minimal 2 layers
#   - Proper train/val/test split
#   - Evaluate dengan multiple metrics

Exercise 3: Sentiment Analysis dengan Bidirectional LSTM

# Dataset: IMDB reviews
# Task: Binary sentiment classification
# Requirements:
#   - Text preprocessing (tokenization, padding)
#   - Embedding layer
#   - Bidirectional LSTM
#   - Compare dengan unidirectional

Exercise 4: Text Generation (Character-level)

# Dataset: Shakespeare texts atau any corpus
# Task: Generate new text character-by-character
# Requirements:
#   - Character-level tokenization
#   - Stacked LSTM
#   - Temperature sampling
#   - Generate coherent sentences

Exercise 5: Energy Consumption Forecasting

# Dataset: Household energy consumption (hourly)
# Task: Predict next 24 hours consumption
# Requirements:
#   - Multi-step forecasting
#   - Compare Simple RNN, LSTM, GRU
#   - Feature engineering (time-based features)
#   - Visualization of predictions

Selamat belajar! Di Lab 7, kita akan mengimplementasikan LSTM untuk Energy Forecasting secara hands-on! 🚀

--- title: "Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling" subtitle: "Deep Learning untuk Data Sequential: Time Series, NLP & Sequential Prediction" number-sections: false --- # Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling {#sec-chapter-07} ::: {.callout-note} ## 🎯 Hasil Pembelajaran (Learning Outcomes) Setelah mempelajari bab ini, Anda akan mampu: 1. **Memahami** arsitektur RNN dan bagaimana ia memproses data sequential 2. **Mengidentifikasi** masalah vanishing gradient dan solusinya (LSTM, GRU) 3. **Mengimplementasikan** RNN, LSTM, dan GRU untuk time series forecasting 4. **Menerapkan** bidirectional dan stacked RNNs untuk performa lebih baik 5. **Menggunakan** sequence-to-sequence models untuk aplikasi NLP 6. **Mengevaluasi** performa model sequential dengan metrik yang tepat ::: ## 7.1 Pengantar Sequential Data dan RNN ### 7.1.1 Mengapa Data Sequential Memerlukan Arsitektur Khusus? **Problem dengan Feedforward Networks untuk Sequential Data:** Di Chapter 5 dan 6, kita belajar MLP dan CNN yang bekerja dengan **fixed-size inputs**. Namun, banyak data di dunia nyata bersifat **sequential** dengan temporal dependencies: **Contoh Sequential Data:** - **Time series**: Harga saham, suhu, konsumsi energi - **Text**: Kalimat, dokumen, conversation - **Audio**: Speech, music, sound - **Video**: Frame sequences - **DNA sequences**: Genomic data **Masalah Feedforward NN:** 1. **No memory**: Setiap input diproses independently 2. **Fixed input size**: Tidak bisa handle variable-length sequences 3. **No temporal relationships**: Kehilangan informasi order/sequence 4. **Parameter explosion**: Beda timestep = beda parameters ::: {.callout-tip} ## 💡 Intuisi RNN RNN mengatasi masalah dengan: - **Hidden state (memory)**: Menyimpan informasi dari timesteps sebelumnya - **Parameter sharing**: Weight yang sama digunakan untuk semua timesteps - **Variable-length sequences**: Bisa proses sequence dengan panjang berbeda - **Temporal modeling**: Capture dependencies antar timesteps **Analogi**: Seperti membaca buku - Anda memahami kalimat berdasarkan kata-kata sebelumnya! ::: ### 7.1.2 Evolution of Sequential Models **Era Pre-RNN:** - Hidden Markov Models (HMM) - Autoregressive models (AR, ARMA, ARIMA) - Manual feature engineering - **Limitation**: Cannot learn long-term dependencies **RNN Era (1990s-2010s):** - **Simple RNN (1986)**: First recurrent architecture - **LSTM (1997)**: Long Short-Term Memory - solved vanishing gradient - **GRU (2014)**: Gated Recurrent Unit - simplified LSTM - **Bidirectional RNN (1997)**: Process sequence forward & backward **Modern Era (2017-now):** - **Attention Mechanisms (2017)**: Self-attention, multi-head attention - **Transformers (2017)**: Attention is all you need - replaced RNN for NLP - **BERT, GPT (2018-now)**: Large language models - **But**: RNNs still relevant for time series, small data, interpretability ::: {.callout-note} ## 📊 RNN Applications Today **Industry Applications:** - **Finance**: Stock prediction, algorithmic trading - **Energy**: Load forecasting, demand prediction - **Healthcare**: Patient monitoring, disease progression - **Manufacturing**: Predictive maintenance, quality control - **NLP**: Machine translation, text generation, sentiment analysis - **Speech**: Speech recognition, text-to-speech ::: ### 7.1.3 Types of Sequential Problems RNN dapat menangani berbagai tipe sequential problems: ```{mermaid} graph LR A[One-to-One Traditional NN Image Classification] --> B[One-to-Many Image Captioning Music Generation] B --> C[Many-to-One Sentiment Analysis Video Classification] C --> D[Many-to-Many same length Video Frame Labeling] D --> E[Many-to-Many diff length Machine Translation] style A fill:#ffcccc style B fill:#ffe6cc style C fill:#ffffcc style D fill:#ccffcc style E fill:#ccccff ``` **Detailed Explanation:** 1. **One-to-One**: Standard feedforward NN - Input: Single vector - Output: Single vector - Example: Image classification 2. **One-to-Many**: Sequence generation - Input: Single vector (atau kondisi awal) - Output: Sequence - Example: Image captioning, music generation 3. **Many-to-One**: Sequence classification - Input: Sequence - Output: Single vector - Example: Sentiment analysis, time series classification 4. **Many-to-Many (same length)**: Synchronized sequence - Input: Sequence - Output: Sequence (same length) - Example: Video frame labeling, POS tagging 5. **Many-to-Many (different length)**: Sequence-to-sequence - Input: Sequence - Output: Sequence (different length) - Example: Machine translation, text summarization ## 7.2 Simple RNN: Fundamentals ### 7.2.1 RNN Architecture Basics **Recurrent Neural Network** memproses sequences dengan **hidden state** yang di-update setiap timestep. **Mathematical Formulation:** Untuk setiap timestep $t$: $$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$ $$y_t = W_{hy} h_t + b_y$$ Where: - $x_t$: Input pada timestep $t$ - $h_t$: Hidden state pada timestep $t$ - $y_t$: Output pada timestep $t$ - $W_{hh}$: Hidden-to-hidden weight matrix - $W_{xh}$: Input-to-hidden weight matrix - $W_{hy}$: Hidden-to-output weight matrix - $b_h$, $b_y$: Bias terms **Visualisasi RNN:** ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt import matplotlib.patches as patches fig, axes = plt.subplots(1, 2, figsize=(16, 6)) # Folded representation ax = axes[0] ax.set_xlim(0, 10) ax.set_ylim(0, 10) ax.axis('off') ax.set_title('RNN: Folded Representation', fontsize=14, fontweight='bold') # Input input_rect = patches.Rectangle((1, 1), 2, 1, linewidth=2, edgecolor='blue', facecolor='lightblue') ax.add_patch(input_rect) ax.text(2, 1.5, r'$x_t$', ha='center', va='center', fontsize=12, fontweight='bold') # Hidden state (RNN cell) rnn_rect = patches.Rectangle((1, 4), 2, 2, linewidth=3, edgecolor='red', facecolor='lightyellow') ax.add_patch(rnn_rect) ax.text(2, 5, 'RNN', ha='center', va='center', fontsize=12, fontweight='bold') # Output output_rect = patches.Rectangle((1, 8), 2, 1, linewidth=2, edgecolor='green', facecolor='lightgreen') ax.add_patch(output_rect) ax.text(2, 8.5, r'$y_t$', ha='center', va='center', fontsize=12, fontweight='bold') # Arrows ax.arrow(2, 2.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='blue', ec='blue', linewidth=2) ax.arrow(2, 6.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2) # Recurrent connection from matplotlib.patches import FancyBboxPatch, FancyArrowPatch arrow_loop = FancyArrowPatch((3.2, 5.5), (3.2, 4.5), arrowstyle='->', mutation_scale=20, linewidth=2.5, color='red', connectionstyle="arc3,rad=1.5") ax.add_patch(arrow_loop) ax.text(5, 5, r'$h_{t-1}$', fontsize=11, color='red', fontweight='bold') # Unfolded representation ax = axes[1] ax.set_xlim(0, 16) ax.set_ylim(0, 10) ax.axis('off') ax.set_title('RNN: Unfolded Representation (Through Time)', fontsize=14, fontweight='bold') timesteps = [2, 6, 10, 14] for i, t in enumerate(timesteps): # Input input_rect = patches.Rectangle((t-0.5, 1), 1, 0.8, linewidth=2, edgecolor='blue', facecolor='lightblue') ax.add_patch(input_rect) ax.text(t, 1.4, f'$x_{i}$', ha='center', va='center', fontsize=10, fontweight='bold') # Hidden state rnn_rect = patches.Rectangle((t-0.5, 4), 1, 1.5, linewidth=2.5, edgecolor='red', facecolor='lightyellow') ax.add_patch(rnn_rect) ax.text(t, 4.75, f'$h_{i}$', ha='center', va='center', fontsize=10, fontweight='bold') # Output output_rect = patches.Rectangle((t-0.5, 8), 1, 0.8, linewidth=2, edgecolor='green', facecolor='lightgreen') ax.add_patch(output_rect) ax.text(t, 8.4, f'$y_{i}$', ha='center', va='center', fontsize=10, fontweight='bold') # Vertical arrows ax.arrow(t, 2, 0, 1.8, head_width=0.15, head_length=0.15, fc='blue', ec='blue', linewidth=1.5) ax.arrow(t, 5.7, 0, 2, head_width=0.15, head_length=0.15, fc='green', ec='green', linewidth=1.5) # Horizontal arrows (recurrent connections) if i < len(timesteps) - 1: ax.arrow(t+0.6, 4.75, 3, 0, head_width=0.2, head_length=0.2, fc='red', ec='red', linewidth=2) # Time axis ax.text(8, 0.3, 'Time →', ha='center', fontsize=12, fontweight='bold', style='italic') plt.tight_layout() plt.show() ``` ### 7.2.2 Simple RNN Implementation **Keras Implementation:** ```{python} #| echo: true #| code-fold: false import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # Simple RNN untuk many-to-one classification def build_simple_rnn_classifier(sequence_length=10, input_dim=1, hidden_units=32, num_classes=2): """ Simple RNN untuk sequence classification Parameters: sequence_length: Panjang sequence input input_dim: Dimensi feature pada setiap timestep hidden_units: Jumlah unit dalam RNN layer num_classes: Jumlah kelas untuk klasifikasi Returns: model: Compiled Keras model """ model = keras.Sequential([ # SimpleRNN layer layers.SimpleRNN( units=hidden_units, activation='tanh', return_sequences=False, # Many-to-one: hanya output terakhir input_shape=(sequence_length, input_dim), name='simple_rnn' ), # Dense layer untuk classification layers.Dense(num_classes, activation='softmax', name='output') ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) return model # Build model simple_rnn = build_simple_rnn_classifier() print(simple_rnn.summary()) ``` **PyTorch Implementation:** ```{python} #| echo: true #| code-fold: false import torch import torch.nn as nn class SimpleRNNClassifier(nn.Module): """ Simple RNN untuk sequence classification (PyTorch) """ def __init__(self, input_dim=1, hidden_dim=32, num_layers=1, num_classes=2): super(SimpleRNNClassifier, self).__init__() self.hidden_dim = hidden_dim self.num_layers = num_layers # RNN layer self.rnn = nn.RNN( input_size=input_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, nonlinearity='tanh' ) # Fully connected layer self.fc = nn.Linear(hidden_dim, num_classes) def forward(self, x): # x shape: (batch, seq_len, input_dim) # Initialize hidden state h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) # RNN forward pass out, hn = self.rnn(x, h0) # out shape: (batch, seq_len, hidden_dim) # hn shape: (num_layers, batch, hidden_dim) # Take output from last timestep out = out[:, -1, :] # (batch, hidden_dim) # Fully connected layer out = self.fc(out) # (batch, num_classes) return out # Instantiate model pytorch_rnn = SimpleRNNClassifier(input_dim=1, hidden_dim=32, num_classes=2) print(pytorch_rnn) # Test forward pass dummy_input = torch.randn(4, 10, 1) # (batch=4, seq_len=10, features=1) output = pytorch_rnn(dummy_input) print(f"\nInput shape: {dummy_input.shape}") print(f"Output shape: {output.shape}") ``` ### 7.2.3 The Vanishing Gradient Problem **Masalah utama Simple RNN**: **Vanishing Gradient** Saat melakukan backpropagation through time (BPTT), gradients harus propagate melalui banyak timesteps. Karena repeated matrix multiplication dengan weights < 1, gradients menjadi semakin kecil (vanish). **Mathematical Explanation:** Gradient untuk timestep awal: $$\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$ Jika $\left|\frac{\partial h_t}{\partial h_{t-1}}\right| < 1$, maka gradient exponentially decay! **Consequences:** - Cannot learn long-term dependencies - Early timesteps tidak mendapat gradient signal yang cukup - Model hanya belajar short-term patterns **Visualisasi Vanishing Gradient:** ```{python} #| echo: true #| code-fold: false # Demonstrasi vanishing gradient def demonstrate_vanishing_gradient(): """ Simulasi how gradients vanish over timesteps """ timesteps = 50 # Simulate gradient flow dengan different weight values gradients_small_w = [] gradients_good_w = [] gradients_large_w = [] initial_gradient = 1.0 for w in [0.9, 1.0, 1.1]: gradient = initial_gradient gradient_history = [gradient] for t in range(timesteps): gradient = gradient * w # Simplified gradient flow gradient_history.append(gradient) if w == 0.9: gradients_small_w = gradient_history elif w == 1.0: gradients_good_w = gradient_history else: gradients_large_w = gradient_history # Plot fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5)) # Linear scale ax1.plot(gradients_small_w, label='W = 0.9 (Vanishing)', linewidth=2.5, color='red') ax1.plot(gradients_good_w, label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--') ax1.plot(gradients_large_w, label='W = 1.1 (Exploding)', linewidth=2.5, color='blue') ax1.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold') ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold') ax1.set_title('Gradient Flow (Linear Scale)', fontsize=14, fontweight='bold') ax1.legend(fontsize=11) ax1.grid(alpha=0.3) # Log scale ax2.semilogy(np.abs(gradients_small_w), label='W = 0.9 (Vanishing)', linewidth=2.5, color='red') ax2.semilogy(np.abs(gradients_good_w), label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--') ax2.semilogy(np.abs(gradients_large_w), label='W = 1.1 (Exploding)', linewidth=2.5, color='blue') ax2.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold') ax2.set_ylabel('Gradient Magnitude (log scale)', fontsize=12, fontweight='bold') ax2.set_title('Gradient Flow (Log Scale)', fontsize=14, fontweight='bold') ax2.legend(fontsize=11) ax2.grid(alpha=0.3, which='both') plt.tight_layout() plt.show() print("Gradient after 50 timesteps:") print(f" W=0.9 (vanishing): {gradients_small_w[-1]:.2e}") print(f" W=1.0 (stable): {gradients_good_w[-1]:.2e}") print(f" W=1.1 (exploding): {gradients_large_w[-1]:.2e}") demonstrate_vanishing_gradient() ``` ::: {.callout-warning} ## ⚠️ Vanishing vs Exploding Gradients **Vanishing Gradient** (lebih common): - Gradients → 0 - Cannot learn long-term dependencies - **Solution**: LSTM, GRU, skip connections **Exploding Gradient** (less common): - Gradients → ∞ - Training unstable, NaN values - **Solution**: Gradient clipping, careful initialization ::: ## 7.3 LSTM: Long Short-Term Memory ### 7.3.1 LSTM Architecture **LSTM** dirancang khusus untuk mengatasi vanishing gradient problem dengan **gating mechanisms**. **Key Components:** 1. **Cell State** ($C_t$): Long-term memory highway 2. **Hidden State** ($h_t$): Short-term memory (output) 3. **Forget Gate** ($f_t$): Decide apa yang dibuang dari cell state 4. **Input Gate** ($i_t$): Decide info baru apa yang disimpan 5. **Output Gate** ($o_t$): Decide apa yang di-output **LSTM Equations:** $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ Forget gate $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ Input gate $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$ Candidate values $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$ Update cell state $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ Output gate $$h_t = o_t \odot \tanh(C_t)$$ Hidden state Where: - $\sigma$: Sigmoid function (output 0-1, acts as gate) - $\odot$: Element-wise multiplication - $\tanh$: Hyperbolic tangent (output -1 to 1) **LSTM Cell Visualization:** ```{python} #| echo: true #| code-fold: false # Complex LSTM visualization fig, ax = plt.subplots(figsize=(16, 10)) ax.set_xlim(0, 16) ax.set_ylim(0, 12) ax.axis('off') ax.set_title('LSTM Cell Architecture', fontsize=16, fontweight='bold', pad=20) # Cell state (top highway) ax.plot([1, 15], [10, 10], 'k-', linewidth=4, label='Cell State ($C_t$)') ax.text(0.3, 10, r'$C_{t-1}$', fontsize=12, fontweight='bold', va='center') ax.text(15.3, 10, r'$C_t$', fontsize=12, fontweight='bold', va='center') # Forget gate forget_rect = patches.Rectangle((3, 8), 1.5, 1.5, linewidth=2, edgecolor='red', facecolor='lightcoral', alpha=0.7) ax.add_patch(forget_rect) ax.text(3.75, 8.75, r'$f_t$', ha='center', va='center', fontsize=11, fontweight='bold') ax.text(3.75, 7.3, r'$\sigma$', ha='center', fontsize=10) # Forget gate operation ax.plot([3.75, 3.75], [9.5, 10], 'r-', linewidth=2) ax.plot([5, 5], [10, 10], 'ro', markersize=12, markerfacecolor='white', markeredgewidth=2) ax.text(5, 10.6, '×', fontsize=14, fontweight='bold', ha='center') # Input gate input_rect = patches.Rectangle((7, 8), 1.5, 1.5, linewidth=2, edgecolor='blue', facecolor='lightblue', alpha=0.7) ax.add_patch(input_rect) ax.text(7.75, 8.75, r'$i_t$', ha='center', va='center', fontsize=11, fontweight='bold') ax.text(7.75, 7.3, r'$\sigma$', ha='center', fontsize=10) # Candidate values candidate_rect = patches.Rectangle((9.5, 8), 1.5, 1.5, linewidth=2, edgecolor='purple', facecolor='plum', alpha=0.7) ax.add_patch(candidate_rect) ax.text(10.25, 8.75, r'$\tilde{C}_t$', ha='center', va='center', fontsize=11, fontweight='bold') ax.text(10.25, 7.3, r'$\tanh$', ha='center', fontsize=10) # Combine input and candidate ax.plot([7.75, 7.75], [9.5, 10.5], 'b-', linewidth=2) ax.plot([10.25, 10.25], [9.5, 10.5], 'purple', linewidth=2) ax.plot([9, 9], [10.5, 10.5], 'g-', linewidth=2) ax.plot([9, 9], [10, 10], 'go', markersize=12, markerfacecolor='white', markeredgewidth=2) ax.text(9, 10.6, '×', fontsize=14, fontweight='bold', ha='center') # Add to cell state ax.plot([11, 11], [10, 10], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2) ax.text(11, 10.6, '+', fontsize=14, fontweight='bold', ha='center') # Output gate output_rect = patches.Rectangle((13, 8), 1.5, 1.5, linewidth=2, edgecolor='green', facecolor='lightgreen', alpha=0.7) ax.add_patch(output_rect) ax.text(13.75, 8.75, r'$o_t$', ha='center', va='center', fontsize=11, fontweight='bold') ax.text(13.75, 7.3, r'$\sigma$', ha='center', fontsize=10) # tanh for output tanh_rect = patches.Rectangle((12.5, 5.5), 1, 1, linewidth=2, edgecolor='orange', facecolor='wheat', alpha=0.7) ax.add_patch(tanh_rect) ax.text(13, 6, r'$\tanh$', ha='center', va='center', fontsize=10, fontweight='bold') # Output combination ax.plot([13, 13], [6.5, 7], 'orange', linewidth=2) ax.plot([13.75, 13.75], [9.5, 7], 'g-', linewidth=2) ax.plot([13.4, 13.4], [7, 7], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2) ax.text(13.4, 7.6, '×', fontsize=14, fontweight='bold', ha='center') # Hidden state output ax.arrow(13.4, 6.5, 0, -2.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2.5) ax.text(13.4, 3.5, r'$h_t$', ha='center', fontsize=12, fontweight='bold') # Inputs ax.arrow(3.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.arrow(7.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.arrow(10.25, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.arrow(13.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.text(8, 1, r'$[h_{t-1}, x_t]$', ha='center', fontsize=12, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2)) # Legend legend_elements = [ patches.Patch(facecolor='lightcoral', edgecolor='red', label='Forget Gate'), patches.Patch(facecolor='lightblue', edgecolor='blue', label='Input Gate'), patches.Patch(facecolor='plum', edgecolor='purple', label='Candidate'), patches.Patch(facecolor='lightgreen', edgecolor='green', label='Output Gate'), ] ax.legend(handles=legend_elements, loc='upper left', fontsize=11) plt.tight_layout() plt.show() ``` ### 7.3.2 LSTM Implementation **Keras LSTM:** ```{python} #| echo: true #| code-fold: false def build_lstm_model(sequence_length=50, input_dim=1, lstm_units=64, dense_units=32, output_dim=1): """ LSTM model untuk time series forecasting Parameters: sequence_length: Lookback window size input_dim: Number of features per timestep lstm_units: LSTM hidden units dense_units: Dense layer units output_dim: Prediction horizon Returns: model: Compiled Keras model """ model = keras.Sequential([ # LSTM layer layers.LSTM( units=lstm_units, activation='tanh', recurrent_activation='sigmoid', return_sequences=False, # Return last output only input_shape=(sequence_length, input_dim), name='lstm_layer' ), # Dropout untuk regularization layers.Dropout(0.2, name='dropout'), # Dense layers layers.Dense(dense_units, activation='relu', name='dense_1'), layers.Dense(output_dim, activation='linear', name='output') ]) model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='mse', metrics=['mae'] ) return model # Build LSTM model lstm_model = build_lstm_model(sequence_length=50, lstm_units=64) print(lstm_model.summary()) ``` **PyTorch LSTM:** ```{python} #| echo: true #| code-fold: false class LSTMForecaster(nn.Module): """ LSTM model untuk time series forecasting (PyTorch) """ def __init__(self, input_dim=1, hidden_dim=64, num_layers=1, dense_dim=32, output_dim=1, dropout=0.2): super(LSTMForecaster, self).__init__() self.hidden_dim = hidden_dim self.num_layers = num_layers # LSTM layer self.lstm = nn.LSTM( input_size=input_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0 ) # Dropout self.dropout = nn.Dropout(dropout) # Fully connected layers self.fc1 = nn.Linear(hidden_dim, dense_dim) self.fc2 = nn.Linear(dense_dim, output_dim) # Activation self.relu = nn.ReLU() def forward(self, x): # x shape: (batch, seq_len, input_dim) # Initialize hidden and cell states h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) # LSTM forward pass out, (hn, cn) = self.lstm(x, (h0, c0)) # out: (batch, seq_len, hidden_dim) # Take last timestep output out = out[:, -1, :] # (batch, hidden_dim) # Dropout out = self.dropout(out) # Fully connected layers out = self.fc1(out) out = self.relu(out) out = self.fc2(out) return out # Instantiate PyTorch LSTM pytorch_lstm = LSTMForecaster(input_dim=1, hidden_dim=64, num_layers=2) print(pytorch_lstm) # Test test_input = torch.randn(8, 50, 1) # (batch=8, seq=50, features=1) test_output = pytorch_lstm(test_input) print(f"\nInput shape: {test_input.shape}") print(f"Output shape: {test_output.shape}") ``` ### 7.3.3 How LSTM Solves Vanishing Gradient **LSTM's Solution: Additive Cell State Update** Key insight: Cell state $C_t$ di-update secara **additive**, bukan multiplicative! $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$ **Gradient Flow:** $$\frac{\partial C_t}{\partial C_{t-1}} = f_t$$ Forget gate $f_t$ dapat mendekati 1, memungkinkan gradient flow tanpa decay! **Comparison:** ```{python} #| echo: true #| code-fold: false # Comparison: Simple RNN vs LSTM gradient flow fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6)) timesteps = np.arange(0, 51) # Simple RNN: multiplicative gradient rnn_gradient = 0.95 ** timesteps # LSTM: controlled by forget gate (closer to 1) lstm_gradient_forget_high = 0.99 ** timesteps lstm_gradient_forget_medium = 0.95 ** timesteps lstm_gradient_forget_low = 0.90 ** timesteps # Linear scale ax1.plot(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8) ax1.plot(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8) ax1.plot(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8) ax1.plot(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8) ax1.set_xlabel('Timesteps', fontsize=12, fontweight='bold') ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold') ax1.set_title('Gradient Flow Comparison (Linear)', fontsize=14, fontweight='bold') ax1.legend(fontsize=11) ax1.grid(alpha=0.3) # Log scale ax2.semilogy(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8) ax2.semilogy(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8) ax2.semilogy(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8) ax2.semilogy(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8) ax2.set_xlabel('Timesteps', fontsize=12, fontweight='bold') ax2.set_ylabel('Gradient Magnitude (log)', fontsize=12, fontweight='bold') ax2.set_title('Gradient Flow Comparison (Log Scale)', fontsize=14, fontweight='bold') ax2.legend(fontsize=11) ax2.grid(alpha=0.3, which='both') plt.tight_layout() plt.show() print("Gradient after 50 timesteps:") print(f" Simple RNN: {rnn_gradient[-1]:.6f}") print(f" LSTM (f=0.99): {lstm_gradient_forget_high[-1]:.6f}") print(f" LSTM (f=0.95): {lstm_gradient_forget_medium[-1]:.6f}") print(f" LSTM (f=0.90): {lstm_gradient_forget_low[-1]:.6f}") ``` ::: {.callout-tip} ## 💡 Why LSTM Works 1. **Cell State Highway**: Direct path untuk info flow tanpa transformations 2. **Gating Mechanisms**: Learned control kapan remember/forget 3. **Additive Updates**: Gradients tidak multiply repeatedly 4. **Flexible Memory**: Bisa learn long-term dependencies (100+ timesteps) **Result**: LSTM bisa learn dependencies ratusan timesteps, sedangkan Simple RNN hanya ~10 timesteps! ::: ## 7.4 GRU: Gated Recurrent Unit ### 7.4.1 GRU Architecture **GRU** adalah simplified version dari LSTM dengan **fewer parameters** tapi **comparable performance**. **Key Differences dari LSTM:** - **2 gates** instead of 3 (reset gate, update gate) - **No separate cell state** - hidden state saja - **Fewer parameters** - faster training - **Simpler architecture** - easier to understand **GRU Equations:** $$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$$ Update gate $$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$$ Reset gate $$\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t])$$ Candidate hidden state $$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$ Final hidden state **Component Functions:** - **Update gate** ($z_t$): Decides how much past info to keep - **Reset gate** ($r_t$): Decides how much past info to forget - **Candidate state** ($\tilde{h}_t$): New memory content - **Final state** ($h_t$): Combination of old and new **GRU Visualization:** ```{python} #| echo: true #| code-fold: false fig, ax = plt.subplots(figsize=(14, 8)) ax.set_xlim(0, 14) ax.set_ylim(0, 10) ax.axis('off') ax.set_title('GRU Cell Architecture', fontsize=16, fontweight='bold', pad=20) # Reset gate reset_rect = patches.Rectangle((2, 7), 1.5, 1.5, linewidth=2, edgecolor='red', facecolor='lightcoral', alpha=0.7) ax.add_patch(reset_rect) ax.text(2.75, 7.75, r'$r_t$', ha='center', va='center', fontsize=12, fontweight='bold') ax.text(2.75, 6.3, r'$\sigma$', ha='center', fontsize=10) # Update gate update_rect = patches.Rectangle((5.5, 7), 1.5, 1.5, linewidth=2, edgecolor='blue', facecolor='lightblue', alpha=0.7) ax.add_patch(update_rect) ax.text(6.25, 7.75, r'$z_t$', ha='center', va='center', fontsize=12, fontweight='bold') ax.text(6.25, 6.3, r'$\sigma$', ha='center', fontsize=10) # Candidate hidden state candidate_rect = patches.Rectangle((9, 7), 1.5, 1.5, linewidth=2, edgecolor='purple', facecolor='plum', alpha=0.7) ax.add_patch(candidate_rect) ax.text(9.75, 7.75, r'$\tilde{h}_t$', ha='center', va='center', fontsize=12, fontweight='bold') ax.text(9.75, 6.3, r'$\tanh$', ha='center', fontsize=10) # Reset operation ax.plot([2.75, 2.75], [8.5, 9.5], 'r-', linewidth=2) ax.plot([2.75, 8], [9.5, 9.5], 'r-', linewidth=2) ax.plot([8, 8], [9.5, 9], 'r-', linewidth=2) ax.plot([8, 8], [9, 9], 'ro', markersize=10, markerfacecolor='white', markeredgewidth=2) ax.text(8, 9.5, '×', fontsize=13, fontweight='bold', ha='center', va='bottom') # Previous hidden state path ax.plot([0.5, 11.5], [9, 9], 'k-', linewidth=3, alpha=0.5) ax.text(0, 9, r'$h_{t-1}$', fontsize=11, fontweight='bold', va='center') # Update gate paths ax.plot([6.25, 6.25], [8.5, 5], 'b-', linewidth=2) ax.plot([6.25, 11.5], [5, 5], 'b-', linewidth=2) # 1 - z_t path ax.plot([4, 4], [5, 5], 'b--', linewidth=2) ax.plot([4, 11.5], [3, 3], 'b--', linewidth=2) ax.text(3.5, 5, r'$1-z_t$', fontsize=10, ha='right', color='blue', fontweight='bold') # Candidate combination ax.plot([9.75, 9.75], [8.5, 5], 'purple', linewidth=2) ax.plot([11.5, 11.5], [5, 5], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2) ax.text(11.5, 5.5, '×', fontsize=13, fontweight='bold', ha='center', color='purple') # Old hidden state path ax.plot([11.5, 11.5], [9, 3], 'k--', linewidth=2, alpha=0.5) ax.plot([11.5, 11.5], [3, 3], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2) ax.text(11.5, 2.5, '×', fontsize=13, fontweight='bold', ha='center') # Final combination ax.plot([11.5, 11.5], [4.5, 1.5], 'g-', linewidth=2.5) ax.plot([11.5, 11.5], [2.5, 1.5], 'g-', linewidth=2.5) ax.plot([11.5, 11.5], [1.5, 1.5], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2) ax.text(11.5, 1, '+', fontsize=14, fontweight='bold', ha='center') # Output ax.arrow(11.5, 0.8, 0, -0.3, head_width=0.2, head_length=0.1, fc='green', ec='green', linewidth=2.5) ax.text(11.5, 0, r'$h_t$', ha='center', fontsize=12, fontweight='bold') # Inputs ax.arrow(2.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.arrow(6.25, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.arrow(9.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2) ax.text(6.25, 1.5, r'$[h_{t-1}, x_t]$', ha='center', fontsize=11, fontweight='bold', bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2)) plt.tight_layout() plt.show() ``` ### 7.4.2 GRU Implementation **Keras GRU:** ```{python} #| echo: true #| code-fold: false def build_gru_model(sequence_length=50, input_dim=1, gru_units=64, dense_units=32, output_dim=1): """ GRU model untuk time series forecasting """ model = keras.Sequential([ # GRU layer layers.GRU( units=gru_units, activation='tanh', recurrent_activation='sigmoid', return_sequences=False, input_shape=(sequence_length, input_dim), name='gru_layer' ), # Dropout layers.Dropout(0.2, name='dropout'), # Dense layers layers.Dense(dense_units, activation='relu', name='dense_1'), layers.Dense(output_dim, activation='linear', name='output') ]) model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='mse', metrics=['mae'] ) return model # Build GRU model gru_model = build_gru_model() print(gru_model.summary()) ``` **PyTorch GRU:** ```{python} #| echo: true #| code-fold: false class GRUForecaster(nn.Module): """ GRU model untuk time series forecasting (PyTorch) """ def __init__(self, input_dim=1, hidden_dim=64, num_layers=1, dense_dim=32, output_dim=1, dropout=0.2): super(GRUForecaster, self).__init__() self.hidden_dim = hidden_dim self.num_layers = num_layers # GRU layer self.gru = nn.GRU( input_size=input_dim, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True, dropout=dropout if num_layers > 1 else 0 ) # Dropout self.dropout = nn.Dropout(dropout) # Fully connected layers self.fc1 = nn.Linear(hidden_dim, dense_dim) self.fc2 = nn.Linear(dense_dim, output_dim) # Activation self.relu = nn.ReLU() def forward(self, x): # Initialize hidden state h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device) # GRU forward pass out, hn = self.gru(x, h0) # Take last timestep out = out[:, -1, :] # Dropout out = self.dropout(out) # FC layers out = self.fc1(out) out = self.relu(out) out = self.fc2(out) return out # Instantiate pytorch_gru = GRUForecaster(hidden_dim=64, num_layers=2) print(pytorch_gru) ``` ### 7.4.3 LSTM vs GRU: When to Use What? **Comparison Table:** | Aspect | LSTM | GRU | |--------|------|-----| | **Parameters** | More (4 gates) | Less (2 gates) | | **Training Speed** | Slower | Faster | | **Memory** | Higher | Lower | | **Performance** | Slightly better on complex tasks | Comparable on most tasks | | **Long-term Dependencies** | Excellent | Very good | | **Overfitting Risk** | Higher (more params) | Lower | | **When to Use** | Large datasets, complex patterns | Smaller datasets, faster training needed | **Practical Guidelines:** ```{python} #| echo: true #| code-fold: false # Parameter comparison def compare_parameters(): """ Compare parameter counts: LSTM vs GRU """ seq_len, input_dim, hidden_dim = 50, 1, 64 # Build models lstm_model = keras.Sequential([ layers.LSTM(hidden_dim, input_shape=(seq_len, input_dim)), layers.Dense(1) ]) gru_model = keras.Sequential([ layers.GRU(hidden_dim, input_shape=(seq_len, input_dim)), layers.Dense(1) ]) lstm_params = lstm_model.count_params() gru_params = gru_model.count_params() print("Parameter Comparison:") print(f" LSTM parameters: {lstm_params:,}") print(f" GRU parameters: {gru_params:,}") print(f" Difference: {lstm_params - gru_params:,} ({(lstm_params-gru_params)/gru_params*100:.1f}% more)") print(f"\n GRU is {lstm_params/gru_params:.2f}x smaller than LSTM") compare_parameters() ``` ::: {.callout-tip} ## 🎯 Rule of Thumb **Use LSTM when:** - You have **large datasets** (millions of samples) - Task requires **very long-term dependencies** (100+ timesteps) - Model interpretability less important - Computational resources abundant **Use GRU when:** - **Smaller datasets** or limited computational resources - Need **faster training/inference** - Medium-term dependencies (10-100 timesteps) - Want **simpler model** with fewer hyperparameters **In practice**: Try both! GRU often performs similarly with less complexity. ::: ## 7.5 Advanced RNN Architectures ### 7.5.1 Bidirectional RNNs **Konsep**: Process sequence **forward AND backward** untuk mendapatkan context dari kedua arah. **Use Cases:** - Sentiment analysis (membutuhkan full sentence context) - Named Entity Recognition - Speech recognition - Tidak cocok untuk real-time forecasting (butuh future data) **Architecture:** ```{mermaid} %%| fig-cap: "Arsitektur Bidirectional LSTM - menggabungkan informasi dari forward dan backward pass" %%| label: fig-bidirectional-lstm flowchart LR subgraph Forward["Forward Pass"] direction LR X1[x1] --> F1[h1_fwd] X2[x2] --> F2[h2_fwd] X3[x3] --> F3[h3_fwd] F1 --> F2 F2 --> F3 end subgraph Backward["Backward Pass"] direction RL X1B[x1] --> B1[h1_bwd] X2B[x2] --> B2[h2_bwd] X3B[x3] --> B3[h3_bwd] B3 --> B2 B2 --> B1 end F1 --> C1[Concat] B1 --> C1 F2 --> C2[Concat] B2 --> C2 F3 --> C3[Concat] B3 --> C3 C1 --> Y1[y1] C2 --> Y2[y2] C3 --> Y3[y3] style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px style C1 fill:#fff4e6,stroke:#333 style C2 fill:#fff4e6,stroke:#333 style C3 fill:#fff4e6,stroke:#333 ``` **Implementation:** ```{python} #| echo: true #| code-fold: false def build_bidirectional_lstm(sequence_length=50, input_dim=1, lstm_units=64, num_classes=3): """ Bidirectional LSTM untuk sequence classification """ model = keras.Sequential([ # Bidirectional LSTM layers.Bidirectional( layers.LSTM(lstm_units, return_sequences=True), input_shape=(sequence_length, input_dim), name='bidirectional_lstm_1' ), # Second bidirectional layer layers.Bidirectional( layers.LSTM(lstm_units // 2), name='bidirectional_lstm_2' ), # Dropout layers.Dropout(0.3), # Output layers.Dense(num_classes, activation='softmax') ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) return model # Build model bi_lstm = build_bidirectional_lstm() print(bi_lstm.summary()) # Note: Output dari bidirectional layer adalah concatenation # Jika forward LSTM has 64 units, backward juga 64 units # Output shape: (batch, 64 + 64) = (batch, 128) ``` ### 7.5.2 Stacked/Deep RNNs **Konsep**: Stack multiple RNN layers untuk learn **hierarchical representations**. **Benefits:** - Learn more complex patterns - Better feature extraction - Hierarchical temporal abstractions **Caution:** - More parameters = more data needed - Risk of overfitting - Harder to train **Implementation:** ```{python} #| echo: true #| code-fold: false def build_stacked_lstm(sequence_length=50, input_dim=1, lstm_layers=[128, 64, 32], output_dim=1): """ Stacked LSTM dengan multiple layers Parameters: lstm_layers: List of units per layer [layer1_units, layer2_units, ...] """ model = keras.Sequential(name='Stacked_LSTM') # First LSTM layer (must return sequences) model.add(layers.LSTM( lstm_layers[0], return_sequences=True, input_shape=(sequence_length, input_dim), name=f'lstm_1' )) model.add(layers.Dropout(0.2, name='dropout_1')) # Middle layers (return sequences for all except last) for i, units in enumerate(lstm_layers[1:-1], start=2): model.add(layers.LSTM( units, return_sequences=True, name=f'lstm_{i}' )) model.add(layers.Dropout(0.2, name=f'dropout_{i}')) # Last LSTM layer (return_sequences=False) model.add(layers.LSTM( lstm_layers[-1], return_sequences=False, name=f'lstm_{len(lstm_layers)}' )) model.add(layers.Dropout(0.2, name=f'dropout_{len(lstm_layers)}')) # Output layer model.add(layers.Dense(output_dim, activation='linear', name='output')) model.compile( optimizer='adam', loss='mse', metrics=['mae'] ) return model # Build 3-layer stacked LSTM stacked_lstm = build_stacked_lstm(lstm_layers=[128, 64, 32]) print(stacked_lstm.summary()) ``` ### 7.5.3 Encoder-Decoder (Seq2Seq) **Konsep**: Architecture untuk **sequence-to-sequence** tasks dengan variable input/output lengths. **Components:** 1. **Encoder**: Process input sequence → context vector 2. **Decoder**: Generate output sequence dari context vector **Use Cases:** - Machine translation - Text summarization - Question answering - Image captioning **Architecture:** ```{mermaid} %%| fig-cap: "Arsitektur Encoder-Decoder (Seq2Seq) - Encoder mengompres input menjadi context vector, Decoder menghasilkan output sequence" %%| label: fig-seq2seq-architecture flowchart LR subgraph Encoder["Encoder"] direction LR X1[x1] --> E1[LSTM 1] X2[x2] --> E2[LSTM 2] X3[x3] --> E3[LSTM 3] E1 --> E2 E2 --> E3 end E3 ==> C[Context Vector] subgraph Decoder["Decoder"] direction LR C ==> D1[LSTM 1] D1 --> D2[LSTM 2] D2 --> D3[LSTM 3] D1 --> Y1[y1] D2 --> Y2[y2] D3 --> Y3[y3] end style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px style C fill:#ffffcc,stroke:#f90,stroke-width:3px style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px style Y1 fill:#d4edda,stroke:#333 style Y2 fill:#d4edda,stroke:#333 style Y3 fill:#d4edda,stroke:#333 ``` **Implementation:** ```{python} #| echo: true #| code-fold: false def build_seq2seq_model(encoder_seq_len=10, decoder_seq_len=10, input_dim=1, output_dim=1, latent_dim=64): """ Simple Seq2Seq model Parameters: encoder_seq_len: Input sequence length decoder_seq_len: Output sequence length latent_dim: Hidden dimension """ # Encoder encoder_inputs = layers.Input(shape=(encoder_seq_len, input_dim), name='encoder_input') encoder_lstm = layers.LSTM(latent_dim, return_state=True, name='encoder_lstm') encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs) encoder_states = [state_h, state_c] # Context vector # Decoder decoder_inputs = layers.Input(shape=(decoder_seq_len, output_dim), name='decoder_input') decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True, name='decoder_lstm') decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = layers.Dense(output_dim, activation='linear', name='decoder_dense') decoder_outputs = decoder_dense(decoder_outputs) # Model model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name='Seq2Seq') model.compile( optimizer='adam', loss='mse', metrics=['mae'] ) return model # Build seq2seq seq2seq = build_seq2seq_model() print(seq2seq.summary()) ``` ## 7.6 Time Series Forecasting dengan RNN ### 7.6.1 Problem Formulation **Time Series Forecasting Task:** Given historical data $x_1, x_2, ..., x_t$, predict future values $x_{t+1}, x_{t+2}, ..., x_{t+h}$ **Approaches:** 1. **One-step ahead**: Predict $x_{t+1}$ saja 2. **Multi-step ahead**: Predict sequence $[x_{t+1}, ..., x_{t+h}]$ 3. **Recursive**: Use predictions as input untuk next prediction 4. **Direct**: Separate model untuk setiap horizon **Data Preparation:** ```{python} #| echo: true #| code-fold: false def create_sequences(data, lookback=50, horizon=1): """ Create input-output sequences untuk time series forecasting Parameters: data: 1D array time series data lookback: Number of past timesteps to use as input horizon: Number of future timesteps to predict Returns: X: Input sequences (samples, lookback, features) y: Target values (samples, horizon) """ X, y = [], [] for i in range(len(data) - lookback - horizon + 1): # Input: [i : i+lookback] X.append(data[i : i + lookback]) # Target: [i+lookback : i+lookback+horizon] if horizon == 1: y.append(data[i + lookback]) else: y.append(data[i + lookback : i + lookback + horizon]) X = np.array(X) y = np.array(y) # Reshape X to (samples, lookback, 1) untuk univariate if len(X.shape) == 2: X = X.reshape((X.shape[0], X.shape[1], 1)) return X, y # Example np.random.seed(42) sample_data = np.sin(np.linspace(0, 100, 1000)) + np.random.normal(0, 0.1, 1000) X, y = create_sequences(sample_data, lookback=50, horizon=1) print(f"Input shape: {X.shape}") # (samples, 50, 1) print(f"Output shape: {y.shape}") # (samples, 1) # Visualize sequences fig, axes = plt.subplots(2, 1, figsize=(15, 8)) # Plot full time series axes[0].plot(sample_data, linewidth=1.5, alpha=0.7) axes[0].set_title('Full Time Series', fontsize=13, fontweight='bold') axes[0].set_xlabel('Time', fontsize=11) axes[0].set_ylabel('Value', fontsize=11) axes[0].grid(alpha=0.3) # Plot one sequence example example_idx = 100 input_seq = X[example_idx].flatten() target_val = y[example_idx] axes[1].plot(range(len(input_seq)), input_seq, 'b-', linewidth=2, label='Input Sequence (lookback=50)') axes[1].plot(len(input_seq), target_val, 'ro', markersize=10, label=f'Target (t+1)', zorder=3) axes[1].axvline(len(input_seq)-1, color='gray', linestyle='--', alpha=0.5) axes[1].set_title(f'Example Sequence #{example_idx}', fontsize=13, fontweight='bold') axes[1].set_xlabel('Timestep', fontsize=11) axes[1].set_ylabel('Value', fontsize=11) axes[1].legend(fontsize=10) axes[1].grid(alpha=0.3) plt.tight_layout() plt.show() ``` ### 7.6.2 Feature Engineering for Time Series **Important Features:** 1. **Lag features**: Past values 2. **Rolling statistics**: Moving average, std 3. **Time-based features**: Hour, day, month, seasonality 4. **Difference features**: First/second differences ```{python} #| echo: true #| code-fold: false import pandas as pd def engineer_time_series_features(data, datetime_index=None): """ Create time series features Parameters: data: 1D array or pandas Series datetime_index: DatetimeIndex (optional) Returns: DataFrame with engineered features """ if isinstance(data, np.ndarray): data = pd.Series(data) df = pd.DataFrame({'value': data}) # Lag features for lag in [1, 2, 3, 7, 14]: df[f'lag_{lag}'] = df['value'].shift(lag) # Rolling statistics for window in [7, 14, 30]: df[f'rolling_mean_{window}'] = df['value'].rolling(window=window).mean() df[f'rolling_std_{window}'] = df['value'].rolling(window=window).std() df[f'rolling_min_{window}'] = df['value'].rolling(window=window).min() df[f'rolling_max_{window}'] = df['value'].rolling(window=window).max() # Difference features df['diff_1'] = df['value'].diff(1) df['diff_2'] = df['value'].diff(2) # Time-based features (if datetime index provided) if datetime_index is not None: df.index = datetime_index df['hour'] = df.index.hour df['day_of_week'] = df.index.dayofweek df['day_of_month'] = df.index.day df['month'] = df.index.month df['quarter'] = df.index.quarter # Cyclical encoding untuk periodic features df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24) df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24) df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7) df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7) # Drop NaN rows df = df.dropna() return df # Example dates = pd.date_range('2023-01-01', periods=len(sample_data), freq='H') features_df = engineer_time_series_features(sample_data, dates) print("Engineered Features:") print(features_df.head(10)) print(f"\nTotal features created: {len(features_df.columns)}") ``` ### 7.6.3 Evaluation Metrics untuk Time Series **Common Metrics:** ```{python} #| echo: true #| code-fold: false from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score def evaluate_forecast(y_true, y_pred): """ Comprehensive evaluation metrics untuk forecasting Returns: Dictionary of metrics """ # Regression metrics mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) # Percentage errors mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100 # Direction accuracy (berapa % trend direction benar) y_true_diff = np.diff(y_true.flatten()) y_pred_diff = np.diff(y_pred.flatten()) direction_accuracy = np.mean((y_true_diff * y_pred_diff) > 0) * 100 metrics = { 'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R²': r2, 'MAPE (%)': mape, 'Direction Accuracy (%)': direction_accuracy } return metrics # Example evaluation y_true = sample_data[100:200] y_pred = sample_data[100:200] + np.random.normal(0, 0.1, 100) # Simulated predictions metrics = evaluate_forecast(y_true, y_pred) print("Forecast Evaluation Metrics:") print("=" * 50) for metric, value in metrics.items(): print(f"{metric:25s}: {value:10.4f}") ``` ::: {.callout-note} ## 📊 Which Metric to Use? **MSE/RMSE**: - Penalize large errors heavily - Good when outliers are critical - Same unit as target **MAE**: - More robust to outliers - Interpretable (average error) - Less sensitive to extreme values **MAPE**: - Percentage-based, scale-independent - Good untuk comparing different datasets - Problem jika y_true close to zero **Direction Accuracy**: - Important for trading strategies - Measures trend prediction - Binary metric (up/down) **R²**: - Goodness of fit - 1.0 = perfect, 0.0 = baseline - Can be negative for bad models ::: ## 7.7 RNN untuk NLP Applications ### 7.7.1 Text Preprocessing untuk RNN **Steps:** 1. **Tokenization**: Split text into tokens 2. **Vocabulary building**: Create word-to-index mapping 3. **Sequence padding**: Make all sequences same length 4. **Embedding**: Convert words to dense vectors ```{python} #| echo: true #| code-fold: false from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences # Sample text data texts = [ "I love machine learning", "Deep learning is amazing", "RNN are great for sequences", "LSTM solves vanishing gradient problem", "NLP with deep learning is powerful" ] # Tokenization tokenizer = Tokenizer(num_words=100, oov_token='<OOV>') tokenizer.fit_on_texts(texts) # Convert to sequences sequences = tokenizer.texts_to_sequences(texts) print("Original texts:") for i, text in enumerate(texts): print(f" {i+1}. {text}") print("\nTokenized sequences:") for i, seq in enumerate(sequences): print(f" {i+1}. {seq}") # Vocabulary word_index = tokenizer.word_index print(f"\nVocabulary size: {len(word_index)}") print("\nWord to index mapping (first 10):") for word, idx in list(word_index.items())[:10]: print(f" '{word}': {idx}") # Padding max_length = 10 padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post') print(f"\nPadded sequences (max_length={max_length}):") for i, seq in enumerate(padded_sequences): print(f" {i+1}. {seq}") ``` ### 7.7.2 Sentiment Analysis dengan LSTM ```{python} #| echo: true #| code-fold: false def build_sentiment_classifier(vocab_size=10000, embedding_dim=64, max_length=100, lstm_units=64): """ LSTM untuk sentiment classification """ model = keras.Sequential([ # Embedding layer layers.Embedding( input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length, name='embedding' ), # Spatial dropout untuk embedding layers.SpatialDropout1D(0.2), # Bidirectional LSTM layers.Bidirectional( layers.LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2), name='bidirectional_lstm' ), # Dense layers layers.Dense(32, activation='relu'), layers.Dropout(0.3), # Output layer (binary classification) layers.Dense(1, activation='sigmoid') ]) model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] ) return model # Build model sentiment_model = build_sentiment_classifier() print(sentiment_model.summary()) ``` ### 7.7.3 Text Generation dengan RNN **Character-level Language Model:** ```{python} #| echo: true #| code-fold: false def build_text_generator(vocab_size=100, embedding_dim=256, rnn_units=512, sequence_length=100): """ Character-level text generation model """ model = keras.Sequential([ # Embedding layers.Embedding(vocab_size, embedding_dim, input_length=sequence_length), # Stacked LSTM layers.LSTM(rnn_units, return_sequences=True), layers.Dropout(0.2), layers.LSTM(rnn_units), layers.Dropout(0.2), # Output layer (predict next character) layers.Dense(vocab_size, activation='softmax') ]) model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) return model text_gen = build_text_generator() print(text_gen.summary()) ``` ## 7.8 Best Practices dan Tips ### 7.8.1 Hyperparameter Tuning **Key Hyperparameters:** 1. **Hidden units**: Start dengan 32-128 2. **Number of layers**: 1-3 layers biasanya cukup 3. **Dropout rate**: 0.2-0.5 4. **Learning rate**: 0.001 (Adam) atau 0.01 (SGD) 5. **Batch size**: 32-128 untuk time series 6. **Sequence length**: Tergantung pada temporal dependency ```{python} #| echo: true #| code-fold: false # Hyperparameter search space hyperparams = { 'lstm_units': [32, 64, 128, 256], 'num_layers': [1, 2, 3], 'dropout_rate': [0.0, 0.2, 0.3, 0.5], 'learning_rate': [0.0001, 0.001, 0.01], 'batch_size': [32, 64, 128], 'sequence_length': [24, 48, 72, 96] } print("Hyperparameter Search Space:") for param, values in hyperparams.items(): print(f" {param:20s}: {values}") ``` ### 7.8.2 Training Tips ::: {.callout-tip} ## 🎯 Training Best Practices **1. Gradient Clipping** (prevent exploding gradients): ```python optimizer = keras.optimizers.Adam(clipnorm=1.0) ``` **2. Early Stopping** (prevent overfitting): ```python early_stop = keras.callbacks.EarlyStopping( monitor='val_loss', patience=10, restore_best_weights=True ) ``` **3. Learning Rate Scheduling**: ```python reduce_lr = keras.callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=5, min_lr=1e-7 ) ``` **4. Batch Normalization** (untuk deeper networks): ```python layers.BatchNormalization() ``` **5. Teacher Forcing** (untuk seq2seq): - Use ground truth sebagai decoder input during training - Use model predictions during inference ::: ### 7.8.3 Common Pitfalls **❌ Common Mistakes:** 1. **Data leakage**: Using future data untuk train 2. **Not normalizing**: RNN sensitive terhadap scale 3. **Too many parameters**: Overfitting pada small datasets 4. **Ignoring stationarity**: Time series should be stationary 5. **Wrong sequence direction**: Pastikan temporal order benar 6. **Batch size too small**: Unstable training 7. **No validation set**: Cannot detect overfitting **✅ Solutions:** ```{python} #| echo: true #| code-fold: false # 1. Proper train/val/test split untuk time series def time_series_split(data, train_ratio=0.7, val_ratio=0.15): """ Time series split (no shuffling!) """ n = len(data) train_end = int(n * train_ratio) val_end = int(n * (train_ratio + val_ratio)) train = data[:train_end] val = data[train_end:val_end] test = data[val_end:] return train, val, test # 2. Normalization from sklearn.preprocessing import StandardScaler, MinMaxScaler def normalize_data(train, val, test, method='standard'): """ Normalize data using train statistics """ if method == 'standard': scaler = StandardScaler() else: scaler = MinMaxScaler() # Fit ONLY on training data train_scaled = scaler.fit_transform(train.reshape(-1, 1)) val_scaled = scaler.transform(val.reshape(-1, 1)) test_scaled = scaler.transform(test.reshape(-1, 1)) return train_scaled, val_scaled, test_scaled, scaler # Example train, val, test = time_series_split(sample_data) train_norm, val_norm, test_norm, scaler = normalize_data(train, val, test) print(f"Train size: {len(train)} ({len(train)/len(sample_data)*100:.1f}%)") print(f"Val size: {len(val)} ({len(val)/len(sample_data)*100:.1f}%)") print(f"Test size: {len(test)} ({len(test)/len(sample_data)*100:.1f}%)") ``` ## 7.9 Rangkuman & Kesimpulan ### 7.9.1 Key Takeaways ::: {.callout-note} ## 📚 Chapter Summary **1. RNN Fundamentals:** - RNN memproses sequential data dengan hidden state (memory) - Parameter sharing across timesteps - Dapat handle variable-length sequences **2. Vanishing Gradient Problem:** - Simple RNN sulit learn long-term dependencies - Gradients exponentially decay saat backpropagation - Solution: LSTM, GRU dengan gating mechanisms **3. LSTM:** - 3 gates (forget, input, output) + cell state - Cell state acts as information highway - Additive updates prevent gradient vanishing - Best untuk very long-term dependencies **4. GRU:** - Simplified LSTM dengan 2 gates (update, reset) - Fewer parameters, faster training - Comparable performance untuk most tasks - Good default choice untuk many applications **5. Advanced Architectures:** - **Bidirectional**: Process forward + backward - **Stacked**: Multiple layers untuk hierarchical learning - **Seq2Seq**: Encoder-decoder untuk variable length I/O **6. Applications:** - **Time Series**: Energy forecasting, stock prediction - **NLP**: Sentiment analysis, text generation, translation - **Speech**: Recognition, synthesis - **Video**: Frame prediction, action recognition **7. Best Practices:** - Normalize input data - Use gradient clipping - Proper train/val/test split (no shuffling untuk time series!) - Start simple, add complexity gradually - Monitor for overfitting ::: ### 7.9.2 When to Use What? ```{mermaid} %%| fig-cap: "Decision tree untuk memilih arsitektur RNN yang tepat berdasarkan karakteristik masalah dan data" %%| label: fig-rnn-decision-tree flowchart TD A["Sequential Problem?"] -->|Yes| B{"Long-term Dependencies?"} A -->|No| Z["Use Feedforward NN or CNN"] B -->|"Yes, >100 steps"| C["Use LSTM"] B -->|"Medium, 10-100 steps"| D["Use GRU"] B -->|"Short, <10 steps"| E["Simple RNN OK"] C --> F{"Large Dataset?"} D --> F E --> F F -->|"Yes, millions"| G["Deep/Stacked RNN"] F -->|"No, thousands"| H["Shallow RNN 1-2 layers"] G --> I{"Need both directions?"} H --> I I -->|Yes| J["Bidirectional"] I -->|No| K["Unidirectional"] style A fill:#ffcccc,stroke:#333,stroke-width:2px style B fill:#fff3cd,stroke:#333,stroke-width:2px style C fill:#ccffcc,stroke:#333,stroke-width:2px style D fill:#ccffcc,stroke:#333,stroke-width:2px style E fill:#ffffcc,stroke:#333,stroke-width:2px style F fill:#fff3cd,stroke:#333,stroke-width:2px style I fill:#fff3cd,stroke:#333,stroke-width:2px style J fill:#ccccff,stroke:#333,stroke-width:2px style K fill:#ccccff,stroke:#333,stroke-width:2px style Z fill:#e2e3e5,stroke:#333,stroke-width:2px ``` ### 7.9.3 Looking Forward **Limitations of RNNs:** - Sequential processing (cannot parallelize) - Still struggle dengan very long sequences (>1000 steps) - Computationally expensive - Hard to capture global dependencies **Modern Alternatives:** - **Transformers**: Attention mechanisms, fully parallelizable - **Temporal Convolutional Networks (TCN)**: CNN untuk sequences - **State Space Models (SSM)**: Linear-time alternatives **When RNN Still Relevant:** - Small datasets (transformers need more data) - Online/streaming prediction - Resource-constrained environments - Interpretability requirements - Classic time series problems ## 7.10 Soal Latihan ### Review Questions 1. Jelaskan mengapa feedforward neural networks tidak cocok untuk sequential data. Sebutkan 3 alasan utama. 2. Apa itu vanishing gradient problem? Mengapa ini terjadi pada Simple RNN? Berikan penjelasan matematis. 3. Bandingkan LSTM dan GRU: - Perbedaan arsitektur - Jumlah parameters - Kapan menggunakan masing-masing - Trade-offs 4. Apa fungsi dari setiap gate dalam LSTM: - Forget gate - Input gate - Output gate Berikan contoh konkret kapan setiap gate akan "terbuka" atau "tertutup". 5. Jelaskan perbedaan antara: - `return_sequences=True` vs `return_sequences=False` - Unidirectional vs Bidirectional RNN - Stateful vs Stateless RNN 6. Untuk time series forecasting, jelaskan: - One-step vs multi-step forecasting - Recursive vs direct multi-step prediction - Kelebihan dan kekurangan masing-masing 7. Mengapa normalization penting untuk RNN? Apa yang terjadi jika tidak melakukan normalization? 8. Jelaskan sequence-to-sequence (Seq2Seq) architecture. Berikan 3 aplikasi nyata. 9. Apa peran dropout dalam RNN? Di mana sebaiknya dropout diterapkan? 10. Bandingkan evaluation metrics untuk time series: - MSE vs MAE - MAPE vs RMSE - Kapan menggunakan masing-masing ### Coding Exercises **Exercise 1**: Implementasi Simple RNN dari scratch (NumPy) ```python # Tugas: Implementasi forward pass Simple RNN tanpa library # Input: sequence dengan shape (batch, timesteps, features) # Output: hidden states untuk setiap timestep ``` **Exercise 2**: LSTM untuk Stock Price Prediction ```python # Dataset: Historical stock prices (Yahoo Finance) # Task: Predict next day close price # Requirements: # - Feature engineering (moving averages, RSI, etc.) # - LSTM model dengan minimal 2 layers # - Proper train/val/test split # - Evaluate dengan multiple metrics ``` **Exercise 3**: Sentiment Analysis dengan Bidirectional LSTM ```python # Dataset: IMDB reviews # Task: Binary sentiment classification # Requirements: # - Text preprocessing (tokenization, padding) # - Embedding layer # - Bidirectional LSTM # - Compare dengan unidirectional ``` **Exercise 4**: Text Generation (Character-level) ```python # Dataset: Shakespeare texts atau any corpus # Task: Generate new text character-by-character # Requirements: # - Character-level tokenization # - Stacked LSTM # - Temperature sampling # - Generate coherent sentences ``` **Exercise 5**: Energy Consumption Forecasting ```python # Dataset: Household energy consumption (hourly) # Task: Predict next 24 hours consumption # Requirements: # - Multi-step forecasting # - Compare Simple RNN, LSTM, GRU # - Feature engineering (time-based features) # - Visualization of predictions ``` --- **Selamat belajar! Di Lab 7, kita akan mengimplementasikan LSTM untuk Energy Forecasting secara hands-on! 🚀**