Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling

Deep Learning untuk Data Sequential: Time Series, NLP & Sequential Prediction

Bab 7: Recurrent Neural Networks, LSTM & Sequence Modeling

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

  1. Memahami arsitektur RNN dan bagaimana ia memproses data sequential
  2. Mengidentifikasi masalah vanishing gradient dan solusinya (LSTM, GRU)
  3. Mengimplementasikan RNN, LSTM, dan GRU untuk time series forecasting
  4. Menerapkan bidirectional dan stacked RNNs untuk performa lebih baik
  5. Menggunakan sequence-to-sequence models untuk aplikasi NLP
  6. Mengevaluasi performa model sequential dengan metrik yang tepat

7.1 Pengantar Sequential Data dan RNN

7.1.1 Mengapa Data Sequential Memerlukan Arsitektur Khusus?

Problem dengan Feedforward Networks untuk Sequential Data:

Di Chapter 5 dan 6, kita belajar MLP dan CNN yang bekerja dengan fixed-size inputs. Namun, banyak data di dunia nyata bersifat sequential dengan temporal dependencies:

Contoh Sequential Data:

  • Time series: Harga saham, suhu, konsumsi energi
  • Text: Kalimat, dokumen, conversation
  • Audio: Speech, music, sound
  • Video: Frame sequences
  • DNA sequences: Genomic data

Masalah Feedforward NN:

  1. No memory: Setiap input diproses independently
  2. Fixed input size: Tidak bisa handle variable-length sequences
  3. No temporal relationships: Kehilangan informasi order/sequence
  4. Parameter explosion: Beda timestep = beda parameters
💡 Intuisi RNN

RNN mengatasi masalah dengan:

  • Hidden state (memory): Menyimpan informasi dari timesteps sebelumnya
  • Parameter sharing: Weight yang sama digunakan untuk semua timesteps
  • Variable-length sequences: Bisa proses sequence dengan panjang berbeda
  • Temporal modeling: Capture dependencies antar timesteps

Analogi: Seperti membaca buku - Anda memahami kalimat berdasarkan kata-kata sebelumnya!

7.1.2 Evolution of Sequential Models

Era Pre-RNN:

  • Hidden Markov Models (HMM)
  • Autoregressive models (AR, ARMA, ARIMA)
  • Manual feature engineering
  • Limitation: Cannot learn long-term dependencies

RNN Era (1990s-2010s):

  • Simple RNN (1986): First recurrent architecture
  • LSTM (1997): Long Short-Term Memory - solved vanishing gradient
  • GRU (2014): Gated Recurrent Unit - simplified LSTM
  • Bidirectional RNN (1997): Process sequence forward & backward

Modern Era (2017-now):

  • Attention Mechanisms (2017): Self-attention, multi-head attention
  • Transformers (2017): Attention is all you need - replaced RNN for NLP
  • BERT, GPT (2018-now): Large language models
  • But: RNNs still relevant for time series, small data, interpretability
📊 RNN Applications Today

Industry Applications:

  • Finance: Stock prediction, algorithmic trading
  • Energy: Load forecasting, demand prediction
  • Healthcare: Patient monitoring, disease progression
  • Manufacturing: Predictive maintenance, quality control
  • NLP: Machine translation, text generation, sentiment analysis
  • Speech: Speech recognition, text-to-speech

7.1.3 Types of Sequential Problems

RNN dapat menangani berbagai tipe sequential problems:

Code
graph LR
    A[One-to-One<br/>Traditional NN<br/>Image Classification] --> B[One-to-Many<br/>Image Captioning<br/>Music Generation]
    B --> C[Many-to-One<br/>Sentiment Analysis<br/>Video Classification]
    C --> D[Many-to-Many<br/>same length<br/>Video Frame Labeling]
    D --> E[Many-to-Many<br/>diff length<br/>Machine Translation]

    style A fill:#ffcccc
    style B fill:#ffe6cc
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#ccccff

graph LR
    A[One-to-One<br/>Traditional NN<br/>Image Classification] --> B[One-to-Many<br/>Image Captioning<br/>Music Generation]
    B --> C[Many-to-One<br/>Sentiment Analysis<br/>Video Classification]
    C --> D[Many-to-Many<br/>same length<br/>Video Frame Labeling]
    D --> E[Many-to-Many<br/>diff length<br/>Machine Translation]

    style A fill:#ffcccc
    style B fill:#ffe6cc
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#ccccff

Detailed Explanation:

  1. One-to-One: Standard feedforward NN
    • Input: Single vector
    • Output: Single vector
    • Example: Image classification
  2. One-to-Many: Sequence generation
    • Input: Single vector (atau kondisi awal)
    • Output: Sequence
    • Example: Image captioning, music generation
  3. Many-to-One: Sequence classification
    • Input: Sequence
    • Output: Single vector
    • Example: Sentiment analysis, time series classification
  4. Many-to-Many (same length): Synchronized sequence
    • Input: Sequence
    • Output: Sequence (same length)
    • Example: Video frame labeling, POS tagging
  5. Many-to-Many (different length): Sequence-to-sequence
    • Input: Sequence
    • Output: Sequence (different length)
    • Example: Machine translation, text summarization

7.2 Simple RNN: Fundamentals

7.2.1 RNN Architecture Basics

Recurrent Neural Network memproses sequences dengan hidden state yang di-update setiap timestep.

Mathematical Formulation:

Untuk setiap timestep \(t\):

\[h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)\] \[y_t = W_{hy} h_t + b_y\]

Where:

  • \(x_t\): Input pada timestep \(t\)
  • \(h_t\): Hidden state pada timestep \(t\)
  • \(y_t\): Output pada timestep \(t\)
  • \(W_{hh}\): Hidden-to-hidden weight matrix
  • \(W_{xh}\): Input-to-hidden weight matrix
  • \(W_{hy}\): Hidden-to-output weight matrix
  • \(b_h\), \(b_y\): Bias terms

Visualisasi RNN:

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Folded representation
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('RNN: Folded Representation', fontsize=14, fontweight='bold')

# Input
input_rect = patches.Rectangle((1, 1), 2, 1, linewidth=2, edgecolor='blue', facecolor='lightblue')
ax.add_patch(input_rect)
ax.text(2, 1.5, r'$x_t$', ha='center', va='center', fontsize=12, fontweight='bold')

# Hidden state (RNN cell)
rnn_rect = patches.Rectangle((1, 4), 2, 2, linewidth=3, edgecolor='red', facecolor='lightyellow')
ax.add_patch(rnn_rect)
ax.text(2, 5, 'RNN', ha='center', va='center', fontsize=12, fontweight='bold')

# Output
output_rect = patches.Rectangle((1, 8), 2, 1, linewidth=2, edgecolor='green', facecolor='lightgreen')
ax.add_patch(output_rect)
ax.text(2, 8.5, r'$y_t$', ha='center', va='center', fontsize=12, fontweight='bold')

# Arrows
ax.arrow(2, 2.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='blue', ec='blue', linewidth=2)
ax.arrow(2, 6.2, 0, 1.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2)

# Recurrent connection
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
arrow_loop = FancyArrowPatch((3.2, 5.5), (3.2, 4.5),
                             arrowstyle='->', mutation_scale=20, linewidth=2.5,
                             color='red', connectionstyle="arc3,rad=1.5")
ax.add_patch(arrow_loop)
ax.text(5, 5, r'$h_{t-1}$', fontsize=11, color='red', fontweight='bold')

# Unfolded representation
ax = axes[1]
ax.set_xlim(0, 16)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('RNN: Unfolded Representation (Through Time)', fontsize=14, fontweight='bold')

timesteps = [2, 6, 10, 14]
for i, t in enumerate(timesteps):
    # Input
    input_rect = patches.Rectangle((t-0.5, 1), 1, 0.8, linewidth=2, edgecolor='blue', facecolor='lightblue')
    ax.add_patch(input_rect)
    ax.text(t, 1.4, f'$x_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Hidden state
    rnn_rect = patches.Rectangle((t-0.5, 4), 1, 1.5, linewidth=2.5, edgecolor='red', facecolor='lightyellow')
    ax.add_patch(rnn_rect)
    ax.text(t, 4.75, f'$h_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Output
    output_rect = patches.Rectangle((t-0.5, 8), 1, 0.8, linewidth=2, edgecolor='green', facecolor='lightgreen')
    ax.add_patch(output_rect)
    ax.text(t, 8.4, f'$y_{i}$', ha='center', va='center', fontsize=10, fontweight='bold')

    # Vertical arrows
    ax.arrow(t, 2, 0, 1.8, head_width=0.15, head_length=0.15, fc='blue', ec='blue', linewidth=1.5)
    ax.arrow(t, 5.7, 0, 2, head_width=0.15, head_length=0.15, fc='green', ec='green', linewidth=1.5)

    # Horizontal arrows (recurrent connections)
    if i < len(timesteps) - 1:
        ax.arrow(t+0.6, 4.75, 3, 0, head_width=0.2, head_length=0.2, fc='red', ec='red', linewidth=2)

# Time axis
ax.text(8, 0.3, 'Time →', ha='center', fontsize=12, fontweight='bold', style='italic')

plt.tight_layout()
plt.show()

7.2.2 Simple RNN Implementation

Keras Implementation:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Simple RNN untuk many-to-one classification
def build_simple_rnn_classifier(sequence_length=10, input_dim=1,
                                hidden_units=32, num_classes=2):
    """
    Simple RNN untuk sequence classification

    Parameters:
        sequence_length: Panjang sequence input
        input_dim: Dimensi feature pada setiap timestep
        hidden_units: Jumlah unit dalam RNN layer
        num_classes: Jumlah kelas untuk klasifikasi

    Returns:
        model: Compiled Keras model
    """
    model = keras.Sequential([
        # SimpleRNN layer
        layers.SimpleRNN(
            units=hidden_units,
            activation='tanh',
            return_sequences=False,  # Many-to-one: hanya output terakhir
            input_shape=(sequence_length, input_dim),
            name='simple_rnn'
        ),

        # Dense layer untuk classification
        layers.Dense(num_classes, activation='softmax', name='output')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
simple_rnn = build_simple_rnn_classifier()
print(simple_rnn.summary())

PyTorch Implementation:

import torch
import torch.nn as nn

class SimpleRNNClassifier(nn.Module):
    """
    Simple RNN untuk sequence classification (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=32, num_layers=1, num_classes=2):
        super(SimpleRNNClassifier, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # RNN layer
        self.rnn = nn.RNN(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            nonlinearity='tanh'
        )

        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)

        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # RNN forward pass
        out, hn = self.rnn(x, h0)
        # out shape: (batch, seq_len, hidden_dim)
        # hn shape: (num_layers, batch, hidden_dim)

        # Take output from last timestep
        out = out[:, -1, :]  # (batch, hidden_dim)

        # Fully connected layer
        out = self.fc(out)  # (batch, num_classes)

        return out

# Instantiate model
pytorch_rnn = SimpleRNNClassifier(input_dim=1, hidden_dim=32, num_classes=2)
print(pytorch_rnn)

# Test forward pass
dummy_input = torch.randn(4, 10, 1)  # (batch=4, seq_len=10, features=1)
output = pytorch_rnn(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape}")

7.2.3 The Vanishing Gradient Problem

Masalah utama Simple RNN: Vanishing Gradient

Saat melakukan backpropagation through time (BPTT), gradients harus propagate melalui banyak timesteps. Karena repeated matrix multiplication dengan weights < 1, gradients menjadi semakin kecil (vanish).

Mathematical Explanation:

Gradient untuk timestep awal:

\[\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}\]

Jika \(\left|\frac{\partial h_t}{\partial h_{t-1}}\right| < 1\), maka gradient exponentially decay!

Consequences:

  • Cannot learn long-term dependencies
  • Early timesteps tidak mendapat gradient signal yang cukup
  • Model hanya belajar short-term patterns

Visualisasi Vanishing Gradient:

# Demonstrasi vanishing gradient
def demonstrate_vanishing_gradient():
    """
    Simulasi how gradients vanish over timesteps
    """
    timesteps = 50

    # Simulate gradient flow dengan different weight values
    gradients_small_w = []
    gradients_good_w = []
    gradients_large_w = []

    initial_gradient = 1.0

    for w in [0.9, 1.0, 1.1]:
        gradient = initial_gradient
        gradient_history = [gradient]

        for t in range(timesteps):
            gradient = gradient * w  # Simplified gradient flow
            gradient_history.append(gradient)

        if w == 0.9:
            gradients_small_w = gradient_history
        elif w == 1.0:
            gradients_good_w = gradient_history
        else:
            gradients_large_w = gradient_history

    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

    # Linear scale
    ax1.plot(gradients_small_w, label='W = 0.9 (Vanishing)', linewidth=2.5, color='red')
    ax1.plot(gradients_good_w, label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--')
    ax1.plot(gradients_large_w, label='W = 1.1 (Exploding)', linewidth=2.5, color='blue')
    ax1.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold')
    ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
    ax1.set_title('Gradient Flow (Linear Scale)', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(alpha=0.3)

    # Log scale
    ax2.semilogy(np.abs(gradients_small_w), label='W = 0.9 (Vanishing)', linewidth=2.5, color='red')
    ax2.semilogy(np.abs(gradients_good_w), label='W = 1.0 (Stable)', linewidth=2.5, color='green', linestyle='--')
    ax2.semilogy(np.abs(gradients_large_w), label='W = 1.1 (Exploding)', linewidth=2.5, color='blue')
    ax2.set_xlabel('Timesteps (backward)', fontsize=12, fontweight='bold')
    ax2.set_ylabel('Gradient Magnitude (log scale)', fontsize=12, fontweight='bold')
    ax2.set_title('Gradient Flow (Log Scale)', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=11)
    ax2.grid(alpha=0.3, which='both')

    plt.tight_layout()
    plt.show()

    print("Gradient after 50 timesteps:")
    print(f"  W=0.9 (vanishing): {gradients_small_w[-1]:.2e}")
    print(f"  W=1.0 (stable):    {gradients_good_w[-1]:.2e}")
    print(f"  W=1.1 (exploding): {gradients_large_w[-1]:.2e}")

demonstrate_vanishing_gradient()
⚠️ Vanishing vs Exploding Gradients

Vanishing Gradient (lebih common):

  • Gradients → 0
  • Cannot learn long-term dependencies
  • Solution: LSTM, GRU, skip connections

Exploding Gradient (less common):

  • Gradients → ∞
  • Training unstable, NaN values
  • Solution: Gradient clipping, careful initialization

7.3 LSTM: Long Short-Term Memory

7.3.1 LSTM Architecture

LSTM dirancang khusus untuk mengatasi vanishing gradient problem dengan gating mechanisms.

Key Components:

  1. Cell State (\(C_t\)): Long-term memory highway
  2. Hidden State (\(h_t\)): Short-term memory (output)
  3. Forget Gate (\(f_t\)): Decide apa yang dibuang dari cell state
  4. Input Gate (\(i_t\)): Decide info baru apa yang disimpan
  5. Output Gate (\(o_t\)): Decide apa yang di-output

LSTM Equations:

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\] Forget gate \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] Input gate \[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\] Candidate values \[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\] Update cell state \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\] Output gate \[h_t = o_t \odot \tanh(C_t)\] Hidden state

Where:

  • \(\sigma\): Sigmoid function (output 0-1, acts as gate)
  • \(\odot\): Element-wise multiplication
  • \(\tanh\): Hyperbolic tangent (output -1 to 1)

LSTM Cell Visualization:

# Complex LSTM visualization
fig, ax = plt.subplots(figsize=(16, 10))
ax.set_xlim(0, 16)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('LSTM Cell Architecture', fontsize=16, fontweight='bold', pad=20)

# Cell state (top highway)
ax.plot([1, 15], [10, 10], 'k-', linewidth=4, label='Cell State ($C_t$)')
ax.text(0.3, 10, r'$C_{t-1}$', fontsize=12, fontweight='bold', va='center')
ax.text(15.3, 10, r'$C_t$', fontsize=12, fontweight='bold', va='center')

# Forget gate
forget_rect = patches.Rectangle((3, 8), 1.5, 1.5, linewidth=2, edgecolor='red',
                                facecolor='lightcoral', alpha=0.7)
ax.add_patch(forget_rect)
ax.text(3.75, 8.75, r'$f_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(3.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# Forget gate operation
ax.plot([3.75, 3.75], [9.5, 10], 'r-', linewidth=2)
ax.plot([5, 5], [10, 10], 'ro', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(5, 10.6, '×', fontsize=14, fontweight='bold', ha='center')

# Input gate
input_rect = patches.Rectangle((7, 8), 1.5, 1.5, linewidth=2, edgecolor='blue',
                               facecolor='lightblue', alpha=0.7)
ax.add_patch(input_rect)
ax.text(7.75, 8.75, r'$i_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(7.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# Candidate values
candidate_rect = patches.Rectangle((9.5, 8), 1.5, 1.5, linewidth=2, edgecolor='purple',
                                  facecolor='plum', alpha=0.7)
ax.add_patch(candidate_rect)
ax.text(10.25, 8.75, r'$\tilde{C}_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(10.25, 7.3, r'$\tanh$', ha='center', fontsize=10)

# Combine input and candidate
ax.plot([7.75, 7.75], [9.5, 10.5], 'b-', linewidth=2)
ax.plot([10.25, 10.25], [9.5, 10.5], 'purple', linewidth=2)
ax.plot([9, 9], [10.5, 10.5], 'g-', linewidth=2)
ax.plot([9, 9], [10, 10], 'go', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(9, 10.6, '×', fontsize=14, fontweight='bold', ha='center')

# Add to cell state
ax.plot([11, 11], [10, 10], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(11, 10.6, '+', fontsize=14, fontweight='bold', ha='center')

# Output gate
output_rect = patches.Rectangle((13, 8), 1.5, 1.5, linewidth=2, edgecolor='green',
                                facecolor='lightgreen', alpha=0.7)
ax.add_patch(output_rect)
ax.text(13.75, 8.75, r'$o_t$', ha='center', va='center', fontsize=11, fontweight='bold')
ax.text(13.75, 7.3, r'$\sigma$', ha='center', fontsize=10)

# tanh for output
tanh_rect = patches.Rectangle((12.5, 5.5), 1, 1, linewidth=2, edgecolor='orange',
                              facecolor='wheat', alpha=0.7)
ax.add_patch(tanh_rect)
ax.text(13, 6, r'$\tanh$', ha='center', va='center', fontsize=10, fontweight='bold')

# Output combination
ax.plot([13, 13], [6.5, 7], 'orange', linewidth=2)
ax.plot([13.75, 13.75], [9.5, 7], 'g-', linewidth=2)
ax.plot([13.4, 13.4], [7, 7], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(13.4, 7.6, '×', fontsize=14, fontweight='bold', ha='center')

# Hidden state output
ax.arrow(13.4, 6.5, 0, -2.5, head_width=0.2, head_length=0.2, fc='green', ec='green', linewidth=2.5)
ax.text(13.4, 3.5, r'$h_t$', ha='center', fontsize=12, fontweight='bold')

# Inputs
ax.arrow(3.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(7.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(10.25, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(13.75, 2, 0, 5.5, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)

ax.text(8, 1, r'$[h_{t-1}, x_t]$', ha='center', fontsize=12, fontweight='bold',
       bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2))

# Legend
legend_elements = [
    patches.Patch(facecolor='lightcoral', edgecolor='red', label='Forget Gate'),
    patches.Patch(facecolor='lightblue', edgecolor='blue', label='Input Gate'),
    patches.Patch(facecolor='plum', edgecolor='purple', label='Candidate'),
    patches.Patch(facecolor='lightgreen', edgecolor='green', label='Output Gate'),
]
ax.legend(handles=legend_elements, loc='upper left', fontsize=11)

plt.tight_layout()
plt.show()

7.3.2 LSTM Implementation

Keras LSTM:

def build_lstm_model(sequence_length=50, input_dim=1,
                     lstm_units=64, dense_units=32, output_dim=1):
    """
    LSTM model untuk time series forecasting

    Parameters:
        sequence_length: Lookback window size
        input_dim: Number of features per timestep
        lstm_units: LSTM hidden units
        dense_units: Dense layer units
        output_dim: Prediction horizon

    Returns:
        model: Compiled Keras model
    """
    model = keras.Sequential([
        # LSTM layer
        layers.LSTM(
            units=lstm_units,
            activation='tanh',
            recurrent_activation='sigmoid',
            return_sequences=False,  # Return last output only
            input_shape=(sequence_length, input_dim),
            name='lstm_layer'
        ),

        # Dropout untuk regularization
        layers.Dropout(0.2, name='dropout'),

        # Dense layers
        layers.Dense(dense_units, activation='relu', name='dense_1'),
        layers.Dense(output_dim, activation='linear', name='output')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )

    return model

# Build LSTM model
lstm_model = build_lstm_model(sequence_length=50, lstm_units=64)
print(lstm_model.summary())

PyTorch LSTM:

class LSTMForecaster(nn.Module):
    """
    LSTM model untuk time series forecasting (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=64, num_layers=1,
                 dense_dim=32, output_dim=1, dropout=0.2):
        super(LSTMForecaster, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Fully connected layers
        self.fc1 = nn.Linear(hidden_dim, dense_dim)
        self.fc2 = nn.Linear(dense_dim, output_dim)

        # Activation
        self.relu = nn.ReLU()

    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)

        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # LSTM forward pass
        out, (hn, cn) = self.lstm(x, (h0, c0))
        # out: (batch, seq_len, hidden_dim)

        # Take last timestep output
        out = out[:, -1, :]  # (batch, hidden_dim)

        # Dropout
        out = self.dropout(out)

        # Fully connected layers
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)

        return out

# Instantiate PyTorch LSTM
pytorch_lstm = LSTMForecaster(input_dim=1, hidden_dim=64, num_layers=2)
print(pytorch_lstm)

# Test
test_input = torch.randn(8, 50, 1)  # (batch=8, seq=50, features=1)
test_output = pytorch_lstm(test_input)
print(f"\nInput shape: {test_input.shape}")
print(f"Output shape: {test_output.shape}")

7.3.3 How LSTM Solves Vanishing Gradient

LSTM’s Solution: Additive Cell State Update

Key insight: Cell state \(C_t\) di-update secara additive, bukan multiplicative!

\[C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\]

Gradient Flow:

\[\frac{\partial C_t}{\partial C_{t-1}} = f_t\]

Forget gate \(f_t\) dapat mendekati 1, memungkinkan gradient flow tanpa decay!

Comparison:

# Comparison: Simple RNN vs LSTM gradient flow
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

timesteps = np.arange(0, 51)

# Simple RNN: multiplicative gradient
rnn_gradient = 0.95 ** timesteps

# LSTM: controlled by forget gate (closer to 1)
lstm_gradient_forget_high = 0.99 ** timesteps
lstm_gradient_forget_medium = 0.95 ** timesteps
lstm_gradient_forget_low = 0.90 ** timesteps

# Linear scale
ax1.plot(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8)
ax1.plot(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8)
ax1.set_xlabel('Timesteps', fontsize=12, fontweight='bold')
ax1.set_ylabel('Gradient Magnitude', fontsize=12, fontweight='bold')
ax1.set_title('Gradient Flow Comparison (Linear)', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(alpha=0.3)

# Log scale
ax2.semilogy(timesteps, rnn_gradient, 'r-', linewidth=3, label='Simple RNN (W=0.95)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_high, 'g-', linewidth=3, label='LSTM (forget=0.99)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_medium, 'b--', linewidth=2.5, label='LSTM (forget=0.95)', alpha=0.8)
ax2.semilogy(timesteps, lstm_gradient_forget_low, 'purple', linewidth=2, label='LSTM (forget=0.90)', linestyle=':', alpha=0.8)
ax2.set_xlabel('Timesteps', fontsize=12, fontweight='bold')
ax2.set_ylabel('Gradient Magnitude (log)', fontsize=12, fontweight='bold')
ax2.set_title('Gradient Flow Comparison (Log Scale)', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(alpha=0.3, which='both')

plt.tight_layout()
plt.show()

print("Gradient after 50 timesteps:")
print(f"  Simple RNN:      {rnn_gradient[-1]:.6f}")
print(f"  LSTM (f=0.99):   {lstm_gradient_forget_high[-1]:.6f}")
print(f"  LSTM (f=0.95):   {lstm_gradient_forget_medium[-1]:.6f}")
print(f"  LSTM (f=0.90):   {lstm_gradient_forget_low[-1]:.6f}")
💡 Why LSTM Works
  1. Cell State Highway: Direct path untuk info flow tanpa transformations
  2. Gating Mechanisms: Learned control kapan remember/forget
  3. Additive Updates: Gradients tidak multiply repeatedly
  4. Flexible Memory: Bisa learn long-term dependencies (100+ timesteps)

Result: LSTM bisa learn dependencies ratusan timesteps, sedangkan Simple RNN hanya ~10 timesteps!

7.4 GRU: Gated Recurrent Unit

7.4.1 GRU Architecture

GRU adalah simplified version dari LSTM dengan fewer parameters tapi comparable performance.

Key Differences dari LSTM:

  • 2 gates instead of 3 (reset gate, update gate)
  • No separate cell state - hidden state saja
  • Fewer parameters - faster training
  • Simpler architecture - easier to understand

GRU Equations:

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] Update gate \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] Reset gate \[\tilde{h}_t = \tanh(W_h \cdot [r_t \odot h_{t-1}, x_t])\] Candidate hidden state \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\] Final hidden state

Component Functions:

  • Update gate (\(z_t\)): Decides how much past info to keep
  • Reset gate (\(r_t\)): Decides how much past info to forget
  • Candidate state (\(\tilde{h}_t\)): New memory content
  • Final state (\(h_t\)): Combination of old and new

GRU Visualization:

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('GRU Cell Architecture', fontsize=16, fontweight='bold', pad=20)

# Reset gate
reset_rect = patches.Rectangle((2, 7), 1.5, 1.5, linewidth=2, edgecolor='red',
                               facecolor='lightcoral', alpha=0.7)
ax.add_patch(reset_rect)
ax.text(2.75, 7.75, r'$r_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(2.75, 6.3, r'$\sigma$', ha='center', fontsize=10)

# Update gate
update_rect = patches.Rectangle((5.5, 7), 1.5, 1.5, linewidth=2, edgecolor='blue',
                                facecolor='lightblue', alpha=0.7)
ax.add_patch(update_rect)
ax.text(6.25, 7.75, r'$z_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(6.25, 6.3, r'$\sigma$', ha='center', fontsize=10)

# Candidate hidden state
candidate_rect = patches.Rectangle((9, 7), 1.5, 1.5, linewidth=2, edgecolor='purple',
                                  facecolor='plum', alpha=0.7)
ax.add_patch(candidate_rect)
ax.text(9.75, 7.75, r'$\tilde{h}_t$', ha='center', va='center', fontsize=12, fontweight='bold')
ax.text(9.75, 6.3, r'$\tanh$', ha='center', fontsize=10)

# Reset operation
ax.plot([2.75, 2.75], [8.5, 9.5], 'r-', linewidth=2)
ax.plot([2.75, 8], [9.5, 9.5], 'r-', linewidth=2)
ax.plot([8, 8], [9.5, 9], 'r-', linewidth=2)
ax.plot([8, 8], [9, 9], 'ro', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(8, 9.5, '×', fontsize=13, fontweight='bold', ha='center', va='bottom')

# Previous hidden state path
ax.plot([0.5, 11.5], [9, 9], 'k-', linewidth=3, alpha=0.5)
ax.text(0, 9, r'$h_{t-1}$', fontsize=11, fontweight='bold', va='center')

# Update gate paths
ax.plot([6.25, 6.25], [8.5, 5], 'b-', linewidth=2)
ax.plot([6.25, 11.5], [5, 5], 'b-', linewidth=2)

# 1 - z_t path
ax.plot([4, 4], [5, 5], 'b--', linewidth=2)
ax.plot([4, 11.5], [3, 3], 'b--', linewidth=2)
ax.text(3.5, 5, r'$1-z_t$', fontsize=10, ha='right', color='blue', fontweight='bold')

# Candidate combination
ax.plot([9.75, 9.75], [8.5, 5], 'purple', linewidth=2)
ax.plot([11.5, 11.5], [5, 5], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 5.5, '×', fontsize=13, fontweight='bold', ha='center', color='purple')

# Old hidden state path
ax.plot([11.5, 11.5], [9, 3], 'k--', linewidth=2, alpha=0.5)
ax.plot([11.5, 11.5], [3, 3], 'go', markersize=10, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 2.5, '×', fontsize=13, fontweight='bold', ha='center')

# Final combination
ax.plot([11.5, 11.5], [4.5, 1.5], 'g-', linewidth=2.5)
ax.plot([11.5, 11.5], [2.5, 1.5], 'g-', linewidth=2.5)
ax.plot([11.5, 11.5], [1.5, 1.5], 'ko', markersize=12, markerfacecolor='white', markeredgewidth=2)
ax.text(11.5, 1, '+', fontsize=14, fontweight='bold', ha='center')

# Output
ax.arrow(11.5, 0.8, 0, -0.3, head_width=0.2, head_length=0.1, fc='green', ec='green', linewidth=2.5)
ax.text(11.5, 0, r'$h_t$', ha='center', fontsize=12, fontweight='bold')

# Inputs
ax.arrow(2.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(6.25, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)
ax.arrow(9.75, 2.5, 0, 4, head_width=0.2, head_length=0.2, fc='black', ec='black', linewidth=2)

ax.text(6.25, 1.5, r'$[h_{t-1}, x_t]$', ha='center', fontsize=11, fontweight='bold',
       bbox=dict(boxstyle='round', facecolor='lightyellow', edgecolor='black', linewidth=2))

plt.tight_layout()
plt.show()

7.4.2 GRU Implementation

Keras GRU:

def build_gru_model(sequence_length=50, input_dim=1,
                    gru_units=64, dense_units=32, output_dim=1):
    """
    GRU model untuk time series forecasting
    """
    model = keras.Sequential([
        # GRU layer
        layers.GRU(
            units=gru_units,
            activation='tanh',
            recurrent_activation='sigmoid',
            return_sequences=False,
            input_shape=(sequence_length, input_dim),
            name='gru_layer'
        ),

        # Dropout
        layers.Dropout(0.2, name='dropout'),

        # Dense layers
        layers.Dense(dense_units, activation='relu', name='dense_1'),
        layers.Dense(output_dim, activation='linear', name='output')
    ])

    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )

    return model

# Build GRU model
gru_model = build_gru_model()
print(gru_model.summary())

PyTorch GRU:

class GRUForecaster(nn.Module):
    """
    GRU model untuk time series forecasting (PyTorch)
    """
    def __init__(self, input_dim=1, hidden_dim=64, num_layers=1,
                 dense_dim=32, output_dim=1, dropout=0.2):
        super(GRUForecaster, self).__init__()

        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        # GRU layer
        self.gru = nn.GRU(
            input_size=input_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Dropout
        self.dropout = nn.Dropout(dropout)

        # Fully connected layers
        self.fc1 = nn.Linear(hidden_dim, dense_dim)
        self.fc2 = nn.Linear(dense_dim, output_dim)

        # Activation
        self.relu = nn.ReLU()

    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)

        # GRU forward pass
        out, hn = self.gru(x, h0)

        # Take last timestep
        out = out[:, -1, :]

        # Dropout
        out = self.dropout(out)

        # FC layers
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)

        return out

# Instantiate
pytorch_gru = GRUForecaster(hidden_dim=64, num_layers=2)
print(pytorch_gru)

7.4.3 LSTM vs GRU: When to Use What?

Comparison Table:

Aspect LSTM GRU
Parameters More (4 gates) Less (2 gates)
Training Speed Slower Faster
Memory Higher Lower
Performance Slightly better on complex tasks Comparable on most tasks
Long-term Dependencies Excellent Very good
Overfitting Risk Higher (more params) Lower
When to Use Large datasets, complex patterns Smaller datasets, faster training needed

Practical Guidelines:

# Parameter comparison
def compare_parameters():
    """
    Compare parameter counts: LSTM vs GRU
    """
    seq_len, input_dim, hidden_dim = 50, 1, 64

    # Build models
    lstm_model = keras.Sequential([
        layers.LSTM(hidden_dim, input_shape=(seq_len, input_dim)),
        layers.Dense(1)
    ])

    gru_model = keras.Sequential([
        layers.GRU(hidden_dim, input_shape=(seq_len, input_dim)),
        layers.Dense(1)
    ])

    lstm_params = lstm_model.count_params()
    gru_params = gru_model.count_params()

    print("Parameter Comparison:")
    print(f"  LSTM parameters: {lstm_params:,}")
    print(f"  GRU parameters:  {gru_params:,}")
    print(f"  Difference:      {lstm_params - gru_params:,} ({(lstm_params-gru_params)/gru_params*100:.1f}% more)")
    print(f"\n  GRU is {lstm_params/gru_params:.2f}x smaller than LSTM")

compare_parameters()
🎯 Rule of Thumb

Use LSTM when:

  • You have large datasets (millions of samples)
  • Task requires very long-term dependencies (100+ timesteps)
  • Model interpretability less important
  • Computational resources abundant

Use GRU when:

  • Smaller datasets or limited computational resources
  • Need faster training/inference
  • Medium-term dependencies (10-100 timesteps)
  • Want simpler model with fewer hyperparameters

In practice: Try both! GRU often performs similarly with less complexity.

7.5 Advanced RNN Architectures

7.5.1 Bidirectional RNNs

Konsep: Process sequence forward AND backward untuk mendapatkan context dari kedua arah.

Use Cases:

  • Sentiment analysis (membutuhkan full sentence context)
  • Named Entity Recognition
  • Speech recognition
  • Tidak cocok untuk real-time forecasting (butuh future data)

Architecture:

Code
flowchart LR
    subgraph Forward["Forward Pass"]
        direction LR
        X1[x1] --> F1[h1_fwd]
        X2[x2] --> F2[h2_fwd]
        X3[x3] --> F3[h3_fwd]
        F1 --> F2
        F2 --> F3
    end

    subgraph Backward["Backward Pass"]
        direction RL
        X1B[x1] --> B1[h1_bwd]
        X2B[x2] --> B2[h2_bwd]
        X3B[x3] --> B3[h3_bwd]
        B3 --> B2
        B2 --> B1
    end

    F1 --> C1[Concat]
    B1 --> C1
    F2 --> C2[Concat]
    B2 --> C2
    F3 --> C3[Concat]
    B3 --> C3

    C1 --> Y1[y1]
    C2 --> Y2[y2]
    C3 --> Y3[y3]

    style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px
    style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px
    style C1 fill:#fff4e6,stroke:#333
    style C2 fill:#fff4e6,stroke:#333
    style C3 fill:#fff4e6,stroke:#333
flowchart LR
    subgraph Forward["Forward Pass"]
        direction LR
        X1[x1] --> F1[h1_fwd]
        X2[x2] --> F2[h2_fwd]
        X3[x3] --> F3[h3_fwd]
        F1 --> F2
        F2 --> F3
    end

    subgraph Backward["Backward Pass"]
        direction RL
        X1B[x1] --> B1[h1_bwd]
        X2B[x2] --> B2[h2_bwd]
        X3B[x3] --> B3[h3_bwd]
        B3 --> B2
        B2 --> B1
    end

    F1 --> C1[Concat]
    B1 --> C1
    F2 --> C2[Concat]
    B2 --> C2
    F3 --> C3[Concat]
    B3 --> C3

    C1 --> Y1[y1]
    C2 --> Y2[y2]
    C3 --> Y3[y3]

    style Forward fill:#e6f3ff,stroke:#333,stroke-width:2px
    style Backward fill:#ffe6e6,stroke:#333,stroke-width:2px
    style C1 fill:#fff4e6,stroke:#333
    style C2 fill:#fff4e6,stroke:#333
    style C3 fill:#fff4e6,stroke:#333
Figure 14.1: Arsitektur Bidirectional LSTM - menggabungkan informasi dari forward dan backward pass

Implementation:

def build_bidirectional_lstm(sequence_length=50, input_dim=1,
                             lstm_units=64, num_classes=3):
    """
    Bidirectional LSTM untuk sequence classification
    """
    model = keras.Sequential([
        # Bidirectional LSTM
        layers.Bidirectional(
            layers.LSTM(lstm_units, return_sequences=True),
            input_shape=(sequence_length, input_dim),
            name='bidirectional_lstm_1'
        ),

        # Second bidirectional layer
        layers.Bidirectional(
            layers.LSTM(lstm_units // 2),
            name='bidirectional_lstm_2'
        ),

        # Dropout
        layers.Dropout(0.3),

        # Output
        layers.Dense(num_classes, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
bi_lstm = build_bidirectional_lstm()
print(bi_lstm.summary())

# Note: Output dari bidirectional layer adalah concatenation
# Jika forward LSTM has 64 units, backward juga 64 units
# Output shape: (batch, 64 + 64) = (batch, 128)

7.5.2 Stacked/Deep RNNs

Konsep: Stack multiple RNN layers untuk learn hierarchical representations.

Benefits:

  • Learn more complex patterns
  • Better feature extraction
  • Hierarchical temporal abstractions

Caution:

  • More parameters = more data needed
  • Risk of overfitting
  • Harder to train

Implementation:

def build_stacked_lstm(sequence_length=50, input_dim=1,
                      lstm_layers=[128, 64, 32], output_dim=1):
    """
    Stacked LSTM dengan multiple layers

    Parameters:
        lstm_layers: List of units per layer [layer1_units, layer2_units, ...]
    """
    model = keras.Sequential(name='Stacked_LSTM')

    # First LSTM layer (must return sequences)
    model.add(layers.LSTM(
        lstm_layers[0],
        return_sequences=True,
        input_shape=(sequence_length, input_dim),
        name=f'lstm_1'
    ))
    model.add(layers.Dropout(0.2, name='dropout_1'))

    # Middle layers (return sequences for all except last)
    for i, units in enumerate(lstm_layers[1:-1], start=2):
        model.add(layers.LSTM(
            units,
            return_sequences=True,
            name=f'lstm_{i}'
        ))
        model.add(layers.Dropout(0.2, name=f'dropout_{i}'))

    # Last LSTM layer (return_sequences=False)
    model.add(layers.LSTM(
        lstm_layers[-1],
        return_sequences=False,
        name=f'lstm_{len(lstm_layers)}'
    ))
    model.add(layers.Dropout(0.2, name=f'dropout_{len(lstm_layers)}'))

    # Output layer
    model.add(layers.Dense(output_dim, activation='linear', name='output'))

    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )

    return model

# Build 3-layer stacked LSTM
stacked_lstm = build_stacked_lstm(lstm_layers=[128, 64, 32])
print(stacked_lstm.summary())

7.5.3 Encoder-Decoder (Seq2Seq)

Konsep: Architecture untuk sequence-to-sequence tasks dengan variable input/output lengths.

Components:

  1. Encoder: Process input sequence → context vector
  2. Decoder: Generate output sequence dari context vector

Use Cases:

  • Machine translation
  • Text summarization
  • Question answering
  • Image captioning

Architecture:

Code
flowchart LR
    subgraph Encoder["Encoder"]
        direction LR
        X1[x1] --> E1[LSTM 1]
        X2[x2] --> E2[LSTM 2]
        X3[x3] --> E3[LSTM 3]
        E1 --> E2
        E2 --> E3
    end

    E3 ==> C[Context Vector]

    subgraph Decoder["Decoder"]
        direction LR
        C ==> D1[LSTM 1]
        D1 --> D2[LSTM 2]
        D2 --> D3[LSTM 3]
        D1 --> Y1[y1]
        D2 --> Y2[y2]
        D3 --> Y3[y3]
    end

    style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px
    style C fill:#ffffcc,stroke:#f90,stroke-width:3px
    style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px
    style Y1 fill:#d4edda,stroke:#333
    style Y2 fill:#d4edda,stroke:#333
    style Y3 fill:#d4edda,stroke:#333
flowchart LR
    subgraph Encoder["Encoder"]
        direction LR
        X1[x1] --> E1[LSTM 1]
        X2[x2] --> E2[LSTM 2]
        X3[x3] --> E3[LSTM 3]
        E1 --> E2
        E2 --> E3
    end

    E3 ==> C[Context Vector]

    subgraph Decoder["Decoder"]
        direction LR
        C ==> D1[LSTM 1]
        D1 --> D2[LSTM 2]
        D2 --> D3[LSTM 3]
        D1 --> Y1[y1]
        D2 --> Y2[y2]
        D3 --> Y3[y3]
    end

    style Encoder fill:#e6f3ff,stroke:#333,stroke-width:2px
    style C fill:#ffffcc,stroke:#f90,stroke-width:3px
    style Decoder fill:#ffe6e6,stroke:#333,stroke-width:2px
    style Y1 fill:#d4edda,stroke:#333
    style Y2 fill:#d4edda,stroke:#333
    style Y3 fill:#d4edda,stroke:#333
Figure 14.2: Arsitektur Encoder-Decoder (Seq2Seq) - Encoder mengompres input menjadi context vector, Decoder menghasilkan output sequence

Implementation:

def build_seq2seq_model(encoder_seq_len=10, decoder_seq_len=10,
                       input_dim=1, output_dim=1, latent_dim=64):
    """
    Simple Seq2Seq model

    Parameters:
        encoder_seq_len: Input sequence length
        decoder_seq_len: Output sequence length
        latent_dim: Hidden dimension
    """
    # Encoder
    encoder_inputs = layers.Input(shape=(encoder_seq_len, input_dim), name='encoder_input')
    encoder_lstm = layers.LSTM(latent_dim, return_state=True, name='encoder_lstm')
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
    encoder_states = [state_h, state_c]  # Context vector

    # Decoder
    decoder_inputs = layers.Input(shape=(decoder_seq_len, output_dim), name='decoder_input')
    decoder_lstm = layers.LSTM(latent_dim, return_sequences=True, return_state=True, name='decoder_lstm')
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = layers.Dense(output_dim, activation='linear', name='decoder_dense')
    decoder_outputs = decoder_dense(decoder_outputs)

    # Model
    model = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs, name='Seq2Seq')

    model.compile(
        optimizer='adam',
        loss='mse',
        metrics=['mae']
    )

    return model

# Build seq2seq
seq2seq = build_seq2seq_model()
print(seq2seq.summary())

7.6 Time Series Forecasting dengan RNN

7.6.1 Problem Formulation

Time Series Forecasting Task:

Given historical data \(x_1, x_2, ..., x_t\), predict future values \(x_{t+1}, x_{t+2}, ..., x_{t+h}\)

Approaches:

  1. One-step ahead: Predict \(x_{t+1}\) saja
  2. Multi-step ahead: Predict sequence \([x_{t+1}, ..., x_{t+h}]\)
  3. Recursive: Use predictions as input untuk next prediction
  4. Direct: Separate model untuk setiap horizon

Data Preparation:

def create_sequences(data, lookback=50, horizon=1):
    """
    Create input-output sequences untuk time series forecasting

    Parameters:
        data: 1D array time series data
        lookback: Number of past timesteps to use as input
        horizon: Number of future timesteps to predict

    Returns:
        X: Input sequences (samples, lookback, features)
        y: Target values (samples, horizon)
    """
    X, y = [], []

    for i in range(len(data) - lookback - horizon + 1):
        # Input: [i : i+lookback]
        X.append(data[i : i + lookback])

        # Target: [i+lookback : i+lookback+horizon]
        if horizon == 1:
            y.append(data[i + lookback])
        else:
            y.append(data[i + lookback : i + lookback + horizon])

    X = np.array(X)
    y = np.array(y)

    # Reshape X to (samples, lookback, 1) untuk univariate
    if len(X.shape) == 2:
        X = X.reshape((X.shape[0], X.shape[1], 1))

    return X, y

# Example
np.random.seed(42)
sample_data = np.sin(np.linspace(0, 100, 1000)) + np.random.normal(0, 0.1, 1000)

X, y = create_sequences(sample_data, lookback=50, horizon=1)
print(f"Input shape: {X.shape}")   # (samples, 50, 1)
print(f"Output shape: {y.shape}")  # (samples, 1)

# Visualize sequences
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

# Plot full time series
axes[0].plot(sample_data, linewidth=1.5, alpha=0.7)
axes[0].set_title('Full Time Series', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Time', fontsize=11)
axes[0].set_ylabel('Value', fontsize=11)
axes[0].grid(alpha=0.3)

# Plot one sequence example
example_idx = 100
input_seq = X[example_idx].flatten()
target_val = y[example_idx]

axes[1].plot(range(len(input_seq)), input_seq, 'b-', linewidth=2, label='Input Sequence (lookback=50)')
axes[1].plot(len(input_seq), target_val, 'ro', markersize=10, label=f'Target (t+1)', zorder=3)
axes[1].axvline(len(input_seq)-1, color='gray', linestyle='--', alpha=0.5)
axes[1].set_title(f'Example Sequence #{example_idx}', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Timestep', fontsize=11)
axes[1].set_ylabel('Value', fontsize=11)
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

7.6.2 Feature Engineering for Time Series

Important Features:

  1. Lag features: Past values
  2. Rolling statistics: Moving average, std
  3. Time-based features: Hour, day, month, seasonality
  4. Difference features: First/second differences
import pandas as pd

def engineer_time_series_features(data, datetime_index=None):
    """
    Create time series features

    Parameters:
        data: 1D array or pandas Series
        datetime_index: DatetimeIndex (optional)

    Returns:
        DataFrame with engineered features
    """
    if isinstance(data, np.ndarray):
        data = pd.Series(data)

    df = pd.DataFrame({'value': data})

    # Lag features
    for lag in [1, 2, 3, 7, 14]:
        df[f'lag_{lag}'] = df['value'].shift(lag)

    # Rolling statistics
    for window in [7, 14, 30]:
        df[f'rolling_mean_{window}'] = df['value'].rolling(window=window).mean()
        df[f'rolling_std_{window}'] = df['value'].rolling(window=window).std()
        df[f'rolling_min_{window}'] = df['value'].rolling(window=window).min()
        df[f'rolling_max_{window}'] = df['value'].rolling(window=window).max()

    # Difference features
    df['diff_1'] = df['value'].diff(1)
    df['diff_2'] = df['value'].diff(2)

    # Time-based features (if datetime index provided)
    if datetime_index is not None:
        df.index = datetime_index
        df['hour'] = df.index.hour
        df['day_of_week'] = df.index.dayofweek
        df['day_of_month'] = df.index.day
        df['month'] = df.index.month
        df['quarter'] = df.index.quarter

        # Cyclical encoding untuk periodic features
        df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
        df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
        df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
        df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

    # Drop NaN rows
    df = df.dropna()

    return df

# Example
dates = pd.date_range('2023-01-01', periods=len(sample_data), freq='H')
features_df = engineer_time_series_features(sample_data, dates)

print("Engineered Features:")
print(features_df.head(10))
print(f"\nTotal features created: {len(features_df.columns)}")

7.6.3 Evaluation Metrics untuk Time Series

Common Metrics:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_forecast(y_true, y_pred):
    """
    Comprehensive evaluation metrics untuk forecasting

    Returns:
        Dictionary of metrics
    """
    # Regression metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)

    # Percentage errors
    mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100

    # Direction accuracy (berapa % trend direction benar)
    y_true_diff = np.diff(y_true.flatten())
    y_pred_diff = np.diff(y_pred.flatten())
    direction_accuracy = np.mean((y_true_diff * y_pred_diff) > 0) * 100

    metrics = {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2,
        'MAPE (%)': mape,
        'Direction Accuracy (%)': direction_accuracy
    }

    return metrics

# Example evaluation
y_true = sample_data[100:200]
y_pred = sample_data[100:200] + np.random.normal(0, 0.1, 100)  # Simulated predictions

metrics = evaluate_forecast(y_true, y_pred)

print("Forecast Evaluation Metrics:")
print("=" * 50)
for metric, value in metrics.items():
    print(f"{metric:25s}: {value:10.4f}")
📊 Which Metric to Use?

MSE/RMSE:

  • Penalize large errors heavily
  • Good when outliers are critical
  • Same unit as target

MAE:

  • More robust to outliers
  • Interpretable (average error)
  • Less sensitive to extreme values

MAPE:

  • Percentage-based, scale-independent
  • Good untuk comparing different datasets
  • Problem jika y_true close to zero

Direction Accuracy:

  • Important for trading strategies
  • Measures trend prediction
  • Binary metric (up/down)

:

  • Goodness of fit
  • 1.0 = perfect, 0.0 = baseline
  • Can be negative for bad models

7.7 RNN untuk NLP Applications

7.7.1 Text Preprocessing untuk RNN

Steps:

  1. Tokenization: Split text into tokens
  2. Vocabulary building: Create word-to-index mapping
  3. Sequence padding: Make all sequences same length
  4. Embedding: Convert words to dense vectors
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample text data
texts = [
    "I love machine learning",
    "Deep learning is amazing",
    "RNN are great for sequences",
    "LSTM solves vanishing gradient problem",
    "NLP with deep learning is powerful"
]

# Tokenization
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(texts)

# Convert to sequences
sequences = tokenizer.texts_to_sequences(texts)

print("Original texts:")
for i, text in enumerate(texts):
    print(f"  {i+1}. {text}")

print("\nTokenized sequences:")
for i, seq in enumerate(sequences):
    print(f"  {i+1}. {seq}")

# Vocabulary
word_index = tokenizer.word_index
print(f"\nVocabulary size: {len(word_index)}")
print("\nWord to index mapping (first 10):")
for word, idx in list(word_index.items())[:10]:
    print(f"  '{word}': {idx}")

# Padding
max_length = 10
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post', truncating='post')

print(f"\nPadded sequences (max_length={max_length}):")
for i, seq in enumerate(padded_sequences):
    print(f"  {i+1}. {seq}")

7.7.2 Sentiment Analysis dengan LSTM

def build_sentiment_classifier(vocab_size=10000, embedding_dim=64,
                               max_length=100, lstm_units=64):
    """
    LSTM untuk sentiment classification
    """
    model = keras.Sequential([
        # Embedding layer
        layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=max_length,
            name='embedding'
        ),

        # Spatial dropout untuk embedding
        layers.SpatialDropout1D(0.2),

        # Bidirectional LSTM
        layers.Bidirectional(
            layers.LSTM(lstm_units, dropout=0.2, recurrent_dropout=0.2),
            name='bidirectional_lstm'
        ),

        # Dense layers
        layers.Dense(32, activation='relu'),
        layers.Dropout(0.3),

        # Output layer (binary classification)
        layers.Dense(1, activation='sigmoid')
    ])

    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )

    return model

# Build model
sentiment_model = build_sentiment_classifier()
print(sentiment_model.summary())

7.7.3 Text Generation dengan RNN

Character-level Language Model:

def build_text_generator(vocab_size=100, embedding_dim=256,
                        rnn_units=512, sequence_length=100):
    """
    Character-level text generation model
    """
    model = keras.Sequential([
        # Embedding
        layers.Embedding(vocab_size, embedding_dim, input_length=sequence_length),

        # Stacked LSTM
        layers.LSTM(rnn_units, return_sequences=True),
        layers.Dropout(0.2),
        layers.LSTM(rnn_units),
        layers.Dropout(0.2),

        # Output layer (predict next character)
        layers.Dense(vocab_size, activation='softmax')
    ])

    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

text_gen = build_text_generator()
print(text_gen.summary())

7.8 Best Practices dan Tips

7.8.1 Hyperparameter Tuning

Key Hyperparameters:

  1. Hidden units: Start dengan 32-128
  2. Number of layers: 1-3 layers biasanya cukup
  3. Dropout rate: 0.2-0.5
  4. Learning rate: 0.001 (Adam) atau 0.01 (SGD)
  5. Batch size: 32-128 untuk time series
  6. Sequence length: Tergantung pada temporal dependency
# Hyperparameter search space
hyperparams = {
    'lstm_units': [32, 64, 128, 256],
    'num_layers': [1, 2, 3],
    'dropout_rate': [0.0, 0.2, 0.3, 0.5],
    'learning_rate': [0.0001, 0.001, 0.01],
    'batch_size': [32, 64, 128],
    'sequence_length': [24, 48, 72, 96]
}

print("Hyperparameter Search Space:")
for param, values in hyperparams.items():
    print(f"  {param:20s}: {values}")

7.8.2 Training Tips

🎯 Training Best Practices

1. Gradient Clipping (prevent exploding gradients):

optimizer = keras.optimizers.Adam(clipnorm=1.0)

2. Early Stopping (prevent overfitting):

early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

3. Learning Rate Scheduling:

reduce_lr = keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    min_lr=1e-7
)

4. Batch Normalization (untuk deeper networks):

layers.BatchNormalization()

5. Teacher Forcing (untuk seq2seq):

  • Use ground truth sebagai decoder input during training
  • Use model predictions during inference

7.8.3 Common Pitfalls

❌ Common Mistakes:

  1. Data leakage: Using future data untuk train
  2. Not normalizing: RNN sensitive terhadap scale
  3. Too many parameters: Overfitting pada small datasets
  4. Ignoring stationarity: Time series should be stationary
  5. Wrong sequence direction: Pastikan temporal order benar
  6. Batch size too small: Unstable training
  7. No validation set: Cannot detect overfitting

✅ Solutions:

# 1. Proper train/val/test split untuk time series
def time_series_split(data, train_ratio=0.7, val_ratio=0.15):
    """
    Time series split (no shuffling!)
    """
    n = len(data)
    train_end = int(n * train_ratio)
    val_end = int(n * (train_ratio + val_ratio))

    train = data[:train_end]
    val = data[train_end:val_end]
    test = data[val_end:]

    return train, val, test

# 2. Normalization
from sklearn.preprocessing import StandardScaler, MinMaxScaler

def normalize_data(train, val, test, method='standard'):
    """
    Normalize data using train statistics
    """
    if method == 'standard':
        scaler = StandardScaler()
    else:
        scaler = MinMaxScaler()

    # Fit ONLY on training data
    train_scaled = scaler.fit_transform(train.reshape(-1, 1))
    val_scaled = scaler.transform(val.reshape(-1, 1))
    test_scaled = scaler.transform(test.reshape(-1, 1))

    return train_scaled, val_scaled, test_scaled, scaler

# Example
train, val, test = time_series_split(sample_data)
train_norm, val_norm, test_norm, scaler = normalize_data(train, val, test)

print(f"Train size: {len(train)} ({len(train)/len(sample_data)*100:.1f}%)")
print(f"Val size:   {len(val)} ({len(val)/len(sample_data)*100:.1f}%)")
print(f"Test size:  {len(test)} ({len(test)/len(sample_data)*100:.1f}%)")

7.9 Rangkuman & Kesimpulan

7.9.1 Key Takeaways

📚 Chapter Summary

1. RNN Fundamentals:

  • RNN memproses sequential data dengan hidden state (memory)
  • Parameter sharing across timesteps
  • Dapat handle variable-length sequences

2. Vanishing Gradient Problem:

  • Simple RNN sulit learn long-term dependencies
  • Gradients exponentially decay saat backpropagation
  • Solution: LSTM, GRU dengan gating mechanisms

3. LSTM:

  • 3 gates (forget, input, output) + cell state
  • Cell state acts as information highway
  • Additive updates prevent gradient vanishing
  • Best untuk very long-term dependencies

4. GRU:

  • Simplified LSTM dengan 2 gates (update, reset)
  • Fewer parameters, faster training
  • Comparable performance untuk most tasks
  • Good default choice untuk many applications

5. Advanced Architectures:

  • Bidirectional: Process forward + backward
  • Stacked: Multiple layers untuk hierarchical learning
  • Seq2Seq: Encoder-decoder untuk variable length I/O

6. Applications:

  • Time Series: Energy forecasting, stock prediction
  • NLP: Sentiment analysis, text generation, translation
  • Speech: Recognition, synthesis
  • Video: Frame prediction, action recognition

7. Best Practices:

  • Normalize input data
  • Use gradient clipping
  • Proper train/val/test split (no shuffling untuk time series!)
  • Start simple, add complexity gradually
  • Monitor for overfitting

7.9.2 When to Use What?

Code
flowchart TD
    A["Sequential Problem?"] -->|Yes| B{"Long-term<br/>Dependencies?"}
    A -->|No| Z["Use Feedforward NN<br/>or CNN"]

    B -->|"Yes, >100 steps"| C["Use LSTM"]
    B -->|"Medium, 10-100 steps"| D["Use GRU"]
    B -->|"Short, <10 steps"| E["Simple RNN OK"]

    C --> F{"Large Dataset?"}
    D --> F
    E --> F

    F -->|"Yes, millions"| G["Deep/Stacked RNN"]
    F -->|"No, thousands"| H["Shallow RNN<br/>1-2 layers"]

    G --> I{"Need both<br/>directions?"}
    H --> I

    I -->|Yes| J["Bidirectional"]
    I -->|No| K["Unidirectional"]

    style A fill:#ffcccc,stroke:#333,stroke-width:2px
    style B fill:#fff3cd,stroke:#333,stroke-width:2px
    style C fill:#ccffcc,stroke:#333,stroke-width:2px
    style D fill:#ccffcc,stroke:#333,stroke-width:2px
    style E fill:#ffffcc,stroke:#333,stroke-width:2px
    style F fill:#fff3cd,stroke:#333,stroke-width:2px
    style I fill:#fff3cd,stroke:#333,stroke-width:2px
    style J fill:#ccccff,stroke:#333,stroke-width:2px
    style K fill:#ccccff,stroke:#333,stroke-width:2px
    style Z fill:#e2e3e5,stroke:#333,stroke-width:2px
flowchart TD
    A["Sequential Problem?"] -->|Yes| B{"Long-term<br/>Dependencies?"}
    A -->|No| Z["Use Feedforward NN<br/>or CNN"]

    B -->|"Yes, >100 steps"| C["Use LSTM"]
    B -->|"Medium, 10-100 steps"| D["Use GRU"]
    B -->|"Short, <10 steps"| E["Simple RNN OK"]

    C --> F{"Large Dataset?"}
    D --> F
    E --> F

    F -->|"Yes, millions"| G["Deep/Stacked RNN"]
    F -->|"No, thousands"| H["Shallow RNN<br/>1-2 layers"]

    G --> I{"Need both<br/>directions?"}
    H --> I

    I -->|Yes| J["Bidirectional"]
    I -->|No| K["Unidirectional"]

    style A fill:#ffcccc,stroke:#333,stroke-width:2px
    style B fill:#fff3cd,stroke:#333,stroke-width:2px
    style C fill:#ccffcc,stroke:#333,stroke-width:2px
    style D fill:#ccffcc,stroke:#333,stroke-width:2px
    style E fill:#ffffcc,stroke:#333,stroke-width:2px
    style F fill:#fff3cd,stroke:#333,stroke-width:2px
    style I fill:#fff3cd,stroke:#333,stroke-width:2px
    style J fill:#ccccff,stroke:#333,stroke-width:2px
    style K fill:#ccccff,stroke:#333,stroke-width:2px
    style Z fill:#e2e3e5,stroke:#333,stroke-width:2px
Figure 14.3: Decision tree untuk memilih arsitektur RNN yang tepat berdasarkan karakteristik masalah dan data

7.9.3 Looking Forward

Limitations of RNNs:

  • Sequential processing (cannot parallelize)
  • Still struggle dengan very long sequences (>1000 steps)
  • Computationally expensive
  • Hard to capture global dependencies

Modern Alternatives:

  • Transformers: Attention mechanisms, fully parallelizable
  • Temporal Convolutional Networks (TCN): CNN untuk sequences
  • State Space Models (SSM): Linear-time alternatives

When RNN Still Relevant:

  • Small datasets (transformers need more data)
  • Online/streaming prediction
  • Resource-constrained environments
  • Interpretability requirements
  • Classic time series problems

7.10 Soal Latihan

Review Questions

  1. Jelaskan mengapa feedforward neural networks tidak cocok untuk sequential data. Sebutkan 3 alasan utama.

  2. Apa itu vanishing gradient problem? Mengapa ini terjadi pada Simple RNN? Berikan penjelasan matematis.

  3. Bandingkan LSTM dan GRU:

    • Perbedaan arsitektur
    • Jumlah parameters
    • Kapan menggunakan masing-masing
    • Trade-offs
  4. Apa fungsi dari setiap gate dalam LSTM:

    • Forget gate
    • Input gate
    • Output gate Berikan contoh konkret kapan setiap gate akan “terbuka” atau “tertutup”.
  5. Jelaskan perbedaan antara:

    • return_sequences=True vs return_sequences=False
    • Unidirectional vs Bidirectional RNN
    • Stateful vs Stateless RNN
  6. Untuk time series forecasting, jelaskan:

    • One-step vs multi-step forecasting
    • Recursive vs direct multi-step prediction
    • Kelebihan dan kekurangan masing-masing
  7. Mengapa normalization penting untuk RNN? Apa yang terjadi jika tidak melakukan normalization?

  8. Jelaskan sequence-to-sequence (Seq2Seq) architecture. Berikan 3 aplikasi nyata.

  9. Apa peran dropout dalam RNN? Di mana sebaiknya dropout diterapkan?

  10. Bandingkan evaluation metrics untuk time series:

    • MSE vs MAE
    • MAPE vs RMSE
    • Kapan menggunakan masing-masing

Coding Exercises

Exercise 1: Implementasi Simple RNN dari scratch (NumPy)

# Tugas: Implementasi forward pass Simple RNN tanpa library
# Input: sequence dengan shape (batch, timesteps, features)
# Output: hidden states untuk setiap timestep

Exercise 2: LSTM untuk Stock Price Prediction

# Dataset: Historical stock prices (Yahoo Finance)
# Task: Predict next day close price
# Requirements:
#   - Feature engineering (moving averages, RSI, etc.)
#   - LSTM model dengan minimal 2 layers
#   - Proper train/val/test split
#   - Evaluate dengan multiple metrics

Exercise 3: Sentiment Analysis dengan Bidirectional LSTM

# Dataset: IMDB reviews
# Task: Binary sentiment classification
# Requirements:
#   - Text preprocessing (tokenization, padding)
#   - Embedding layer
#   - Bidirectional LSTM
#   - Compare dengan unidirectional

Exercise 4: Text Generation (Character-level)

# Dataset: Shakespeare texts atau any corpus
# Task: Generate new text character-by-character
# Requirements:
#   - Character-level tokenization
#   - Stacked LSTM
#   - Temperature sampling
#   - Generate coherent sentences

Exercise 5: Energy Consumption Forecasting

# Dataset: Household energy consumption (hourly)
# Task: Predict next 24 hours consumption
# Requirements:
#   - Multi-step forecasting
#   - Compare Simple RNN, LSTM, GRU
#   - Feature engineering (time-based features)
#   - Visualization of predictions

Selamat belajar! Di Lab 7, kita akan mengimplementasikan LSTM untuk Energy Forecasting secara hands-on! 🚀