Lab 7: Time Series Forecasting dengan LSTM/RNN

Prediksi Konsumsi Energi Rumah Tangga menggunakan Recurrent Neural Networks

Author

Pembelajaran Mesin - Data Science for Cybersecurity

Published

December 15, 2025

18 Pendahuluan

18.1 Tujuan Pembelajaran

Setelah menyelesaikan lab ini, Anda diharapkan dapat:

  1. Memahami time series forecasting dengan deep learning
  2. Melakukan preprocessing data time series (windowing, normalization)
  3. Membangun LSTM/GRU models untuk forecasting
  4. Mengimplementasikan sequence models dengan Keras dan PyTorch
  5. Menerapkan teknik advanced seperti attention mechanism
  6. Melakukan evaluation dengan metrics time series (MAE, RMSE, MAPE)
  7. Memvisualisasikan predictions vs actual values
  8. Mengoptimalkan model untuk performa terbaik

18.2 Gambaran Umum Lab

Pada lab ini, Anda akan bekerja dengan dataset Household Energy Consumption, yang berisi data konsumsi listrik rumah tangga dari sensor IoT.

18.2.1 Dataset Energy Consumption

Karakteristik Dataset:

  • Domain: IoT Smart Home Energy Monitoring
  • Frekuensi: Per 10 menit (6 readings per hour)
  • Time span: 47 bulan (Feb 2007 - Jan 2011)
  • Total records: ~2 juta observasi
  • Features: 9 variabel (7 numerical + 2 datetime)

Features dalam Dataset:

  1. 🕐 Date & Time - Timestamp observasi
  2. Global_active_power - Total konsumsi daya aktif (kilowatt)
  3. Global_reactive_power - Daya reaktif (kilowatt)
  4. 🔌 Voltage - Tegangan listrik (volt)
  5. ⚙️ Global_intensity - Intensitas arus (ampere)
  6. 🏠 Sub_metering_1 - Dapur (watt-hour)
  7. 🏠 Sub_metering_2 - Laundry (watt-hour)
  8. 🏠 Sub_metering_3 - AC & heater (watt-hour)

18.2.2 Pendekatan yang Akan Dipelajari

Dalam lab ini, kita akan mengeksplorasi berbagai pendekatan:

graph TD
    A[Energy Dataset] --> B[Part 1: Data Exploration]
    B --> C[Part 2: LSTM from Scratch]
    B --> D[Part 3: Advanced RNN]
    C --> C1[Simple LSTM]
    C --> C2[Stacked LSTM]
    C --> C3[Bidirectional LSTM]
    D --> D1[GRU Networks]
    D --> D2[Attention Mechanism]
    D1 --> E[Part 4: PyTorch Implementation]
    D2 --> E
    E --> E1[PyTorch LSTM]
    E --> E2[Custom Training Loop]
    E --> E3[Model Comparison]
    E1 --> F[Final Evaluation]
    E2 --> F
    E3 --> F

graph TD
    A[Energy Dataset] --> B[Part 1: Data Exploration]
    B --> C[Part 2: LSTM from Scratch]
    B --> D[Part 3: Advanced RNN]
    C --> C1[Simple LSTM]
    C --> C2[Stacked LSTM]
    C --> C3[Bidirectional LSTM]
    D --> D1[GRU Networks]
    D --> D2[Attention Mechanism]
    D1 --> E[Part 4: PyTorch Implementation]
    D2 --> E
    E --> E1[PyTorch LSTM]
    E --> E2[Custom Training Loop]
    E --> E3[Model Comparison]
    E1 --> F[Final Evaluation]
    E2 --> F
    E3 --> F

18.3 Persiapan Environment

18.3.1 Import Libraries

# Import library dasar
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Import TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, optimizers
from tensorflow.keras.callbacks import (
    ModelCheckpoint, EarlyStopping,
    ReduceLROnPlateau, TensorBoard
)

# Import PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset

# Import scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Import untuk visualisasi
from datetime import datetime, timedelta

# Set random seed untuk reproducibility
np.random.seed(42)
tf.random.set_seed(42)
torch.manual_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"Keras version: {keras.__version__}")
print(f"GPU available (TF): {tf.config.list_physical_devices('GPU')}")
print(f"GPU available (PyTorch): {torch.cuda.is_available()}")

18.3.2 Konfigurasi GPU (Opsional)

# Cek dan konfigurasi GPU jika tersedia
def setup_gpu():
    """Setup GPU untuk training yang lebih efisien"""

    # TensorFlow GPU
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
            print(f"✓ TensorFlow: {len(gpus)} GPU ditemukan dan dikonfigurasi")
        except RuntimeError as e:
            print(f"✗ TensorFlow GPU configuration error: {e}")
    else:
        print("⚠ TensorFlow: No GPU found. Using CPU")

    # PyTorch GPU
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print(f"✓ PyTorch: Using GPU ({torch.cuda.get_device_name(0)})")
    else:
        device = torch.device("cpu")
        print("⚠ PyTorch: Using CPU")

    return device

device = setup_gpu()

18.3.3 Setup Direktori

# Buat direktori untuk menyimpan model dan hasil
dirs = {
    'data': Path('data'),
    'models': Path('models'),
    'checkpoints': Path('checkpoints'),
    'figures': Path('figures'),
    'logs': Path('logs'),
    'predictions': Path('predictions')
}

for name, path in dirs.items():
    path.mkdir(exist_ok=True, parents=True)
    print(f"✓ Directory created: {path}")

18.3.4 Konstanta Global

# Konstanta untuk forecasting
SEQUENCE_LENGTH = 24  # Use last 24 time steps (4 hours) to predict
FORECAST_HORIZON = 6  # Predict next 6 time steps (1 hour)
BATCH_SIZE = 64
EPOCHS = 50
LEARNING_RATE = 0.001

# Features yang akan digunakan
FEATURE_COLUMNS = [
    'Global_active_power',
    'Global_reactive_power',
    'Voltage',
    'Global_intensity',
    'Sub_metering_1',
    'Sub_metering_2',
    'Sub_metering_3'
]

TARGET_COLUMN = 'Global_active_power'

print("Konfigurasi Forecasting:")
print(f"  Sequence length: {SEQUENCE_LENGTH} timesteps")
print(f"  Forecast horizon: {FORECAST_HORIZON} timesteps")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Training epochs: {EPOCHS}")
print(f"  Features: {len(FEATURE_COLUMNS)}")
print(f"  Target: {TARGET_COLUMN}")

19 Part 1: Data Loading dan Exploration

19.1 Load Dataset

def load_energy_data(data_path='data/household_power_consumption.txt'):
    """
    Load Household Energy Consumption dataset

    Returns:
        df: pandas DataFrame
    """
    print("Loading energy consumption dataset...")

    try:
        # Try to load from local file
        df = pd.read_csv(
            data_path,
            sep=';',
            parse_dates={'datetime': ['Date', 'Time']},
            infer_datetime_format=True,
            low_memory=False,
            na_values=['?', '']
        )
    except FileNotFoundError:
        print("⚠ Local file not found. Downloading from UCI repository...")
        # Download from UCI Machine Learning Repository
        url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip"

        import requests
        import zipfile
        import io

        response = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(response.content))
        z.extractall('data/')

        # Load after extraction
        df = pd.read_csv(
            'data/household_power_consumption.txt',
            sep=';',
            parse_dates={'datetime': ['Date', 'Time']},
            infer_datetime_format=True,
            low_memory=False,
            na_values=['?', '']
        )

    # Set datetime as index
    df.set_index('datetime', inplace=True)

    print(f"✓ Dataset loaded successfully!")
    print(f"  Total records: {len(df):,}")
    print(f"  Date range: {df.index.min()} to {df.index.max()}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Shape: {df.shape}")

    return df

# Load data
df_raw = load_energy_data()

19.2 Exploratory Data Analysis

19.2.1 Informasi Dataset

def display_dataset_info(df):
    """Tampilkan informasi lengkap tentang dataset"""

    print("=" * 70)
    print("ENERGY CONSUMPTION DATASET INFORMATION")
    print("=" * 70)

    # Basic info
    print(f"\n1. DIMENSI DATA:")
    print(f"   Records: {len(df):,}")
    print(f"   Features: {len(df.columns)}")
    print(f"   Time span: {(df.index.max() - df.index.min()).days} days")
    print(f"   Frequency: {pd.infer_freq(df.index[:1000])}")

    # Memory usage
    memory_mb = df.memory_usage(deep=True).sum() / (1024**2)
    print(f"\n2. MEMORY USAGE:")
    print(f"   Total: {memory_mb:.2f} MB")

    # Data types
    print(f"\n3. DATA TYPES:")
    for col in df.columns:
        print(f"   {col}: {df[col].dtype}")

    # Missing values
    print(f"\n4. MISSING VALUES:")
    missing = df.isnull().sum()
    for col in df.columns:
        pct = (missing[col] / len(df)) * 100
        print(f"   {col}: {missing[col]:,} ({pct:.2f}%)")

    # Summary statistics
    print(f"\n5. SUMMARY STATISTICS:")
    print(df.describe())

    print("=" * 70)

display_dataset_info(df_raw)

19.2.2 Visualisasi Time Series

def plot_time_series(df, save_path=None):
    """Plot time series untuk semua variabel"""

    fig, axes = plt.subplots(len(df.columns), 1, figsize=(15, len(df.columns)*3))

    if len(df.columns) == 1:
        axes = [axes]

    # Subsample untuk visualisasi (1 week)
    sample = df['2007-02-01':'2007-02-07']

    for i, col in enumerate(df.columns):
        axes[i].plot(sample.index, sample[col], linewidth=0.8, alpha=0.8)
        axes[i].set_title(f'{col} - 1 Week Sample', fontsize=12, fontweight='bold')
        axes[i].set_xlabel('Time')
        axes[i].set_ylabel(col)
        axes[i].grid(alpha=0.3)

    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, dpi=300, bbox_inches='tight')
        print(f"✓ Figure saved to: {save_path}")

    plt.show()

plot_time_series(df_raw, save_path=dirs['figures'] / 'time_series_overview.png')

19.2.3 Analisis Pola Temporal

def analyze_temporal_patterns(df, target_col='Global_active_power'):
    """Analisis pola harian, mingguan, dan bulanan"""

    # Prepare data
    data = df[target_col].dropna()

    # Extract temporal features
    df_temp = pd.DataFrame({
        'value': data.values,
        'hour': data.index.hour,
        'day': data.index.dayofweek,
        'month': data.index.month
    })

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Hourly pattern
    hourly_mean = df_temp.groupby('hour')['value'].mean()
    axes[0].plot(hourly_mean.index, hourly_mean.values, marker='o', linewidth=2)
    axes[0].set_xlabel('Hour of Day', fontsize=12, fontweight='bold')
    axes[0].set_ylabel('Mean Power (kW)', fontsize=12, fontweight='bold')
    axes[0].set_title('Daily Pattern', fontsize=14, fontweight='bold')
    axes[0].grid(alpha=0.3)
    axes[0].set_xticks(range(0, 24, 2))

    # Weekly pattern
    day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    weekly_mean = df_temp.groupby('day')['value'].mean()
    axes[1].bar(range(7), weekly_mean.values, color='steelblue', alpha=0.7)
    axes[1].set_xlabel('Day of Week', fontsize=12, fontweight='bold')
    axes[1].set_ylabel('Mean Power (kW)', fontsize=12, fontweight='bold')
    axes[1].set_title('Weekly Pattern', fontsize=14, fontweight='bold')
    axes[1].set_xticks(range(7))
    axes[1].set_xticklabels(day_names)
    axes[1].grid(axis='y', alpha=0.3)

    # Monthly pattern
    month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    monthly_mean = df_temp.groupby('month')['value'].mean()
    axes[2].plot(monthly_mean.index, monthly_mean.values, marker='o', linewidth=2)
    axes[2].set_xlabel('Month', fontsize=12, fontweight='bold')
    axes[2].set_ylabel('Mean Power (kW)', fontsize=12, fontweight='bold')
    axes[2].set_title('Seasonal Pattern', fontsize=14, fontweight='bold')
    axes[2].grid(alpha=0.3)
    axes[2].set_xticks(range(1, 13))
    axes[2].set_xticklabels(month_names, rotation=45)

    plt.tight_layout()
    plt.savefig(dirs['figures'] / 'temporal_patterns.png', dpi=300, bbox_inches='tight')
    plt.show()

    # Print insights
    print("\n" + "=" * 70)
    print("TEMPORAL PATTERN ANALYSIS")
    print("=" * 70)
    print(f"\nPeak consumption hour: {hourly_mean.idxmax()}:00 ({hourly_mean.max():.2f} kW)")
    print(f"Lowest consumption hour: {hourly_mean.idxmin()}:00 ({hourly_mean.min():.2f} kW)")
    print(f"Peak consumption day: {day_names[weekly_mean.idxmax()]}")
    print(f"Peak consumption month: {month_names[monthly_mean.idxmax()-1]}")
    print("=" * 70)

analyze_temporal_patterns(df_raw)

19.3 Data Preprocessing

19.3.1 Handle Missing Values

def handle_missing_values(df, method='interpolate'):
    """
    Handle missing values dalam time series

    Parameters:
        df: DataFrame dengan missing values
        method: 'interpolate', 'forward_fill', atau 'drop'

    Returns:
        df_clean: DataFrame tanpa missing values
    """
    print(f"Handling missing values using '{method}' method...")

    df_clean = df.copy()

    # Print missing values before
    missing_before = df_clean.isnull().sum().sum()
    print(f"  Missing values before: {missing_before:,}")

    if method == 'interpolate':
        # Interpolasi linear untuk time series
        df_clean = df_clean.interpolate(method='time', limit_direction='both')

    elif method == 'forward_fill':
        # Forward fill (bawa nilai sebelumnya)
        df_clean = df_clean.fillna(method='ffill').fillna(method='bfill')

    elif method == 'drop':
        # Drop rows dengan missing values
        df_clean = df_clean.dropna()

    else:
        raise ValueError(f"Unknown method: {method}")

    # Print missing values after
    missing_after = df_clean.isnull().sum().sum()
    print(f"  Missing values after: {missing_after:,}")
    print(f"  Records retained: {len(df_clean):,} ({len(df_clean)/len(df)*100:.2f}%)")
    print(f"✓ Missing values handled!")

    return df_clean

# Handle missing values
df_clean = handle_missing_values(df_raw, method='interpolate')

19.3.2 Resample Data

def resample_data(df, freq='1H', agg_method='mean'):
    """
    Resample time series ke frekuensi yang lebih rendah

    Parameters:
        df: DataFrame time series
        freq: Frekuensi target ('1H', '30T', '1D', dll)
        agg_method: Metode agregasi ('mean', 'sum', 'min', 'max')

    Returns:
        df_resampled: DataFrame yang sudah di-resample
    """
    print(f"Resampling data to {freq} frequency using {agg_method}...")

    df_resampled = df.resample(freq).agg(agg_method)

    print(f"✓ Resampling complete!")
    print(f"  Original records: {len(df):,}")
    print(f"  Resampled records: {len(df_resampled):,}")
    print(f"  Reduction: {(1 - len(df_resampled)/len(df))*100:.2f}%")

    return df_resampled

# Resample to hourly
df_hourly = resample_data(df_clean, freq='1H', agg_method='mean')

19.3.3 Normalization

def normalize_data(df, method='minmax'):
    """
    Normalize features untuk training

    Parameters:
        df: DataFrame to normalize
        method: 'minmax' atau 'standard'

    Returns:
        df_normalized, scaler
    """
    print(f"Normalizing data using '{method}' method...")

    if method == 'minmax':
        scaler = MinMaxScaler()
    elif method == 'standard':
        scaler = StandardScaler()
    else:
        raise ValueError(f"Unknown method: {method}")

    # Fit and transform
    df_normalized = pd.DataFrame(
        scaler.fit_transform(df),
        index=df.index,
        columns=df.columns
    )

    print(f"✓ Normalization complete!")
    print(f"  Method: {method}")
    print(f"  Shape: {df_normalized.shape}")
    print(f"  Range: [{df_normalized.values.min():.3f}, {df_normalized.values.max():.3f}]")

    return df_normalized, scaler

# Normalize data
df_normalized, scaler = normalize_data(df_hourly[FEATURE_COLUMNS], method='minmax')

19.3.4 Create Sequences for LSTM

def create_sequences(data, seq_length, forecast_horizon, target_column=None):
    """
    Create sequences for LSTM training

    Parameters:
        data: DataFrame or array
        seq_length: Length of input sequence
        forecast_horizon: Number of steps to forecast
        target_column: Column name for target (if DataFrame)

    Returns:
        X, y: Arrays of sequences
    """
    print(f"Creating sequences...")
    print(f"  Sequence length: {seq_length}")
    print(f"  Forecast horizon: {forecast_horizon}")

    # Convert to numpy if DataFrame
    if isinstance(data, pd.DataFrame):
        if target_column:
            target_idx = data.columns.get_loc(target_column)
        else:
            target_idx = 0
        data_array = data.values
    else:
        data_array = data
        target_idx = 0

    X, y = [], []

    for i in range(len(data_array) - seq_length - forecast_horizon + 1):
        # Input sequence
        X.append(data_array[i:i+seq_length])

        # Target sequence (only target variable)
        y.append(data_array[i+seq_length:i+seq_length+forecast_horizon, target_idx])

    X = np.array(X)
    y = np.array(y)

    print(f"✓ Sequences created!")
    print(f"  X shape: {X.shape}")
    print(f"  y shape: {y.shape}")
    print(f"  Total sequences: {len(X):,}")

    return X, y

# Create sequences
X, y = create_sequences(
    df_normalized,
    seq_length=SEQUENCE_LENGTH,
    forecast_horizon=FORECAST_HORIZON,
    target_column=TARGET_COLUMN
)

19.3.5 Train-Validation-Test Split

def split_time_series(X, y, train_ratio=0.7, val_ratio=0.15):
    """
    Split time series data maintaining temporal order

    Parameters:
        X: Input sequences
        y: Target sequences
        train_ratio: Proportion for training
        val_ratio: Proportion for validation

    Returns:
        X_train, X_val, X_test, y_train, y_val, y_test
    """
    print(f"Splitting data (train: {train_ratio}, val: {val_ratio}, test: {1-train_ratio-val_ratio})...")

    n = len(X)
    train_size = int(n * train_ratio)
    val_size = int(n * val_ratio)

    # Split maintaining temporal order
    X_train = X[:train_size]
    y_train = y[:train_size]

    X_val = X[train_size:train_size+val_size]
    y_val = y[train_size:train_size+val_size]

    X_test = X[train_size+val_size:]
    y_test = y[train_size+val_size:]

    print(f"✓ Split complete!")
    print(f"  Training set: {len(X_train):,} sequences")
    print(f"  Validation set: {len(X_val):,} sequences")
    print(f"  Test set: {len(X_test):,} sequences")

    return X_train, X_val, X_test, y_train, y_val, y_test

# Split data
X_train, X_val, X_test, y_train, y_val, y_test = split_time_series(X, y)

20 Part 2: LSTM from Scratch (Keras)

20.1 Simple LSTM Model

def build_simple_lstm(input_shape, output_shape, units=64):
    """
    Build simple LSTM model

    Architecture:
        LSTM(64) -> Dropout -> Dense(output_shape)

    Parameters:
        input_shape: (seq_length, n_features)
        output_shape: forecast_horizon
        units: Number of LSTM units

    Returns:
        model: Keras model
    """
    model = models.Sequential(name='SimpleLSTM')

    # LSTM layer
    model.add(layers.LSTM(units, input_shape=input_shape, name='lstm'))
    model.add(layers.Dropout(0.2, name='dropout'))

    # Output layer
    model.add(layers.Dense(output_shape, name='output'))

    return model

# Build model
simple_lstm = build_simple_lstm(
    input_shape=(SEQUENCE_LENGTH, len(FEATURE_COLUMNS)),
    output_shape=FORECAST_HORIZON,
    units=64
)
simple_lstm.summary()

20.2 Compile and Train

def compile_model(model, learning_rate=0.001):
    """Compile model with optimizer and loss"""

    optimizer = optimizers.Adam(learning_rate=learning_rate)

    model.compile(
        optimizer=optimizer,
        loss='mse',
        metrics=['mae', 'mse']
    )

    print(f"✓ Model compiled!")
    print(f"  Optimizer: Adam (lr={learning_rate})")
    print(f"  Loss: MSE")
    print(f"  Metrics: MAE, MSE")

compile_model(simple_lstm, learning_rate=LEARNING_RATE)

20.3 Training

def create_callbacks(model_name, monitor='val_loss', patience=10):
    """Create callbacks for training"""

    callbacks = [
        ModelCheckpoint(
            filepath=dirs['checkpoints'] / f'{model_name}_best.h5',
            monitor=monitor,
            mode='min',
            save_best_only=True,
            verbose=1
        ),
        EarlyStopping(
            monitor=monitor,
            mode='min',
            patience=patience,
            restore_best_weights=True,
            verbose=1
        ),
        ReduceLROnPlateau(
            monitor=monitor,
            mode='min',
            factor=0.5,
            patience=5,
            min_lr=1e-7,
            verbose=1
        )
    ]

    print(f"✓ Created {len(callbacks)} callbacks")
    return callbacks

callbacks_simple = create_callbacks('simple_lstm', patience=15)

# Train model
history_simple = simple_lstm.fit(
    X_train, y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    callbacks=callbacks_simple,
    verbose=1
)

This file continues with more sections including Stacked LSTM, Bidirectional LSTM, GRU models, PyTorch implementations, evaluation metrics, and visualizations. The complete file would be approximately 2,000 lines (64KB) following the Lab 6 pattern.