Bab 5: Multilayer Perceptron (MLP) Fundamentals

Fondasi Deep Learning dan Neural Networks

Bab 5: Multilayer Perceptron (MLP) Fundamentals

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

Memahami arsitektur neural network dan konsep fundamental deep learning
Mengimplementasikan MLP menggunakan Keras dan PyTorch framework
Menerapkan teknik optimisasi (learning rate, batch size, epochs) untuk training efektif
Menggunakan regularization techniques (dropout, L2) untuk mencegah overfitting
Mengevaluasi dan memvisualisasikan training process deep learning models
Membandingkan deep learning dengan classical machine learning approaches

5.1 Pengantar: Evolusi dari Classical ML ke Deep Learning

5.1.1 Mengapa Neural Networks?

Setelah mempelajari classical machine learning di Phase 2 (Chapters 1-4), Anda mungkin bertanya: Mengapa kita perlu deep learning? Jawabannya terletak pada keterbatasan classical ML dan kemampuan unik neural networks.

Keterbatasan Classical ML:

Aspek	Classical ML	Deep Learning
Feature Engineering	Manual, membutuhkan domain expertise	Otomatis, hierarchical feature learning
Data Kompleksitas	Kesulitan dengan data dimensi tinggi	Unggul pada image, text, audio
Scalability	Performa plateau pada dataset besar	Meningkat dengan data lebih banyak
Representation	Shallow, single-layer features	Deep, hierarchical abstractions
Transfer Learning	Terbatas	Sangat efektif

💡 Intuisi Deep Learning

Bayangkan mengenali wajah seseorang:

Layer 1: Deteksi edges (garis, kontur)
Layer 2: Deteksi parts (mata, hidung, mulut)
Layer 3: Deteksi patterns (susunan wajah)
Layer 4: Identifikasi individu

Neural networks belajar representasi hierarkis seperti ini secara otomatis dari data!

5.1.2 Success Stories Deep Learning

Deep learning telah mencapai breakthrough di berbagai domain:

Computer Vision:

ImageNet (2012): AlexNet mengurangi error dari 26% → 15%
Object detection real-time (YOLO, Faster R-CNN)
Medical imaging: deteksi kanker dengan akurasi dokter spesialis

Natural Language Processing:

Machine translation: Google Translate neural MT
Large Language Models: GPT-4, Claude, Gemini
Sentiment analysis, text generation

Speech & Audio:

Speech recognition: Google Assistant, Siri
Text-to-speech synthesis yang natural
Music generation (Jukebox, MusicGen)

Game Playing:

AlphaGo: mengalahkan Lee Sedol (Go champion)
AlphaZero: master Chess, Go, Shogi dari zero
OpenAI Five: Dota 2 championship level

Science & Research:

AlphaFold: prediksi protein structure
Drug discovery acceleration
Climate modeling

5.1.3 Deep Learning vs Classical ML: Kapan Menggunakan Apa?

Gunakan Classical ML jika:

Dataset kecil (<10,000 samples)
Features sudah well-defined
Interpretability sangat penting
Computational resources terbatas
Baseline cepat dibutuhkan

Gunakan Deep Learning jika:

Dataset besar (>100,000 samples)
Data kompleks (images, text, audio)
Feature engineering sulit/tidak jelas
Computational resources tersedia (GPU)
State-of-the-art performance diperlukan

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from matplotlib.colors import ListedColormap

# Generate non-linear data
np.random.seed(42)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classical ML: Logistic Regression (linear decision boundary)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_score = lr_model.score(X_test, y_test)

# Deep Learning: MLP (non-linear decision boundary)
mlp_model = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42)
mlp_model.fit(X_train, y_train)
mlp_score = mlp_model.score(X_test, y_test)

# Visualization
def plot_decision_boundary(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#0000FF']),
                edgecolors='k', s=50)
    plt.title(title, fontsize=14, fontweight='bold')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plt.sca(axes[0])
plot_decision_boundary(lr_model, X_test, y_test,
                       f'Logistic Regression\nAccuracy: {lr_score:.3f}')
plt.sca(axes[1])
plot_decision_boundary(mlp_model, X_test, y_test,
                       f'Neural Network (MLP)\nAccuracy: {mlp_score:.3f}')
plt.tight_layout()
plt.show()

print(f"Linear Model (Logistic Regression): {lr_score:.4f}")
print(f"Non-linear Model (MLP): {mlp_score:.4f}")
print(f"Improvement: {(mlp_score - lr_score) * 100:.2f}%")

Insight: Neural networks dapat mempelajari decision boundaries non-linear yang kompleks, sementara linear models terbatas pada pemisahan linear.

5.2 Neural Network Architecture

5.2.1 Perceptron: Building Block Fundamental

Perceptron adalah unit komputasi dasar neural network, terinspirasi dari neuron biologis.

Code

flowchart LR
    X1["x1 (Input 1)"] -->|w1| S["Σ (Weighted Sum)"]
    X2["x2 (Input 2)"] -->|w2| S
    X3["x3 (Input 3)"] -->|w3| S
    B["b (Bias)"] -->|1| S
    S --> A["σ (Activation)"]
    A --> Y["y-hat (Output)"]

    style S fill:#ffeb3b
    style A fill:#4caf50
    style Y fill:#2196f3

flowchart LR
    X1["x1 (Input 1)"] -->|w1| S["Σ (Weighted Sum)"]
    X2["x2 (Input 2)"] -->|w2| S
    X3["x3 (Input 3)"] -->|w3| S
    B["b (Bias)"] -->|1| S
    S --> A["σ (Activation)"]
    A --> Y["y-hat (Output)"]

    style S fill:#ffeb3b
    style A fill:#4caf50
    style Y fill:#2196f3

Arsitektur Single Perceptron

Matematika Perceptron:

\[ \begin{aligned} z &= \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b \\ \hat{y} &= \sigma(z) = \sigma(\mathbf{w}^T \mathbf{x} + b) \end{aligned} \]

Dimana:

$\mathbf{x} = [x_1, x_2, ..., x_n]$: input features
$\mathbf{w} = [w_1, w_2, ..., w_n]$: weights (bobot)
$b$: bias (intercept)
$\sigma$: activation function
$z$: pre-activation value
$\hat{y}$: output prediction

Komponen Kunci:

Weights (w): Parameter yang dipelajari, mengontrol kekuatan koneksi
Bias (b): Offset yang memungkinkan shifting function
Activation Function: Non-linearity yang memungkinkan learning complex patterns

import numpy as np
import matplotlib.pyplot as plt

# Implementasi Perceptron dari scratch
class SimplePerceptron:
    def __init__(self, n_inputs):
        # Initialize weights dan bias secara random
        self.weights = np.random.randn(n_inputs)
        self.bias = np.random.randn()

    def sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-z))

    def forward(self, x):
        """Forward pass: compute output"""
        z = np.dot(x, self.weights) + self.bias
        return self.sigmoid(z)

    def predict(self, X):
        """Predict untuk multiple samples"""
        return np.array([self.forward(x) for x in X])

# Demo: XOR problem (classic non-linearly separable problem)
# Single perceptron TIDAK BISA menyelesaikan XOR
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])  # XOR outputs

perceptron = SimplePerceptron(n_inputs=2)
predictions = perceptron.predict(X_xor)

print("XOR Problem dengan Single Perceptron:")
print("Input (x1, x2) | True Label | Prediction | Correct?")
print("-" * 55)
for i in range(len(X_xor)):
    pred_label = 1 if predictions[i] >= 0.5 else 0
    correct = "✓" if pred_label == y_xor[i] else "✗"
    print(f"{X_xor[i]} | {y_xor[i]:^10d} | {predictions[i]:^10.3f} | {correct:^8s}")

print("\n⚠️ Single perceptron gagal menyelesaikan XOR problem!")
print("Ini memotivasi development MLP (multiple layers).")

Keterbatasan Single Perceptron:

⚠️ XOR Problem

Single perceptron hanya bisa mempelajari linearly separable patterns. XOR problem adalah contoh klasik yang membuktikan keterbatasan ini, yang memotivasi pengembangan Multi-Layer Perceptron (MLP).

5.2.2 Multilayer Perceptron (MLP) Architecture

MLP mengatasi keterbatasan single perceptron dengan menambahkan hidden layers.

Code

graph LR
    subgraph Input["Input Layer"]
        X1["x₁"]
        X2["x₂"]
        X3["x₃"]
    end

    subgraph Hidden1["Hidden Layer 1"]
        H11["h₁₁"]
        H12["h₁₂"]
        H13["h₁₃"]
        H14["h₁₄"]
    end

    subgraph Hidden2["Hidden Layer 2"]
        H21["h₂₁"]
        H22["h₂₂"]
        H23["h₂₃"]
    end

    subgraph Output["Output Layer"]
        Y1["ŷ₁"]
        Y2["ŷ₂"]
    end

    X1 --> H11 & H12 & H13 & H14
    X2 --> H11 & H12 & H13 & H14
    X3 --> H11 & H12 & H13 & H14

    H11 --> H21 & H22 & H23
    H12 --> H21 & H22 & H23
    H13 --> H21 & H22 & H23
    H14 --> H21 & H22 & H23

    H21 --> Y1 & Y2
    H22 --> Y1 & Y2
    H23 --> Y1 & Y2

    style Input fill:#e3f2fd
    style Hidden1 fill:#fff3e0
    style Hidden2 fill:#fff3e0
    style Output fill:#e8f5e9

graph LR
    subgraph Input["Input Layer"]
        X1["x₁"]
        X2["x₂"]
        X3["x₃"]
    end

    subgraph Hidden1["Hidden Layer 1"]
        H11["h₁₁"]
        H12["h₁₂"]
        H13["h₁₃"]
        H14["h₁₄"]
    end

    subgraph Hidden2["Hidden Layer 2"]
        H21["h₂₁"]
        H22["h₂₂"]
        H23["h₂₃"]
    end

    subgraph Output["Output Layer"]
        Y1["ŷ₁"]
        Y2["ŷ₂"]
    end

    X1 --> H11 & H12 & H13 & H14
    X2 --> H11 & H12 & H13 & H14
    X3 --> H11 & H12 & H13 & H14

    H11 --> H21 & H22 & H23
    H12 --> H21 & H22 & H23
    H13 --> H21 & H22 & H23
    H14 --> H21 & H22 & H23

    H21 --> Y1 & Y2
    H22 --> Y1 & Y2
    H23 --> Y1 & Y2

    style Input fill:#e3f2fd
    style Hidden1 fill:#fff3e0
    style Hidden2 fill:#fff3e0
    style Output fill:#e8f5e9

Arsitektur Multilayer Perceptron (MLP)

Terminologi:

Input Layer: Menerima features (tidak ada komputasi)
Hidden Layer(s): Layer intermediate yang mempelajari representations
Output Layer: Menghasilkan predictions
Depth: Jumlah hidden layers (Deep = many layers)
Width: Jumlah neurons per layer

Notasi Matematika:

Untuk layer $l$:

\[ \begin{aligned} \mathbf{z}^{[l]} &= \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \\ \mathbf{a}^{[l]} &= \sigma^{[l]}(\mathbf{z}^{[l]}) \end{aligned} \]

Dimana:

$\mathbf{W}^{[l]}$: weight matrix untuk layer $l$
$\mathbf{b}^{[l]}$: bias vector untuk layer $l$
$\mathbf{a}^{[l]}$: activations (output) dari layer $l$
$\mathbf{a}^{[0]} = \mathbf{x}$: input features
$\sigma^{[l]}$: activation function untuk layer $l$

Ukuran Matrices:

Jika layer $l$ memiliki $n^{[l]}$ neurons dan layer $l-1$ memiliki $n^{[l-1]}$ neurons:

$\mathbf{W}^{[l]}$: shape $(n^{[l]}, n^{[l-1]})$
$\mathbf{b}^{[l]}$: shape $(n^{[l]}, 1)$
$\mathbf{a}^{[l]}$: shape $(n^{[l]}, 1)$ untuk single sample

5.2.3 Activation Functions

Activation functions memberikan non-linearity yang esensial untuk learning complex patterns.

Perbandingan Activation Functions:

Function	Formula	Range	Use Case	Pros	Cons
Sigmoid	$\sigma(z) = \frac{1}{1+e^{-z}}$	(0, 1)	Output layer (binary)	Smooth, probabilistic	Vanishing gradient
Tanh	$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$	(-1, 1)	Hidden layers (old)	Zero-centered	Vanishing gradient
ReLU	$\text{ReLU}(z) = \max(0, z)$	[0, ∞)	Hidden layers (default)	Fast, no vanishing	Dying ReLU
Leaky ReLU	$\text{LReLU}(z) = \max(0.01z, z)$	(-∞, ∞)	Hidden layers	Fixes dying ReLU	Hyperparameter α
Softmax	$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$	(0, 1) sum=1	Output (multi-class)	Probabilities	Only output layer

import numpy as np
import matplotlib.pyplot as plt

# Implementasi activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

# Visualisasi
z = np.linspace(-5, 5, 200)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sigmoid
axes[0, 0].plot(z, sigmoid(z), linewidth=3, color='#2196F3')
axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('z')
axes[0, 0].set_ylabel('σ(z)')
axes[0, 0].text(-4, 0.9, 'Range: (0, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat'))

# Tanh
axes[0, 1].plot(z, tanh(z), linewidth=3, color='#4CAF50')
axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title('Tanh: tanh(z)', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('z')
axes[0, 1].set_ylabel('tanh(z)')
axes[0, 1].text(-4, 0.8, 'Range: (-1, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='lightgreen'))

# ReLU
axes[1, 0].plot(z, relu(z), linewidth=3, color='#FF9800')
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title('ReLU: max(0, z)', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('z')
axes[1, 0].set_ylabel('ReLU(z)')
axes[1, 0].text(-4, 4, 'Range: [0, ∞)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#FFE0B2'))

# Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), linewidth=3, color='#9C27B0')
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title('Leaky ReLU: max(0.01z, z)', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('z')
axes[1, 1].set_ylabel('Leaky ReLU(z)')
axes[1, 1].text(-4, 4, 'Range: (-∞, ∞)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#E1BEE7'))

plt.tight_layout()
plt.show()

print("Karakteristik Activation Functions:")
print("-" * 70)
print("Sigmoid  : Smooth, outputs (0,1), ⚠️ vanishing gradient problem")
print("Tanh     : Zero-centered, outputs (-1,1), ⚠️ vanishing gradient")
print("ReLU     : Fast, sparse activation, ⚠️ dying ReLU (neurons die)")
print("Leaky ReLU: Solves dying ReLU, allows small negative values")
print("\n✅ Rekomendasi: Gunakan ReLU untuk hidden layers (default choice)")

💡 Memilih Activation Function

Hidden Layers:

Default: ReLU (cepat, efektif, sparse)
Alternative: Leaky ReLU (menghindari dying neurons)
Avoid: Sigmoid/Tanh (vanishing gradient untuk deep networks)

Output Layer:

Binary Classification: Sigmoid (output probability 0-1)
Multi-class Classification: Softmax (probabilities sum to 1)
Regression: Linear (no activation, output real values)

5.2.4 Forward Propagation

Forward propagation adalah proses komputasi output dari input, layer demi layer.

import numpy as np

class MLPFromScratch:
    """MLP implementation dari scratch untuk educational purpose"""

    def __init__(self, layer_sizes):
        """
        Initialize MLP dengan arsitektur specified

        Args:
            layer_sizes: List of integers, e.g., [3, 4, 4, 2]
                        Input: 3 neurons
                        Hidden1: 4 neurons
                        Hidden2: 4 neurons
                        Output: 2 neurons
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)
        self.parameters = {}

        # Initialize weights dan biases
        np.random.seed(42)
        for l in range(1, self.num_layers):
            # He initialization untuk ReLU
            self.parameters[f'W{l}'] = np.random.randn(
                layer_sizes[l], layer_sizes[l-1]
            ) * np.sqrt(2.0 / layer_sizes[l-1])

            self.parameters[f'b{l}'] = np.zeros((layer_sizes[l], 1))

    def relu(self, Z):
        """ReLU activation"""
        return np.maximum(0, Z)

    def softmax(self, Z):
        """Softmax activation (untuk output layer)"""
        exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True))  # numerical stability
        return exp_Z / np.sum(exp_Z, axis=0, keepdims=True)

    def forward_propagation(self, X):
        """
        Forward pass melalui network

        Args:
            X: Input data (n_features, m_samples)

        Returns:
            AL: Output activations
            cache: Dictionary containing Z, A untuk setiap layer
        """
        cache = {'A0': X}
        A = X

        # Hidden layers dengan ReLU
        for l in range(1, self.num_layers - 1):
            Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}']
            A = self.relu(Z)
            cache[f'Z{l}'] = Z
            cache[f'A{l}'] = A

        # Output layer dengan softmax
        l = self.num_layers - 1
        Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}']
        A = self.softmax(Z)
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A

        return A, cache

    def predict(self, X):
        """Predict class labels"""
        A, _ = self.forward_propagation(X)
        return np.argmax(A, axis=0)

# Demo: Solve XOR problem dengan MLP
print("Demo: MLP Solving XOR Problem")
print("=" * 60)

# XOR data
X_xor = np.array([[0, 0, 1, 1],
                  [0, 1, 0, 1]])  # (2 features, 4 samples)
y_xor = np.array([0, 1, 1, 0])   # XOR labels

# Create MLP: 2 inputs -> 4 hidden -> 2 outputs
mlp = MLPFromScratch(layer_sizes=[2, 4, 2])

# Forward pass
output, cache = mlp.forward_propagation(X_xor)

print(f"\nNetwork Architecture: {mlp.layer_sizes}")
print(f"Input → {mlp.layer_sizes[0]} neurons")
print(f"Hidden → {mlp.layer_sizes[1]} neurons (ReLU)")
print(f"Output → {mlp.layer_sizes[2]} neurons (Softmax)")

print(f"\n📊 Forward Propagation Results:")
print("-" * 60)
for i in range(X_xor.shape[1]):
    print(f"Input: {X_xor[:, i]} | Output probs: {output[:, i]} | Prediction: {np.argmax(output[:, i])}")

print("\n⚠️ Catatan: Ini adalah RANDOM INITIALIZATION (belum training)")
print("Setelah training dengan backpropagation, MLP akan menyelesaikan XOR!")
print("\nKita akan pelajari training process di section berikutnya.")

Proses Forward Propagation:

Input Layer: $\mathbf{a}^{[0]} = \mathbf{x}$
Hidden Layer 1:
- $\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]}$
- $\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})$
Hidden Layer 2:
- $\mathbf{z}^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}$
- $\mathbf{a}^{[2]} = \text{ReLU}(\mathbf{z}^{[2]})$
Output Layer:
- $\mathbf{z}^{[3]} = \mathbf{W}^{[3]} \mathbf{a}^{[2]} + \mathbf{b}^{[3]}$
- $\mathbf{a}^{[3]} = \text{Softmax}(\mathbf{z}^{[3]})$

5.2.5 Network Topology Considerations

Berapa banyak hidden layers?

Network Type	Hidden Layers	Use Case
Shallow	1	Simple patterns, XOR, small data
Medium	2-3	Most practical problems
Deep	4+	Complex hierarchical features (images, text)

Berapa banyak neurons per layer?

Rules of Thumb:

Mulai dengan neurons = mean(input_size, output_size)
Hidden layers > input layer (tapi tidak terlalu besar)
Pyramid shape: gradually decreasing (e.g., 128 → 64 → 32)
Sama untuk semua hidden layers juga ok (e.g., 64 → 64 → 64)

Experimentasi penting! Tidak ada formula perfect, tergantung pada:

Data complexity
Number of samples
Overfitting/underfitting

5.3 Training Neural Networks

5.3.1 Loss Functions

Loss function mengukur seberapa jauh predictions dari true values.

Binary Classification: Binary Cross-Entropy

\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right] \]

Multi-class Classification: Categorical Cross-Entropy

\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)}) \]

Regression: Mean Squared Error (MSE)

\[ \mathcal{L}(\mathbf{w}) = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2 \]

import numpy as np
import matplotlib.pyplot as plt

def binary_cross_entropy(y_true, y_pred):
    """Binary cross-entropy loss"""
    epsilon = 1e-15  # untuk numerical stability
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Categorical cross-entropy loss"""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def mean_squared_error(y_true, y_pred):
    """Mean squared error loss"""
    return np.mean((y_true - y_pred) ** 2)

# Visualisasi: Binary Cross-Entropy vs MSE
y_true_binary = 1  # True label = 1
y_pred_range = np.linspace(0.01, 0.99, 100)

bce_loss = [-np.log(yp) for yp in y_pred_range]
mse_loss = [(1 - yp)**2 for yp in y_pred_range]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Binary Cross-Entropy
axes[0].plot(y_pred_range, bce_loss, linewidth=3, color='#E91E63')
axes[0].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7)
axes[0].set_xlabel('Predicted Probability', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Binary Cross-Entropy Loss\n(True label = 1)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# MSE
axes[1].plot(y_pred_range, mse_loss, linewidth=3, color='#3F51B5')
axes[1].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7)
axes[1].set_xlabel('Predicted Probability', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Mean Squared Error Loss\n(True label = 1)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Perbandingan Loss Functions:")
print("-" * 70)
print("Binary Cross-Entropy:")
print("  ✅ Cocok untuk probability outputs (sigmoid)")
print("  ✅ Asymmetric punishment (more punishment when confident but wrong)")
print("  ✅ Better gradient flow untuk classification")
print("\nMean Squared Error:")
print("  ✅ Cocok untuk regression")
print("  ⚠️ Kurang optimal untuk classification (symmetric punishment)")

💡 Memilih Loss Function

Classification:

Binary: Binary Cross-Entropy (dengan Sigmoid activation)
Multi-class: Categorical Cross-Entropy (dengan Softmax activation)

Regression:

MSE: Default choice, sensitive to outliers
MAE: Robust to outliers
Huber: Kombinasi MSE dan MAE

5.3.2 Backpropagation Algorithm

Backpropagation adalah algoritma untuk menghitung gradients loss function terhadap semua parameters.

Intuisi: Backpropagation menggunakan chain rule dari calculus untuk menghitung gradients layer demi layer dari output ke input.

Code

graph RL
    L["Loss L"] -->|"∂L/∂a³"| A3["Output a³"]
    A3 -->|"∂a³/∂z³"| Z3["z³ = W³a² + b³"]
    Z3 -->|"∂z³/∂W³, ∂z³/∂b³"| W3["W³, b³"]
    Z3 -->|"∂z³/∂a²"| A2["Hidden a²"]
    A2 -->|"∂a²/∂z²"| Z2["z² = W²a¹ + b²"]
    Z2 -->|"∂z²/∂W², ∂z²/∂b²"| W2["W², b²"]
    Z2 -->|"∂z²/∂a¹"| A1["Hidden a¹"]
    A1 -->|"∂a¹/∂z¹"| Z1["z¹ = W¹x + b¹"]
    Z1 -->|"∂z¹/∂W¹, ∂z¹/∂b¹"| W1["W¹, b¹"]

    style L fill:#f44336
    style W3 fill:#4caf50
    style W2 fill:#4caf50
    style W1 fill:#4caf50

graph RL
    L["Loss L"] -->|"∂L/∂a³"| A3["Output a³"]
    A3 -->|"∂a³/∂z³"| Z3["z³ = W³a² + b³"]
    Z3 -->|"∂z³/∂W³, ∂z³/∂b³"| W3["W³, b³"]
    Z3 -->|"∂z³/∂a²"| A2["Hidden a²"]
    A2 -->|"∂a²/∂z²"| Z2["z² = W²a¹ + b²"]
    Z2 -->|"∂z²/∂W², ∂z²/∂b²"| W2["W², b²"]
    Z2 -->|"∂z²/∂a¹"| A1["Hidden a¹"]
    A1 -->|"∂a¹/∂z¹"| Z1["z¹ = W¹x + b¹"]
    Z1 -->|"∂z¹/∂W¹, ∂z¹/∂b¹"| W1["W¹, b¹"]

    style L fill:#f44336
    style W3 fill:#4caf50
    style W2 fill:#4caf50
    style W1 fill:#4caf50

Backpropagation Flow

Matematika Backpropagation (Simplified):

Untuk output layer $L$:

\[ \delta^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} = \mathbf{a}^{[L]} - \mathbf{y} \]

Untuk hidden layer $l$:

\[ \delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(\mathbf{z}^{[l]}) \]

Gradients untuk parameters:

\[ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} &= \frac{1}{m} \delta^{[l]} (\mathbf{a}^{[l-1]})^T \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} &= \frac{1}{m} \sum_{i} \delta^{[l](i)} \end{aligned} \]

📝 Catatan Implementasi

Anda tidak perlu mengimplementasikan backpropagation manual! Frameworks modern (TensorFlow, PyTorch) menggunakan automatic differentiation yang menghitung gradients secara otomatis.

Memahami konsep backpropagation penting untuk: 1. Debugging training issues 2. Designing custom architectures 3. Understanding vanishing/exploding gradients

5.3.3 Gradient Descent Variants

Gradient descent updates parameters menuju arah yang mengurangi loss.

Basic Update Rule:

\[ \mathbf{W} := \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}} \]

Dimana $\alpha$ adalah learning rate.

Variants:

Method	Update	Karakteristik
Batch GD	Gunakan semua data	Akurat tapi lambat, memory intensive
Stochastic GD	Gunakan 1 sample	Cepat tapi noisy, poor convergence
Mini-batch GD	Gunakan batch (32-512)	Balance: speed & stability
SGD + Momentum	$v := \beta v + (1-\beta) \nabla$	Smooth convergence, faster
RMSprop	Adaptive learning rate per parameter	Good for RNN
Adam	Momentum + RMSprop	Default choice, adaptive, robust

import numpy as np
import matplotlib.pyplot as plt

# Simulasi optimization dengan berbagai optimizers
def optimize_path(optimizer_func, start_point, learning_rate, iterations):
    """Simulate optimization path"""
    path = [start_point]
    point = np.array(start_point, dtype=float)

    for _ in range(iterations):
        # Simple 2D function: f(x,y) = x² + 4y²
        gradient = np.array([2*point[0], 8*point[1]])
        point = optimizer_func(point, gradient, learning_rate)
        path.append(point.copy())

    return np.array(path)

def sgd(point, gradient, lr):
    """Standard SGD"""
    return point - lr * gradient

def sgd_momentum(point, gradient, lr, momentum=0.9):
    """SGD with momentum"""
    if not hasattr(sgd_momentum, 'velocity') or sgd_momentum.velocity is None:
        sgd_momentum.velocity = np.zeros_like(point)
    sgd_momentum.velocity = momentum * sgd_momentum.velocity + lr * gradient
    return point - sgd_momentum.velocity

def adam_optimizer(point, gradient, lr, beta1=0.9, beta2=0.999):
    """Adam optimizer (simplified)"""
    if not hasattr(adam_optimizer, 'm') or adam_optimizer.m is None:
        adam_optimizer.m = np.zeros_like(point)
        adam_optimizer.v = np.zeros_like(point)
        adam_optimizer.t = 0

    adam_optimizer.t += 1
    adam_optimizer.m = beta1 * adam_optimizer.m + (1 - beta1) * gradient
    adam_optimizer.v = beta2 * adam_optimizer.v + (1 - beta2) * (gradient ** 2)

    m_hat = adam_optimizer.m / (1 - beta1 ** adam_optimizer.t)
    v_hat = adam_optimizer.v / (1 - beta2 ** adam_optimizer.t)

    return point - lr * m_hat / (np.sqrt(v_hat) + 1e-8)

# Visualisasi convergence paths
start = [2.0, 2.0]
iterations = 50

# Reset momentum/adam state
sgd_momentum.velocity = None
adam_optimizer.m = None
adam_optimizer.v = None
adam_optimizer.t = None

path_sgd = optimize_path(sgd, start, learning_rate=0.1, iterations=iterations)

# Momentum path (velocity already reset above at line 836)
path_momentum = optimize_path(
    lambda p, g, lr: sgd_momentum(p, g, lr, momentum=0.9),
    start, learning_rate=0.1, iterations=iterations
)

# Adam path (state already reset above at lines 837-839)
path_adam = optimize_path(
    lambda p, g, lr: adam_optimizer(p, g, lr),
    start, learning_rate=0.1, iterations=iterations
)

# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 8))

# Contour plot of loss landscape
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + 4*Y**2  # Loss function

contour = ax.contour(X, Y, Z, levels=20, alpha=0.3)
ax.clabel(contour, inline=True, fontsize=8)

# Plot paths
ax.plot(path_sgd[:, 0], path_sgd[:, 1], 'o-', label='SGD', linewidth=2, markersize=4)
ax.plot(path_momentum[:, 0], path_momentum[:, 1], 's-', label='SGD + Momentum', linewidth=2, markersize=4)
ax.plot(path_adam[:, 0], path_adam[:, 1], '^-', label='Adam', linewidth=2, markersize=4)

# Mark start and optimum
ax.plot(*start, 'r*', markersize=20, label='Start')
ax.plot(0, 0, 'g*', markersize=20, label='Optimum')

ax.set_xlabel('Parameter 1', fontsize=12)
ax.set_ylabel('Parameter 2', fontsize=12)
ax.set_title('Optimizer Convergence Paths', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Perbandingan Optimizers:")
print("-" * 70)
print(f"SGD          : {iterations} iterations, final point: {path_sgd[-1]}")
print(f"SGD+Momentum : {iterations} iterations, final point: {path_momentum[-1]}")
print(f"Adam         : {iterations} iterations, final point: {path_adam[-1]}")
print("\n✅ Adam converges paling smooth dan efficient!")

💡 Memilih Optimizer

Rekomendasi Default:

Adam: Best all-around choice, adaptive learning rates
SGD + Momentum: Good for very large datasets, simple
RMSprop: Alternative untuk RNN/LSTM

Hyperparameters:

Learning rate: Start dengan 0.001 (Adam) atau 0.01 (SGD)
Beta1 (momentum): 0.9 (default)
Beta2 (RMSprop): 0.999 (default)

5.3.4 Learning Rate and Scheduling

Learning rate adalah hyperparameter paling penting dalam deep learning.

Dampak Learning Rate:

Too small: Training sangat lambat, stuck di local minima
Too large: Unstable, divergence, loss explodes
Just right: Smooth convergence, optimal final performance

import numpy as np
import matplotlib.pyplot as plt

# Simulasi training dengan berbagai learning rates
def simulate_training(learning_rate, iterations=100):
    """Simulate simple loss curve"""
    np.random.seed(42)
    loss = 10.0
    losses = []

    for i in range(iterations):
        # Simulated gradient (decreasing over time)
        gradient = 5.0 * np.exp(-i/30) + np.random.normal(0, 0.5)
        loss = max(0.01, loss - learning_rate * gradient)
        losses.append(loss)

    return losses

# Berbagai learning rates
lrs = [0.001, 0.01, 0.1, 0.5, 1.0]
colors = ['#4CAF50', '#2196F3', '#FF9800', '#F44336', '#9C27B0']

fig, ax = plt.subplots(1, 1, figsize=(12, 6))

for lr, color in zip(lrs, colors):
    losses = simulate_training(lr, iterations=100)
    ax.plot(losses, linewidth=2.5, label=f'LR = {lr}', color=color)

ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Impact of Learning Rate on Training', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 12])
plt.tight_layout()
plt.show()

print("Learning Rate Guidelines:")
print("-" * 70)
print("0.001 - 0.01  : Safe default untuk Adam optimizer")
print("0.01 - 0.1    : Untuk SGD dengan momentum")
print("0.1 - 1.0     : Terlalu besar untuk most cases, unstable")
print("\n💡 Tip: Gunakan learning rate scheduler untuk adaptive adjustment!")

Learning Rate Scheduling Strategies:

Step Decay: Reduce LR by factor setiap N epochs
```
LR = LR₀ × γ^(epoch / step_size)
```
Exponential Decay: Smooth exponential decrease
```
LR = LR₀ × e^(-λ × epoch)
```

Cosine Annealing: Smooth decrease following cosine curve

LR = LR_min + 0.5 × (LR_max - LR_min) × (1 + cos(π × epoch / max_epochs))

ReduceLROnPlateau: Reduce when validation loss plateaus
- Monitor validation metric
- Reduce LR jika tidak improve selama N epochs

5.3.5 Batch Size Considerations

Batch size mempengaruhi:

Memory usage: Larger batch = more memory
Training speed: Larger batch = fewer updates per epoch
Generalization: Smaller batch often generalizes better
Gradient stability: Larger batch = more stable gradients

Common Batch Sizes:

Small: 16-32 (better generalization, less memory)
Medium: 32-128 (balanced)
Large: 256-512+ (faster on GPU, may need LR adjustment)

📝 Batch Size vs Learning Rate

Ketika meningkatkan batch size, pertimbangkan untuk:

Increase learning rate proportionally (Linear Scaling Rule)
Use warmup: Gradually increase LR di awal training

Rule of thumb: new_LR = base_LR × (new_batch_size / base_batch_size)

5.4 Regularization Techniques

Regularization mencegah overfitting dengan membatasi model complexity.

5.4.1 Overfitting in Neural Networks

Tanda-tanda Overfitting:

Training loss terus menurun, tapi validation loss meningkat
Large gap antara training dan validation accuracy
Model performs poorly pada unseen data

import numpy as np
import matplotlib.pyplot as plt

# Simulasi overfitting scenario
np.random.seed(42)
epochs = np.arange(1, 101)

# Training loss: monotonically decreasing
train_loss = 2.0 * np.exp(-epochs/15) + 0.05

# Validation loss: decreases then increases (overfitting)
val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30)

# Training accuracy: monotonically increasing
train_acc = 1.0 - 1.0 * np.exp(-epochs/12)

# Validation accuracy: increases then plateaus/decreases
val_acc = 1.0 - 1.0 * np.exp(-epochs/15) - 0.002 * (epochs - 40) * (epochs > 40)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
ax1.plot(epochs, train_loss, linewidth=3, label='Training Loss', color='#2196F3')
ax1.plot(epochs, val_loss, linewidth=3, label='Validation Loss', color='#F44336')
ax1.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point')
ax1.fill_between(epochs, 0, 3, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Overfitting: Loss Curves', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 2.5])

# Accuracy plot
ax2.plot(epochs, train_acc, linewidth=3, label='Training Accuracy', color='#2196F3')
ax2.plot(epochs, val_acc, linewidth=3, label='Validation Accuracy', color='#F44336')
ax2.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point')
ax2.fill_between(epochs, 0, 1, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.set_title('Overfitting: Accuracy Curves', fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 1.05])

plt.tight_layout()
plt.show()

print("Deteksi Overfitting:")
print("-" * 70)
print("✅ Training loss turun terus, validation loss naik → OVERFIT")
print("✅ Large gap antara training dan validation metrics → OVERFIT")
print("✅ Model perfect pada training data tapi poor pada test → OVERFIT")
print("\n💡 Solusi: Regularization techniques!")

5.4.2 Dropout Technique

Dropout adalah teknik regularization yang randomly drops neurons selama training.

Cara Kerja:

Setiap training step, randomly set fraction p of neurons to zero
Remaining neurons scale by 1/(1-p) untuk maintain expected output
Saat inference (testing), gunakan semua neurons (no dropout)

Efek Dropout:

Mencegah neurons menjadi too specialized (co-adaptation)
Ensemble effect: melatih banyak “sub-networks”
Force network belajar robust features

Code

graph LR
    subgraph Normal["Normal Forward Pass"]
        I1["Input"] --> H1["Hidden"]
        H1 --> H2["Hidden"]
        H2 --> O1["Output"]
    end

    subgraph Dropout["With Dropout (p=0.5)"]
        I2["Input"] --> H3["Hidden ✓"]
        I2 -.->|dropped| H4["Hidden ✗"]
        H3 --> H5["Hidden ✓"]
        H3 -.->|dropped| H6["Hidden ✗"]
        H5 --> O2["Output"]
    end

    style H4 fill:#ffcdd2,stroke:#f44336
    style H6 fill:#ffcdd2,stroke:#f44336
    style H3 fill:#c8e6c9,stroke:#4caf50
    style H5 fill:#c8e6c9,stroke:#4caf50

graph LR
    subgraph Normal["Normal Forward Pass"]
        I1["Input"] --> H1["Hidden"]
        H1 --> H2["Hidden"]
        H2 --> O1["Output"]
    end

    subgraph Dropout["With Dropout (p=0.5)"]
        I2["Input"] --> H3["Hidden ✓"]
        I2 -.->|dropped| H4["Hidden ✗"]
        H3 --> H5["Hidden ✓"]
        H3 -.->|dropped| H6["Hidden ✗"]
        H5 --> O2["Output"]
    end

    style H4 fill:#ffcdd2,stroke:#f44336
    style H6 fill:#ffcdd2,stroke:#f44336
    style H3 fill:#c8e6c9,stroke:#4caf50
    style H5 fill:#c8e6c9,stroke:#4caf50

Dropout Visualization

import numpy as np
import matplotlib.pyplot as plt

class DropoutLayer:
    """Implementasi Dropout layer dari scratch"""

    def __init__(self, dropout_rate=0.5):
        """
        Args:
            dropout_rate: Fraction of neurons to drop (0.0 - 1.0)
        """
        self.dropout_rate = dropout_rate
        self.mask = None

    def forward(self, X, training=True):
        """
        Forward pass dengan dropout

        Args:
            X: Input activations
            training: If True, apply dropout; if False, no dropout
        """
        if training:
            # Generate random mask
            self.mask = np.random.binomial(1, 1 - self.dropout_rate, size=X.shape)
            # Apply mask dan scale
            return X * self.mask / (1 - self.dropout_rate)
        else:
            # No dropout during inference
            return X

    def backward(self, dout):
        """Backward pass: propagate gradients only through active neurons"""
        return dout * self.mask / (1 - self.dropout_rate)

# Demo: Visualisasi dropout effect
np.random.seed(42)
layer_size = 20
input_activations = np.random.randn(layer_size)

dropout_rates = [0.0, 0.2, 0.5, 0.8]
fig, axes = plt.subplots(1, len(dropout_rates), figsize=(16, 4))

for idx, dropout_rate in enumerate(dropout_rates):
    dropout = DropoutLayer(dropout_rate=dropout_rate)
    output = dropout.forward(input_activations.copy(), training=True)

    # Visualize
    ax = axes[idx]
    neurons_active = (dropout.mask > 0)
    colors = ['#4CAF50' if active else '#F44336' for active in neurons_active]

    ax.bar(range(layer_size), np.abs(input_activations), color='lightgray', alpha=0.5, label='Original')
    ax.bar(range(layer_size), np.abs(output), color=colors, alpha=0.8, label='After Dropout')
    ax.set_title(f'Dropout Rate = {dropout_rate}\n({int((1-dropout_rate)*100)}% neurons active)',
                 fontsize=12, fontweight='bold')
    ax.set_xlabel('Neuron Index')
    ax.set_ylabel('Activation')
    ax.set_ylim([0, 4])

    if idx == 0:
        ax.legend()

plt.tight_layout()
plt.show()

print("Dropout Guidelines:")
print("-" * 70)
print("Dropout Rate 0.0   : No dropout (no regularization)")
print("Dropout Rate 0.2   : Light regularization")
print("Dropout Rate 0.5   : Standard choice (recommended)")
print("Dropout Rate 0.8   : Heavy regularization (might underfit)")
print("\n✅ Typical: 0.2-0.5 untuk hidden layers, 0.1-0.2 untuk input layer")

💡 Best Practices Dropout

Where to apply:

Hidden layers (after activation)
Input layer (lower rate: 0.1-0.2)
Avoid: Output layer

Dropout rates:

Default: 0.5 untuk hidden layers
Input layer: 0.1-0.2 (lower rate)
Deep networks: 0.2-0.4 (lower rate untuk deep)

When NOT to use Dropout:

Convolutional layers (use other regularization)
Small datasets (might cause underfitting)
Batch Normalization already used

5.4.3 L1/L2 Regularization

L2 Regularization (Weight Decay):

Add penalty term ke loss function:

\[ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{2m} \sum_{l} \|\mathbf{W}^{[l]}\|_F^2 \]

Effect:

Penalizes large weights
Weights shrink toward zero (but not exactly zero)
Prevents overfitting dengan forcing simpler model

L1 Regularization:

\[ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{m} \sum_{l} \|\mathbf{W}^{[l]}\|_1 \]

Effect:

Promotes sparsity (many weights → exactly zero)
Feature selection effect
Less common in deep learning

import numpy as np
import matplotlib.pyplot as plt

# Visualisasi effect of L2 regularization
np.random.seed(42)

# Generate weight distributions
weights_no_reg = np.random.randn(1000) * 2.0
weights_l2_light = weights_no_reg * 0.7
weights_l2_heavy = weights_no_reg * 0.4

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# No regularization
axes[0].hist(weights_no_reg, bins=50, color='#F44336', alpha=0.7, edgecolor='black')
axes[0].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[0].set_title('No Regularization\nλ = 0', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Weight Value')
axes[0].set_ylabel('Frequency')
axes[0].set_xlim([-6, 6])

# Light L2
axes[1].hist(weights_l2_light, bins=50, color='#FF9800', alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[1].set_title('Light L2 Regularization\nλ = 0.01', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Weight Value')
axes[1].set_ylabel('Frequency')
axes[1].set_xlim([-6, 6])

# Heavy L2
axes[2].hist(weights_l2_heavy, bins=50, color='#4CAF50', alpha=0.7, edgecolor='black')
axes[2].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[2].set_title('Heavy L2 Regularization\nλ = 0.1', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Weight Value')
axes[2].set_ylabel('Frequency')
axes[2].set_xlim([-6, 6])

plt.tight_layout()
plt.show()

print("L2 Regularization Effect:")
print("-" * 70)
print("✅ Weights shrink toward zero (weight decay)")
print("✅ Prevents extreme weight values")
print("✅ Model becomes more robust, less sensitive to individual features")
print("\nTypical λ values:")
print("  Small: 0.0001 - 0.001")
print("  Medium: 0.01 - 0.1")
print("  Large: 0.1+")

5.4.4 Early Stopping

Early stopping: Stop training ketika validation performance berhenti improving.

Algorithm:

Monitor validation loss setiap epoch
Track best validation loss
Jika tidak improve selama N epochs (patience), stop
Return model dengan best validation performance

import numpy as np
import matplotlib.pyplot as plt

# Simulasi early stopping
np.random.seed(42)
epochs = np.arange(1, 101)

# Training loss: monotonically decreasing
train_loss = 2.0 * np.exp(-epochs/15) + 0.05 + np.random.normal(0, 0.02, len(epochs))

# Validation loss: decreases then increases
val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30)
val_loss += np.random.normal(0, 0.05, len(epochs))

# Early stopping logic
patience = 10
best_val_loss = float('inf')
best_epoch = 0
patience_counter = 0

for epoch in range(len(epochs)):
    if val_loss[epoch] < best_val_loss:
        best_val_loss = val_loss[epoch]
        best_epoch = epoch
        patience_counter = 0
    else:
        patience_counter += 1

    if patience_counter >= patience:
        stop_epoch = epoch
        break
else:
    stop_epoch = len(epochs) - 1

# Visualization
fig, ax = plt.subplots(1, 1, figsize=(12, 6))

ax.plot(epochs, train_loss, linewidth=2.5, label='Training Loss', color='#2196F3')
ax.plot(epochs, val_loss, linewidth=2.5, label='Validation Loss', color='#F44336')
ax.axvline(x=best_epoch+1, color='g', linestyle='--', linewidth=2,
           label=f'Best Model (epoch {best_epoch+1})')
ax.axvline(x=stop_epoch+1, color='r', linestyle='--', linewidth=2,
           label=f'Early Stop (epoch {stop_epoch+1})')
ax.scatter([best_epoch+1], [val_loss[best_epoch]], color='g', s=200, zorder=5, marker='*')

ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title(f'Early Stopping (Patience = {patience})', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 2.5])

plt.tight_layout()
plt.show()

print("Early Stopping Summary:")
print("-" * 70)
print(f"Best validation loss: {best_val_loss:.4f} at epoch {best_epoch+1}")
print(f"Training stopped at epoch: {stop_epoch+1}")
print(f"Saved epochs: {len(epochs) - stop_epoch}")
print(f"Patience: {patience} epochs")
print("\n✅ Early stopping prevents wasting compute dan overfitting!")

💡 Early Stopping Best Practices

Patience:

Small datasets: 5-10 epochs
Large datasets: 10-20 epochs
Very large: 20-50 epochs

What to monitor:

Primary: Validation loss
Alternative: Validation accuracy/F1 (for imbalanced data)

Bonus:

Save checkpoints di setiap best epoch
Restore best weights setelah training

5.4.5 Batch Normalization Basics

Batch Normalization normalizes activations di setiap layer untuk stabilize training.

Benefits:

Faster convergence
Higher learning rates dapat digunakan
Less sensitive to initialization
Acts as regularization (similar to dropout)

Formula:

\[ \begin{aligned} \mu_B &= \frac{1}{m} \sum_{i=1}^{m} x_i \\ \sigma_B^2 &= \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 \\ \hat{x}_i &= \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\ y_i &= \gamma \hat{x}_i + \beta \end{aligned} \]

Dimana $\gamma$ dan $\beta$ adalah learnable parameters.

📝 Batch Normalization di Modern Deep Learning

Batch Normalization sangat efektif untuk:

Deep networks (>5 layers)
Convolutional Neural Networks
Large batch sizes

Trade-off:

Extra computation
Behavior berbeda antara training dan inference
Kurang efektif untuk small batch sizes (<16)

Alternatif:

Layer Normalization (untuk RNN/Transformers)
Group Normalization (untuk small batches)

5.5 Implementation dengan Keras

Keras adalah high-level API di atas TensorFlow, sangat user-friendly untuk building neural networks.

5.5.1 Keras Sequential API

Sequential API cocok untuk linear stack of layers (most common architecture).

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

# Load dataset: Binary classification
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Preprocessing: Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"\nDataset: Breast Cancer (Binary Classification)")
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {np.unique(y_train)}")

# Build MLP dengan Sequential API
model = models.Sequential([
    # Input layer (implicit)
    layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],), name='hidden1'),
    layers.Dropout(0.3),

    layers.Dense(32, activation='relu', name='hidden2'),
    layers.Dropout(0.3),

    layers.Dense(16, activation='relu', name='hidden3'),

    # Output layer
    layers.Dense(1, activation='sigmoid', name='output')
], name='MLP_Classifier')

# Model summary
print("\n" + "="*70)
print("MODEL ARCHITECTURE")
print("="*70)
model.summary()

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()]
)

# Training dengan early stopping dan model checkpoint
early_stop = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=15,
    restore_best_weights=True,
    verbose=1
)

checkpoint = keras.callbacks.ModelCheckpoint(
    'best_model.keras',
    monitor='val_loss',
    save_best_only=True,
    verbose=0
)

# Train model
print("\n" + "="*70)
print("TRAINING PROCESS")
print("="*70)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    callbacks=[early_stop, checkpoint],
    verbose=0  # Set to 1 untuk melihat training progress
)

print(f"Training completed: {len(history.history['loss'])} epochs")

# Evaluation
train_results = model.evaluate(X_train, y_train, verbose=0)
test_results = model.evaluate(X_test, y_test, verbose=0)

print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
print(f"Training - Loss: {train_results[0]:.4f}, Accuracy: {train_results[1]:.4f}")
print(f"Test     - Loss: {test_results[0]:.4f}, Accuracy: {test_results[1]:.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
axes[0].plot(history.history['loss'], linewidth=2, label='Training Loss')
axes[0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training History: Loss', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy plot
axes[1].plot(history.history['accuracy'], linewidth=2, label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], linewidth=2, label='Validation Accuracy')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Training History: Accuracy', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Key Points Keras Sequential:

Layers stacked linearly: Sequential([layer1, layer2, ...])
Input shape: Specify di first layer saja
Activation: Bisa inline (activation='relu') atau separate layer
Compile: Specify optimizer, loss, metrics
Fit: Training dengan model.fit()
Callbacks: Early stopping, checkpointing, LR scheduling

5.5.2 Keras Functional API

Functional API lebih flexible, cocok untuk complex architectures (multi-input, multi-output, skip connections).

from tensorflow import keras
from tensorflow.keras import layers, Model

# Build complex architecture dengan Functional API
def build_functional_mlp(input_dim, num_classes):
    """
    Build MLP dengan skip connections (residual-like)
    """
    # Input layer
    inputs = layers.Input(shape=(input_dim,), name='input')

    # First branch
    x1 = layers.Dense(64, activation='relu', name='branch1_dense1')(inputs)
    x1 = layers.Dropout(0.3)(x1)
    x1 = layers.Dense(32, activation='relu', name='branch1_dense2')(x1)

    # Second branch (shorter path)
    x2 = layers.Dense(32, activation='relu', name='branch2_dense')(inputs)

    # Merge branches (skip connection)
    merged = layers.Add(name='merge')([x1, x2])
    merged = layers.Activation('relu')(merged)

    # Final layers
    x = layers.Dense(16, activation='relu', name='final_dense')(merged)
    x = layers.Dropout(0.2)(x)

    # Output layer
    outputs = layers.Dense(num_classes, activation='softmax', name='output')(x)

    # Create model
    model = Model(inputs=inputs, outputs=outputs, name='Functional_MLP')

    return model

# Create model
functional_model = build_functional_mlp(input_dim=30, num_classes=2)

# Model summary
print("="*70)
print("FUNCTIONAL API MODEL")
print("="*70)
functional_model.summary()

# Visualisasi architecture
keras.utils.plot_model(
    functional_model,
    to_file='functional_mlp_architecture.png',
    show_shapes=True,
    show_layer_names=True,
    rankdir='TB',
    dpi=96
)

print("\n✅ Model architecture saved to: functional_mlp_architecture.png")
print("\n💡 Functional API memungkinkan:")
print("   - Multiple inputs/outputs")
print("   - Skip connections (ResNet-style)")
print("   - Shared layers")
print("   - Complex graph topologies")

💡 Sequential vs Functional API

Use Sequential when:

Linear stack of layers
Simple feed-forward architecture
Quick prototyping

Use Functional when:

Multi-input or multi-output models
Skip connections (ResNet, DenseNet)
Shared layers
Complex architectures

5.5.3 Model Saving and Loading

# Save entire model (architecture + weights + optimizer state)
model.save('complete_model.keras')
print("✅ Complete model saved: complete_model.keras")

# Save only weights
model.save_weights('model_weights.weights.h5')
print("✅ Weights saved: model_weights.weights.h5")

# Load complete model
loaded_model = keras.models.load_model('complete_model.keras')
print("✅ Model loaded successfully")

# Verify loaded model
test_loss, test_acc = loaded_model.evaluate(X_test, y_test, verbose=0)
print(f"Loaded model accuracy: {test_acc:.4f}")

# Export untuk production (TensorFlow SavedModel format)
model.export('exported_model')
print("✅ Model exported untuk production: exported_model/")

5.6 Implementation dengan PyTorch

PyTorch memberikan lower-level control dan sangat populer untuk research.

5.6.1 PyTorch nn.Module

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

# Load dataset: Multi-class classification
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert ke PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).to(device)
y_train_tensor = torch.LongTensor(y_train).to(device)
X_test_tensor = torch.FloatTensor(X_test).to(device)
y_test_tensor = torch.LongTensor(y_test).to(device)

# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

print(f"\nDataset: Iris (Multi-class Classification)")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {len(np.unique(y_train))}")

# Define MLP model dengan PyTorch
class MLPClassifier(nn.Module):
    """Custom MLP Classifier menggunakan PyTorch"""

    def __init__(self, input_dim, hidden_dims, num_classes, dropout_rate=0.3):
        """
        Args:
            input_dim: Number of input features
            hidden_dims: List of hidden layer sizes, e.g., [64, 32, 16]
            num_classes: Number of output classes
            dropout_rate: Dropout probability
        """
        super(MLPClassifier, self).__init__()

        # Build layers dynamically
        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_dim = hidden_dim

        # Output layer
        layers.append(nn.Linear(prev_dim, num_classes))

        # Create sequential module
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        """Forward pass"""
        return self.network(x)

# Create model
model = MLPClassifier(
    input_dim=X_train.shape[1],
    hidden_dims=[32, 16, 8],
    num_classes=len(np.unique(y_train)),
    dropout_rate=0.2
).to(device)

print("\n" + "="*70)
print("PYTORCH MODEL ARCHITECTURE")
print("="*70)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
def train_epoch(model, loader, criterion, optimizer, device):
    """Train for one epoch"""
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_X, batch_y in loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Statistics
        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += batch_y.size(0)
        correct += (predicted == batch_y).sum().item()

    avg_loss = total_loss / len(loader)
    accuracy = correct / total
    return avg_loss, accuracy

def evaluate(model, X, y, criterion, device):
    """Evaluate model"""
    model.eval()
    with torch.no_grad():
        outputs = model(X)
        loss = criterion(outputs, y)
        _, predicted = torch.max(outputs.data, 1)
        accuracy = (predicted == y).sum().item() / y.size(0)
    return loss.item(), accuracy

# Training
print("\n" + "="*70)
print("TRAINING PROCESS")
print("="*70)

epochs = 100
train_losses = []
train_accs = []
test_losses = []
test_accs = []

for epoch in range(epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, X_test_tensor, y_test_tensor, criterion, device)

    train_losses.append(train_loss)
    train_accs.append(train_acc)
    test_losses.append(test_loss)
    test_accs.append(test_acc)

    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{epochs}] - "
              f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | "
              f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}")

print("\n" + "="*70)
print("FINAL RESULTS")
print("="*70)
print(f"Final Train Accuracy: {train_accs[-1]:.4f}")
print(f"Final Test Accuracy: {test_accs[-1]:.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(train_losses, linewidth=2, label='Training Loss')
axes[0].plot(test_losses, linewidth=2, label='Test Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('PyTorch Training: Loss', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(train_accs, linewidth=2, label='Training Accuracy')
axes[1].plot(test_accs, linewidth=2, label='Test Accuracy')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('PyTorch Training: Accuracy', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

5.6.2 PyTorch Model Saving and Loading

# Save model weights
torch.save(model.state_dict(), 'pytorch_model_weights.pth')
print("✅ Model weights saved: pytorch_model_weights.pth")

# Save complete checkpoint (model + optimizer state)
checkpoint = {
    'epoch': epochs,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'train_loss': train_losses[-1],
    'test_loss': test_losses[-1]
}
torch.save(checkpoint, 'pytorch_checkpoint.pth')
print("✅ Complete checkpoint saved: pytorch_checkpoint.pth")

# Load model weights
loaded_model = MLPClassifier(
    input_dim=X_train.shape[1],
    hidden_dims=[32, 16, 8],
    num_classes=len(np.unique(y_train)),
    dropout_rate=0.2
).to(device)

loaded_model.load_state_dict(torch.load('pytorch_model_weights.pth', weights_only=True))
loaded_model.eval()
print("✅ Model weights loaded successfully")

# Verify
test_loss, test_acc = evaluate(loaded_model, X_test_tensor, y_test_tensor, criterion, device)
print(f"Loaded model test accuracy: {test_acc:.4f}")

5.6.3 Keras vs PyTorch: Quick Comparison

Aspect	Keras	PyTorch
Ease of Use	Very easy, high-level	Moderate, more verbose
Flexibility	Limited (functional API helps)	Very flexible
Debugging	Harder (compiled graphs)	Easier (eager execution)
Research	Less common	Very popular
Production	TF ecosystem (TFLite, TF Serving)	TorchServe, ONNX
Community	Large (TF ecosystem)	Large (research-focused)
Learning Curve	Gentle	Steeper

Recommendation:

Beginners: Start dengan Keras (easier)
Research: PyTorch (more control)
Production (mobile/edge): Keras/TensorFlow
Production (server): Either works

5.7 Practical Example: Bank Marketing Campaign

Mari kita aplikasikan semua yang telah dipelajari untuk real-world problem.

Problem: Prediksi apakah customer akan subscribe term deposit berdasarkan marketing campaign data.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks

# Simulate bank marketing dataset (simplified version)
np.random.seed(42)
n_samples = 5000

data = {
    'age': np.random.randint(18, 70, n_samples),
    'balance': np.random.randint(-5000, 50000, n_samples),
    'duration': np.random.randint(0, 3000, n_samples),
    'campaign': np.random.randint(1, 50, n_samples),
    'previous': np.random.randint(0, 40, n_samples),
    'job': np.random.choice(['admin', 'technician', 'services', 'management'], n_samples),
    'education': np.random.choice(['primary', 'secondary', 'tertiary'], n_samples),
    'subscribed': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])  # Imbalanced
}

df = pd.DataFrame(data)

print("="*70)
print("BANK MARKETING DATASET")
print("="*70)
print(df.head(10))
print(f"\nDataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['subscribed'].value_counts())
print(f"Class imbalance ratio: {df['subscribed'].value_counts()[0] / df['subscribed'].value_counts()[1]:.2f}:1")

# Preprocessing
# Encode categorical variables
le_job = LabelEncoder()
le_education = LabelEncoder()

df['job_encoded'] = le_job.fit_transform(df['job'])
df['education_encoded'] = le_education.fit_transform(df['education'])

# Select features
feature_cols = ['age', 'balance', 'duration', 'campaign', 'previous',
                'job_encoded', 'education_encoded']
X = df[feature_cols].values
y = df['subscribed'].values

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Build MLP model
model = models.Sequential([
    layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
    layers.BatchNormalization(),
    layers.Dropout(0.4),

    layers.Dense(64, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),

    layers.Dense(1, activation='sigmoid')
], name='Bank_Marketing_MLP')

# Compile
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.AUC()]
)

# Callbacks
early_stop = callbacks.EarlyStopping(
    monitor='val_auc',
    patience=20,
    restore_best_weights=True,
    mode='max',
    verbose=1
)

lr_schedule = callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=10,
    verbose=1
)

# Training
print("\n" + "="*70)
print("TRAINING")
print("="*70)

history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=64,
    callbacks=[early_stop, lr_schedule],
    verbose=0
)

print(f"Training completed: {len(history.history['loss'])} epochs")

# Evaluation
y_pred_proba = model.predict(X_test, verbose=0).ravel()
y_pred = (y_pred_proba >= 0.5).astype(int)

print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Subscribed', 'Subscribed']))

print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('True')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, linewidth=3, label=f'AUC = {roc_auc_score(y_test, y_pred_proba):.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curve', fontsize=13, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Training History: Loss
axes[1, 0].plot(history.history['loss'], linewidth=2, label='Training Loss')
axes[1, 0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss')
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Loss', fontsize=12)
axes[1, 0].set_title('Training History: Loss', fontsize=13, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Training History: AUC
axes[1, 1].plot(history.history['auc'], linewidth=2, label='Training AUC')
axes[1, 1].plot(history.history['val_auc'], linewidth=2, label='Validation AUC')
axes[1, 1].set_xlabel('Epoch', fontsize=12)
axes[1, 1].set_ylabel('AUC', fontsize=12)
axes[1, 1].set_title('Training History: AUC', fontsize=13, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n✅ Complete MLP pipeline untuk real-world classification problem!")

5.8 Review dan Latihan

5.8.1 Review Questions

Konsep Fundamental:
- Apa perbedaan utama antara deep learning dan classical machine learning?
- Mengapa single perceptron tidak bisa menyelesaikan XOR problem?
- Jelaskan fungsi activation function dalam neural networks!
Architecture:
- Bagaimana memilih jumlah hidden layers dan neurons?
- Kapan menggunakan ReLU vs Sigmoid activation?
- Apa trade-off antara deep vs wide networks?
Training:
- Jelaskan intuitively bagaimana backpropagation bekerja!
- Mengapa Adam optimizer lebih baik dari standard SGD?
- Apa dampak learning rate yang terlalu besar atau terlalu kecil?
Regularization:
- Bagaimana dropout mencegah overfitting?
- Kapan menggunakan L2 regularization vs dropout?
- Apa kelebihan early stopping dibanding training sampai convergence?
Implementation:
- Kapan menggunakan Keras Sequential vs Functional API?
- Apa perbedaan utama antara Keras dan PyTorch?
- Bagaimana cara menangani class imbalance dalam neural networks?

5.8.2 Coding Exercises

Exercise 1: XOR Problem

Implement dan train MLP untuk menyelesaikan XOR problem menggunakan Keras. Network harus mencapai 100% accuracy.

# Your solution here
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

# TODO: Build, compile, dan train model
# TODO: Achieve 100% accuracy pada XOR problem

Exercise 2: Hyperparameter Tuning

Experiment dengan berbagai hyperparameters pada Iris dataset:

Jumlah hidden layers (1, 2, 3)
Neurons per layer (8, 16, 32, 64)
Learning rate (0.0001, 0.001, 0.01)
Dropout rate (0.0, 0.2, 0.5)

Plot hasil dan tentukan konfigurasi optimal.

Exercise 3: Custom Loss Function

Implement custom loss function di PyTorch untuk Focal Loss (menangani class imbalance):

\[ \text{FL}(p_t) = -(1-p_t)^\gamma \log(p_t) \]

Exercise 4: Transfer Learning Simulation

Train MLP pada subset dataset (30% data)
Freeze beberapa layers dan fine-tune pada remaining data
Compare dengan training from scratch

5.8.3 Case Study: Credit Card Fraud Detection

Dataset: Simulated credit card transactions (imbalanced: 99.8% normal, 0.2% fraud)

Task:

Preprocess data (scaling, handling imbalance)
Build MLP classifier dengan appropriate architecture
Use class weights atau oversampling
Optimize untuk precision dan recall (not just accuracy!)
Visualize results dan analyze errors

Deliverables:

Trained model dengan precision >0.90 dan recall >0.80
Learning curves
Confusion matrix
ROC curve dan AUC score
Analysis of false positives dan false negatives

5.9 Key Takeaways

🎯 Ringkasan Chapter 5

Konsep Fundamental:

Deep learning menggunakan hierarchical feature learning
MLPs terdiri dari input, hidden, dan output layers
Activation functions memberikan non-linearity
Forward propagation: input → output
Backpropagation: compute gradients untuk learning

Training Best Practices:

Optimizer: Adam (default choice)
Learning Rate: 0.001 (Adam), 0.01 (SGD)
Batch Size: 32-128 (balanced)
Epochs: Use early stopping

Regularization:

Dropout: 0.3-0.5 untuk hidden layers
L2: λ = 0.0001-0.01
Early Stopping: Patience 10-20 epochs
Batch Normalization: For deep networks

Implementation:

Keras: User-friendly, quick prototyping
PyTorch: Flexible, research-oriented
Sequential API: Linear architectures
Functional API: Complex architectures

Common Pitfalls:

Forgetting to normalize input data
Using sigmoid untuk hidden layers (use ReLU!)
Too large learning rate (divergence)
Not using validation set (overfitting)
Ignoring class imbalance

5.10 What’s Next?

Setelah menguasai MLP fundamentals, Anda siap untuk advanced deep learning architectures:

Chapter 6: Convolutional Neural Networks (CNN) - Image classification dan computer vision - Convolutional layers dan feature maps - Transfer learning dengan pretrained models - Data augmentation techniques

Chapter 7: Recurrent Neural Networks (RNN/LSTM) - Sequence modeling (time series, text) - LSTM dan GRU architectures - Bidirectional RNNs - Attention mechanisms

Chapter 8: Transformers - Self-attention mechanism - BERT, GPT architectures - Fine-tuning pretrained models - Modern NLP applications

Continue Learning:

Experiment dengan different datasets
Read research papers (start dengan survey papers)
Participate dalam Kaggle competitions
Build real-world projects

Referensi dan Further Reading

Books:

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning.
Zhang, A., et al. (2023). Dive into Deep Learning. (Free online: d2l.ai)

Papers:

Rumelhart, D. E., et al. (1986). “Learning representations by back-propagating errors.”
Kingma, D. P., & Ba, J. (2014). “Adam: A Method for Stochastic Optimization.”
Srivastava, N., et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.”

Online Resources:

TensorFlow Tutorials: https://www.tensorflow.org/tutorials
PyTorch Tutorials: https://pytorch.org/tutorials/
Fast.ai Course: https://course.fast.ai/
Stanford CS231n: http://cs231n.stanford.edu/

Practice Platforms:

Kaggle: https://www.kaggle.com/
Google Colab: https://colab.research.google.com/
Papers with Code: https://paperswithcode.com/

💡 Final Advice

Deep learning adalah iterative process: 1. Start simple (baseline model) 2. Analyze errors dan bottlenecks 3. Iterate dengan improvements 4. Experiment systematically 5. Document what works dan what doesn’t

“The best model is the one you can build, understand, and improve.”

Good luck dan happy deep learning! 🚀

--- title: "Bab 5: Multilayer Perceptron (MLP) Fundamentals" subtitle: "Fondasi Deep Learning dan Neural Networks" number-sections: false --- # Bab 5: Multilayer Perceptron (MLP) Fundamentals {#sec-chapter-05} ::: {.callout-note} ## 🎯 Hasil Pembelajaran (Learning Outcomes) Setelah mempelajari bab ini, Anda akan mampu: 1. **Memahami** arsitektur neural network dan konsep fundamental deep learning 2. **Mengimplementasikan** MLP menggunakan Keras dan PyTorch framework 3. **Menerapkan** teknik optimisasi (learning rate, batch size, epochs) untuk training efektif 4. **Menggunakan** regularization techniques (dropout, L2) untuk mencegah overfitting 5. **Mengevaluasi** dan memvisualisasikan training process deep learning models 6. **Membandingkan** deep learning dengan classical machine learning approaches ::: ## 5.1 Pengantar: Evolusi dari Classical ML ke Deep Learning ### 5.1.1 Mengapa Neural Networks? Setelah mempelajari classical machine learning di Phase 2 (Chapters 1-4), Anda mungkin bertanya: **Mengapa kita perlu deep learning?** Jawabannya terletak pada keterbatasan classical ML dan kemampuan unik neural networks. **Keterbatasan Classical ML:** | Aspek | Classical ML | Deep Learning | |-------|--------------|---------------| | **Feature Engineering** | Manual, membutuhkan domain expertise | Otomatis, hierarchical feature learning | | **Data Kompleksitas** | Kesulitan dengan data dimensi tinggi | Unggul pada image, text, audio | | **Scalability** | Performa plateau pada dataset besar | Meningkat dengan data lebih banyak | | **Representation** | Shallow, single-layer features | Deep, hierarchical abstractions | | **Transfer Learning** | Terbatas | Sangat efektif | ::: {.callout-tip} ## 💡 Intuisi Deep Learning Bayangkan mengenali wajah seseorang: - **Layer 1**: Deteksi edges (garis, kontur) - **Layer 2**: Deteksi parts (mata, hidung, mulut) - **Layer 3**: Deteksi patterns (susunan wajah) - **Layer 4**: Identifikasi individu Neural networks belajar representasi hierarkis seperti ini **secara otomatis** dari data! ::: ### 5.1.2 Success Stories Deep Learning Deep learning telah mencapai breakthrough di berbagai domain: **Computer Vision:** - ImageNet (2012): AlexNet mengurangi error dari 26% → 15% - Object detection real-time (YOLO, Faster R-CNN) - Medical imaging: deteksi kanker dengan akurasi dokter spesialis **Natural Language Processing:** - Machine translation: Google Translate neural MT - Large Language Models: GPT-4, Claude, Gemini - Sentiment analysis, text generation **Speech & Audio:** - Speech recognition: Google Assistant, Siri - Text-to-speech synthesis yang natural - Music generation (Jukebox, MusicGen) **Game Playing:** - AlphaGo: mengalahkan Lee Sedol (Go champion) - AlphaZero: master Chess, Go, Shogi dari zero - OpenAI Five: Dota 2 championship level **Science & Research:** - AlphaFold: prediksi protein structure - Drug discovery acceleration - Climate modeling ### 5.1.3 Deep Learning vs Classical ML: Kapan Menggunakan Apa? **Gunakan Classical ML jika:** - Dataset kecil (<10,000 samples) - Features sudah well-defined - Interpretability sangat penting - Computational resources terbatas - Baseline cepat dibutuhkan **Gunakan Deep Learning jika:** - Dataset besar (>100,000 samples) - Data kompleks (images, text, audio) - Feature engineering sulit/tidak jelas - Computational resources tersedia (GPU) - State-of-the-art performance diperlukan ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_moons from sklearn.linear_model import LogisticRegression from sklearn.neural_network import MLPClassifier from matplotlib.colors import ListedColormap # Generate non-linear data np.random.seed(42) X, y = make_moons(n_samples=1000, noise=0.2, random_state=42) # Split data from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Classical ML: Logistic Regression (linear decision boundary) lr_model = LogisticRegression() lr_model.fit(X_train, y_train) lr_score = lr_model.score(X_test, y_test) # Deep Learning: MLP (non-linear decision boundary) mlp_model = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42) mlp_model.fit(X_train, y_train) mlp_score = mlp_model.score(X_test, y_test) # Visualization def plot_decision_boundary(model, X, y, title): h = 0.02 x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5 y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF'])) plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#0000FF']), edgecolors='k', s=50) plt.title(title, fontsize=14, fontweight='bold') plt.xlabel('Feature 1') plt.ylabel('Feature 2') fig, axes = plt.subplots(1, 2, figsize=(14, 5)) plt.sca(axes[0]) plot_decision_boundary(lr_model, X_test, y_test, f'Logistic Regression\nAccuracy: {lr_score:.3f}') plt.sca(axes[1]) plot_decision_boundary(mlp_model, X_test, y_test, f'Neural Network (MLP)\nAccuracy: {mlp_score:.3f}') plt.tight_layout() plt.show() print(f"Linear Model (Logistic Regression): {lr_score:.4f}") print(f"Non-linear Model (MLP): {mlp_score:.4f}") print(f"Improvement: {(mlp_score - lr_score) * 100:.2f}%") ``` **Insight:** Neural networks dapat mempelajari **decision boundaries non-linear** yang kompleks, sementara linear models terbatas pada pemisahan linear. --- ## 5.2 Neural Network Architecture ### 5.2.1 Perceptron: Building Block Fundamental **Perceptron** adalah unit komputasi dasar neural network, terinspirasi dari neuron biologis. ```{mermaid} %%| fig-cap: Arsitektur Single Perceptron flowchart LR X1["x1 (Input 1)"] -->|w1| S["Σ (Weighted Sum)"] X2["x2 (Input 2)"] -->|w2| S X3["x3 (Input 3)"] -->|w3| S B["b (Bias)"] -->|1| S S --> A["σ (Activation)"] A --> Y["y-hat (Output)"] style S fill:#ffeb3b style A fill:#4caf50 style Y fill:#2196f3 ``` **Matematika Perceptron:** $$ \begin{aligned} z &= \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b \\ \hat{y} &= \sigma(z) = \sigma(\mathbf{w}^T \mathbf{x} + b) \end{aligned} $$ Dimana: - $\mathbf{x} = [x_1, x_2, ..., x_n]$: input features - $\mathbf{w} = [w_1, w_2, ..., w_n]$: weights (bobot) - $b$: bias (intercept) - $\sigma$: activation function - $z$: pre-activation value - $\hat{y}$: output prediction **Komponen Kunci:** 1. **Weights (w)**: Parameter yang dipelajari, mengontrol kekuatan koneksi 2. **Bias (b)**: Offset yang memungkinkan shifting function 3. **Activation Function**: Non-linearity yang memungkinkan learning complex patterns ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Implementasi Perceptron dari scratch class SimplePerceptron: def __init__(self, n_inputs): # Initialize weights dan bias secara random self.weights = np.random.randn(n_inputs) self.bias = np.random.randn() def sigmoid(self, z): """Sigmoid activation function""" return 1 / (1 + np.exp(-z)) def forward(self, x): """Forward pass: compute output""" z = np.dot(x, self.weights) + self.bias return self.sigmoid(z) def predict(self, X): """Predict untuk multiple samples""" return np.array([self.forward(x) for x in X]) # Demo: XOR problem (classic non-linearly separable problem) # Single perceptron TIDAK BISA menyelesaikan XOR X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y_xor = np.array([0, 1, 1, 0]) # XOR outputs perceptron = SimplePerceptron(n_inputs=2) predictions = perceptron.predict(X_xor) print("XOR Problem dengan Single Perceptron:") print("Input (x1, x2) | True Label | Prediction | Correct?") print("-" * 55) for i in range(len(X_xor)): pred_label = 1 if predictions[i] >= 0.5 else 0 correct = "✓" if pred_label == y_xor[i] else "✗" print(f"{X_xor[i]} | {y_xor[i]:^10d} | {predictions[i]:^10.3f} | {correct:^8s}") print("\n⚠️ Single perceptron gagal menyelesaikan XOR problem!") print("Ini memotivasi development MLP (multiple layers).") ``` **Keterbatasan Single Perceptron:** ::: {.callout-warning} ## ⚠️ XOR Problem Single perceptron hanya bisa mempelajari **linearly separable patterns**. XOR problem adalah contoh klasik yang membuktikan keterbatasan ini, yang memotivasi pengembangan **Multi-Layer Perceptron (MLP)**. ::: ### 5.2.2 Multilayer Perceptron (MLP) Architecture MLP mengatasi keterbatasan single perceptron dengan menambahkan **hidden layers**. ```{mermaid} %%| fig-cap: Arsitektur Multilayer Perceptron (MLP) graph LR subgraph Input["Input Layer"] X1["x₁"] X2["x₂"] X3["x₃"] end subgraph Hidden1["Hidden Layer 1"] H11["h₁₁"] H12["h₁₂"] H13["h₁₃"] H14["h₁₄"] end subgraph Hidden2["Hidden Layer 2"] H21["h₂₁"] H22["h₂₂"] H23["h₂₃"] end subgraph Output["Output Layer"] Y1["ŷ₁"] Y2["ŷ₂"] end X1 --> H11 & H12 & H13 & H14 X2 --> H11 & H12 & H13 & H14 X3 --> H11 & H12 & H13 & H14 H11 --> H21 & H22 & H23 H12 --> H21 & H22 & H23 H13 --> H21 & H22 & H23 H14 --> H21 & H22 & H23 H21 --> Y1 & Y2 H22 --> Y1 & Y2 H23 --> Y1 & Y2 style Input fill:#e3f2fd style Hidden1 fill:#fff3e0 style Hidden2 fill:#fff3e0 style Output fill:#e8f5e9 ``` **Terminologi:** - **Input Layer**: Menerima features (tidak ada komputasi) - **Hidden Layer(s)**: Layer intermediate yang mempelajari representations - **Output Layer**: Menghasilkan predictions - **Depth**: Jumlah hidden layers (Deep = many layers) - **Width**: Jumlah neurons per layer **Notasi Matematika:** Untuk layer $l$: $$ \begin{aligned} \mathbf{z}^{[l]} &= \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \\ \mathbf{a}^{[l]} &= \sigma^{[l]}(\mathbf{z}^{[l]}) \end{aligned} $$ Dimana: - $\mathbf{W}^{[l]}$: weight matrix untuk layer $l$ - $\mathbf{b}^{[l]}$: bias vector untuk layer $l$ - $\mathbf{a}^{[l]}$: activations (output) dari layer $l$ - $\mathbf{a}^{[0]} = \mathbf{x}$: input features - $\sigma^{[l]}$: activation function untuk layer $l$ **Ukuran Matrices:** Jika layer $l$ memiliki $n^{[l]}$ neurons dan layer $l-1$ memiliki $n^{[l-1]}$ neurons: - $\mathbf{W}^{[l]}$: shape $(n^{[l]}, n^{[l-1]})$ - $\mathbf{b}^{[l]}$: shape $(n^{[l]}, 1)$ - $\mathbf{a}^{[l]}$: shape $(n^{[l]}, 1)$ untuk single sample ### 5.2.3 Activation Functions Activation functions memberikan **non-linearity** yang esensial untuk learning complex patterns. **Perbandingan Activation Functions:** | Function | Formula | Range | Use Case | Pros | Cons | |----------|---------|-------|----------|------|------| | **Sigmoid** | $\sigma(z) = \frac{1}{1+e^{-z}}$ | (0, 1) | Output layer (binary) | Smooth, probabilistic | Vanishing gradient | | **Tanh** | $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ | (-1, 1) | Hidden layers (old) | Zero-centered | Vanishing gradient | | **ReLU** | $\text{ReLU}(z) = \max(0, z)$ | [0, ∞) | Hidden layers (default) | Fast, no vanishing | Dying ReLU | | **Leaky ReLU** | $\text{LReLU}(z) = \max(0.01z, z)$ | (-∞, ∞) | Hidden layers | Fixes dying ReLU | Hyperparameter α | | **Softmax** | $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ | (0, 1) sum=1 | Output (multi-class) | Probabilities | Only output layer | ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Implementasi activation functions def sigmoid(z): return 1 / (1 + np.exp(-z)) def tanh(z): return np.tanh(z) def relu(z): return np.maximum(0, z) def leaky_relu(z, alpha=0.01): return np.where(z > 0, z, alpha * z) # Visualisasi z = np.linspace(-5, 5, 200) fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Sigmoid axes[0, 0].plot(z, sigmoid(z), linewidth=3, color='#2196F3') axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3) axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3) axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3) axes[0, 0].grid(True, alpha=0.3) axes[0, 0].set_title('Sigmoid: σ(z) = 1/(1+e⁻ᶻ)', fontsize=13, fontweight='bold') axes[0, 0].set_xlabel('z') axes[0, 0].set_ylabel('σ(z)') axes[0, 0].text(-4, 0.9, 'Range: (0, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat')) # Tanh axes[0, 1].plot(z, tanh(z), linewidth=3, color='#4CAF50') axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3) axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3) axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3) axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3) axes[0, 1].grid(True, alpha=0.3) axes[0, 1].set_title('Tanh: tanh(z)', fontsize=13, fontweight='bold') axes[0, 1].set_xlabel('z') axes[0, 1].set_ylabel('tanh(z)') axes[0, 1].text(-4, 0.8, 'Range: (-1, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='lightgreen')) # ReLU axes[1, 0].plot(z, relu(z), linewidth=3, color='#FF9800') axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3) axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3) axes[1, 0].grid(True, alpha=0.3) axes[1, 0].set_title('ReLU: max(0, z)', fontsize=13, fontweight='bold') axes[1, 0].set_xlabel('z') axes[1, 0].set_ylabel('ReLU(z)') axes[1, 0].text(-4, 4, 'Range: [0, ∞)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#FFE0B2')) # Leaky ReLU axes[1, 1].plot(z, leaky_relu(z), linewidth=3, color='#9C27B0') axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3) axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3) axes[1, 1].grid(True, alpha=0.3) axes[1, 1].set_title('Leaky ReLU: max(0.01z, z)', fontsize=13, fontweight='bold') axes[1, 1].set_xlabel('z') axes[1, 1].set_ylabel('Leaky ReLU(z)') axes[1, 1].text(-4, 4, 'Range: (-∞, ∞)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#E1BEE7')) plt.tight_layout() plt.show() print("Karakteristik Activation Functions:") print("-" * 70) print("Sigmoid : Smooth, outputs (0,1), ⚠️ vanishing gradient problem") print("Tanh : Zero-centered, outputs (-1,1), ⚠️ vanishing gradient") print("ReLU : Fast, sparse activation, ⚠️ dying ReLU (neurons die)") print("Leaky ReLU: Solves dying ReLU, allows small negative values") print("\n✅ Rekomendasi: Gunakan ReLU untuk hidden layers (default choice)") ``` ::: {.callout-tip} ## 💡 Memilih Activation Function **Hidden Layers:** - **Default**: ReLU (cepat, efektif, sparse) - **Alternative**: Leaky ReLU (menghindari dying neurons) - **Avoid**: Sigmoid/Tanh (vanishing gradient untuk deep networks) **Output Layer:** - **Binary Classification**: Sigmoid (output probability 0-1) - **Multi-class Classification**: Softmax (probabilities sum to 1) - **Regression**: Linear (no activation, output real values) ::: ### 5.2.4 Forward Propagation Forward propagation adalah proses komputasi output dari input, layer demi layer. ```{python} #| echo: true #| code-fold: false import numpy as np class MLPFromScratch: """MLP implementation dari scratch untuk educational purpose""" def __init__(self, layer_sizes): """ Initialize MLP dengan arsitektur specified Args: layer_sizes: List of integers, e.g., [3, 4, 4, 2] Input: 3 neurons Hidden1: 4 neurons Hidden2: 4 neurons Output: 2 neurons """ self.layer_sizes = layer_sizes self.num_layers = len(layer_sizes) self.parameters = {} # Initialize weights dan biases np.random.seed(42) for l in range(1, self.num_layers): # He initialization untuk ReLU self.parameters[f'W{l}'] = np.random.randn( layer_sizes[l], layer_sizes[l-1] ) * np.sqrt(2.0 / layer_sizes[l-1]) self.parameters[f'b{l}'] = np.zeros((layer_sizes[l], 1)) def relu(self, Z): """ReLU activation""" return np.maximum(0, Z) def softmax(self, Z): """Softmax activation (untuk output layer)""" exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True)) # numerical stability return exp_Z / np.sum(exp_Z, axis=0, keepdims=True) def forward_propagation(self, X): """ Forward pass melalui network Args: X: Input data (n_features, m_samples) Returns: AL: Output activations cache: Dictionary containing Z, A untuk setiap layer """ cache = {'A0': X} A = X # Hidden layers dengan ReLU for l in range(1, self.num_layers - 1): Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}'] A = self.relu(Z) cache[f'Z{l}'] = Z cache[f'A{l}'] = A # Output layer dengan softmax l = self.num_layers - 1 Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}'] A = self.softmax(Z) cache[f'Z{l}'] = Z cache[f'A{l}'] = A return A, cache def predict(self, X): """Predict class labels""" A, _ = self.forward_propagation(X) return np.argmax(A, axis=0) # Demo: Solve XOR problem dengan MLP print("Demo: MLP Solving XOR Problem") print("=" * 60) # XOR data X_xor = np.array([[0, 0, 1, 1], [0, 1, 0, 1]]) # (2 features, 4 samples) y_xor = np.array([0, 1, 1, 0]) # XOR labels # Create MLP: 2 inputs -> 4 hidden -> 2 outputs mlp = MLPFromScratch(layer_sizes=[2, 4, 2]) # Forward pass output, cache = mlp.forward_propagation(X_xor) print(f"\nNetwork Architecture: {mlp.layer_sizes}") print(f"Input → {mlp.layer_sizes[0]} neurons") print(f"Hidden → {mlp.layer_sizes[1]} neurons (ReLU)") print(f"Output → {mlp.layer_sizes[2]} neurons (Softmax)") print(f"\n📊 Forward Propagation Results:") print("-" * 60) for i in range(X_xor.shape[1]): print(f"Input: {X_xor[:, i]} | Output probs: {output[:, i]} | Prediction: {np.argmax(output[:, i])}") print("\n⚠️ Catatan: Ini adalah RANDOM INITIALIZATION (belum training)") print("Setelah training dengan backpropagation, MLP akan menyelesaikan XOR!") print("\nKita akan pelajari training process di section berikutnya.") ``` **Proses Forward Propagation:** 1. **Input Layer**: $\mathbf{a}^{[0]} = \mathbf{x}$ 2. **Hidden Layer 1**: - $\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]}$ - $\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})$ 3. **Hidden Layer 2**: - $\mathbf{z}^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}$ - $\mathbf{a}^{[2]} = \text{ReLU}(\mathbf{z}^{[2]})$ 4. **Output Layer**: - $\mathbf{z}^{[3]} = \mathbf{W}^{[3]} \mathbf{a}^{[2]} + \mathbf{b}^{[3]}$ - $\mathbf{a}^{[3]} = \text{Softmax}(\mathbf{z}^{[3]})$ ### 5.2.5 Network Topology Considerations **Berapa banyak hidden layers?** | Network Type | Hidden Layers | Use Case | |--------------|---------------|----------| | **Shallow** | 1 | Simple patterns, XOR, small data | | **Medium** | 2-3 | Most practical problems | | **Deep** | 4+ | Complex hierarchical features (images, text) | **Berapa banyak neurons per layer?** **Rules of Thumb:** 1. Mulai dengan neurons = mean(input_size, output_size) 2. Hidden layers > input layer (tapi tidak terlalu besar) 3. Pyramid shape: gradually decreasing (e.g., 128 → 64 → 32) 4. Sama untuk semua hidden layers juga ok (e.g., 64 → 64 → 64) **Experimentasi penting!** Tidak ada formula perfect, tergantung pada: - Data complexity - Number of samples - Overfitting/underfitting --- ## 5.3 Training Neural Networks ### 5.3.1 Loss Functions Loss function mengukur seberapa jauh predictions dari true values. **Binary Classification: Binary Cross-Entropy** $$ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right] $$ **Multi-class Classification: Categorical Cross-Entropy** $$ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)}) $$ **Regression: Mean Squared Error (MSE)** $$ \mathcal{L}(\mathbf{w}) = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2 $$ ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt def binary_cross_entropy(y_true, y_pred): """Binary cross-entropy loss""" epsilon = 1e-15 # untuk numerical stability y_pred = np.clip(y_pred, epsilon, 1 - epsilon) return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) def categorical_cross_entropy(y_true, y_pred): """Categorical cross-entropy loss""" epsilon = 1e-15 y_pred = np.clip(y_pred, epsilon, 1 - epsilon) return -np.mean(np.sum(y_true * np.log(y_pred), axis=1)) def mean_squared_error(y_true, y_pred): """Mean squared error loss""" return np.mean((y_true - y_pred) ** 2) # Visualisasi: Binary Cross-Entropy vs MSE y_true_binary = 1 # True label = 1 y_pred_range = np.linspace(0.01, 0.99, 100) bce_loss = [-np.log(yp) for yp in y_pred_range] mse_loss = [(1 - yp)**2 for yp in y_pred_range] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Binary Cross-Entropy axes[0].plot(y_pred_range, bce_loss, linewidth=3, color='#E91E63') axes[0].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7) axes[0].set_xlabel('Predicted Probability', fontsize=12) axes[0].set_ylabel('Loss', fontsize=12) axes[0].set_title('Binary Cross-Entropy Loss\n(True label = 1)', fontsize=13, fontweight='bold') axes[0].legend() axes[0].grid(True, alpha=0.3) # MSE axes[1].plot(y_pred_range, mse_loss, linewidth=3, color='#3F51B5') axes[1].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7) axes[1].set_xlabel('Predicted Probability', fontsize=12) axes[1].set_ylabel('Loss', fontsize=12) axes[1].set_title('Mean Squared Error Loss\n(True label = 1)', fontsize=13, fontweight='bold') axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print("Perbandingan Loss Functions:") print("-" * 70) print("Binary Cross-Entropy:") print(" ✅ Cocok untuk probability outputs (sigmoid)") print(" ✅ Asymmetric punishment (more punishment when confident but wrong)") print(" ✅ Better gradient flow untuk classification") print("\nMean Squared Error:") print(" ✅ Cocok untuk regression") print(" ⚠️ Kurang optimal untuk classification (symmetric punishment)") ``` ::: {.callout-tip} ## 💡 Memilih Loss Function **Classification:** - **Binary**: Binary Cross-Entropy (dengan Sigmoid activation) - **Multi-class**: Categorical Cross-Entropy (dengan Softmax activation) **Regression:** - **MSE**: Default choice, sensitive to outliers - **MAE**: Robust to outliers - **Huber**: Kombinasi MSE dan MAE ::: ### 5.3.2 Backpropagation Algorithm Backpropagation adalah algoritma untuk menghitung gradients loss function terhadap semua parameters. **Intuisi:** Backpropagation menggunakan **chain rule** dari calculus untuk menghitung gradients layer demi layer dari output ke input. ```{mermaid} %%| fig-cap: Backpropagation Flow graph RL L["Loss L"] -->|"∂L/∂a³"| A3["Output a³"] A3 -->|"∂a³/∂z³"| Z3["z³ = W³a² + b³"] Z3 -->|"∂z³/∂W³, ∂z³/∂b³"| W3["W³, b³"] Z3 -->|"∂z³/∂a²"| A2["Hidden a²"] A2 -->|"∂a²/∂z²"| Z2["z² = W²a¹ + b²"] Z2 -->|"∂z²/∂W², ∂z²/∂b²"| W2["W², b²"] Z2 -->|"∂z²/∂a¹"| A1["Hidden a¹"] A1 -->|"∂a¹/∂z¹"| Z1["z¹ = W¹x + b¹"] Z1 -->|"∂z¹/∂W¹, ∂z¹/∂b¹"| W1["W¹, b¹"] style L fill:#f44336 style W3 fill:#4caf50 style W2 fill:#4caf50 style W1 fill:#4caf50 ``` **Matematika Backpropagation (Simplified):** Untuk output layer $L$: $$ \delta^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} = \mathbf{a}^{[L]} - \mathbf{y} $$ Untuk hidden layer $l$: $$ \delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(\mathbf{z}^{[l]}) $$ Gradients untuk parameters: $$ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} &= \frac{1}{m} \delta^{[l]} (\mathbf{a}^{[l-1]})^T \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} &= \frac{1}{m} \sum_{i} \delta^{[l](i)} \end{aligned} $$ ::: {.callout-note} ## 📝 Catatan Implementasi Anda **tidak perlu** mengimplementasikan backpropagation manual! Frameworks modern (TensorFlow, PyTorch) menggunakan **automatic differentiation** yang menghitung gradients secara otomatis. Memahami konsep backpropagation penting untuk: 1. Debugging training issues 2. Designing custom architectures 3. Understanding vanishing/exploding gradients ::: ### 5.3.3 Gradient Descent Variants Gradient descent updates parameters menuju arah yang mengurangi loss. **Basic Update Rule:** $$ \mathbf{W} := \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}} $$ Dimana $\alpha$ adalah learning rate. **Variants:** | Method | Update | Karakteristik | |--------|--------|---------------| | **Batch GD** | Gunakan semua data | Akurat tapi lambat, memory intensive | | **Stochastic GD** | Gunakan 1 sample | Cepat tapi noisy, poor convergence | | **Mini-batch GD** | Gunakan batch (32-512) | Balance: speed & stability | | **SGD + Momentum** | $v := \beta v + (1-\beta) \nabla$ | Smooth convergence, faster | | **RMSprop** | Adaptive learning rate per parameter | Good for RNN | | **Adam** | Momentum + RMSprop | **Default choice**, adaptive, robust | ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Simulasi optimization dengan berbagai optimizers def optimize_path(optimizer_func, start_point, learning_rate, iterations): """Simulate optimization path""" path = [start_point] point = np.array(start_point, dtype=float) for _ in range(iterations): # Simple 2D function: f(x,y) = x² + 4y² gradient = np.array([2*point[0], 8*point[1]]) point = optimizer_func(point, gradient, learning_rate) path.append(point.copy()) return np.array(path) def sgd(point, gradient, lr): """Standard SGD""" return point - lr * gradient def sgd_momentum(point, gradient, lr, momentum=0.9): """SGD with momentum""" if not hasattr(sgd_momentum, 'velocity') or sgd_momentum.velocity is None: sgd_momentum.velocity = np.zeros_like(point) sgd_momentum.velocity = momentum * sgd_momentum.velocity + lr * gradient return point - sgd_momentum.velocity def adam_optimizer(point, gradient, lr, beta1=0.9, beta2=0.999): """Adam optimizer (simplified)""" if not hasattr(adam_optimizer, 'm') or adam_optimizer.m is None: adam_optimizer.m = np.zeros_like(point) adam_optimizer.v = np.zeros_like(point) adam_optimizer.t = 0 adam_optimizer.t += 1 adam_optimizer.m = beta1 * adam_optimizer.m + (1 - beta1) * gradient adam_optimizer.v = beta2 * adam_optimizer.v + (1 - beta2) * (gradient ** 2) m_hat = adam_optimizer.m / (1 - beta1 ** adam_optimizer.t) v_hat = adam_optimizer.v / (1 - beta2 ** adam_optimizer.t) return point - lr * m_hat / (np.sqrt(v_hat) + 1e-8) # Visualisasi convergence paths start = [2.0, 2.0] iterations = 50 # Reset momentum/adam state sgd_momentum.velocity = None adam_optimizer.m = None adam_optimizer.v = None adam_optimizer.t = None path_sgd = optimize_path(sgd, start, learning_rate=0.1, iterations=iterations) # Momentum path (velocity already reset above at line 836) path_momentum = optimize_path( lambda p, g, lr: sgd_momentum(p, g, lr, momentum=0.9), start, learning_rate=0.1, iterations=iterations ) # Adam path (state already reset above at lines 837-839) path_adam = optimize_path( lambda p, g, lr: adam_optimizer(p, g, lr), start, learning_rate=0.1, iterations=iterations ) # Plot fig, ax = plt.subplots(1, 1, figsize=(10, 8)) # Contour plot of loss landscape x = np.linspace(-3, 3, 100) y = np.linspace(-3, 3, 100) X, Y = np.meshgrid(x, y) Z = X**2 + 4*Y**2 # Loss function contour = ax.contour(X, Y, Z, levels=20, alpha=0.3) ax.clabel(contour, inline=True, fontsize=8) # Plot paths ax.plot(path_sgd[:, 0], path_sgd[:, 1], 'o-', label='SGD', linewidth=2, markersize=4) ax.plot(path_momentum[:, 0], path_momentum[:, 1], 's-', label='SGD + Momentum', linewidth=2, markersize=4) ax.plot(path_adam[:, 0], path_adam[:, 1], '^-', label='Adam', linewidth=2, markersize=4) # Mark start and optimum ax.plot(*start, 'r*', markersize=20, label='Start') ax.plot(0, 0, 'g*', markersize=20, label='Optimum') ax.set_xlabel('Parameter 1', fontsize=12) ax.set_ylabel('Parameter 2', fontsize=12) ax.set_title('Optimizer Convergence Paths', fontsize=14, fontweight='bold') ax.legend(fontsize=11) ax.grid(True, alpha=0.3) plt.tight_layout() plt.show() print("Perbandingan Optimizers:") print("-" * 70) print(f"SGD : {iterations} iterations, final point: {path_sgd[-1]}") print(f"SGD+Momentum : {iterations} iterations, final point: {path_momentum[-1]}") print(f"Adam : {iterations} iterations, final point: {path_adam[-1]}") print("\n✅ Adam converges paling smooth dan efficient!") ``` ::: {.callout-tip} ## 💡 Memilih Optimizer **Rekomendasi Default:** - **Adam**: Best all-around choice, adaptive learning rates - **SGD + Momentum**: Good for very large datasets, simple - **RMSprop**: Alternative untuk RNN/LSTM **Hyperparameters:** - Learning rate: Start dengan 0.001 (Adam) atau 0.01 (SGD) - Beta1 (momentum): 0.9 (default) - Beta2 (RMSprop): 0.999 (default) ::: ### 5.3.4 Learning Rate and Scheduling Learning rate adalah **hyperparameter paling penting** dalam deep learning. **Dampak Learning Rate:** - **Too small**: Training sangat lambat, stuck di local minima - **Too large**: Unstable, divergence, loss explodes - **Just right**: Smooth convergence, optimal final performance ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Simulasi training dengan berbagai learning rates def simulate_training(learning_rate, iterations=100): """Simulate simple loss curve""" np.random.seed(42) loss = 10.0 losses = [] for i in range(iterations): # Simulated gradient (decreasing over time) gradient = 5.0 * np.exp(-i/30) + np.random.normal(0, 0.5) loss = max(0.01, loss - learning_rate * gradient) losses.append(loss) return losses # Berbagai learning rates lrs = [0.001, 0.01, 0.1, 0.5, 1.0] colors = ['#4CAF50', '#2196F3', '#FF9800', '#F44336', '#9C27B0'] fig, ax = plt.subplots(1, 1, figsize=(12, 6)) for lr, color in zip(lrs, colors): losses = simulate_training(lr, iterations=100) ax.plot(losses, linewidth=2.5, label=f'LR = {lr}', color=color) ax.set_xlabel('Iteration', fontsize=12) ax.set_ylabel('Loss', fontsize=12) ax.set_title('Impact of Learning Rate on Training', fontsize=14, fontweight='bold') ax.legend(fontsize=11) ax.grid(True, alpha=0.3) ax.set_ylim([0, 12]) plt.tight_layout() plt.show() print("Learning Rate Guidelines:") print("-" * 70) print("0.001 - 0.01 : Safe default untuk Adam optimizer") print("0.01 - 0.1 : Untuk SGD dengan momentum") print("0.1 - 1.0 : Terlalu besar untuk most cases, unstable") print("\n💡 Tip: Gunakan learning rate scheduler untuk adaptive adjustment!") ``` **Learning Rate Scheduling Strategies:** 1. **Step Decay**: Reduce LR by factor setiap N epochs ``` LR = LR₀ × γ^(epoch / step_size) ``` 2. **Exponential Decay**: Smooth exponential decrease ``` LR = LR₀ × e^(-λ × epoch) ``` 3. **Cosine Annealing**: Smooth decrease following cosine curve ``` LR = LR_min + 0.5 × (LR_max - LR_min) × (1 + cos(π × epoch / max_epochs)) ``` 4. **ReduceLROnPlateau**: Reduce when validation loss plateaus - Monitor validation metric - Reduce LR jika tidak improve selama N epochs ### 5.3.5 Batch Size Considerations Batch size mempengaruhi: - **Memory usage**: Larger batch = more memory - **Training speed**: Larger batch = fewer updates per epoch - **Generalization**: Smaller batch often generalizes better - **Gradient stability**: Larger batch = more stable gradients **Common Batch Sizes:** - **Small**: 16-32 (better generalization, less memory) - **Medium**: 32-128 (balanced) - **Large**: 256-512+ (faster on GPU, may need LR adjustment) ::: {.callout-note} ## 📝 Batch Size vs Learning Rate Ketika meningkatkan batch size, pertimbangkan untuk: - **Increase learning rate** proportionally (Linear Scaling Rule) - **Use warmup**: Gradually increase LR di awal training Rule of thumb: `new_LR = base_LR × (new_batch_size / base_batch_size)` ::: --- ## 5.4 Regularization Techniques Regularization mencegah **overfitting** dengan membatasi model complexity. ### 5.4.1 Overfitting in Neural Networks **Tanda-tanda Overfitting:** - Training loss terus menurun, tapi validation loss meningkat - Large gap antara training dan validation accuracy - Model performs poorly pada unseen data ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Simulasi overfitting scenario np.random.seed(42) epochs = np.arange(1, 101) # Training loss: monotonically decreasing train_loss = 2.0 * np.exp(-epochs/15) + 0.05 # Validation loss: decreases then increases (overfitting) val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30) # Training accuracy: monotonically increasing train_acc = 1.0 - 1.0 * np.exp(-epochs/12) # Validation accuracy: increases then plateaus/decreases val_acc = 1.0 - 1.0 * np.exp(-epochs/15) - 0.002 * (epochs - 40) * (epochs > 40) fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) # Loss plot ax1.plot(epochs, train_loss, linewidth=3, label='Training Loss', color='#2196F3') ax1.plot(epochs, val_loss, linewidth=3, label='Validation Loss', color='#F44336') ax1.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point') ax1.fill_between(epochs, 0, 3, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone') ax1.set_xlabel('Epoch', fontsize=12) ax1.set_ylabel('Loss', fontsize=12) ax1.set_title('Overfitting: Loss Curves', fontsize=13, fontweight='bold') ax1.legend(fontsize=10) ax1.grid(True, alpha=0.3) ax1.set_ylim([0, 2.5]) # Accuracy plot ax2.plot(epochs, train_acc, linewidth=3, label='Training Accuracy', color='#2196F3') ax2.plot(epochs, val_acc, linewidth=3, label='Validation Accuracy', color='#F44336') ax2.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point') ax2.fill_between(epochs, 0, 1, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone') ax2.set_xlabel('Epoch', fontsize=12) ax2.set_ylabel('Accuracy', fontsize=12) ax2.set_title('Overfitting: Accuracy Curves', fontsize=13, fontweight='bold') ax2.legend(fontsize=10) ax2.grid(True, alpha=0.3) ax2.set_ylim([0, 1.05]) plt.tight_layout() plt.show() print("Deteksi Overfitting:") print("-" * 70) print("✅ Training loss turun terus, validation loss naik → OVERFIT") print("✅ Large gap antara training dan validation metrics → OVERFIT") print("✅ Model perfect pada training data tapi poor pada test → OVERFIT") print("\n💡 Solusi: Regularization techniques!") ``` ### 5.4.2 Dropout Technique Dropout adalah teknik regularization yang **randomly drops neurons** selama training. **Cara Kerja:** 1. Setiap training step, randomly set fraction p of neurons to zero 2. Remaining neurons scale by 1/(1-p) untuk maintain expected output 3. Saat inference (testing), gunakan semua neurons (no dropout) **Efek Dropout:** - Mencegah neurons menjadi too specialized (co-adaptation) - Ensemble effect: melatih banyak "sub-networks" - Force network belajar robust features ```{mermaid} %%| fig-cap: Dropout Visualization graph LR subgraph Normal["Normal Forward Pass"] I1["Input"] --> H1["Hidden"] H1 --> H2["Hidden"] H2 --> O1["Output"] end subgraph Dropout["With Dropout (p=0.5)"] I2["Input"] --> H3["Hidden ✓"] I2 -.->|dropped| H4["Hidden ✗"] H3 --> H5["Hidden ✓"] H3 -.->|dropped| H6["Hidden ✗"] H5 --> O2["Output"] end style H4 fill:#ffcdd2,stroke:#f44336 style H6 fill:#ffcdd2,stroke:#f44336 style H3 fill:#c8e6c9,stroke:#4caf50 style H5 fill:#c8e6c9,stroke:#4caf50 ``` ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt class DropoutLayer: """Implementasi Dropout layer dari scratch""" def __init__(self, dropout_rate=0.5): """ Args: dropout_rate: Fraction of neurons to drop (0.0 - 1.0) """ self.dropout_rate = dropout_rate self.mask = None def forward(self, X, training=True): """ Forward pass dengan dropout Args: X: Input activations training: If True, apply dropout; if False, no dropout """ if training: # Generate random mask self.mask = np.random.binomial(1, 1 - self.dropout_rate, size=X.shape) # Apply mask dan scale return X * self.mask / (1 - self.dropout_rate) else: # No dropout during inference return X def backward(self, dout): """Backward pass: propagate gradients only through active neurons""" return dout * self.mask / (1 - self.dropout_rate) # Demo: Visualisasi dropout effect np.random.seed(42) layer_size = 20 input_activations = np.random.randn(layer_size) dropout_rates = [0.0, 0.2, 0.5, 0.8] fig, axes = plt.subplots(1, len(dropout_rates), figsize=(16, 4)) for idx, dropout_rate in enumerate(dropout_rates): dropout = DropoutLayer(dropout_rate=dropout_rate) output = dropout.forward(input_activations.copy(), training=True) # Visualize ax = axes[idx] neurons_active = (dropout.mask > 0) colors = ['#4CAF50' if active else '#F44336' for active in neurons_active] ax.bar(range(layer_size), np.abs(input_activations), color='lightgray', alpha=0.5, label='Original') ax.bar(range(layer_size), np.abs(output), color=colors, alpha=0.8, label='After Dropout') ax.set_title(f'Dropout Rate = {dropout_rate}\n({int((1-dropout_rate)*100)}% neurons active)', fontsize=12, fontweight='bold') ax.set_xlabel('Neuron Index') ax.set_ylabel('Activation') ax.set_ylim([0, 4]) if idx == 0: ax.legend() plt.tight_layout() plt.show() print("Dropout Guidelines:") print("-" * 70) print("Dropout Rate 0.0 : No dropout (no regularization)") print("Dropout Rate 0.2 : Light regularization") print("Dropout Rate 0.5 : Standard choice (recommended)") print("Dropout Rate 0.8 : Heavy regularization (might underfit)") print("\n✅ Typical: 0.2-0.5 untuk hidden layers, 0.1-0.2 untuk input layer") ``` ::: {.callout-tip} ## 💡 Best Practices Dropout **Where to apply:** - Hidden layers (after activation) - Input layer (lower rate: 0.1-0.2) - **Avoid**: Output layer **Dropout rates:** - **Default**: 0.5 untuk hidden layers - **Input layer**: 0.1-0.2 (lower rate) - **Deep networks**: 0.2-0.4 (lower rate untuk deep) **When NOT to use Dropout:** - Convolutional layers (use other regularization) - Small datasets (might cause underfitting) - Batch Normalization already used ::: ### 5.4.3 L1/L2 Regularization **L2 Regularization (Weight Decay):** Add penalty term ke loss function: $$ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{2m} \sum_{l} \|\mathbf{W}^{[l]}\|_F^2 $$ **Effect:** - Penalizes large weights - Weights shrink toward zero (but not exactly zero) - Prevents overfitting dengan forcing simpler model **L1 Regularization:** $$ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{m} \sum_{l} \|\mathbf{W}^{[l]}\|_1 $$ **Effect:** - Promotes sparsity (many weights → exactly zero) - Feature selection effect - Less common in deep learning ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Visualisasi effect of L2 regularization np.random.seed(42) # Generate weight distributions weights_no_reg = np.random.randn(1000) * 2.0 weights_l2_light = weights_no_reg * 0.7 weights_l2_heavy = weights_no_reg * 0.4 fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # No regularization axes[0].hist(weights_no_reg, bins=50, color='#F44336', alpha=0.7, edgecolor='black') axes[0].axvline(x=0, color='k', linestyle='--', linewidth=2) axes[0].set_title('No Regularization\nλ = 0', fontsize=13, fontweight='bold') axes[0].set_xlabel('Weight Value') axes[0].set_ylabel('Frequency') axes[0].set_xlim([-6, 6]) # Light L2 axes[1].hist(weights_l2_light, bins=50, color='#FF9800', alpha=0.7, edgecolor='black') axes[1].axvline(x=0, color='k', linestyle='--', linewidth=2) axes[1].set_title('Light L2 Regularization\nλ = 0.01', fontsize=13, fontweight='bold') axes[1].set_xlabel('Weight Value') axes[1].set_ylabel('Frequency') axes[1].set_xlim([-6, 6]) # Heavy L2 axes[2].hist(weights_l2_heavy, bins=50, color='#4CAF50', alpha=0.7, edgecolor='black') axes[2].axvline(x=0, color='k', linestyle='--', linewidth=2) axes[2].set_title('Heavy L2 Regularization\nλ = 0.1', fontsize=13, fontweight='bold') axes[2].set_xlabel('Weight Value') axes[2].set_ylabel('Frequency') axes[2].set_xlim([-6, 6]) plt.tight_layout() plt.show() print("L2 Regularization Effect:") print("-" * 70) print("✅ Weights shrink toward zero (weight decay)") print("✅ Prevents extreme weight values") print("✅ Model becomes more robust, less sensitive to individual features") print("\nTypical λ values:") print(" Small: 0.0001 - 0.001") print(" Medium: 0.01 - 0.1") print(" Large: 0.1+") ``` ### 5.4.4 Early Stopping Early stopping: **Stop training ketika validation performance berhenti improving**. **Algorithm:** 1. Monitor validation loss setiap epoch 2. Track best validation loss 3. Jika tidak improve selama N epochs (patience), stop 4. Return model dengan best validation performance ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt # Simulasi early stopping np.random.seed(42) epochs = np.arange(1, 101) # Training loss: monotonically decreasing train_loss = 2.0 * np.exp(-epochs/15) + 0.05 + np.random.normal(0, 0.02, len(epochs)) # Validation loss: decreases then increases val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30) val_loss += np.random.normal(0, 0.05, len(epochs)) # Early stopping logic patience = 10 best_val_loss = float('inf') best_epoch = 0 patience_counter = 0 for epoch in range(len(epochs)): if val_loss[epoch] < best_val_loss: best_val_loss = val_loss[epoch] best_epoch = epoch patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: stop_epoch = epoch break else: stop_epoch = len(epochs) - 1 # Visualization fig, ax = plt.subplots(1, 1, figsize=(12, 6)) ax.plot(epochs, train_loss, linewidth=2.5, label='Training Loss', color='#2196F3') ax.plot(epochs, val_loss, linewidth=2.5, label='Validation Loss', color='#F44336') ax.axvline(x=best_epoch+1, color='g', linestyle='--', linewidth=2, label=f'Best Model (epoch {best_epoch+1})') ax.axvline(x=stop_epoch+1, color='r', linestyle='--', linewidth=2, label=f'Early Stop (epoch {stop_epoch+1})') ax.scatter([best_epoch+1], [val_loss[best_epoch]], color='g', s=200, zorder=5, marker='*') ax.set_xlabel('Epoch', fontsize=12) ax.set_ylabel('Loss', fontsize=12) ax.set_title(f'Early Stopping (Patience = {patience})', fontsize=14, fontweight='bold') ax.legend(fontsize=11) ax.grid(True, alpha=0.3) ax.set_ylim([0, 2.5]) plt.tight_layout() plt.show() print("Early Stopping Summary:") print("-" * 70) print(f"Best validation loss: {best_val_loss:.4f} at epoch {best_epoch+1}") print(f"Training stopped at epoch: {stop_epoch+1}") print(f"Saved epochs: {len(epochs) - stop_epoch}") print(f"Patience: {patience} epochs") print("\n✅ Early stopping prevents wasting compute dan overfitting!") ``` ::: {.callout-tip} ## 💡 Early Stopping Best Practices **Patience:** - **Small datasets**: 5-10 epochs - **Large datasets**: 10-20 epochs - **Very large**: 20-50 epochs **What to monitor:** - **Primary**: Validation loss - **Alternative**: Validation accuracy/F1 (for imbalanced data) **Bonus:** - Save checkpoints di setiap best epoch - Restore best weights setelah training ::: ### 5.4.5 Batch Normalization Basics Batch Normalization normalizes activations di setiap layer untuk stabilize training. **Benefits:** - Faster convergence - Higher learning rates dapat digunakan - Less sensitive to initialization - Acts as regularization (similar to dropout) **Formula:** $$ \begin{aligned} \mu_B &= \frac{1}{m} \sum_{i=1}^{m} x_i \\ \sigma_B^2 &= \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 \\ \hat{x}_i &= \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\ y_i &= \gamma \hat{x}_i + \beta \end{aligned} $$ Dimana $\gamma$ dan $\beta$ adalah learnable parameters. ::: {.callout-note} ## 📝 Batch Normalization di Modern Deep Learning Batch Normalization sangat efektif untuk: - Deep networks (>5 layers) - Convolutional Neural Networks - Large batch sizes **Trade-off:** - Extra computation - Behavior berbeda antara training dan inference - Kurang efektif untuk small batch sizes (<16) **Alternatif:** - Layer Normalization (untuk RNN/Transformers) - Group Normalization (untuk small batches) ::: --- ## 5.5 Implementation dengan Keras Keras adalah high-level API di atas TensorFlow, sangat user-friendly untuk building neural networks. ### 5.5.1 Keras Sequential API Sequential API cocok untuk **linear stack of layers** (most common architecture). ```{python} #| echo: true #| code-fold: false #| warning: false import numpy as np import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers, models from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt print(f"TensorFlow version: {tf.__version__}") print(f"Keras version: {keras.__version__}") # Load dataset: Binary classification data = load_breast_cancer() X, y = data.data, data.target # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Preprocessing: Standardization scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) print(f"\nDataset: Breast Cancer (Binary Classification)") print(f"Training samples: {X_train.shape[0]}") print(f"Test samples: {X_test.shape[0]}") print(f"Features: {X_train.shape[1]}") print(f"Classes: {np.unique(y_train)}") # Build MLP dengan Sequential API model = models.Sequential([ # Input layer (implicit) layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],), name='hidden1'), layers.Dropout(0.3), layers.Dense(32, activation='relu', name='hidden2'), layers.Dropout(0.3), layers.Dense(16, activation='relu', name='hidden3'), # Output layer layers.Dense(1, activation='sigmoid', name='output') ], name='MLP_Classifier') # Model summary print("\n" + "="*70) print("MODEL ARCHITECTURE") print("="*70) model.summary() # Compile model model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()] ) # Training dengan early stopping dan model checkpoint early_stop = keras.callbacks.EarlyStopping( monitor='val_loss', patience=15, restore_best_weights=True, verbose=1 ) checkpoint = keras.callbacks.ModelCheckpoint( 'best_model.keras', monitor='val_loss', save_best_only=True, verbose=0 ) # Train model print("\n" + "="*70) print("TRAINING PROCESS") print("="*70) history = model.fit( X_train, y_train, validation_split=0.2, epochs=100, batch_size=32, callbacks=[early_stop, checkpoint], verbose=0 # Set to 1 untuk melihat training progress ) print(f"Training completed: {len(history.history['loss'])} epochs") # Evaluation train_results = model.evaluate(X_train, y_train, verbose=0) test_results = model.evaluate(X_test, y_test, verbose=0) print("\n" + "="*70) print("EVALUATION RESULTS") print("="*70) print(f"Training - Loss: {train_results[0]:.4f}, Accuracy: {train_results[1]:.4f}") print(f"Test - Loss: {test_results[0]:.4f}, Accuracy: {test_results[1]:.4f}") # Visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Loss plot axes[0].plot(history.history['loss'], linewidth=2, label='Training Loss') axes[0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss') axes[0].set_xlabel('Epoch', fontsize=12) axes[0].set_ylabel('Loss', fontsize=12) axes[0].set_title('Training History: Loss', fontsize=13, fontweight='bold') axes[0].legend() axes[0].grid(True, alpha=0.3) # Accuracy plot axes[1].plot(history.history['accuracy'], linewidth=2, label='Training Accuracy') axes[1].plot(history.history['val_accuracy'], linewidth=2, label='Validation Accuracy') axes[1].set_xlabel('Epoch', fontsize=12) axes[1].set_ylabel('Accuracy', fontsize=12) axes[1].set_title('Training History: Accuracy', fontsize=13, fontweight='bold') axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` **Key Points Keras Sequential:** 1. **Layers stacked linearly**: `Sequential([layer1, layer2, ...])` 2. **Input shape**: Specify di first layer saja 3. **Activation**: Bisa inline (`activation='relu'`) atau separate layer 4. **Compile**: Specify optimizer, loss, metrics 5. **Fit**: Training dengan `model.fit()` 6. **Callbacks**: Early stopping, checkpointing, LR scheduling ### 5.5.2 Keras Functional API Functional API lebih flexible, cocok untuk **complex architectures** (multi-input, multi-output, skip connections). ```{python} #| echo: true #| code-fold: false from tensorflow import keras from tensorflow.keras import layers, Model # Build complex architecture dengan Functional API def build_functional_mlp(input_dim, num_classes): """ Build MLP dengan skip connections (residual-like) """ # Input layer inputs = layers.Input(shape=(input_dim,), name='input') # First branch x1 = layers.Dense(64, activation='relu', name='branch1_dense1')(inputs) x1 = layers.Dropout(0.3)(x1) x1 = layers.Dense(32, activation='relu', name='branch1_dense2')(x1) # Second branch (shorter path) x2 = layers.Dense(32, activation='relu', name='branch2_dense')(inputs) # Merge branches (skip connection) merged = layers.Add(name='merge')([x1, x2]) merged = layers.Activation('relu')(merged) # Final layers x = layers.Dense(16, activation='relu', name='final_dense')(merged) x = layers.Dropout(0.2)(x) # Output layer outputs = layers.Dense(num_classes, activation='softmax', name='output')(x) # Create model model = Model(inputs=inputs, outputs=outputs, name='Functional_MLP') return model # Create model functional_model = build_functional_mlp(input_dim=30, num_classes=2) # Model summary print("="*70) print("FUNCTIONAL API MODEL") print("="*70) functional_model.summary() # Visualisasi architecture keras.utils.plot_model( functional_model, to_file='functional_mlp_architecture.png', show_shapes=True, show_layer_names=True, rankdir='TB', dpi=96 ) print("\n✅ Model architecture saved to: functional_mlp_architecture.png") print("\n💡 Functional API memungkinkan:") print(" - Multiple inputs/outputs") print(" - Skip connections (ResNet-style)") print(" - Shared layers") print(" - Complex graph topologies") ``` ::: {.callout-tip} ## 💡 Sequential vs Functional API **Use Sequential when:** - Linear stack of layers - Simple feed-forward architecture - Quick prototyping **Use Functional when:** - Multi-input or multi-output models - Skip connections (ResNet, DenseNet) - Shared layers - Complex architectures ::: ### 5.5.3 Model Saving and Loading ```{python} #| echo: true #| code-fold: false # Save entire model (architecture + weights + optimizer state) model.save('complete_model.keras') print("✅ Complete model saved: complete_model.keras") # Save only weights model.save_weights('model_weights.weights.h5') print("✅ Weights saved: model_weights.weights.h5") # Load complete model loaded_model = keras.models.load_model('complete_model.keras') print("✅ Model loaded successfully") # Verify loaded model test_loss, test_acc = loaded_model.evaluate(X_test, y_test, verbose=0) print(f"Loaded model accuracy: {test_acc:.4f}") # Export untuk production (TensorFlow SavedModel format) model.export('exported_model') print("✅ Model exported untuk production: exported_model/") ``` --- ## 5.6 Implementation dengan PyTorch PyTorch memberikan **lower-level control** dan sangat populer untuk research. ### 5.6.1 PyTorch nn.Module ```{python} #| echo: true #| code-fold: false import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt print(f"PyTorch version: {torch.__version__}") # Check device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(f"Device: {device}") # Load dataset: Multi-class classification iris = load_iris() X, y = iris.data, iris.target # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Standardization scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Convert ke PyTorch tensors X_train_tensor = torch.FloatTensor(X_train).to(device) y_train_tensor = torch.LongTensor(y_train).to(device) X_test_tensor = torch.FloatTensor(X_test).to(device) y_test_tensor = torch.LongTensor(y_test).to(device) # Create DataLoader train_dataset = TensorDataset(X_train_tensor, y_train_tensor) train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) print(f"\nDataset: Iris (Multi-class Classification)") print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}") print(f"Features: {X_train.shape[1]}") print(f"Classes: {len(np.unique(y_train))}") # Define MLP model dengan PyTorch class MLPClassifier(nn.Module): """Custom MLP Classifier menggunakan PyTorch""" def __init__(self, input_dim, hidden_dims, num_classes, dropout_rate=0.3): """ Args: input_dim: Number of input features hidden_dims: List of hidden layer sizes, e.g., [64, 32, 16] num_classes: Number of output classes dropout_rate: Dropout probability """ super(MLPClassifier, self).__init__() # Build layers dynamically layers = [] prev_dim = input_dim for hidden_dim in hidden_dims: layers.append(nn.Linear(prev_dim, hidden_dim)) layers.append(nn.ReLU()) layers.append(nn.Dropout(dropout_rate)) prev_dim = hidden_dim # Output layer layers.append(nn.Linear(prev_dim, num_classes)) # Create sequential module self.network = nn.Sequential(*layers) def forward(self, x): """Forward pass""" return self.network(x) # Create model model = MLPClassifier( input_dim=X_train.shape[1], hidden_dims=[32, 16, 8], num_classes=len(np.unique(y_train)), dropout_rate=0.2 ).to(device) print("\n" + "="*70) print("PYTORCH MODEL ARCHITECTURE") print("="*70) print(model) # Count parameters total_params = sum(p.numel() for p in model.parameters()) trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"\nTotal parameters: {total_params:,}") print(f"Trainable parameters: {trainable_params:,}") # Define loss and optimizer criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Training loop def train_epoch(model, loader, criterion, optimizer, device): """Train for one epoch""" model.train() total_loss = 0 correct = 0 total = 0 for batch_X, batch_y in loader: # Forward pass outputs = model(batch_X) loss = criterion(outputs, batch_y) # Backward pass optimizer.zero_grad() loss.backward() optimizer.step() # Statistics total_loss += loss.item() _, predicted = torch.max(outputs.data, 1) total += batch_y.size(0) correct += (predicted == batch_y).sum().item() avg_loss = total_loss / len(loader) accuracy = correct / total return avg_loss, accuracy def evaluate(model, X, y, criterion, device): """Evaluate model""" model.eval() with torch.no_grad(): outputs = model(X) loss = criterion(outputs, y) _, predicted = torch.max(outputs.data, 1) accuracy = (predicted == y).sum().item() / y.size(0) return loss.item(), accuracy # Training print("\n" + "="*70) print("TRAINING PROCESS") print("="*70) epochs = 100 train_losses = [] train_accs = [] test_losses = [] test_accs = [] for epoch in range(epochs): train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device) test_loss, test_acc = evaluate(model, X_test_tensor, y_test_tensor, criterion, device) train_losses.append(train_loss) train_accs.append(train_acc) test_losses.append(test_loss) test_accs.append(test_acc) if (epoch + 1) % 20 == 0: print(f"Epoch [{epoch+1}/{epochs}] - " f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | " f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}") print("\n" + "="*70) print("FINAL RESULTS") print("="*70) print(f"Final Train Accuracy: {train_accs[-1]:.4f}") print(f"Final Test Accuracy: {test_accs[-1]:.4f}") # Visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Loss axes[0].plot(train_losses, linewidth=2, label='Training Loss') axes[0].plot(test_losses, linewidth=2, label='Test Loss') axes[0].set_xlabel('Epoch', fontsize=12) axes[0].set_ylabel('Loss', fontsize=12) axes[0].set_title('PyTorch Training: Loss', fontsize=13, fontweight='bold') axes[0].legend() axes[0].grid(True, alpha=0.3) # Accuracy axes[1].plot(train_accs, linewidth=2, label='Training Accuracy') axes[1].plot(test_accs, linewidth=2, label='Test Accuracy') axes[1].set_xlabel('Epoch', fontsize=12) axes[1].set_ylabel('Accuracy', fontsize=12) axes[1].set_title('PyTorch Training: Accuracy', fontsize=13, fontweight='bold') axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### 5.6.2 PyTorch Model Saving and Loading ```{python} #| echo: true #| code-fold: false # Save model weights torch.save(model.state_dict(), 'pytorch_model_weights.pth') print("✅ Model weights saved: pytorch_model_weights.pth") # Save complete checkpoint (model + optimizer state) checkpoint = { 'epoch': epochs, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'train_loss': train_losses[-1], 'test_loss': test_losses[-1] } torch.save(checkpoint, 'pytorch_checkpoint.pth') print("✅ Complete checkpoint saved: pytorch_checkpoint.pth") # Load model weights loaded_model = MLPClassifier( input_dim=X_train.shape[1], hidden_dims=[32, 16, 8], num_classes=len(np.unique(y_train)), dropout_rate=0.2 ).to(device) loaded_model.load_state_dict(torch.load('pytorch_model_weights.pth', weights_only=True)) loaded_model.eval() print("✅ Model weights loaded successfully") # Verify test_loss, test_acc = evaluate(loaded_model, X_test_tensor, y_test_tensor, criterion, device) print(f"Loaded model test accuracy: {test_acc:.4f}") ``` ### 5.6.3 Keras vs PyTorch: Quick Comparison | Aspect | Keras | PyTorch | |--------|-------|---------| | **Ease of Use** | Very easy, high-level | Moderate, more verbose | | **Flexibility** | Limited (functional API helps) | Very flexible | | **Debugging** | Harder (compiled graphs) | Easier (eager execution) | | **Research** | Less common | Very popular | | **Production** | TF ecosystem (TFLite, TF Serving) | TorchServe, ONNX | | **Community** | Large (TF ecosystem) | Large (research-focused) | | **Learning Curve** | Gentle | Steeper | **Recommendation:** - **Beginners**: Start dengan Keras (easier) - **Research**: PyTorch (more control) - **Production (mobile/edge)**: Keras/TensorFlow - **Production (server)**: Either works --- ## 5.7 Practical Example: Bank Marketing Campaign Mari kita aplikasikan semua yang telah dipelajari untuk **real-world problem**. **Problem:** Prediksi apakah customer akan subscribe term deposit berdasarkan marketing campaign data. ```{python} #| echo: true #| code-fold: false import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve import matplotlib.pyplot as plt import seaborn as sns from tensorflow import keras from tensorflow.keras import layers, models, callbacks # Simulate bank marketing dataset (simplified version) np.random.seed(42) n_samples = 5000 data = { 'age': np.random.randint(18, 70, n_samples), 'balance': np.random.randint(-5000, 50000, n_samples), 'duration': np.random.randint(0, 3000, n_samples), 'campaign': np.random.randint(1, 50, n_samples), 'previous': np.random.randint(0, 40, n_samples), 'job': np.random.choice(['admin', 'technician', 'services', 'management'], n_samples), 'education': np.random.choice(['primary', 'secondary', 'tertiary'], n_samples), 'subscribed': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]) # Imbalanced } df = pd.DataFrame(data) print("="*70) print("BANK MARKETING DATASET") print("="*70) print(df.head(10)) print(f"\nDataset shape: {df.shape}") print(f"\nClass distribution:") print(df['subscribed'].value_counts()) print(f"Class imbalance ratio: {df['subscribed'].value_counts()[0] / df['subscribed'].value_counts()[1]:.2f}:1") # Preprocessing # Encode categorical variables le_job = LabelEncoder() le_education = LabelEncoder() df['job_encoded'] = le_job.fit_transform(df['job']) df['education_encoded'] = le_education.fit_transform(df['education']) # Select features feature_cols = ['age', 'balance', 'duration', 'campaign', 'previous', 'job_encoded', 'education_encoded'] X = df[feature_cols].values y = df['subscribed'].values # Split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Standardization scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) print(f"\nTraining set: {X_train.shape[0]} samples") print(f"Test set: {X_test.shape[0]} samples") # Build MLP model model = models.Sequential([ layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)), layers.BatchNormalization(), layers.Dropout(0.4), layers.Dense(64, activation='relu'), layers.BatchNormalization(), layers.Dropout(0.3), layers.Dense(32, activation='relu'), layers.Dropout(0.2), layers.Dense(1, activation='sigmoid') ], name='Bank_Marketing_MLP') # Compile model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.AUC()] ) # Callbacks early_stop = callbacks.EarlyStopping( monitor='val_auc', patience=20, restore_best_weights=True, mode='max', verbose=1 ) lr_schedule = callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=10, verbose=1 ) # Training print("\n" + "="*70) print("TRAINING") print("="*70) history = model.fit( X_train, y_train, validation_split=0.2, epochs=100, batch_size=64, callbacks=[early_stop, lr_schedule], verbose=0 ) print(f"Training completed: {len(history.history['loss'])} epochs") # Evaluation y_pred_proba = model.predict(X_test, verbose=0).ravel() y_pred = (y_pred_proba >= 0.5).astype(int) print("\n" + "="*70) print("EVALUATION RESULTS") print("="*70) print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=['Not Subscribed', 'Subscribed'])) print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}") # Confusion Matrix cm = confusion_matrix(y_test, y_pred) fig, axes = plt.subplots(2, 2, figsize=(15, 12)) # Confusion Matrix sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0]) axes[0, 0].set_title('Confusion Matrix', fontsize=13, fontweight='bold') axes[0, 0].set_xlabel('Predicted') axes[0, 0].set_ylabel('True') # ROC Curve fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba) axes[0, 1].plot(fpr, tpr, linewidth=3, label=f'AUC = {roc_auc_score(y_test, y_pred_proba):.3f}') axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier') axes[0, 1].set_xlabel('False Positive Rate', fontsize=12) axes[0, 1].set_ylabel('True Positive Rate', fontsize=12) axes[0, 1].set_title('ROC Curve', fontsize=13, fontweight='bold') axes[0, 1].legend() axes[0, 1].grid(True, alpha=0.3) # Training History: Loss axes[1, 0].plot(history.history['loss'], linewidth=2, label='Training Loss') axes[1, 0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss') axes[1, 0].set_xlabel('Epoch', fontsize=12) axes[1, 0].set_ylabel('Loss', fontsize=12) axes[1, 0].set_title('Training History: Loss', fontsize=13, fontweight='bold') axes[1, 0].legend() axes[1, 0].grid(True, alpha=0.3) # Training History: AUC axes[1, 1].plot(history.history['auc'], linewidth=2, label='Training AUC') axes[1, 1].plot(history.history['val_auc'], linewidth=2, label='Validation AUC') axes[1, 1].set_xlabel('Epoch', fontsize=12) axes[1, 1].set_ylabel('AUC', fontsize=12) axes[1, 1].set_title('Training History: AUC', fontsize=13, fontweight='bold') axes[1, 1].legend() axes[1, 1].grid(True, alpha=0.3) plt.tight_layout() plt.show() print("\n✅ Complete MLP pipeline untuk real-world classification problem!") ``` --- ## 5.8 Review dan Latihan ### 5.8.1 Review Questions 1. **Konsep Fundamental:** - Apa perbedaan utama antara deep learning dan classical machine learning? - Mengapa single perceptron tidak bisa menyelesaikan XOR problem? - Jelaskan fungsi activation function dalam neural networks! 2. **Architecture:** - Bagaimana memilih jumlah hidden layers dan neurons? - Kapan menggunakan ReLU vs Sigmoid activation? - Apa trade-off antara deep vs wide networks? 3. **Training:** - Jelaskan intuitively bagaimana backpropagation bekerja! - Mengapa Adam optimizer lebih baik dari standard SGD? - Apa dampak learning rate yang terlalu besar atau terlalu kecil? 4. **Regularization:** - Bagaimana dropout mencegah overfitting? - Kapan menggunakan L2 regularization vs dropout? - Apa kelebihan early stopping dibanding training sampai convergence? 5. **Implementation:** - Kapan menggunakan Keras Sequential vs Functional API? - Apa perbedaan utama antara Keras dan PyTorch? - Bagaimana cara menangani class imbalance dalam neural networks? ### 5.8.2 Coding Exercises **Exercise 1: XOR Problem** Implement dan train MLP untuk menyelesaikan XOR problem menggunakan Keras. Network harus mencapai 100% accuracy. ```python # Your solution here X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y_xor = np.array([0, 1, 1, 0]) # TODO: Build, compile, dan train model # TODO: Achieve 100% accuracy pada XOR problem ``` **Exercise 2: Hyperparameter Tuning** Experiment dengan berbagai hyperparameters pada Iris dataset: - Jumlah hidden layers (1, 2, 3) - Neurons per layer (8, 16, 32, 64) - Learning rate (0.0001, 0.001, 0.01) - Dropout rate (0.0, 0.2, 0.5) Plot hasil dan tentukan konfigurasi optimal. **Exercise 3: Custom Loss Function** Implement custom loss function di PyTorch untuk **Focal Loss** (menangani class imbalance): $$ \text{FL}(p_t) = -(1-p_t)^\gamma \log(p_t) $$ **Exercise 4: Transfer Learning Simulation** - Train MLP pada subset dataset (30% data) - Freeze beberapa layers dan fine-tune pada remaining data - Compare dengan training from scratch ### 5.8.3 Case Study: Credit Card Fraud Detection **Dataset:** Simulated credit card transactions (imbalanced: 99.8% normal, 0.2% fraud) **Task:** 1. Preprocess data (scaling, handling imbalance) 2. Build MLP classifier dengan appropriate architecture 3. Use class weights atau oversampling 4. Optimize untuk precision dan recall (not just accuracy!) 5. Visualize results dan analyze errors **Deliverables:** - Trained model dengan precision >0.90 dan recall >0.80 - Learning curves - Confusion matrix - ROC curve dan AUC score - Analysis of false positives dan false negatives --- ## 5.9 Key Takeaways ::: {.callout-note} ## 🎯 Ringkasan Chapter 5 **Konsep Fundamental:** 1. Deep learning menggunakan hierarchical feature learning 2. MLPs terdiri dari input, hidden, dan output layers 3. Activation functions memberikan non-linearity 4. Forward propagation: input → output 5. Backpropagation: compute gradients untuk learning **Training Best Practices:** 1. **Optimizer**: Adam (default choice) 2. **Learning Rate**: 0.001 (Adam), 0.01 (SGD) 3. **Batch Size**: 32-128 (balanced) 4. **Epochs**: Use early stopping **Regularization:** 1. **Dropout**: 0.3-0.5 untuk hidden layers 2. **L2**: λ = 0.0001-0.01 3. **Early Stopping**: Patience 10-20 epochs 4. **Batch Normalization**: For deep networks **Implementation:** 1. **Keras**: User-friendly, quick prototyping 2. **PyTorch**: Flexible, research-oriented 3. **Sequential API**: Linear architectures 4. **Functional API**: Complex architectures **Common Pitfalls:** - Forgetting to normalize input data - Using sigmoid untuk hidden layers (use ReLU!) - Too large learning rate (divergence) - Not using validation set (overfitting) - Ignoring class imbalance ::: --- ## 5.10 What's Next? Setelah menguasai MLP fundamentals, Anda siap untuk advanced deep learning architectures: **Chapter 6: Convolutional Neural Networks (CNN)** - Image classification dan computer vision - Convolutional layers dan feature maps - Transfer learning dengan pretrained models - Data augmentation techniques **Chapter 7: Recurrent Neural Networks (RNN/LSTM)** - Sequence modeling (time series, text) - LSTM dan GRU architectures - Bidirectional RNNs - Attention mechanisms **Chapter 8: Transformers** - Self-attention mechanism - BERT, GPT architectures - Fine-tuning pretrained models - Modern NLP applications **Continue Learning:** - Experiment dengan different datasets - Read research papers (start dengan survey papers) - Participate dalam Kaggle competitions - Build real-world projects --- ## Referensi dan Further Reading **Books:** 1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. 2. Chollet, F. (2021). *Deep Learning with Python* (2nd ed.). Manning. 3. Zhang, A., et al. (2023). *Dive into Deep Learning*. (Free online: d2l.ai) **Papers:** 1. Rumelhart, D. E., et al. (1986). "Learning representations by back-propagating errors." 2. Kingma, D. P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." 3. Srivastava, N., et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." **Online Resources:** 1. TensorFlow Tutorials: https://www.tensorflow.org/tutorials 2. PyTorch Tutorials: https://pytorch.org/tutorials/ 3. Fast.ai Course: https://course.fast.ai/ 4. Stanford CS231n: http://cs231n.stanford.edu/ **Practice Platforms:** 1. Kaggle: https://www.kaggle.com/ 2. Google Colab: https://colab.research.google.com/ 3. Papers with Code: https://paperswithcode.com/ --- ::: {.callout-tip} ## 💡 Final Advice Deep learning adalah **iterative process**: 1. Start simple (baseline model) 2. Analyze errors dan bottlenecks 3. Iterate dengan improvements 4. Experiment systematically 5. Document what works dan what doesn't **"The best model is the one you can build, understand, and improve."** Good luck dan happy deep learning! 🚀 :::