import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from matplotlib.colors import ListedColormap
# Generate non-linear data
np.random.seed(42)
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Classical ML: Logistic Regression (linear decision boundary)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_score = lr_model.score(X_test, y_test)
# Deep Learning: MLP (non-linear decision boundary)
mlp_model = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=1000, random_state=42)
mlp_model.fit(X_train, y_train)
mlp_score = mlp_model.score(X_test, y_test)
# Visualization
def plot_decision_boundary(model, X, y, title):
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=ListedColormap(['#FF0000', '#0000FF']),
edgecolors='k', s=50)
plt.title(title, fontsize=14, fontweight='bold')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
plt.sca(axes[0])
plot_decision_boundary(lr_model, X_test, y_test,
f'Logistic Regression\nAccuracy: {lr_score:.3f}')
plt.sca(axes[1])
plot_decision_boundary(mlp_model, X_test, y_test,
f'Neural Network (MLP)\nAccuracy: {mlp_score:.3f}')
plt.tight_layout()
plt.show()
print(f"Linear Model (Logistic Regression): {lr_score:.4f}")
print(f"Non-linear Model (MLP): {mlp_score:.4f}")
print(f"Improvement: {(mlp_score - lr_score) * 100:.2f}%")Bab 5: Multilayer Perceptron (MLP) Fundamentals
Fondasi Deep Learning dan Neural Networks
Bab 5: Multilayer Perceptron (MLP) Fundamentals
Setelah mempelajari bab ini, Anda akan mampu:
- Memahami arsitektur neural network dan konsep fundamental deep learning
- Mengimplementasikan MLP menggunakan Keras dan PyTorch framework
- Menerapkan teknik optimisasi (learning rate, batch size, epochs) untuk training efektif
- Menggunakan regularization techniques (dropout, L2) untuk mencegah overfitting
- Mengevaluasi dan memvisualisasikan training process deep learning models
- Membandingkan deep learning dengan classical machine learning approaches
5.1 Pengantar: Evolusi dari Classical ML ke Deep Learning
5.1.1 Mengapa Neural Networks?
Setelah mempelajari classical machine learning di Phase 2 (Chapters 1-4), Anda mungkin bertanya: Mengapa kita perlu deep learning? Jawabannya terletak pada keterbatasan classical ML dan kemampuan unik neural networks.
Keterbatasan Classical ML:
| Aspek | Classical ML | Deep Learning |
|---|---|---|
| Feature Engineering | Manual, membutuhkan domain expertise | Otomatis, hierarchical feature learning |
| Data Kompleksitas | Kesulitan dengan data dimensi tinggi | Unggul pada image, text, audio |
| Scalability | Performa plateau pada dataset besar | Meningkat dengan data lebih banyak |
| Representation | Shallow, single-layer features | Deep, hierarchical abstractions |
| Transfer Learning | Terbatas | Sangat efektif |
Bayangkan mengenali wajah seseorang:
- Layer 1: Deteksi edges (garis, kontur)
- Layer 2: Deteksi parts (mata, hidung, mulut)
- Layer 3: Deteksi patterns (susunan wajah)
- Layer 4: Identifikasi individu
Neural networks belajar representasi hierarkis seperti ini secara otomatis dari data!
5.1.2 Success Stories Deep Learning
Deep learning telah mencapai breakthrough di berbagai domain:
Computer Vision:
- ImageNet (2012): AlexNet mengurangi error dari 26% β 15%
- Object detection real-time (YOLO, Faster R-CNN)
- Medical imaging: deteksi kanker dengan akurasi dokter spesialis
Natural Language Processing:
- Machine translation: Google Translate neural MT
- Large Language Models: GPT-4, Claude, Gemini
- Sentiment analysis, text generation
Speech & Audio:
- Speech recognition: Google Assistant, Siri
- Text-to-speech synthesis yang natural
- Music generation (Jukebox, MusicGen)
Game Playing:
- AlphaGo: mengalahkan Lee Sedol (Go champion)
- AlphaZero: master Chess, Go, Shogi dari zero
- OpenAI Five: Dota 2 championship level
Science & Research:
- AlphaFold: prediksi protein structure
- Drug discovery acceleration
- Climate modeling
5.1.3 Deep Learning vs Classical ML: Kapan Menggunakan Apa?
Gunakan Classical ML jika:
- Dataset kecil (<10,000 samples)
- Features sudah well-defined
- Interpretability sangat penting
- Computational resources terbatas
- Baseline cepat dibutuhkan
Gunakan Deep Learning jika:
- Dataset besar (>100,000 samples)
- Data kompleks (images, text, audio)
- Feature engineering sulit/tidak jelas
- Computational resources tersedia (GPU)
- State-of-the-art performance diperlukan
Insight: Neural networks dapat mempelajari decision boundaries non-linear yang kompleks, sementara linear models terbatas pada pemisahan linear.
5.2 Neural Network Architecture
5.2.1 Perceptron: Building Block Fundamental
Perceptron adalah unit komputasi dasar neural network, terinspirasi dari neuron biologis.
Code
flowchart LR
X1["x1 (Input 1)"] -->|w1| S["Ξ£ (Weighted Sum)"]
X2["x2 (Input 2)"] -->|w2| S
X3["x3 (Input 3)"] -->|w3| S
B["b (Bias)"] -->|1| S
S --> A["Ο (Activation)"]
A --> Y["y-hat (Output)"]
style S fill:#ffeb3b
style A fill:#4caf50
style Y fill:#2196f3flowchart LR
X1["x1 (Input 1)"] -->|w1| S["Ξ£ (Weighted Sum)"]
X2["x2 (Input 2)"] -->|w2| S
X3["x3 (Input 3)"] -->|w3| S
B["b (Bias)"] -->|1| S
S --> A["Ο (Activation)"]
A --> Y["y-hat (Output)"]
style S fill:#ffeb3b
style A fill:#4caf50
style Y fill:#2196f3
Matematika Perceptron:
\[ \begin{aligned} z &= \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b \\ \hat{y} &= \sigma(z) = \sigma(\mathbf{w}^T \mathbf{x} + b) \end{aligned} \]
Dimana:
- \(\mathbf{x} = [x_1, x_2, ..., x_n]\): input features
- \(\mathbf{w} = [w_1, w_2, ..., w_n]\): weights (bobot)
- \(b\): bias (intercept)
- \(\sigma\): activation function
- \(z\): pre-activation value
- \(\hat{y}\): output prediction
Komponen Kunci:
- Weights (w): Parameter yang dipelajari, mengontrol kekuatan koneksi
- Bias (b): Offset yang memungkinkan shifting function
- Activation Function: Non-linearity yang memungkinkan learning complex patterns
import numpy as np
import matplotlib.pyplot as plt
# Implementasi Perceptron dari scratch
class SimplePerceptron:
def __init__(self, n_inputs):
# Initialize weights dan bias secara random
self.weights = np.random.randn(n_inputs)
self.bias = np.random.randn()
def sigmoid(self, z):
"""Sigmoid activation function"""
return 1 / (1 + np.exp(-z))
def forward(self, x):
"""Forward pass: compute output"""
z = np.dot(x, self.weights) + self.bias
return self.sigmoid(z)
def predict(self, X):
"""Predict untuk multiple samples"""
return np.array([self.forward(x) for x in X])
# Demo: XOR problem (classic non-linearly separable problem)
# Single perceptron TIDAK BISA menyelesaikan XOR
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0]) # XOR outputs
perceptron = SimplePerceptron(n_inputs=2)
predictions = perceptron.predict(X_xor)
print("XOR Problem dengan Single Perceptron:")
print("Input (x1, x2) | True Label | Prediction | Correct?")
print("-" * 55)
for i in range(len(X_xor)):
pred_label = 1 if predictions[i] >= 0.5 else 0
correct = "β" if pred_label == y_xor[i] else "β"
print(f"{X_xor[i]} | {y_xor[i]:^10d} | {predictions[i]:^10.3f} | {correct:^8s}")
print("\nβ οΈ Single perceptron gagal menyelesaikan XOR problem!")
print("Ini memotivasi development MLP (multiple layers).")Keterbatasan Single Perceptron:
Single perceptron hanya bisa mempelajari linearly separable patterns. XOR problem adalah contoh klasik yang membuktikan keterbatasan ini, yang memotivasi pengembangan Multi-Layer Perceptron (MLP).
5.2.2 Multilayer Perceptron (MLP) Architecture
MLP mengatasi keterbatasan single perceptron dengan menambahkan hidden layers.
Code
graph LR
subgraph Input["Input Layer"]
X1["xβ"]
X2["xβ"]
X3["xβ"]
end
subgraph Hidden1["Hidden Layer 1"]
H11["hββ"]
H12["hββ"]
H13["hββ"]
H14["hββ"]
end
subgraph Hidden2["Hidden Layer 2"]
H21["hββ"]
H22["hββ"]
H23["hββ"]
end
subgraph Output["Output Layer"]
Y1["Ε·β"]
Y2["Ε·β"]
end
X1 --> H11 & H12 & H13 & H14
X2 --> H11 & H12 & H13 & H14
X3 --> H11 & H12 & H13 & H14
H11 --> H21 & H22 & H23
H12 --> H21 & H22 & H23
H13 --> H21 & H22 & H23
H14 --> H21 & H22 & H23
H21 --> Y1 & Y2
H22 --> Y1 & Y2
H23 --> Y1 & Y2
style Input fill:#e3f2fd
style Hidden1 fill:#fff3e0
style Hidden2 fill:#fff3e0
style Output fill:#e8f5e9graph LR
subgraph Input["Input Layer"]
X1["xβ"]
X2["xβ"]
X3["xβ"]
end
subgraph Hidden1["Hidden Layer 1"]
H11["hββ"]
H12["hββ"]
H13["hββ"]
H14["hββ"]
end
subgraph Hidden2["Hidden Layer 2"]
H21["hββ"]
H22["hββ"]
H23["hββ"]
end
subgraph Output["Output Layer"]
Y1["Ε·β"]
Y2["Ε·β"]
end
X1 --> H11 & H12 & H13 & H14
X2 --> H11 & H12 & H13 & H14
X3 --> H11 & H12 & H13 & H14
H11 --> H21 & H22 & H23
H12 --> H21 & H22 & H23
H13 --> H21 & H22 & H23
H14 --> H21 & H22 & H23
H21 --> Y1 & Y2
H22 --> Y1 & Y2
H23 --> Y1 & Y2
style Input fill:#e3f2fd
style Hidden1 fill:#fff3e0
style Hidden2 fill:#fff3e0
style Output fill:#e8f5e9
Terminologi:
- Input Layer: Menerima features (tidak ada komputasi)
- Hidden Layer(s): Layer intermediate yang mempelajari representations
- Output Layer: Menghasilkan predictions
- Depth: Jumlah hidden layers (Deep = many layers)
- Width: Jumlah neurons per layer
Notasi Matematika:
Untuk layer \(l\):
\[ \begin{aligned} \mathbf{z}^{[l]} &= \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]} \\ \mathbf{a}^{[l]} &= \sigma^{[l]}(\mathbf{z}^{[l]}) \end{aligned} \]
Dimana:
- \(\mathbf{W}^{[l]}\): weight matrix untuk layer \(l\)
- \(\mathbf{b}^{[l]}\): bias vector untuk layer \(l\)
- \(\mathbf{a}^{[l]}\): activations (output) dari layer \(l\)
- \(\mathbf{a}^{[0]} = \mathbf{x}\): input features
- \(\sigma^{[l]}\): activation function untuk layer \(l\)
Ukuran Matrices:
Jika layer \(l\) memiliki \(n^{[l]}\) neurons dan layer \(l-1\) memiliki \(n^{[l-1]}\) neurons:
- \(\mathbf{W}^{[l]}\): shape \((n^{[l]}, n^{[l-1]})\)
- \(\mathbf{b}^{[l]}\): shape \((n^{[l]}, 1)\)
- \(\mathbf{a}^{[l]}\): shape \((n^{[l]}, 1)\) untuk single sample
5.2.3 Activation Functions
Activation functions memberikan non-linearity yang esensial untuk learning complex patterns.
Perbandingan Activation Functions:
| Function | Formula | Range | Use Case | Pros | Cons |
|---|---|---|---|---|---|
| Sigmoid | \(\sigma(z) = \frac{1}{1+e^{-z}}\) | (0, 1) | Output layer (binary) | Smooth, probabilistic | Vanishing gradient |
| Tanh | \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\) | (-1, 1) | Hidden layers (old) | Zero-centered | Vanishing gradient |
| ReLU | \(\text{ReLU}(z) = \max(0, z)\) | [0, β) | Hidden layers (default) | Fast, no vanishing | Dying ReLU |
| Leaky ReLU | \(\text{LReLU}(z) = \max(0.01z, z)\) | (-β, β) | Hidden layers | Fixes dying ReLU | Hyperparameter Ξ± |
| Softmax | \(\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\) | (0, 1) sum=1 | Output (multi-class) | Probabilities | Only output layer |
import numpy as np
import matplotlib.pyplot as plt
# Implementasi activation functions
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def tanh(z):
return np.tanh(z)
def relu(z):
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
# Visualisasi
z = np.linspace(-5, 5, 200)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Sigmoid
axes[0, 0].plot(z, sigmoid(z), linewidth=3, color='#2196F3')
axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title('Sigmoid: Ο(z) = 1/(1+eβ»αΆ»)', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('z')
axes[0, 0].set_ylabel('Ο(z)')
axes[0, 0].text(-4, 0.9, 'Range: (0, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat'))
# Tanh
axes[0, 1].plot(z, tanh(z), linewidth=3, color='#4CAF50')
axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title('Tanh: tanh(z)', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('z')
axes[0, 1].set_ylabel('tanh(z)')
axes[0, 1].text(-4, 0.8, 'Range: (-1, 1)', fontsize=11, bbox=dict(boxstyle='round', facecolor='lightgreen'))
# ReLU
axes[1, 0].plot(z, relu(z), linewidth=3, color='#FF9800')
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title('ReLU: max(0, z)', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('z')
axes[1, 0].set_ylabel('ReLU(z)')
axes[1, 0].text(-4, 4, 'Range: [0, β)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#FFE0B2'))
# Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), linewidth=3, color='#9C27B0')
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title('Leaky ReLU: max(0.01z, z)', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('z')
axes[1, 1].set_ylabel('Leaky ReLU(z)')
axes[1, 1].text(-4, 4, 'Range: (-β, β)', fontsize=11, bbox=dict(boxstyle='round', facecolor='#E1BEE7'))
plt.tight_layout()
plt.show()
print("Karakteristik Activation Functions:")
print("-" * 70)
print("Sigmoid : Smooth, outputs (0,1), β οΈ vanishing gradient problem")
print("Tanh : Zero-centered, outputs (-1,1), β οΈ vanishing gradient")
print("ReLU : Fast, sparse activation, β οΈ dying ReLU (neurons die)")
print("Leaky ReLU: Solves dying ReLU, allows small negative values")
print("\nβ
Rekomendasi: Gunakan ReLU untuk hidden layers (default choice)")Hidden Layers:
- Default: ReLU (cepat, efektif, sparse)
- Alternative: Leaky ReLU (menghindari dying neurons)
- Avoid: Sigmoid/Tanh (vanishing gradient untuk deep networks)
Output Layer:
- Binary Classification: Sigmoid (output probability 0-1)
- Multi-class Classification: Softmax (probabilities sum to 1)
- Regression: Linear (no activation, output real values)
5.2.4 Forward Propagation
Forward propagation adalah proses komputasi output dari input, layer demi layer.
import numpy as np
class MLPFromScratch:
"""MLP implementation dari scratch untuk educational purpose"""
def __init__(self, layer_sizes):
"""
Initialize MLP dengan arsitektur specified
Args:
layer_sizes: List of integers, e.g., [3, 4, 4, 2]
Input: 3 neurons
Hidden1: 4 neurons
Hidden2: 4 neurons
Output: 2 neurons
"""
self.layer_sizes = layer_sizes
self.num_layers = len(layer_sizes)
self.parameters = {}
# Initialize weights dan biases
np.random.seed(42)
for l in range(1, self.num_layers):
# He initialization untuk ReLU
self.parameters[f'W{l}'] = np.random.randn(
layer_sizes[l], layer_sizes[l-1]
) * np.sqrt(2.0 / layer_sizes[l-1])
self.parameters[f'b{l}'] = np.zeros((layer_sizes[l], 1))
def relu(self, Z):
"""ReLU activation"""
return np.maximum(0, Z)
def softmax(self, Z):
"""Softmax activation (untuk output layer)"""
exp_Z = np.exp(Z - np.max(Z, axis=0, keepdims=True)) # numerical stability
return exp_Z / np.sum(exp_Z, axis=0, keepdims=True)
def forward_propagation(self, X):
"""
Forward pass melalui network
Args:
X: Input data (n_features, m_samples)
Returns:
AL: Output activations
cache: Dictionary containing Z, A untuk setiap layer
"""
cache = {'A0': X}
A = X
# Hidden layers dengan ReLU
for l in range(1, self.num_layers - 1):
Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}']
A = self.relu(Z)
cache[f'Z{l}'] = Z
cache[f'A{l}'] = A
# Output layer dengan softmax
l = self.num_layers - 1
Z = self.parameters[f'W{l}'] @ A + self.parameters[f'b{l}']
A = self.softmax(Z)
cache[f'Z{l}'] = Z
cache[f'A{l}'] = A
return A, cache
def predict(self, X):
"""Predict class labels"""
A, _ = self.forward_propagation(X)
return np.argmax(A, axis=0)
# Demo: Solve XOR problem dengan MLP
print("Demo: MLP Solving XOR Problem")
print("=" * 60)
# XOR data
X_xor = np.array([[0, 0, 1, 1],
[0, 1, 0, 1]]) # (2 features, 4 samples)
y_xor = np.array([0, 1, 1, 0]) # XOR labels
# Create MLP: 2 inputs -> 4 hidden -> 2 outputs
mlp = MLPFromScratch(layer_sizes=[2, 4, 2])
# Forward pass
output, cache = mlp.forward_propagation(X_xor)
print(f"\nNetwork Architecture: {mlp.layer_sizes}")
print(f"Input β {mlp.layer_sizes[0]} neurons")
print(f"Hidden β {mlp.layer_sizes[1]} neurons (ReLU)")
print(f"Output β {mlp.layer_sizes[2]} neurons (Softmax)")
print(f"\nπ Forward Propagation Results:")
print("-" * 60)
for i in range(X_xor.shape[1]):
print(f"Input: {X_xor[:, i]} | Output probs: {output[:, i]} | Prediction: {np.argmax(output[:, i])}")
print("\nβ οΈ Catatan: Ini adalah RANDOM INITIALIZATION (belum training)")
print("Setelah training dengan backpropagation, MLP akan menyelesaikan XOR!")
print("\nKita akan pelajari training process di section berikutnya.")Proses Forward Propagation:
Input Layer: \(\mathbf{a}^{[0]} = \mathbf{x}\)
Hidden Layer 1:
- \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]}\)
- \(\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})\)
Hidden Layer 2:
- \(\mathbf{z}^{[2]} = \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}\)
- \(\mathbf{a}^{[2]} = \text{ReLU}(\mathbf{z}^{[2]})\)
Output Layer:
- \(\mathbf{z}^{[3]} = \mathbf{W}^{[3]} \mathbf{a}^{[2]} + \mathbf{b}^{[3]}\)
- \(\mathbf{a}^{[3]} = \text{Softmax}(\mathbf{z}^{[3]})\)
5.2.5 Network Topology Considerations
Berapa banyak hidden layers?
| Network Type | Hidden Layers | Use Case |
|---|---|---|
| Shallow | 1 | Simple patterns, XOR, small data |
| Medium | 2-3 | Most practical problems |
| Deep | 4+ | Complex hierarchical features (images, text) |
Berapa banyak neurons per layer?
Rules of Thumb:
- Mulai dengan neurons = mean(input_size, output_size)
- Hidden layers > input layer (tapi tidak terlalu besar)
- Pyramid shape: gradually decreasing (e.g., 128 β 64 β 32)
- Sama untuk semua hidden layers juga ok (e.g., 64 β 64 β 64)
Experimentasi penting! Tidak ada formula perfect, tergantung pada:
- Data complexity
- Number of samples
- Overfitting/underfitting
5.3 Training Neural Networks
5.3.1 Loss Functions
Loss function mengukur seberapa jauh predictions dari true values.
Binary Classification: Binary Cross-Entropy
\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right] \]
Multi-class Classification: Categorical Cross-Entropy
\[ \mathcal{L}(\mathbf{w}) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)}) \]
Regression: Mean Squared Error (MSE)
\[ \mathcal{L}(\mathbf{w}) = \frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2 \]
import numpy as np
import matplotlib.pyplot as plt
def binary_cross_entropy(y_true, y_pred):
"""Binary cross-entropy loss"""
epsilon = 1e-15 # untuk numerical stability
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def categorical_cross_entropy(y_true, y_pred):
"""Categorical cross-entropy loss"""
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
def mean_squared_error(y_true, y_pred):
"""Mean squared error loss"""
return np.mean((y_true - y_pred) ** 2)
# Visualisasi: Binary Cross-Entropy vs MSE
y_true_binary = 1 # True label = 1
y_pred_range = np.linspace(0.01, 0.99, 100)
bce_loss = [-np.log(yp) for yp in y_pred_range]
mse_loss = [(1 - yp)**2 for yp in y_pred_range]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Binary Cross-Entropy
axes[0].plot(y_pred_range, bce_loss, linewidth=3, color='#E91E63')
axes[0].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7)
axes[0].set_xlabel('Predicted Probability', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Binary Cross-Entropy Loss\n(True label = 1)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# MSE
axes[1].plot(y_pred_range, mse_loss, linewidth=3, color='#3F51B5')
axes[1].axvline(x=1.0, color='g', linestyle='--', label='Perfect prediction (y=1)', alpha=0.7)
axes[1].set_xlabel('Predicted Probability', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Mean Squared Error Loss\n(True label = 1)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Perbandingan Loss Functions:")
print("-" * 70)
print("Binary Cross-Entropy:")
print(" β
Cocok untuk probability outputs (sigmoid)")
print(" β
Asymmetric punishment (more punishment when confident but wrong)")
print(" β
Better gradient flow untuk classification")
print("\nMean Squared Error:")
print(" β
Cocok untuk regression")
print(" β οΈ Kurang optimal untuk classification (symmetric punishment)")Classification:
- Binary: Binary Cross-Entropy (dengan Sigmoid activation)
- Multi-class: Categorical Cross-Entropy (dengan Softmax activation)
Regression:
- MSE: Default choice, sensitive to outliers
- MAE: Robust to outliers
- Huber: Kombinasi MSE dan MAE
5.3.2 Backpropagation Algorithm
Backpropagation adalah algoritma untuk menghitung gradients loss function terhadap semua parameters.
Intuisi: Backpropagation menggunakan chain rule dari calculus untuk menghitung gradients layer demi layer dari output ke input.
Code
graph RL
L["Loss L"] -->|"βL/βaΒ³"| A3["Output aΒ³"]
A3 -->|"βaΒ³/βzΒ³"| Z3["zΒ³ = WΒ³aΒ² + bΒ³"]
Z3 -->|"βzΒ³/βWΒ³, βzΒ³/βbΒ³"| W3["WΒ³, bΒ³"]
Z3 -->|"βzΒ³/βaΒ²"| A2["Hidden aΒ²"]
A2 -->|"βaΒ²/βzΒ²"| Z2["zΒ² = WΒ²aΒΉ + bΒ²"]
Z2 -->|"βzΒ²/βWΒ², βzΒ²/βbΒ²"| W2["WΒ², bΒ²"]
Z2 -->|"βzΒ²/βaΒΉ"| A1["Hidden aΒΉ"]
A1 -->|"βaΒΉ/βzΒΉ"| Z1["zΒΉ = WΒΉx + bΒΉ"]
Z1 -->|"βzΒΉ/βWΒΉ, βzΒΉ/βbΒΉ"| W1["WΒΉ, bΒΉ"]
style L fill:#f44336
style W3 fill:#4caf50
style W2 fill:#4caf50
style W1 fill:#4caf50graph RL
L["Loss L"] -->|"βL/βaΒ³"| A3["Output aΒ³"]
A3 -->|"βaΒ³/βzΒ³"| Z3["zΒ³ = WΒ³aΒ² + bΒ³"]
Z3 -->|"βzΒ³/βWΒ³, βzΒ³/βbΒ³"| W3["WΒ³, bΒ³"]
Z3 -->|"βzΒ³/βaΒ²"| A2["Hidden aΒ²"]
A2 -->|"βaΒ²/βzΒ²"| Z2["zΒ² = WΒ²aΒΉ + bΒ²"]
Z2 -->|"βzΒ²/βWΒ², βzΒ²/βbΒ²"| W2["WΒ², bΒ²"]
Z2 -->|"βzΒ²/βaΒΉ"| A1["Hidden aΒΉ"]
A1 -->|"βaΒΉ/βzΒΉ"| Z1["zΒΉ = WΒΉx + bΒΉ"]
Z1 -->|"βzΒΉ/βWΒΉ, βzΒΉ/βbΒΉ"| W1["WΒΉ, bΒΉ"]
style L fill:#f44336
style W3 fill:#4caf50
style W2 fill:#4caf50
style W1 fill:#4caf50
Matematika Backpropagation (Simplified):
Untuk output layer \(L\):
\[ \delta^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}} = \mathbf{a}^{[L]} - \mathbf{y} \]
Untuk hidden layer \(l\):
\[ \delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(\mathbf{z}^{[l]}) \]
Gradients untuk parameters:
\[ \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} &= \frac{1}{m} \delta^{[l]} (\mathbf{a}^{[l-1]})^T \\ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[l]}} &= \frac{1}{m} \sum_{i} \delta^{[l](i)} \end{aligned} \]
Anda tidak perlu mengimplementasikan backpropagation manual! Frameworks modern (TensorFlow, PyTorch) menggunakan automatic differentiation yang menghitung gradients secara otomatis.
Memahami konsep backpropagation penting untuk: 1. Debugging training issues 2. Designing custom architectures 3. Understanding vanishing/exploding gradients
5.3.3 Gradient Descent Variants
Gradient descent updates parameters menuju arah yang mengurangi loss.
Basic Update Rule:
\[ \mathbf{W} := \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}} \]
Dimana \(\alpha\) adalah learning rate.
Variants:
| Method | Update | Karakteristik |
|---|---|---|
| Batch GD | Gunakan semua data | Akurat tapi lambat, memory intensive |
| Stochastic GD | Gunakan 1 sample | Cepat tapi noisy, poor convergence |
| Mini-batch GD | Gunakan batch (32-512) | Balance: speed & stability |
| SGD + Momentum | \(v := \beta v + (1-\beta) \nabla\) | Smooth convergence, faster |
| RMSprop | Adaptive learning rate per parameter | Good for RNN |
| Adam | Momentum + RMSprop | Default choice, adaptive, robust |
import numpy as np
import matplotlib.pyplot as plt
# Simulasi optimization dengan berbagai optimizers
def optimize_path(optimizer_func, start_point, learning_rate, iterations):
"""Simulate optimization path"""
path = [start_point]
point = np.array(start_point, dtype=float)
for _ in range(iterations):
# Simple 2D function: f(x,y) = xΒ² + 4yΒ²
gradient = np.array([2*point[0], 8*point[1]])
point = optimizer_func(point, gradient, learning_rate)
path.append(point.copy())
return np.array(path)
def sgd(point, gradient, lr):
"""Standard SGD"""
return point - lr * gradient
def sgd_momentum(point, gradient, lr, momentum=0.9):
"""SGD with momentum"""
if not hasattr(sgd_momentum, 'velocity') or sgd_momentum.velocity is None:
sgd_momentum.velocity = np.zeros_like(point)
sgd_momentum.velocity = momentum * sgd_momentum.velocity + lr * gradient
return point - sgd_momentum.velocity
def adam_optimizer(point, gradient, lr, beta1=0.9, beta2=0.999):
"""Adam optimizer (simplified)"""
if not hasattr(adam_optimizer, 'm') or adam_optimizer.m is None:
adam_optimizer.m = np.zeros_like(point)
adam_optimizer.v = np.zeros_like(point)
adam_optimizer.t = 0
adam_optimizer.t += 1
adam_optimizer.m = beta1 * adam_optimizer.m + (1 - beta1) * gradient
adam_optimizer.v = beta2 * adam_optimizer.v + (1 - beta2) * (gradient ** 2)
m_hat = adam_optimizer.m / (1 - beta1 ** adam_optimizer.t)
v_hat = adam_optimizer.v / (1 - beta2 ** adam_optimizer.t)
return point - lr * m_hat / (np.sqrt(v_hat) + 1e-8)
# Visualisasi convergence paths
start = [2.0, 2.0]
iterations = 50
# Reset momentum/adam state
sgd_momentum.velocity = None
adam_optimizer.m = None
adam_optimizer.v = None
adam_optimizer.t = None
path_sgd = optimize_path(sgd, start, learning_rate=0.1, iterations=iterations)
# Momentum path (velocity already reset above at line 836)
path_momentum = optimize_path(
lambda p, g, lr: sgd_momentum(p, g, lr, momentum=0.9),
start, learning_rate=0.1, iterations=iterations
)
# Adam path (state already reset above at lines 837-839)
path_adam = optimize_path(
lambda p, g, lr: adam_optimizer(p, g, lr),
start, learning_rate=0.1, iterations=iterations
)
# Plot
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
# Contour plot of loss landscape
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = X**2 + 4*Y**2 # Loss function
contour = ax.contour(X, Y, Z, levels=20, alpha=0.3)
ax.clabel(contour, inline=True, fontsize=8)
# Plot paths
ax.plot(path_sgd[:, 0], path_sgd[:, 1], 'o-', label='SGD', linewidth=2, markersize=4)
ax.plot(path_momentum[:, 0], path_momentum[:, 1], 's-', label='SGD + Momentum', linewidth=2, markersize=4)
ax.plot(path_adam[:, 0], path_adam[:, 1], '^-', label='Adam', linewidth=2, markersize=4)
# Mark start and optimum
ax.plot(*start, 'r*', markersize=20, label='Start')
ax.plot(0, 0, 'g*', markersize=20, label='Optimum')
ax.set_xlabel('Parameter 1', fontsize=12)
ax.set_ylabel('Parameter 2', fontsize=12)
ax.set_title('Optimizer Convergence Paths', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Perbandingan Optimizers:")
print("-" * 70)
print(f"SGD : {iterations} iterations, final point: {path_sgd[-1]}")
print(f"SGD+Momentum : {iterations} iterations, final point: {path_momentum[-1]}")
print(f"Adam : {iterations} iterations, final point: {path_adam[-1]}")
print("\nβ
Adam converges paling smooth dan efficient!")Rekomendasi Default:
- Adam: Best all-around choice, adaptive learning rates
- SGD + Momentum: Good for very large datasets, simple
- RMSprop: Alternative untuk RNN/LSTM
Hyperparameters:
- Learning rate: Start dengan 0.001 (Adam) atau 0.01 (SGD)
- Beta1 (momentum): 0.9 (default)
- Beta2 (RMSprop): 0.999 (default)
5.3.4 Learning Rate and Scheduling
Learning rate adalah hyperparameter paling penting dalam deep learning.
Dampak Learning Rate:
- Too small: Training sangat lambat, stuck di local minima
- Too large: Unstable, divergence, loss explodes
- Just right: Smooth convergence, optimal final performance
import numpy as np
import matplotlib.pyplot as plt
# Simulasi training dengan berbagai learning rates
def simulate_training(learning_rate, iterations=100):
"""Simulate simple loss curve"""
np.random.seed(42)
loss = 10.0
losses = []
for i in range(iterations):
# Simulated gradient (decreasing over time)
gradient = 5.0 * np.exp(-i/30) + np.random.normal(0, 0.5)
loss = max(0.01, loss - learning_rate * gradient)
losses.append(loss)
return losses
# Berbagai learning rates
lrs = [0.001, 0.01, 0.1, 0.5, 1.0]
colors = ['#4CAF50', '#2196F3', '#FF9800', '#F44336', '#9C27B0']
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
for lr, color in zip(lrs, colors):
losses = simulate_training(lr, iterations=100)
ax.plot(losses, linewidth=2.5, label=f'LR = {lr}', color=color)
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Impact of Learning Rate on Training', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 12])
plt.tight_layout()
plt.show()
print("Learning Rate Guidelines:")
print("-" * 70)
print("0.001 - 0.01 : Safe default untuk Adam optimizer")
print("0.01 - 0.1 : Untuk SGD dengan momentum")
print("0.1 - 1.0 : Terlalu besar untuk most cases, unstable")
print("\nπ‘ Tip: Gunakan learning rate scheduler untuk adaptive adjustment!")Learning Rate Scheduling Strategies:
Step Decay: Reduce LR by factor setiap N epochs
LR = LRβ Γ Ξ³^(epoch / step_size)Exponential Decay: Smooth exponential decrease
LR = LRβ Γ e^(-Ξ» Γ epoch)Cosine Annealing: Smooth decrease following cosine curve
LR = LR_min + 0.5 Γ (LR_max - LR_min) Γ (1 + cos(Ο Γ epoch / max_epochs))ReduceLROnPlateau: Reduce when validation loss plateaus
- Monitor validation metric
- Reduce LR jika tidak improve selama N epochs
5.3.5 Batch Size Considerations
Batch size mempengaruhi:
- Memory usage: Larger batch = more memory
- Training speed: Larger batch = fewer updates per epoch
- Generalization: Smaller batch often generalizes better
- Gradient stability: Larger batch = more stable gradients
Common Batch Sizes:
- Small: 16-32 (better generalization, less memory)
- Medium: 32-128 (balanced)
- Large: 256-512+ (faster on GPU, may need LR adjustment)
Ketika meningkatkan batch size, pertimbangkan untuk:
- Increase learning rate proportionally (Linear Scaling Rule)
- Use warmup: Gradually increase LR di awal training
Rule of thumb: new_LR = base_LR Γ (new_batch_size / base_batch_size)
5.4 Regularization Techniques
Regularization mencegah overfitting dengan membatasi model complexity.
5.4.1 Overfitting in Neural Networks
Tanda-tanda Overfitting:
- Training loss terus menurun, tapi validation loss meningkat
- Large gap antara training dan validation accuracy
- Model performs poorly pada unseen data
import numpy as np
import matplotlib.pyplot as plt
# Simulasi overfitting scenario
np.random.seed(42)
epochs = np.arange(1, 101)
# Training loss: monotonically decreasing
train_loss = 2.0 * np.exp(-epochs/15) + 0.05
# Validation loss: decreases then increases (overfitting)
val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30)
# Training accuracy: monotonically increasing
train_acc = 1.0 - 1.0 * np.exp(-epochs/12)
# Validation accuracy: increases then plateaus/decreases
val_acc = 1.0 - 1.0 * np.exp(-epochs/15) - 0.002 * (epochs - 40) * (epochs > 40)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Loss plot
ax1.plot(epochs, train_loss, linewidth=3, label='Training Loss', color='#2196F3')
ax1.plot(epochs, val_loss, linewidth=3, label='Validation Loss', color='#F44336')
ax1.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point')
ax1.fill_between(epochs, 0, 3, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Overfitting: Loss Curves', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 2.5])
# Accuracy plot
ax2.plot(epochs, train_acc, linewidth=3, label='Training Accuracy', color='#2196F3')
ax2.plot(epochs, val_acc, linewidth=3, label='Validation Accuracy', color='#F44336')
ax2.axvline(x=30, color='g', linestyle='--', alpha=0.7, label='Optimal Stopping Point')
ax2.fill_between(epochs, 0, 1, where=(epochs > 30), alpha=0.2, color='red', label='Overfitting Zone')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.set_title('Overfitting: Accuracy Curves', fontsize=13, fontweight='bold')
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 1.05])
plt.tight_layout()
plt.show()
print("Deteksi Overfitting:")
print("-" * 70)
print("β
Training loss turun terus, validation loss naik β OVERFIT")
print("β
Large gap antara training dan validation metrics β OVERFIT")
print("β
Model perfect pada training data tapi poor pada test β OVERFIT")
print("\nπ‘ Solusi: Regularization techniques!")5.4.2 Dropout Technique
Dropout adalah teknik regularization yang randomly drops neurons selama training.
Cara Kerja:
- Setiap training step, randomly set fraction p of neurons to zero
- Remaining neurons scale by 1/(1-p) untuk maintain expected output
- Saat inference (testing), gunakan semua neurons (no dropout)
Efek Dropout:
- Mencegah neurons menjadi too specialized (co-adaptation)
- Ensemble effect: melatih banyak βsub-networksβ
- Force network belajar robust features
Code
graph LR
subgraph Normal["Normal Forward Pass"]
I1["Input"] --> H1["Hidden"]
H1 --> H2["Hidden"]
H2 --> O1["Output"]
end
subgraph Dropout["With Dropout (p=0.5)"]
I2["Input"] --> H3["Hidden β"]
I2 -.->|dropped| H4["Hidden β"]
H3 --> H5["Hidden β"]
H3 -.->|dropped| H6["Hidden β"]
H5 --> O2["Output"]
end
style H4 fill:#ffcdd2,stroke:#f44336
style H6 fill:#ffcdd2,stroke:#f44336
style H3 fill:#c8e6c9,stroke:#4caf50
style H5 fill:#c8e6c9,stroke:#4caf50graph LR
subgraph Normal["Normal Forward Pass"]
I1["Input"] --> H1["Hidden"]
H1 --> H2["Hidden"]
H2 --> O1["Output"]
end
subgraph Dropout["With Dropout (p=0.5)"]
I2["Input"] --> H3["Hidden β"]
I2 -.->|dropped| H4["Hidden β"]
H3 --> H5["Hidden β"]
H3 -.->|dropped| H6["Hidden β"]
H5 --> O2["Output"]
end
style H4 fill:#ffcdd2,stroke:#f44336
style H6 fill:#ffcdd2,stroke:#f44336
style H3 fill:#c8e6c9,stroke:#4caf50
style H5 fill:#c8e6c9,stroke:#4caf50
import numpy as np
import matplotlib.pyplot as plt
class DropoutLayer:
"""Implementasi Dropout layer dari scratch"""
def __init__(self, dropout_rate=0.5):
"""
Args:
dropout_rate: Fraction of neurons to drop (0.0 - 1.0)
"""
self.dropout_rate = dropout_rate
self.mask = None
def forward(self, X, training=True):
"""
Forward pass dengan dropout
Args:
X: Input activations
training: If True, apply dropout; if False, no dropout
"""
if training:
# Generate random mask
self.mask = np.random.binomial(1, 1 - self.dropout_rate, size=X.shape)
# Apply mask dan scale
return X * self.mask / (1 - self.dropout_rate)
else:
# No dropout during inference
return X
def backward(self, dout):
"""Backward pass: propagate gradients only through active neurons"""
return dout * self.mask / (1 - self.dropout_rate)
# Demo: Visualisasi dropout effect
np.random.seed(42)
layer_size = 20
input_activations = np.random.randn(layer_size)
dropout_rates = [0.0, 0.2, 0.5, 0.8]
fig, axes = plt.subplots(1, len(dropout_rates), figsize=(16, 4))
for idx, dropout_rate in enumerate(dropout_rates):
dropout = DropoutLayer(dropout_rate=dropout_rate)
output = dropout.forward(input_activations.copy(), training=True)
# Visualize
ax = axes[idx]
neurons_active = (dropout.mask > 0)
colors = ['#4CAF50' if active else '#F44336' for active in neurons_active]
ax.bar(range(layer_size), np.abs(input_activations), color='lightgray', alpha=0.5, label='Original')
ax.bar(range(layer_size), np.abs(output), color=colors, alpha=0.8, label='After Dropout')
ax.set_title(f'Dropout Rate = {dropout_rate}\n({int((1-dropout_rate)*100)}% neurons active)',
fontsize=12, fontweight='bold')
ax.set_xlabel('Neuron Index')
ax.set_ylabel('Activation')
ax.set_ylim([0, 4])
if idx == 0:
ax.legend()
plt.tight_layout()
plt.show()
print("Dropout Guidelines:")
print("-" * 70)
print("Dropout Rate 0.0 : No dropout (no regularization)")
print("Dropout Rate 0.2 : Light regularization")
print("Dropout Rate 0.5 : Standard choice (recommended)")
print("Dropout Rate 0.8 : Heavy regularization (might underfit)")
print("\nβ
Typical: 0.2-0.5 untuk hidden layers, 0.1-0.2 untuk input layer")Where to apply:
- Hidden layers (after activation)
- Input layer (lower rate: 0.1-0.2)
- Avoid: Output layer
Dropout rates:
- Default: 0.5 untuk hidden layers
- Input layer: 0.1-0.2 (lower rate)
- Deep networks: 0.2-0.4 (lower rate untuk deep)
When NOT to use Dropout:
- Convolutional layers (use other regularization)
- Small datasets (might cause underfitting)
- Batch Normalization already used
5.4.3 L1/L2 Regularization
L2 Regularization (Weight Decay):
Add penalty term ke loss function:
\[ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{2m} \sum_{l} \|\mathbf{W}^{[l]}\|_F^2 \]
Effect:
- Penalizes large weights
- Weights shrink toward zero (but not exactly zero)
- Prevents overfitting dengan forcing simpler model
L1 Regularization:
\[ \mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{original}} + \frac{\lambda}{m} \sum_{l} \|\mathbf{W}^{[l]}\|_1 \]
Effect:
- Promotes sparsity (many weights β exactly zero)
- Feature selection effect
- Less common in deep learning
import numpy as np
import matplotlib.pyplot as plt
# Visualisasi effect of L2 regularization
np.random.seed(42)
# Generate weight distributions
weights_no_reg = np.random.randn(1000) * 2.0
weights_l2_light = weights_no_reg * 0.7
weights_l2_heavy = weights_no_reg * 0.4
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# No regularization
axes[0].hist(weights_no_reg, bins=50, color='#F44336', alpha=0.7, edgecolor='black')
axes[0].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[0].set_title('No Regularization\nΞ» = 0', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Weight Value')
axes[0].set_ylabel('Frequency')
axes[0].set_xlim([-6, 6])
# Light L2
axes[1].hist(weights_l2_light, bins=50, color='#FF9800', alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[1].set_title('Light L2 Regularization\nΞ» = 0.01', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Weight Value')
axes[1].set_ylabel('Frequency')
axes[1].set_xlim([-6, 6])
# Heavy L2
axes[2].hist(weights_l2_heavy, bins=50, color='#4CAF50', alpha=0.7, edgecolor='black')
axes[2].axvline(x=0, color='k', linestyle='--', linewidth=2)
axes[2].set_title('Heavy L2 Regularization\nΞ» = 0.1', fontsize=13, fontweight='bold')
axes[2].set_xlabel('Weight Value')
axes[2].set_ylabel('Frequency')
axes[2].set_xlim([-6, 6])
plt.tight_layout()
plt.show()
print("L2 Regularization Effect:")
print("-" * 70)
print("β
Weights shrink toward zero (weight decay)")
print("β
Prevents extreme weight values")
print("β
Model becomes more robust, less sensitive to individual features")
print("\nTypical Ξ» values:")
print(" Small: 0.0001 - 0.001")
print(" Medium: 0.01 - 0.1")
print(" Large: 0.1+")5.4.4 Early Stopping
Early stopping: Stop training ketika validation performance berhenti improving.
Algorithm:
- Monitor validation loss setiap epoch
- Track best validation loss
- Jika tidak improve selama N epochs (patience), stop
- Return model dengan best validation performance
import numpy as np
import matplotlib.pyplot as plt
# Simulasi early stopping
np.random.seed(42)
epochs = np.arange(1, 101)
# Training loss: monotonically decreasing
train_loss = 2.0 * np.exp(-epochs/15) + 0.05 + np.random.normal(0, 0.02, len(epochs))
# Validation loss: decreases then increases
val_loss = 2.0 * np.exp(-epochs/20) + 0.1 + 0.003 * (epochs - 30)**2 * (epochs > 30)
val_loss += np.random.normal(0, 0.05, len(epochs))
# Early stopping logic
patience = 10
best_val_loss = float('inf')
best_epoch = 0
patience_counter = 0
for epoch in range(len(epochs)):
if val_loss[epoch] < best_val_loss:
best_val_loss = val_loss[epoch]
best_epoch = epoch
patience_counter = 0
else:
patience_counter += 1
if patience_counter >= patience:
stop_epoch = epoch
break
else:
stop_epoch = len(epochs) - 1
# Visualization
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(epochs, train_loss, linewidth=2.5, label='Training Loss', color='#2196F3')
ax.plot(epochs, val_loss, linewidth=2.5, label='Validation Loss', color='#F44336')
ax.axvline(x=best_epoch+1, color='g', linestyle='--', linewidth=2,
label=f'Best Model (epoch {best_epoch+1})')
ax.axvline(x=stop_epoch+1, color='r', linestyle='--', linewidth=2,
label=f'Early Stop (epoch {stop_epoch+1})')
ax.scatter([best_epoch+1], [val_loss[best_epoch]], color='g', s=200, zorder=5, marker='*')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title(f'Early Stopping (Patience = {patience})', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, 2.5])
plt.tight_layout()
plt.show()
print("Early Stopping Summary:")
print("-" * 70)
print(f"Best validation loss: {best_val_loss:.4f} at epoch {best_epoch+1}")
print(f"Training stopped at epoch: {stop_epoch+1}")
print(f"Saved epochs: {len(epochs) - stop_epoch}")
print(f"Patience: {patience} epochs")
print("\nβ
Early stopping prevents wasting compute dan overfitting!")Patience:
- Small datasets: 5-10 epochs
- Large datasets: 10-20 epochs
- Very large: 20-50 epochs
What to monitor:
- Primary: Validation loss
- Alternative: Validation accuracy/F1 (for imbalanced data)
Bonus:
- Save checkpoints di setiap best epoch
- Restore best weights setelah training
5.4.5 Batch Normalization Basics
Batch Normalization normalizes activations di setiap layer untuk stabilize training.
Benefits:
- Faster convergence
- Higher learning rates dapat digunakan
- Less sensitive to initialization
- Acts as regularization (similar to dropout)
Formula:
\[ \begin{aligned} \mu_B &= \frac{1}{m} \sum_{i=1}^{m} x_i \\ \sigma_B^2 &= \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2 \\ \hat{x}_i &= \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \\ y_i &= \gamma \hat{x}_i + \beta \end{aligned} \]
Dimana \(\gamma\) dan \(\beta\) adalah learnable parameters.
Batch Normalization sangat efektif untuk:
- Deep networks (>5 layers)
- Convolutional Neural Networks
- Large batch sizes
Trade-off:
- Extra computation
- Behavior berbeda antara training dan inference
- Kurang efektif untuk small batch sizes (<16)
Alternatif:
- Layer Normalization (untuk RNN/Transformers)
- Group Normalization (untuk small batches)
5.5 Implementation dengan Keras
Keras adalah high-level API di atas TensorFlow, sangat user-friendly untuk building neural networks.
5.5.1 Keras Sequential API
Sequential API cocok untuk linear stack of layers (most common architecture).
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")
# Load dataset: Binary classification
data = load_breast_cancer()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Preprocessing: Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print(f"\nDataset: Breast Cancer (Binary Classification)")
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {np.unique(y_train)}")
# Build MLP dengan Sequential API
model = models.Sequential([
# Input layer (implicit)
layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],), name='hidden1'),
layers.Dropout(0.3),
layers.Dense(32, activation='relu', name='hidden2'),
layers.Dropout(0.3),
layers.Dense(16, activation='relu', name='hidden3'),
# Output layer
layers.Dense(1, activation='sigmoid', name='output')
], name='MLP_Classifier')
# Model summary
print("\n" + "="*70)
print("MODEL ARCHITECTURE")
print("="*70)
model.summary()
# Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()]
)
# Training dengan early stopping dan model checkpoint
early_stop = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=15,
restore_best_weights=True,
verbose=1
)
checkpoint = keras.callbacks.ModelCheckpoint(
'best_model.keras',
monitor='val_loss',
save_best_only=True,
verbose=0
)
# Train model
print("\n" + "="*70)
print("TRAINING PROCESS")
print("="*70)
history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=100,
batch_size=32,
callbacks=[early_stop, checkpoint],
verbose=0 # Set to 1 untuk melihat training progress
)
print(f"Training completed: {len(history.history['loss'])} epochs")
# Evaluation
train_results = model.evaluate(X_train, y_train, verbose=0)
test_results = model.evaluate(X_test, y_test, verbose=0)
print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
print(f"Training - Loss: {train_results[0]:.4f}, Accuracy: {train_results[1]:.4f}")
print(f"Test - Loss: {test_results[0]:.4f}, Accuracy: {test_results[1]:.4f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss plot
axes[0].plot(history.history['loss'], linewidth=2, label='Training Loss')
axes[0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training History: Loss', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Accuracy plot
axes[1].plot(history.history['accuracy'], linewidth=2, label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], linewidth=2, label='Validation Accuracy')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Training History: Accuracy', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Key Points Keras Sequential:
- Layers stacked linearly:
Sequential([layer1, layer2, ...]) - Input shape: Specify di first layer saja
- Activation: Bisa inline (
activation='relu') atau separate layer - Compile: Specify optimizer, loss, metrics
- Fit: Training dengan
model.fit() - Callbacks: Early stopping, checkpointing, LR scheduling
5.5.2 Keras Functional API
Functional API lebih flexible, cocok untuk complex architectures (multi-input, multi-output, skip connections).
from tensorflow import keras
from tensorflow.keras import layers, Model
# Build complex architecture dengan Functional API
def build_functional_mlp(input_dim, num_classes):
"""
Build MLP dengan skip connections (residual-like)
"""
# Input layer
inputs = layers.Input(shape=(input_dim,), name='input')
# First branch
x1 = layers.Dense(64, activation='relu', name='branch1_dense1')(inputs)
x1 = layers.Dropout(0.3)(x1)
x1 = layers.Dense(32, activation='relu', name='branch1_dense2')(x1)
# Second branch (shorter path)
x2 = layers.Dense(32, activation='relu', name='branch2_dense')(inputs)
# Merge branches (skip connection)
merged = layers.Add(name='merge')([x1, x2])
merged = layers.Activation('relu')(merged)
# Final layers
x = layers.Dense(16, activation='relu', name='final_dense')(merged)
x = layers.Dropout(0.2)(x)
# Output layer
outputs = layers.Dense(num_classes, activation='softmax', name='output')(x)
# Create model
model = Model(inputs=inputs, outputs=outputs, name='Functional_MLP')
return model
# Create model
functional_model = build_functional_mlp(input_dim=30, num_classes=2)
# Model summary
print("="*70)
print("FUNCTIONAL API MODEL")
print("="*70)
functional_model.summary()
# Visualisasi architecture
keras.utils.plot_model(
functional_model,
to_file='functional_mlp_architecture.png',
show_shapes=True,
show_layer_names=True,
rankdir='TB',
dpi=96
)
print("\nβ
Model architecture saved to: functional_mlp_architecture.png")
print("\nπ‘ Functional API memungkinkan:")
print(" - Multiple inputs/outputs")
print(" - Skip connections (ResNet-style)")
print(" - Shared layers")
print(" - Complex graph topologies")Use Sequential when:
- Linear stack of layers
- Simple feed-forward architecture
- Quick prototyping
Use Functional when:
- Multi-input or multi-output models
- Skip connections (ResNet, DenseNet)
- Shared layers
- Complex architectures
5.5.3 Model Saving and Loading
# Save entire model (architecture + weights + optimizer state)
model.save('complete_model.keras')
print("β
Complete model saved: complete_model.keras")
# Save only weights
model.save_weights('model_weights.weights.h5')
print("β
Weights saved: model_weights.weights.h5")
# Load complete model
loaded_model = keras.models.load_model('complete_model.keras')
print("β
Model loaded successfully")
# Verify loaded model
test_loss, test_acc = loaded_model.evaluate(X_test, y_test, verbose=0)
print(f"Loaded model accuracy: {test_acc:.4f}")
# Export untuk production (TensorFlow SavedModel format)
model.export('exported_model')
print("β
Model exported untuk production: exported_model/")5.6 Implementation dengan PyTorch
PyTorch memberikan lower-level control dan sangat populer untuk research.
5.6.1 PyTorch nn.Module
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
print(f"PyTorch version: {torch.__version__}")
# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
# Load dataset: Multi-class classification
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Convert ke PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).to(device)
y_train_tensor = torch.LongTensor(y_train).to(device)
X_test_tensor = torch.FloatTensor(X_test).to(device)
y_test_tensor = torch.LongTensor(y_test).to(device)
# Create DataLoader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
print(f"\nDataset: Iris (Multi-class Classification)")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {len(np.unique(y_train))}")
# Define MLP model dengan PyTorch
class MLPClassifier(nn.Module):
"""Custom MLP Classifier menggunakan PyTorch"""
def __init__(self, input_dim, hidden_dims, num_classes, dropout_rate=0.3):
"""
Args:
input_dim: Number of input features
hidden_dims: List of hidden layer sizes, e.g., [64, 32, 16]
num_classes: Number of output classes
dropout_rate: Dropout probability
"""
super(MLPClassifier, self).__init__()
# Build layers dynamically
layers = []
prev_dim = input_dim
for hidden_dim in hidden_dims:
layers.append(nn.Linear(prev_dim, hidden_dim))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout_rate))
prev_dim = hidden_dim
# Output layer
layers.append(nn.Linear(prev_dim, num_classes))
# Create sequential module
self.network = nn.Sequential(*layers)
def forward(self, x):
"""Forward pass"""
return self.network(x)
# Create model
model = MLPClassifier(
input_dim=X_train.shape[1],
hidden_dims=[32, 16, 8],
num_classes=len(np.unique(y_train)),
dropout_rate=0.2
).to(device)
print("\n" + "="*70)
print("PYTORCH MODEL ARCHITECTURE")
print("="*70)
print(model)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
def train_epoch(model, loader, criterion, optimizer, device):
"""Train for one epoch"""
model.train()
total_loss = 0
correct = 0
total = 0
for batch_X, batch_y in loader:
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Statistics
total_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += batch_y.size(0)
correct += (predicted == batch_y).sum().item()
avg_loss = total_loss / len(loader)
accuracy = correct / total
return avg_loss, accuracy
def evaluate(model, X, y, criterion, device):
"""Evaluate model"""
model.eval()
with torch.no_grad():
outputs = model(X)
loss = criterion(outputs, y)
_, predicted = torch.max(outputs.data, 1)
accuracy = (predicted == y).sum().item() / y.size(0)
return loss.item(), accuracy
# Training
print("\n" + "="*70)
print("TRAINING PROCESS")
print("="*70)
epochs = 100
train_losses = []
train_accs = []
test_losses = []
test_accs = []
for epoch in range(epochs):
train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
test_loss, test_acc = evaluate(model, X_test_tensor, y_test_tensor, criterion, device)
train_losses.append(train_loss)
train_accs.append(train_acc)
test_losses.append(test_loss)
test_accs.append(test_acc)
if (epoch + 1) % 20 == 0:
print(f"Epoch [{epoch+1}/{epochs}] - "
f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | "
f"Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}")
print("\n" + "="*70)
print("FINAL RESULTS")
print("="*70)
print(f"Final Train Accuracy: {train_accs[-1]:.4f}")
print(f"Final Test Accuracy: {test_accs[-1]:.4f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss
axes[0].plot(train_losses, linewidth=2, label='Training Loss')
axes[0].plot(test_losses, linewidth=2, label='Test Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('PyTorch Training: Loss', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Accuracy
axes[1].plot(train_accs, linewidth=2, label='Training Accuracy')
axes[1].plot(test_accs, linewidth=2, label='Test Accuracy')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('PyTorch Training: Accuracy', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()5.6.2 PyTorch Model Saving and Loading
# Save model weights
torch.save(model.state_dict(), 'pytorch_model_weights.pth')
print("β
Model weights saved: pytorch_model_weights.pth")
# Save complete checkpoint (model + optimizer state)
checkpoint = {
'epoch': epochs,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'train_loss': train_losses[-1],
'test_loss': test_losses[-1]
}
torch.save(checkpoint, 'pytorch_checkpoint.pth')
print("β
Complete checkpoint saved: pytorch_checkpoint.pth")
# Load model weights
loaded_model = MLPClassifier(
input_dim=X_train.shape[1],
hidden_dims=[32, 16, 8],
num_classes=len(np.unique(y_train)),
dropout_rate=0.2
).to(device)
loaded_model.load_state_dict(torch.load('pytorch_model_weights.pth', weights_only=True))
loaded_model.eval()
print("β
Model weights loaded successfully")
# Verify
test_loss, test_acc = evaluate(loaded_model, X_test_tensor, y_test_tensor, criterion, device)
print(f"Loaded model test accuracy: {test_acc:.4f}")5.6.3 Keras vs PyTorch: Quick Comparison
| Aspect | Keras | PyTorch |
|---|---|---|
| Ease of Use | Very easy, high-level | Moderate, more verbose |
| Flexibility | Limited (functional API helps) | Very flexible |
| Debugging | Harder (compiled graphs) | Easier (eager execution) |
| Research | Less common | Very popular |
| Production | TF ecosystem (TFLite, TF Serving) | TorchServe, ONNX |
| Community | Large (TF ecosystem) | Large (research-focused) |
| Learning Curve | Gentle | Steeper |
Recommendation:
- Beginners: Start dengan Keras (easier)
- Research: PyTorch (more control)
- Production (mobile/edge): Keras/TensorFlow
- Production (server): Either works
5.7 Practical Example: Bank Marketing Campaign
Mari kita aplikasikan semua yang telah dipelajari untuk real-world problem.
Problem: Prediksi apakah customer akan subscribe term deposit berdasarkan marketing campaign data.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow import keras
from tensorflow.keras import layers, models, callbacks
# Simulate bank marketing dataset (simplified version)
np.random.seed(42)
n_samples = 5000
data = {
'age': np.random.randint(18, 70, n_samples),
'balance': np.random.randint(-5000, 50000, n_samples),
'duration': np.random.randint(0, 3000, n_samples),
'campaign': np.random.randint(1, 50, n_samples),
'previous': np.random.randint(0, 40, n_samples),
'job': np.random.choice(['admin', 'technician', 'services', 'management'], n_samples),
'education': np.random.choice(['primary', 'secondary', 'tertiary'], n_samples),
'subscribed': np.random.choice([0, 1], n_samples, p=[0.85, 0.15]) # Imbalanced
}
df = pd.DataFrame(data)
print("="*70)
print("BANK MARKETING DATASET")
print("="*70)
print(df.head(10))
print(f"\nDataset shape: {df.shape}")
print(f"\nClass distribution:")
print(df['subscribed'].value_counts())
print(f"Class imbalance ratio: {df['subscribed'].value_counts()[0] / df['subscribed'].value_counts()[1]:.2f}:1")
# Preprocessing
# Encode categorical variables
le_job = LabelEncoder()
le_education = LabelEncoder()
df['job_encoded'] = le_job.fit_transform(df['job'])
df['education_encoded'] = le_education.fit_transform(df['education'])
# Select features
feature_cols = ['age', 'balance', 'duration', 'campaign', 'previous',
'job_encoded', 'education_encoded']
X = df[feature_cols].values
y = df['subscribed'].values
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print(f"\nTraining set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Build MLP model
model = models.Sequential([
layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
layers.BatchNormalization(),
layers.Dropout(0.4),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')
], name='Bank_Marketing_MLP')
# Compile
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.AUC()]
)
# Callbacks
early_stop = callbacks.EarlyStopping(
monitor='val_auc',
patience=20,
restore_best_weights=True,
mode='max',
verbose=1
)
lr_schedule = callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=10,
verbose=1
)
# Training
print("\n" + "="*70)
print("TRAINING")
print("="*70)
history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=100,
batch_size=64,
callbacks=[early_stop, lr_schedule],
verbose=0
)
print(f"Training completed: {len(history.history['loss'])} epochs")
# Evaluation
y_pred_proba = model.predict(X_test, verbose=0).ravel()
y_pred = (y_pred_proba >= 0.5).astype(int)
print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Not Subscribed', 'Subscribed']))
print(f"\nROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Predicted')
axes[0, 0].set_ylabel('True')
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, linewidth=3, label=f'AUC = {roc_auc_score(y_test, y_pred_proba):.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
axes[0, 1].set_xlabel('False Positive Rate', fontsize=12)
axes[0, 1].set_ylabel('True Positive Rate', fontsize=12)
axes[0, 1].set_title('ROC Curve', fontsize=13, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Training History: Loss
axes[1, 0].plot(history.history['loss'], linewidth=2, label='Training Loss')
axes[1, 0].plot(history.history['val_loss'], linewidth=2, label='Validation Loss')
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Loss', fontsize=12)
axes[1, 0].set_title('Training History: Loss', fontsize=13, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Training History: AUC
axes[1, 1].plot(history.history['auc'], linewidth=2, label='Training AUC')
axes[1, 1].plot(history.history['val_auc'], linewidth=2, label='Validation AUC')
axes[1, 1].set_xlabel('Epoch', fontsize=12)
axes[1, 1].set_ylabel('AUC', fontsize=12)
axes[1, 1].set_title('Training History: AUC', fontsize=13, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nβ
Complete MLP pipeline untuk real-world classification problem!")5.8 Review dan Latihan
5.8.1 Review Questions
Konsep Fundamental:
- Apa perbedaan utama antara deep learning dan classical machine learning?
- Mengapa single perceptron tidak bisa menyelesaikan XOR problem?
- Jelaskan fungsi activation function dalam neural networks!
Architecture:
- Bagaimana memilih jumlah hidden layers dan neurons?
- Kapan menggunakan ReLU vs Sigmoid activation?
- Apa trade-off antara deep vs wide networks?
Training:
- Jelaskan intuitively bagaimana backpropagation bekerja!
- Mengapa Adam optimizer lebih baik dari standard SGD?
- Apa dampak learning rate yang terlalu besar atau terlalu kecil?
Regularization:
- Bagaimana dropout mencegah overfitting?
- Kapan menggunakan L2 regularization vs dropout?
- Apa kelebihan early stopping dibanding training sampai convergence?
Implementation:
- Kapan menggunakan Keras Sequential vs Functional API?
- Apa perbedaan utama antara Keras dan PyTorch?
- Bagaimana cara menangani class imbalance dalam neural networks?
5.8.2 Coding Exercises
Exercise 1: XOR Problem
Implement dan train MLP untuk menyelesaikan XOR problem menggunakan Keras. Network harus mencapai 100% accuracy.
# Your solution here
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])
# TODO: Build, compile, dan train model
# TODO: Achieve 100% accuracy pada XOR problemExercise 2: Hyperparameter Tuning
Experiment dengan berbagai hyperparameters pada Iris dataset:
- Jumlah hidden layers (1, 2, 3)
- Neurons per layer (8, 16, 32, 64)
- Learning rate (0.0001, 0.001, 0.01)
- Dropout rate (0.0, 0.2, 0.5)
Plot hasil dan tentukan konfigurasi optimal.
Exercise 3: Custom Loss Function
Implement custom loss function di PyTorch untuk Focal Loss (menangani class imbalance):
\[ \text{FL}(p_t) = -(1-p_t)^\gamma \log(p_t) \]
Exercise 4: Transfer Learning Simulation
- Train MLP pada subset dataset (30% data)
- Freeze beberapa layers dan fine-tune pada remaining data
- Compare dengan training from scratch
5.8.3 Case Study: Credit Card Fraud Detection
Dataset: Simulated credit card transactions (imbalanced: 99.8% normal, 0.2% fraud)
Task:
- Preprocess data (scaling, handling imbalance)
- Build MLP classifier dengan appropriate architecture
- Use class weights atau oversampling
- Optimize untuk precision dan recall (not just accuracy!)
- Visualize results dan analyze errors
Deliverables:
- Trained model dengan precision >0.90 dan recall >0.80
- Learning curves
- Confusion matrix
- ROC curve dan AUC score
- Analysis of false positives dan false negatives
5.9 Key Takeaways
Konsep Fundamental:
- Deep learning menggunakan hierarchical feature learning
- MLPs terdiri dari input, hidden, dan output layers
- Activation functions memberikan non-linearity
- Forward propagation: input β output
- Backpropagation: compute gradients untuk learning
Training Best Practices:
- Optimizer: Adam (default choice)
- Learning Rate: 0.001 (Adam), 0.01 (SGD)
- Batch Size: 32-128 (balanced)
- Epochs: Use early stopping
Regularization:
- Dropout: 0.3-0.5 untuk hidden layers
- L2: Ξ» = 0.0001-0.01
- Early Stopping: Patience 10-20 epochs
- Batch Normalization: For deep networks
Implementation:
- Keras: User-friendly, quick prototyping
- PyTorch: Flexible, research-oriented
- Sequential API: Linear architectures
- Functional API: Complex architectures
Common Pitfalls:
- Forgetting to normalize input data
- Using sigmoid untuk hidden layers (use ReLU!)
- Too large learning rate (divergence)
- Not using validation set (overfitting)
- Ignoring class imbalance
5.10 Whatβs Next?
Setelah menguasai MLP fundamentals, Anda siap untuk advanced deep learning architectures:
Chapter 6: Convolutional Neural Networks (CNN) - Image classification dan computer vision - Convolutional layers dan feature maps - Transfer learning dengan pretrained models - Data augmentation techniques
Chapter 7: Recurrent Neural Networks (RNN/LSTM) - Sequence modeling (time series, text) - LSTM dan GRU architectures - Bidirectional RNNs - Attention mechanisms
Chapter 8: Transformers - Self-attention mechanism - BERT, GPT architectures - Fine-tuning pretrained models - Modern NLP applications
Continue Learning:
- Experiment dengan different datasets
- Read research papers (start dengan survey papers)
- Participate dalam Kaggle competitions
- Build real-world projects
Referensi dan Further Reading
Books:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning.
- Zhang, A., et al. (2023). Dive into Deep Learning. (Free online: d2l.ai)
Papers:
- Rumelhart, D. E., et al. (1986). βLearning representations by back-propagating errors.β
- Kingma, D. P., & Ba, J. (2014). βAdam: A Method for Stochastic Optimization.β
- Srivastava, N., et al. (2014). βDropout: A Simple Way to Prevent Neural Networks from Overfitting.β
Online Resources:
- TensorFlow Tutorials: https://www.tensorflow.org/tutorials
- PyTorch Tutorials: https://pytorch.org/tutorials/
- Fast.ai Course: https://course.fast.ai/
- Stanford CS231n: http://cs231n.stanford.edu/
Practice Platforms:
- Kaggle: https://www.kaggle.com/
- Google Colab: https://colab.research.google.com/
- Papers with Code: https://paperswithcode.com/
Deep learning adalah iterative process: 1. Start simple (baseline model) 2. Analyze errors dan bottlenecks 3. Iterate dengan improvements 4. Experiment systematically 5. Document what works dan what doesnβt
βThe best model is the one you can build, understand, and improve.β
Good luck dan happy deep learning! π