Bab 14: Panduan Capstone Project & Best Practices

Dari Ideasi hingga Implementasi: Merencanakan, Mengeksekusi, dan Mengkomunikasikan Proyek ML Profesional

Bab 14: Panduan Capstone Project & Best Practices

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah menyelesaikan bab ini, Anda akan mampu:

  1. Merencanakan proyek ML produksi dengan scope yang jelas dan timeline realistis
  2. Merumuskan masalah dengan baik melalui problem statement yang terukur (SMART criteria)
  3. Merancang strategi data yang komprehensif dari koleksi hingga validasi
  4. Membangun baseline model dan iterasi sistematis menuju target performance
  5. Mengevaluasi dan memvalidasi model dengan metrik yang sesuai use case
  6. Mendokumentasikan proyek dengan standar profesional dan reproducibility
  7. Mempresentasikan project findings dan insights dengan jelas kepada stakeholder
  8. Mengidentifikasi dan menghindari pitfall umum yang sering terjadi dalam capstone project

14.1 Project Planning & Scoping

14.1.1 Mengapa Scoping Penting?

Banyak student menghabiskan waktu untuk modeling, sementara planning dan scoping sering diabaikan. Padahal, poor scoping adalah penyebab utama project failure.

Statistik Mencolok:

  • 45% capstone projects gagal deliver hasil meaningful karena scope tidak jelas
  • 60% students underestimate timeline di awal project
  • 38% mencoba problem yang terlalu ambitious untuk 1 semester
⚠️ Common Mistakes
  1. Scope Creep: Mulai dengan scope jelas, terus bertambah fitur baru
  2. Underestimating Complexity: β€œSepertinya mudah” β†’ ternyata butuh 3x lebih lama
  3. Fixing Scope, Not Timeline: Tenggat waktu diperas, hasil jadi jelek
  4. Tidak Ada Minimum Viable Product (MVP): Semua atau tidak ada

14.1.2 Project Scoping Framework

Step 1: Tentukan Tujuan Proyek

Proyek bukan hanya tentang ANDA, tapi tentang VALUE yang akan diberikan.

Pertanyaan Kunci:

  • Siapa stakeholder utama?
  • Masalah apa yang dipecahkan?
  • Bagaimana kesuksesan diukur?
  • Apa timeline yang realistis?

Contoh Good Scope vs Bad Scope:

Bad Scope Good Scope
β€œBuild a ML model for sentiment analysis” β€œBuild sentiment analysis model untuk classify customer feedback (positive/negative/neutral) dengan target accuracy 85%, untuk membantu customer service team prioritize complaints”
β€œPredict stock prices” β€œBuild LSTM model untuk predict intraday price movements (+/- 2% threshold) menggunakan 6 bulan historical data, untuk identify trading opportunities dengan risk-adjusted returns”
β€œImage classification” β€œClassify malware vs benign Windows executable files dengan 90% recall untuk security screening, menggunakan image representation dari binary files”

Step 2: Define Success Criteria (SMART)

  • Specific: Jelas dan terdefinisi
  • Measurable: Bisa diukur dengan metrik konkret
  • Achievable: Realistis dengan resources yang tersedia
  • Relevant: Penting bagi stakeholder
  • Time-bound: Ada deadline yang jelas

Contoh SMART Criteria:

❌ Bad: "Model harus akurat"

βœ… Good: "Achieve 85%+ accuracy pada test set dengan balanced
dataset (n=5000 samples) menggunakan Random Forest,
validation dilakukan dengan 5-fold cross-validation,
deadline 31 Desember 2024"

Step 3: Risk Assessment

Identifikasi potential blockers SEBELUM dimulai:

Risk Probability Impact Mitigation
Data tidak tersedia Medium Critical Request dari company 2 minggu sebelumnya
Data quality buruk High Medium Plan intensive data cleaning phase
Model susah konvergen Medium High Research state-of-art papers, test multiple algorithms early

14.1.3 Timeline & Milestone Planning

Struktur Timeline Capstone (1 semester = 16 minggu):

Weeks 1-2:   Problem Definition + Planning (10%)
Weeks 3-5:   Data Collection & EDA (15%)
Weeks 6-8:   Feature Engineering & Preprocessing (15%)
Weeks 9-11:  Model Development & Experimentation (25%)
Weeks 12-13: Evaluation & Optimization (15%)
Weeks 14-15: Documentation & Presentation Prep (15%)
Week 16:     Final Presentation & Submission (5%)

Critical Milestones:

πŸ“‹ Project Milestones Checklist

Month 1: Project Setup (Due: Week 4) - [ ] Problem statement finalized dan approved - [ ] Stakeholder identified - [ ] Preliminary data assessment done - [ ] Team roles & responsibilities defined - [ ] Git repository setup dengan proper structure

Month 2: Data & Baseline (Due: Week 8) - [ ] Dataset collected dan cleaned - [ ] EDA report completed - [ ] Data splits (train/val/test) finalized - [ ] Baseline model implemented - [ ] Evaluation metrics selected

Month 3: Model Development (Due: Week 12) - [ ] 3+ models trained dan compared - [ ] Hyperparameter tuning completed - [ ] Best model selected - [ ] Cross-validation done - [ ] Model card drafted

Final 2 Weeks: Finalization - [ ] Documentation complete - [ ] Code cleaned & tested - [ ] Demo prepared - [ ] Presentation slides ready

14.2 Problem Formulation

14.2.1 Anatomy of Good Problem Statement

Komponen penting:

  1. Context: Latar belakang dan business case
  2. Problem: Apa yang perlu dipecahkan
  3. Data: Apa dan berapa banyak data yang tersedia
  4. Success Metrics: Bagaimana kesuksesan diukur
  5. Constraints: Batasan teknis dan non-teknis

Contoh Problem Statement:

CONTEXT:
GrowthBank melayani 50,000+ customers B2B dengan
average loan size Rp 500 juta. Manual credit approval
memakan 3-5 hari dan memiliki default rate 8%.

PROBLEM:
Otomasi credit scoring process untuk mengurangi approval
time menjadi <24 jam dan default rate menjadi <5%,
dengan tetap mempertahankan customer satisfaction.

DATA:
- 10,000 historical loans (2018-2023)
- 40+ features: company profile, financials, payment history
- 5% data missing (handled appropriately)

SUCCESS METRICS:
1. Model accuracy: 85%+ pada test set
2. Default recall: 90% (catch bad borrowers)
3. Processing speed: <5 seconds per application
4. Interpretability: Top 5 important features identifiable

CONSTRAINTS:
- Data privacy: PII must be removed/encrypted
- Latency: Must respond in <5 sec
- Availability: 99% uptime required
- Fairness: No discrimination against protected groups

14.2.2 Problem Type Classification

Classification:

  • Binary (yes/no, churn/stay, fraud/legitimate)
  • Multi-class (sentiment: positive/neutral/negative)
  • Multi-label (music genres: rock, pop, jazz simultaneously)

Regression:

  • Continuous values (price, temperature, traffic volume)
  • Time series (stock price prediction, demand forecasting)

Ranking/Recommendation:

  • Prioritize items (search ranking, recommendation system)
  • Matching (matching job seekers to jobs)

Anomaly Detection:

  • Outlier detection (fraud, system intrusion, equipment failure)
  • Novelty detection (new attack types)

Clustering:

  • Customer segmentation
  • Document clustering

Choosing the right problem type determines:

  • Data requirements
  • Metrics selection
  • Algorithm choices
  • Evaluation approach

14.2.3 Defining Metrics

πŸ’‘ Best Practice: Match Metrics to Business Goals

β€œ60% accuracy” tidak berarti apa-apa. Metrics harus: 1. Aligned dengan business KPI 2. Interpretable (bukan hanya untuk statistician) 3. Actionable (bisa disambung dengan keputusan)

Contoh Metric Selection:

PROBLEM: Fraud Detection
↓
BUSINESS GOAL: Catch 95% frauds, minimize false positives
↓
METRICS: Recall=95% (catch fraud), Precision high
         (avoid false alarms)
↓
IMPLEMENTATION: Select threshold yang maximize F2-score
                (2x weight pada recall)

Common Metrics by Problem Type:

Problem Type Primary Metric Secondary Metrics
Classification Accuracy (balanced), Precision, Recall, F1 AUC-ROC, Confusion Matrix
Imbalanced Precision, Recall, F1-score, AUC Sensitivity, Specificity
Regression MAE, RMSE RΒ², MAPE
Ranking NDCG, MAP MRR, Recall@K
Clustering Silhouette Score Davies-Bouldin Index
Anomaly Detection Rate, False Positive Rate Precision@K, AUROC

Caution on Single Metric:

# ❌ Don't do this
if accuracy > 0.85:
    print("Model is good!")

# βœ… Do this instead
metrics = {
    'accuracy': 0.85,
    'precision': 0.82,
    'recall': 0.88,
    'f1': 0.85,
    'auc_roc': 0.89
}

# Interpret holistically
print(f"High recall (0.88) β†’ catches most positives")
print(f"OK precision (0.82) β†’ some false alarms acceptable")
print(f"Balanced F1 (0.85) β†’ good overall trade-off")

14.3 Data Strategy

14.3.1 Data Collection Plan

Template: Data Collection Checklist

πŸ“‹ Data Collection Planning

Data Source - [ ] Source identified (API, database, CSV, web scraping) - [ ] Access obtained (permissions, credentials) - [ ] Data freshness understood (real-time, daily, monthly) - [ ] Size confirmed (n samples Γ— m features)

Data Quality Assessment - [ ] Missing values documented (<5% acceptable) - [ ] Duplicates checked - [ ] Outliers identified - [ ] Data type validation done - [ ] Value ranges reasonable

Data Privacy & Ethics - [ ] PII removal/anonymization done - [ ] GDPR/compliance checked - [ ] Bias in data identified - [ ] Consent obtained (if needed) - [ ] Data retention policy defined

Data Documentation - [ ] Data dictionary created (each feature explained) - [ ] Data quality report generated - [ ] Collection date/period documented - [ ] Known issues documented

Contoh: Data Dictionary

Feature Name: transaction_amount
β”œβ”€ Type: float64
β”œβ”€ Unit: Indonesian Rupiah (IDR)
β”œβ”€ Range: 10,000 - 999,999,999
β”œβ”€ Missing: 0.2% (handled by median imputation)
β”œβ”€ Distribution: Right-skewed (log-transform applied)
β”œβ”€ Source: transaction_table.amount
└─ Notes: 3 outliers > 999M (verified, kept)

Feature Name: customer_age
β”œβ”€ Type: int64
β”œβ”€ Unit: Years
β”œβ”€ Range: 18 - 75
β”œβ”€ Missing: 1.5% (filled with median)
β”œβ”€ Distribution: Relatively uniform
β”œβ”€ Source: customer_table.age
└─ Notes: Some suspicious values (999), filtered out

14.3.2 Exploratory Data Analysis (EDA) Structure

Layered EDA Approach:

  1. Univariate Analysis (1 variable at a time)
    • Distribution, central tendency, spread
    • Outliers, skewness, missing values
  2. Bivariate Analysis (2 variables)
    • Correlation dengan target
    • Feature relationships
    • Potential interactions
  3. Multivariate Analysis (3+ variables)
    • Feature correlations
    • Clustering patterns
    • Domain insights

EDA Outputs to Document:

Checklist untuk EDA Report:
- [ ] Dataset shape dan basic info
- [ ] Missing values visualization & handling
- [ ] Distributions (histograms, KDE plots)
- [ ] Outliers identified & approach decided
- [ ] Correlation heatmap & top correlated features
- [ ] Feature importance from EDA
- [ ] Class imbalance (if classification)
- [ ] Data quality issues & resolutions
- [ ] Key insights & hypotheses
- [ ] Feature engineering ideas

14.3.3 Data Preparation Workflow

Raw Data
   ↓
[Clean] β†’ Remove duplicates, fix obvious errors
   ↓
[Transform] β†’ Handle missing, encode categorical, scale
   ↓
[Validate] β†’ Check quality, range, distribution
   ↓
[Split] β†’ Train (70%) / Validation (15%) / Test (15%)
   ↓
[Document] β†’ Version data, document transformations
   ↓
Ready for Modeling

Key Decision Points:

Decision Options Trade-offs
Missing Values Drop / Impute (mean/median/KNN) Lose data vs bias
Categorical Encoding One-hot / Label / Ordinal Sparsity vs information
Feature Scaling StandardScaler / MinMaxScaler / RobustScaler Interpretability vs performance
Imbalanced Data Oversample / Undersample / SMOTE Overfitting vs underfitting

14.4 Baseline & Iteration

14.4.1 Establishing Baseline

β€œBaseline” = Simplest possible model untuk problem Anda.

πŸ’‘ Why Baseline Matters

Baseline bukan tentang performa tinggi. Baseline adalah: 1. Sanity check β†’ Model Anda better than baseline? 2. Reference point β†’ Berapa improvement dari baseline? 3. Proof of concept β†’ Apakah problem solvable dengan ML?

Baseline Ideas by Problem Type:

CLASSIFICATION:
β”œβ”€ Majority class (predict always positive/negative)
β”œβ”€ Random classifier (50% untuk binary)
β”œβ”€ Logistic Regression
└─ Decision Tree

REGRESSION:
β”œβ”€ Mean predictor (always predict mean)
β”œβ”€ Median predictor
β”œβ”€ Linear Regression
└─ Decision Tree

RANKING:
β”œβ”€ Random ranking
β”œβ”€ Popularity ranking
└─ TF-IDF based ranking

Example: Fraud Detection Baseline

from sklearn.metrics import precision_recall_curve
import numpy as np

# Baseline 1: Always predict "no fraud" (majority class)
baseline1_accuracy = (fraud_data.label == 0).mean()  # e.g., 98.5%
baseline1_recall = 0  # Catches 0% of frauds

# Baseline 2: Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
baseline2_accuracy = lr.score(X_test, y_test)
baseline2_recall = (lr.predict(X_test) == 1).mean()

print(f"Baseline 1 (always negative): Acc={baseline1_accuracy:.1%}")
print(f"Baseline 2 (Logistic): Acc={baseline2_accuracy:.1%}, "
      f"Recall={baseline2_recall:.1%}")

# Your model harus BOTH lebih akurat AND better recall

14.4.2 Systematic Iteration Process

Don’t randomly try 100 algorithms. Iterate systematically.

Phase 1: Simple Models (Week 1-2)
  Try: Logistic Regression, Decision Tree, KNN
  Goal: Understand problem & get baseline

Phase 2: Intermediate Models (Week 2-3)
  Try: Random Forest, SVM, Gradient Boosting
  Goal: Find algorithm yang works best

Phase 3: Advanced Models (Week 3-4)
  Try: Neural Networks, Ensemble, State-of-art
  Goal: Push towards target performance

Phase 4: Optimization (Week 4-5)
  Try: Hyperparameter tuning, ensemble methods
  Goal: Final squeeze on performance

Iteration Template to Document:

## Experiment Log

### Experiment 1: Logistic Regression Baseline
- Date: 2024-01-10
- Model: LogisticRegression(C=1.0, max_iter=1000)
- Features: 35 features, no scaling
- Result: Accuracy=0.82, Recall=0.75, F1=0.78
- Note: Baseline established
- Next: Try feature scaling

### Experiment 2: Logistic Regression + Scaling
- Date: 2024-01-11
- Model: LogisticRegression with StandardScaler
- Features: 35 features, StandardScaled
- Result: Accuracy=0.84, Recall=0.78, F1=0.81
- Note: Slight improvement from scaling
- Next: Try feature engineering

### Experiment 3: Random Forest
- Date: 2024-01-12
- Model: RandomForest(n_estimators=100, max_depth=10)
- Features: 35 features + 8 engineered features
- Result: Accuracy=0.87, Recall=0.85, F1=0.86
- Note: Significant improvement!
- Next: Hyperparameter tuning for RF

14.4.3 Common Iteration Pitfalls

⚠️ Things NOT to Do

❌ Tuning on Test Set

# WRONG: Evaluating on test set repeatedly
for hyperparams in search_space:
    model = fit_model(X_train, y_train, hyperparams)
    score = model.score(X_test, y_test)  # OVERFITTING TO TEST!

# RIGHT: Evaluate on validation set
for hyperparams in search_space:
    model = fit_model(X_train, y_train, hyperparams)
    score = model.score(X_val, y_val)  # Use validation
# Final evaluation pada test set saja!

❌ Data Leakage

# WRONG: Scale entire dataset, then split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # fit on ALL data!
X_train, X_test = train_test_split(X_scaled)

# RIGHT: Fit scaler only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # transform using train stats

❌ Metric Obsession

# WRONG: Chasing single metric
# "My model has 97% accuracy! I'm done!"
# But: 0% recall, 500ms latency, not deployable

# RIGHT: Consider all metrics
# Accuracy: 85%, Recall: 90%, Latency: 50ms, Interpretable: Yes

14.5 Evaluation & Validation

14.5.1 Comprehensive Evaluation Framework

Jangan evaluasi model berdasarkan single metric saja.

Model Evaluation Matrix:

Dimension Metric Target
Accuracy F1-Score β‰₯ 0.85
Fairness Demographic Parity < 0.1 difference
Robustness Adversarial Accuracy > 0.80
Interpretability SHAP importance Top-5 features clear
Speed Latency (p95) < 100ms
Resource Usage Memory, CPU < 1GB RAM

14.5.2 Advanced Validation Techniques

1. K-Fold Cross Validation

from sklearn.model_selection import cross_validate

# Better than single train/test split
cv_results = cross_validate(
    model, X, y,
    cv=5,  # 5-fold
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)

# Check for overfitting
train_score = cv_results['train_accuracy'].mean()
test_score = cv_results['test_accuracy'].mean()
print(f"Train: {train_score:.3f}, Test: {test_score:.3f}")

if (train_score - test_score) > 0.15:
    print("⚠️ Possible overfitting! Gap > 15%")

2. Stratified Split (untuk imbalanced data)

from sklearn.model_selection import StratifiedKFold

# Ensure class distribution maintained
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Training loop

3. Time Series Cross Validation (untuk sequential data)

from sklearn.model_selection import TimeSeriesSplit

# Don't shuffle! Respect temporal order
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    # Model trained on past, tested on future

14.5.3 Diagnostic Plots

Essential diagnostic plots untuk capstone:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, ax=axes[0, 0], cmap='Blues')
axes[0, 0].set_title('Confusion Matrix')

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = auc(fpr, tpr)
axes[0, 1].plot(fpr, tpr, label=f'AUC={auc_score:.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0, 1].set_xlabel('FPR'), axes[0, 1].set_ylabel('TPR')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()

# 3. Precision-Recall Curve
from sklearn.metrics import precision_recall_curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
axes[1, 0].plot(recall, precision)
axes[1, 0].set_xlabel('Recall'), axes[1, 0].set_ylabel('Precision')
axes[1, 0].set_title('Precision-Recall Curve')

# 4. Feature Importance
feature_importance = pd.DataFrame({
    'feature': X_test.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

axes[1, 1].barh(feature_importance['feature'][:10],
                feature_importance['importance'][:10])
axes[1, 1].set_title('Top 10 Important Features')

plt.tight_layout()
plt.savefig('model_diagnostics.png', dpi=300, bbox_inches='tight')

14.6 Documentation Best Practices

14.6.1 Project Structure & Documentation

Professional project structure:

my-capstone-project/
β”œβ”€β”€ README.md                          # Project overview
β”œβ”€β”€ DOCUMENTATION.md                   # Detailed documentation
β”œβ”€β”€ LICENSE                            # MIT or similar
β”œβ”€β”€ .gitignore                        # Exclude large files, cache
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                          # Original data
β”‚   β”œβ”€β”€ processed/                    # Cleaned, transformed data
β”‚   └── README.md                     # Data dictionary
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_eda.ipynb                 # Exploratory analysis
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb        # Data preparation
β”‚   └── 03_modeling.ipynb             # Model training
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ preprocessing.py              # Data prep functions
β”‚   β”œβ”€β”€ features.py                   # Feature engineering
β”‚   β”œβ”€β”€ models.py                     # Model definitions
β”‚   β”œβ”€β”€ evaluation.py                 # Evaluation metrics
β”‚   └── utils.py                      # Helper functions
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ model_v1.0.pkl               # Saved models
β”‚   β”œβ”€β”€ model_v1.1.pkl
β”‚   └── model_card.md                # Model documentation
β”‚
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ eda_report.html              # EDA visualization
β”‚   β”œβ”€β”€ model_comparison.csv         # Experiment results
β”‚   └── final_report.pdf             # Final analysis
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_preprocessing.py        # Unit tests
β”‚   β”œβ”€β”€ test_models.py
β”‚   └── test_pipeline.py
β”‚
β”œβ”€β”€ requirements.txt                  # Dependencies
β”œβ”€β”€ setup.py                         # Package setup (if publishing)
└── train.py                         # Main training script

14.6.2 Documentation Template

README.md Example:

# Credit Risk Prediction Model

## Overview
Building an automated credit scoring system to reduce loan
approval time from 3 days to <24 hours while maintaining
default rate below 5%.

## Dataset
- **Source**: GrowthBank historical loans (2018-2023)
- **Size**: 10,000 samples Γ— 42 features
- **Target**: Binary (Default/Non-default), 8% positive class
- **Time Period**: 2018-01-01 to 2023-12-31

## Project Structure
[File structure description]

## Quick Start

pip install -r requirements.txt python train.py


## Results
- **Best Model**: Random Forest
- **Accuracy**: 87.2% Β± 1.3% (5-fold CV)
- **Recall (Default)**: 91.5% (catch 91.5% of defaults)
- **Precision**: 45.2% (acceptable false positive rate)
- **ROC-AUC**: 0.934

## Model Performance
[Performance visualization and metrics table]

## Key Findings
1. Transaction frequency is strongest indicator of default risk
2. Recent payment history more important than historical average
3. Model shows no significant bias against age groups

## Limitations
- Limited to B2B loans (may not generalize to consumer)
- Training data from 2018-2023 (concept drift possible)
- No alternative data sources (e.g., behavioral)

## Future Work
- Implement real-time model monitoring
- Extend to multi-class risk levels (low/medium/high)
- Add fairness constraints for protected attributes

14.6.3 Model Card Documentation

Model Card = standardized documentation untuk ML model.

# Model Card: Credit Risk Classifier v1.2

## Model Details
- **Model Type**: Random Forest Classification
- **Framework**: Scikit-learn
- **Version**: 1.2
- **Date**: 2024-01-15
- **Authors**: [Your Name]
- **License**: MIT

## Intended Use
- **Intended Use**: Automated credit risk assessment for B2B loans
- **Primary Users**: Credit department, lending officers
- **Out-of-Scope**: Consumer lending, international markets

## Performance
- **Training Data**: 8,000 samples (80%)
- **Test Data**: 2,000 samples (20%)
- **Metric**: Accuracy, Precision, Recall, F1

### Detailed Performance Metrics
              Precision  Recall  F1-Score  Support

Non-Default (0) 0.93 0.88 0.90 1840 Default (1) 0.45 0.92 0.61 160 Accuracy 0.87 2000


## Fairness Analysis
- **Gender Bias**: FPR diff = 1.2% (acceptable)
- **Age Bias**: ROC-AUC for <30 = 0.92, >50 = 0.94 (no significant difference)

## Limitations
- Only trained on B2B segment
- Assumes data distribution similar to training period
- Requires regular retraining

## Data and Preprocessing
- **Training Data**: 8000 historical loans with known outcomes
- **Input Features**: 42 features (company profile + financials)
- **Preprocessing**: StandardScaler on numerical, OneHot on categorical

14.7 Code Quality & Reproducibility

14.7.1 Reproducibility Checklist

βœ… Reproducibility Checklist

Code Reproducibility:

Data Reproducibility:

Results Reproducibility:

Example: Reproducible Training Script

# train.py - Fully reproducible training

import numpy as np
import random
import tensorflow as tf
from pathlib import Path
import json

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

def main(config_path='config.json'):
    """Main training function"""

    # Load configuration
    with open(config_path, 'r') as f:
        config = json.load(f)

    # Load and preprocess data
    X_train, y_train, X_test, y_test = load_data(
        config['data_path'],
        test_size=config['test_size'],
        random_state=RANDOM_SEED
    )

    # Train model
    model = train_model(
        X_train, y_train,
        **config['model_params']
    )

    # Evaluate
    metrics = evaluate_model(model, X_test, y_test)

    # Save results
    save_artifacts(model, metrics, config)

    return metrics

if __name__ == "__main__":
    metrics = main()
    print(f"Final accuracy: {metrics['accuracy']:.3f}")

config.json:

{
  "data_path": "data/processed/",
  "test_size": 0.2,
  "model_params": {
    "n_estimators": 100,
    "max_depth": 10,
    "random_state": 42
  },
  "output_dir": "results/model_v1.2/"
}

14.7.2 Code Quality Standards

Minimal standards untuk capstone:

# βœ… Good: Functions with docstrings
def prepare_features(X, categorical_cols, numerical_cols):
    """
    Prepare features untuk model training.

    Parameters
    ----------
    X : pd.DataFrame
        Input features dengan categorical dan numerical columns
    categorical_cols : list
        Names of categorical columns untuk OneHot encoding
    numerical_cols : list
        Names of numerical columns untuk scaling

    Returns
    -------
    X_prepared : np.ndarray
        Prepared feature matrix ready for training
    """
    # Implementation
    return X_prepared

# βœ… Good: Comments explain WHY, not WHAT
# Skip features dengan >50% missing values
# (imputation would introduce too much bias)
mask = feature_missing_rate < 0.5
X_filtered = X.loc[:, mask]

# ❌ Bad: Comments repeat code
for i in range(len(data)):  # Loop through data
    if data[i] > threshold:  # Check if greater
        results.append(data[i])  # Add to results

Type hints untuk clarity:

from typing import Tuple, List, Dict
import pandas as pd
import numpy as np

def split_features(
    X: pd.DataFrame,
    target_ratio: float = 0.2
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split into train and test with specified ratio."""
    n_test = int(len(X) * target_ratio)
    return X[:-n_test], X[-n_test:]

def calculate_metrics(
    y_true: np.ndarray,
    y_pred: np.ndarray
) -> Dict[str, float]:
    """Calculate evaluation metrics."""
    return {
        'accuracy': (y_true == y_pred).mean(),
        # ... more metrics
    }

14.8 Technical Report Writing

14.8.1 Report Structure

Standar technical report capstone (15-30 pages):

1. Executive Summary (1 page)
   - Problem, solution, results
   - Key recommendation
   - Business impact

2. Introduction (2-3 pages)
   - Context dan motivation
   - Problem statement
   - Research questions
   - Contributions

3. Literature Review (2-3 pages)
   - Related work
   - Existing solutions
   - Knowledge gaps
   - How your work differs

4. Methodology (3-4 pages)
   - Problem formulation (mathematical)
   - Algorithms & approaches
   - Evaluation metrics
   - Hyperparameters

5. Data Description (2-3 pages)
   - Dataset characteristics
   - Data collection & preprocessing
   - Feature engineering
   - Class distribution, missing values
   - Train/val/test splits

6. Results (3-5 pages)
   - Model comparison
   - Best model performance
   - Ablation studies
   - Visualizations (confusion matrix, ROC, etc.)

7. Analysis & Discussion (2-3 pages)
   - Interpret results
   - Why did model work/fail
   - Limitations
   - Error analysis

8. Conclusion & Future Work (1-2 pages)
   - Summary of findings
   - Practical implications
   - Directions for future research

9. References (1-2 pages)
   - Academic citations
   - Data sources
   - Software libraries

14.8.2 Writing Best Practices

❌ Avoid:

  • Overly technical jargon tanpa explanation
  • Unsupported claims (β€œOur model is the best!”)
  • Lengthy code listings in main report
  • Vague statements (β€œWe tried many models”)

βœ… Do:

  • Explain technical concepts tersedia untuk general audience
  • Support claims dengan evidence (metrics, citations)
  • Put code in appendix
  • Be specific (β€œWe evaluated 12 models using GridSearchCV”)

Example: Good vs Bad Writing

❌ BAD:
"We used an RF with optimized hyperparameters to
maximize the AUC on the validation dataset."

βœ… GOOD:
"We trained a Random Forest classifier with
100 trees and max depth of 10 (selected via 5-fold
cross-validation to maximize AUC-ROC). The model
achieved 87.2% accuracy on the test set."

14.9 Presentation & Demo Skills

14.9.1 Presentation Structure

Final presentation (15-20 minutes):

0-1 min:   Title slide, introduce yourself
1-2 min:   The Problem (why should audience care?)
2-3 min:   Your Solution (brief overview)
3-8 min:   Key Results (metrics, visualizations)
8-12 min:  Technical Deep Dive (1-2 complex topics)
12-15 min: Limitations & Future Work
15-20 min: Q&A

Slide Guidelines:

  • Slide 1-2: Title + Problem (make audience care!)
  • Slide 3-4: Data overview (n samples, features, class distribution)
  • Slide 5-6: Approach (your method vs baselines)
  • Slide 7-10: Results (visualizations!)
    • Confusion matrix
    • ROC/PR curve
    • Feature importance
    • Model comparison
  • Slide 11-12: Deep dive on 1-2 interesting findings
  • Slide 13: Limitations honestly discussed
  • Slide 14: Future work & lessons learned
πŸ’‘ Presentation Tips
  1. Lead with Why: β€œWhy should anyone care about this problem?”
  2. Show Impact: How does this solve a real problem?
  3. Use Visualizations: Data > tables > text
  4. Tell a Story: Not just β€œWe did X and got Y accuracy”
  5. Prepare for Questions: Know your code, data, and assumptions
  6. Practice: Time yourself, get feedback

14.9.2 Live Demo Best Practices

If demonstrating working system:

βœ… DO:
- [ ] Test demo multiple times before presentation
- [ ] Have backup screenshots/video if live breaks
- [ ] Keep demo simple (don't show complex edge cases)
- [ ] Make prediction in <5 seconds
- [ ] Explain what model learned, not just output

❌ DON'T:
- Don't go off script ("Let me show you something...")
- Don't click around aimlessly
- Don't make predictions on untypical data
- Don't forget what model is doing
- Don't spend >2 minutes on demo

14.10 Project Examples & Case Studies

14.10.1 Case Study 1: Fraud Detection (Classification)

Problem: E-commerce platform dengan 100K+ transactions/day, fraud rate 0.8%.

Approach:

1. Baseline: Logistic Regression
   - Accuracy: 99.8% (predicting all non-fraud!)
   - Recall: 0% (catches 0 frauds) ❌

2. Strategy: Class imbalance handling
   - SMOTE oversampling untuk minority class
   - Use Recall as primary metric
   - Adjust decision threshold

3. Best Model: Gradient Boosting + Custom Threshold
   - Accuracy: 98.5%
   - Recall: 92% (catch 92% of frauds)
   - Precision: 35% (acceptable for fraud detection)
   - Live performance: catches $2M fraud/month

Key Lessons:

  • Accuracy misleading pada imbalanced data
  • Recall > Precision untuk fraud (catch fraud, accept false alarms)
  • Real-time constraints matter (must score transaction in <100ms)

14.10.2 Case Study 2: Predictive Maintenance (Regression)

Problem: Manufacturing plant, reduce unplanned downtime.

Approach:

1. Data: 5 years sensor data (temperature, vibration, pressure)
2. Target: RUL (Remaining Useful Life) prediction
3. Baseline: Linear regression on recent sensor readings
   - RMSE: 150 hours

4. Advanced: Sequence-to-sequence LSTM
   - Input: 7 days sensor history
   - Output: Days until failure
   - RMSE: 42 hours (3x better)
   - Enables preventive maintenance

Challenges & Solutions:

  • Class imbalance: Most machines work fine
    • Solution: Weighted loss function, focus on failures
  • Seasonality: Equipment behaves differently by season
    • Solution: Add seasonal features
  • Concept drift: Equipment degrades over 5 years
    • Solution: Retrain monthly with recent data

14.11 Common Pitfalls & Prevention

14.11.1 Technical Pitfalls

⚠️ Data Leakage (Most Common!)

❌ Problem: Information from future leaks into training

# WRONG: Use future information in features
for idx in range(len(data)):
    # Using data[idx+1] (future) to predict data[idx]
    features[idx] = [
        data[idx]['price'],
        data[idx+1]['price'],  # ← FUTURE DATA!
        data[idx+1]['volume']   # ← FUTURE DATA!
    ]

# RIGHT: Use only past information
for idx in range(1, len(data)):  # Start from idx=1
    features[idx] = [
        data[idx]['price'],
        data[idx-1]['price'],  # ← Past
        data[idx-1]['volume']   # ← Past
    ]

❌ Problem: Scaling entire dataset before train/test split

# WRONG: Fit scaler on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ← Fit on ALL data
X_train, X_test = train_test_split(X_scaled)

# Model "sees" test set information during scaling!

# RIGHT: Fit scaler ONLY on training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # ← Fit on train only
X_test_scaled = scaler.transform(X_test)  # ← Transform using train stats

How to prevent:

⚠️ Overfitting

❌ Training accuracy 99%, test accuracy 70%

# Signs of overfitting:
if train_acc - test_acc > 0.15:  # Gap > 15%
    print("⚠️ Probable overfitting!")

# Solutions:
# 1. More training data
# 2. Reduce model complexity
# 3. Regularization (L1/L2)
# 4. Early stopping
# 5. Dropout (neural networks)

# βœ… Regularized model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.01, penalty='l2')
# C < 1 means stronger regularization
⚠️ Hyperparameter Tuning on Test Set
# ❌ WRONG: Evaluating on test set repeatedly
best_score = 0
for C in [0.001, 0.01, 0.1, 1, 10]:
    model = LogisticRegression(C=C)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)  # ← TUNING ON TEST!
    if score > best_score:
        best_score = score
        best_C = C

# βœ… RIGHT: Use validation set for hyperparameter selection
for C in [0.001, 0.01, 0.1, 1, 10]:
    model = LogisticRegression(C=C)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)  # ← USE VALIDATION
    if score > best_score:
        best_score = score
        best_C = C

# Final test on held-out test set
final_score = best_model.score(X_test, y_test)

14.11.2 Project Management Pitfalls

Pitfall Symptom Prevention
Scope Creep Keeps adding features Lock scope week 2, use change request process
No Baseline Don’t know if model good Implement baseline in week 1
Single Train/Test High variance in metrics Use 5-fold cross-validation
Ignoring Class Imbalance 98% accuracy but useless Check class distribution day 1
No Documentation Can’t reproduce results Write as you code, not at end
Last-Minute Demo Presentation full of errors Practice presentation 1 week before

14.11.3 Communication Pitfalls

⚠️ Overselling Results

❌ WRONG:

β€œOur model is 95% accurate!” (In context of imbalanced dataset: 95% = predicting majority class)

βœ… RIGHT:

β€œOur model achieves 95% accuracy, 85% recall, and 60% precision. This means we catch 85% of frauds with 60% of alerts being true positives.” (Specific, contextualized, honest)

14.12 Project Grading Rubric

14.12.1 Comprehensive Rubric untuk Evaluasi

Total: 100 points

1. Problem Definition & Planning (15 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Problem Statement Clear, specific, measurable, business-aligned Clear problem, minor scope issues Problem vague, weak business case Problem ill-defined
(0-5 pts) 5 4 2 0
Data Planning Complete strategy, quality assessment, privacy check Data plan documented, mostly complete Minimal planning, data quality not assessed No data strategy
(0-5 pts) 5 4 2 0
Feasibility & Timeline Realistic scope, detailed milestone, risk mitigation Reasonable scope, milestone defined Over-scoped or unrealistic timeline No feasibility analysis
(0-5 pts) 5 4 2 0

2. Data & Analysis (15 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Data Quality Clean, well-documented, proper handling of missing/outliers Mostly clean, minor issues handled Some quality issues, incomplete cleaning Poor data quality
(0-5 pts) 5 4 2 0
EDA & Insights Comprehensive EDA, clear insights, good visualizations Good EDA, identified patterns Basic EDA, minimal insights Insufficient analysis
(0-5 pts) 5 4 2 0
Feature Engineering Smart features, validated impact, interpretable Some feature engineering, reasonable Minimal feature work No feature engineering
(0-5 pts) 5 4 2 0

3. Model Development (25 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Approach Multiple algorithms tested, systematic comparison, justification 2-3 algorithms, reasonable comparison Limited algorithm exploration Single model tried
(0-8 pts) 8 6 4 0
Validation K-fold CV, proper train/val/test split, no data leakage Proper splitting, no obvious leakage Basic validation, possible issues Train/test contamination
(0-9 pts) 9 7 5 0
Hyperparameter Tuning Systematic tuning, documented process, prevents overfitting Good tuning, reasonable results Limited tuning Random/no tuning
(0-8 pts) 8 6 4 0

4. Evaluation & Results (20 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Metrics Selection Appropriate for problem, multiple metrics, justified Good metrics, mostly justified Limited metrics, weak justification Inappropriate metrics
(0-5 pts) 5 4 2 0
Results & Analysis Clear improvement over baseline, thorough analysis, error discussion Good results with analysis Acceptable results, minimal analysis Poor results, no analysis
(0-8 pts) 8 6 4 0
Reproducibility Complete reproducibility, clear results, documented seeds Mostly reproducible, good documentation Somewhat reproducible, missing details Not reproducible
(0-7 pts) 7 5 3 0

5. Documentation & Code Quality (15 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Code Quality Clean, documented, proper structure, follows best practices Generally clean, documented, reasonable structure Some documentation, inconsistent style Poor documentation, messy code
(0-5 pts) 5 4 2 0
Documentation Complete report, clear writing, all sections thorough Good documentation, mostly complete Incomplete documentation, some clarity issues Minimal documentation
(0-5 pts) 5 4 2 0
Repository Organized structure, git history clean, includes all artifacts Good organization, mostly complete Basic organization, missing some items Disorganized or incomplete
(0-5 pts) 5 4 2 0

6. Presentation (10 points)

Criteria Excellent (A) Good (B) Fair (C) Poor (D)
Clarity Clear story, appropriate level, engaging for audience Generally clear, minor issues Some clarity issues, organization weak Confusing or hard to follow
(0-5 pts) 5 4 2 0
Delivery Confident, paced well, handles questions effectively Good delivery, minor pacing issues Nervous or rushed, struggles with questions Poor presentation skills
(0-5 pts) 5 4 2 0

14.12.2 Grading Scale

90-100: A (Excellent)
  - Professional-quality project
  - Clear contribution
  - Production-ready code
  - Excellent presentation

80-89:  B (Good)
  - Solid project with minor issues
  - Good approach and results
  - Well-documented
  - Good presentation

70-79:  C (Fair)
  - Acceptable project, significant gaps
  - Basic approach, adequate results
  - Documentation could be better
  - Adequate presentation

60-69:  D (Poor)
  - Project not fully meeting standards
  - Issues in approach or execution
  - Poor documentation
  - Weak presentation

<60:    F (Fail)
  - Does not meet minimum standards

14.13 Ringkasan

πŸ“š Chapter Summary

1. Project Planning - Scoping adalah kunci sukses - Define problem dengan SMART criteria - Plan realistic timeline dengan milestones

2. Problem Formulation - Clear problem statement (context, problem, data, metrics) - Match metrics dengan business goals - Consider constraints (latency, privacy, fairness)

3. Data Strategy - Comprehensive data collection plan - Structured EDA dengan insights - Proper preprocessing dengan documentation

4. Baseline & Iteration - Establish baseline untuk context - Iterate systematically (not randomly) - Document experiments carefully

5. Evaluation & Validation - Use multiple metrics (not single number) - K-fold cross-validation untuk robustness - Create diagnostic plots untuk understanding

6. Documentation - Professional project structure - Complete README dan model card - Reproducible code dengan seeds

7. Presentation - Lead dengan WHY - Use visualizations effectively - Tell coherent story

8. Avoid Common Pitfalls - Data leakage (most common!) - Overfitting - Tuning on test set - Ignoring class imbalance

Checklist Akhir: 30 Days Before Submission

βœ… Final 30-Day Checklist

Week 1-2 (Code finalization) - [ ] All code reviewed and cleaned - [ ] Tests written and passing - [ ] No hardcoded paths or credentials - [ ] Requirements.txt updated dengan versions - [ ] Git history clean (meaningful commits)

Week 2-3 (Documentation) - [ ] README.md complete dan tested - [ ] Code comments explain WHY not WHAT - [ ] Model card written - [ ] EDA report finalized - [ ] All results reproducible

Week 3-4 (Presentation) - [ ] Slides drafted (14-15 slides) - [ ] Key visualizations created - [ ] Practice presentation (3x minimum) - [ ] Get feedback from mentor/friend - [ ] Prepare for common Q&A

Final Week (Testing) - [ ] Run entire pipeline end-to-end - [ ] Verify all outputs match report - [ ] Check presentation on actual equipment - [ ] Submit early (day before deadline) - [ ] Take screenshot proof


πŸŽ“ Selamat! Anda siap memulai capstone project!

Ingat: Scope kecil tapi selesai dengan baik > scope besar tapi tidak selesai.

Fokus pada QUALITY over QUANTITY.

Good luck! πŸš€