Bab 14: Panduan Capstone Project & Best Practices
Dari Ideasi hingga Implementasi: Merencanakan, Mengeksekusi, dan Mengkomunikasikan Proyek ML Profesional
Bab 14: Panduan Capstone Project & Best Practices
Setelah menyelesaikan bab ini, Anda akan mampu:
- Merencanakan proyek ML produksi dengan scope yang jelas dan timeline realistis
- Merumuskan masalah dengan baik melalui problem statement yang terukur (SMART criteria)
- Merancang strategi data yang komprehensif dari koleksi hingga validasi
- Membangun baseline model dan iterasi sistematis menuju target performance
- Mengevaluasi dan memvalidasi model dengan metrik yang sesuai use case
- Mendokumentasikan proyek dengan standar profesional dan reproducibility
- Mempresentasikan project findings dan insights dengan jelas kepada stakeholder
- Mengidentifikasi dan menghindari pitfall umum yang sering terjadi dalam capstone project
14.1 Project Planning & Scoping
14.1.1 Mengapa Scoping Penting?
Banyak student menghabiskan waktu untuk modeling, sementara planning dan scoping sering diabaikan. Padahal, poor scoping adalah penyebab utama project failure.
Statistik Mencolok:
- 45% capstone projects gagal deliver hasil meaningful karena scope tidak jelas
- 60% students underestimate timeline di awal project
- 38% mencoba problem yang terlalu ambitious untuk 1 semester
- Scope Creep: Mulai dengan scope jelas, terus bertambah fitur baru
- Underestimating Complexity: βSepertinya mudahβ β ternyata butuh 3x lebih lama
- Fixing Scope, Not Timeline: Tenggat waktu diperas, hasil jadi jelek
- Tidak Ada Minimum Viable Product (MVP): Semua atau tidak ada
14.1.2 Project Scoping Framework
Step 1: Tentukan Tujuan Proyek
Proyek bukan hanya tentang ANDA, tapi tentang VALUE yang akan diberikan.
Pertanyaan Kunci:
- Siapa stakeholder utama?
- Masalah apa yang dipecahkan?
- Bagaimana kesuksesan diukur?
- Apa timeline yang realistis?
Contoh Good Scope vs Bad Scope:
| Bad Scope | Good Scope |
|---|---|
| βBuild a ML model for sentiment analysisβ | βBuild sentiment analysis model untuk classify customer feedback (positive/negative/neutral) dengan target accuracy 85%, untuk membantu customer service team prioritize complaintsβ |
| βPredict stock pricesβ | βBuild LSTM model untuk predict intraday price movements (+/- 2% threshold) menggunakan 6 bulan historical data, untuk identify trading opportunities dengan risk-adjusted returnsβ |
| βImage classificationβ | βClassify malware vs benign Windows executable files dengan 90% recall untuk security screening, menggunakan image representation dari binary filesβ |
Step 2: Define Success Criteria (SMART)
- Specific: Jelas dan terdefinisi
- Measurable: Bisa diukur dengan metrik konkret
- Achievable: Realistis dengan resources yang tersedia
- Relevant: Penting bagi stakeholder
- Time-bound: Ada deadline yang jelas
Contoh SMART Criteria:
β Bad: "Model harus akurat"
β
Good: "Achieve 85%+ accuracy pada test set dengan balanced
dataset (n=5000 samples) menggunakan Random Forest,
validation dilakukan dengan 5-fold cross-validation,
deadline 31 Desember 2024"
Step 3: Risk Assessment
Identifikasi potential blockers SEBELUM dimulai:
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Data tidak tersedia | Medium | Critical | Request dari company 2 minggu sebelumnya |
| Data quality buruk | High | Medium | Plan intensive data cleaning phase |
| Model susah konvergen | Medium | High | Research state-of-art papers, test multiple algorithms early |
14.1.3 Timeline & Milestone Planning
Struktur Timeline Capstone (1 semester = 16 minggu):
Weeks 1-2: Problem Definition + Planning (10%)
Weeks 3-5: Data Collection & EDA (15%)
Weeks 6-8: Feature Engineering & Preprocessing (15%)
Weeks 9-11: Model Development & Experimentation (25%)
Weeks 12-13: Evaluation & Optimization (15%)
Weeks 14-15: Documentation & Presentation Prep (15%)
Week 16: Final Presentation & Submission (5%)
Critical Milestones:
Month 1: Project Setup (Due: Week 4) - [ ] Problem statement finalized dan approved - [ ] Stakeholder identified - [ ] Preliminary data assessment done - [ ] Team roles & responsibilities defined - [ ] Git repository setup dengan proper structure
Month 2: Data & Baseline (Due: Week 8) - [ ] Dataset collected dan cleaned - [ ] EDA report completed - [ ] Data splits (train/val/test) finalized - [ ] Baseline model implemented - [ ] Evaluation metrics selected
Month 3: Model Development (Due: Week 12) - [ ] 3+ models trained dan compared - [ ] Hyperparameter tuning completed - [ ] Best model selected - [ ] Cross-validation done - [ ] Model card drafted
Final 2 Weeks: Finalization - [ ] Documentation complete - [ ] Code cleaned & tested - [ ] Demo prepared - [ ] Presentation slides ready
14.2 Problem Formulation
14.2.1 Anatomy of Good Problem Statement
Komponen penting:
- Context: Latar belakang dan business case
- Problem: Apa yang perlu dipecahkan
- Data: Apa dan berapa banyak data yang tersedia
- Success Metrics: Bagaimana kesuksesan diukur
- Constraints: Batasan teknis dan non-teknis
Contoh Problem Statement:
CONTEXT:
GrowthBank melayani 50,000+ customers B2B dengan
average loan size Rp 500 juta. Manual credit approval
memakan 3-5 hari dan memiliki default rate 8%.
PROBLEM:
Otomasi credit scoring process untuk mengurangi approval
time menjadi <24 jam dan default rate menjadi <5%,
dengan tetap mempertahankan customer satisfaction.
DATA:
- 10,000 historical loans (2018-2023)
- 40+ features: company profile, financials, payment history
- 5% data missing (handled appropriately)
SUCCESS METRICS:
1. Model accuracy: 85%+ pada test set
2. Default recall: 90% (catch bad borrowers)
3. Processing speed: <5 seconds per application
4. Interpretability: Top 5 important features identifiable
CONSTRAINTS:
- Data privacy: PII must be removed/encrypted
- Latency: Must respond in <5 sec
- Availability: 99% uptime required
- Fairness: No discrimination against protected groups
14.2.2 Problem Type Classification
Classification:
- Binary (yes/no, churn/stay, fraud/legitimate)
- Multi-class (sentiment: positive/neutral/negative)
- Multi-label (music genres: rock, pop, jazz simultaneously)
Regression:
- Continuous values (price, temperature, traffic volume)
- Time series (stock price prediction, demand forecasting)
Ranking/Recommendation:
- Prioritize items (search ranking, recommendation system)
- Matching (matching job seekers to jobs)
Anomaly Detection:
- Outlier detection (fraud, system intrusion, equipment failure)
- Novelty detection (new attack types)
Clustering:
- Customer segmentation
- Document clustering
Choosing the right problem type determines:
- Data requirements
- Metrics selection
- Algorithm choices
- Evaluation approach
14.2.3 Defining Metrics
β60% accuracyβ tidak berarti apa-apa. Metrics harus: 1. Aligned dengan business KPI 2. Interpretable (bukan hanya untuk statistician) 3. Actionable (bisa disambung dengan keputusan)
Contoh Metric Selection:
PROBLEM: Fraud Detection
β
BUSINESS GOAL: Catch 95% frauds, minimize false positives
β
METRICS: Recall=95% (catch fraud), Precision high
(avoid false alarms)
β
IMPLEMENTATION: Select threshold yang maximize F2-score
(2x weight pada recall)
Common Metrics by Problem Type:
| Problem Type | Primary Metric | Secondary Metrics |
|---|---|---|
| Classification | Accuracy (balanced), Precision, Recall, F1 | AUC-ROC, Confusion Matrix |
| Imbalanced | Precision, Recall, F1-score, AUC | Sensitivity, Specificity |
| Regression | MAE, RMSE | RΒ², MAPE |
| Ranking | NDCG, MAP | MRR, Recall@K |
| Clustering | Silhouette Score | Davies-Bouldin Index |
| Anomaly | Detection Rate, False Positive Rate | Precision@K, AUROC |
Caution on Single Metric:
# β Don't do this
if accuracy > 0.85:
print("Model is good!")
# β
Do this instead
metrics = {
'accuracy': 0.85,
'precision': 0.82,
'recall': 0.88,
'f1': 0.85,
'auc_roc': 0.89
}
# Interpret holistically
print(f"High recall (0.88) β catches most positives")
print(f"OK precision (0.82) β some false alarms acceptable")
print(f"Balanced F1 (0.85) β good overall trade-off")14.3 Data Strategy
14.3.1 Data Collection Plan
Template: Data Collection Checklist
Data Source - [ ] Source identified (API, database, CSV, web scraping) - [ ] Access obtained (permissions, credentials) - [ ] Data freshness understood (real-time, daily, monthly) - [ ] Size confirmed (n samples Γ m features)
Data Quality Assessment - [ ] Missing values documented (<5% acceptable) - [ ] Duplicates checked - [ ] Outliers identified - [ ] Data type validation done - [ ] Value ranges reasonable
Data Privacy & Ethics - [ ] PII removal/anonymization done - [ ] GDPR/compliance checked - [ ] Bias in data identified - [ ] Consent obtained (if needed) - [ ] Data retention policy defined
Data Documentation - [ ] Data dictionary created (each feature explained) - [ ] Data quality report generated - [ ] Collection date/period documented - [ ] Known issues documented
Contoh: Data Dictionary
Feature Name: transaction_amount
ββ Type: float64
ββ Unit: Indonesian Rupiah (IDR)
ββ Range: 10,000 - 999,999,999
ββ Missing: 0.2% (handled by median imputation)
ββ Distribution: Right-skewed (log-transform applied)
ββ Source: transaction_table.amount
ββ Notes: 3 outliers > 999M (verified, kept)
Feature Name: customer_age
ββ Type: int64
ββ Unit: Years
ββ Range: 18 - 75
ββ Missing: 1.5% (filled with median)
ββ Distribution: Relatively uniform
ββ Source: customer_table.age
ββ Notes: Some suspicious values (999), filtered out
14.3.2 Exploratory Data Analysis (EDA) Structure
Layered EDA Approach:
- Univariate Analysis (1 variable at a time)
- Distribution, central tendency, spread
- Outliers, skewness, missing values
- Bivariate Analysis (2 variables)
- Correlation dengan target
- Feature relationships
- Potential interactions
- Multivariate Analysis (3+ variables)
- Feature correlations
- Clustering patterns
- Domain insights
EDA Outputs to Document:
Checklist untuk EDA Report:
- [ ] Dataset shape dan basic info
- [ ] Missing values visualization & handling
- [ ] Distributions (histograms, KDE plots)
- [ ] Outliers identified & approach decided
- [ ] Correlation heatmap & top correlated features
- [ ] Feature importance from EDA
- [ ] Class imbalance (if classification)
- [ ] Data quality issues & resolutions
- [ ] Key insights & hypotheses
- [ ] Feature engineering ideas
14.3.3 Data Preparation Workflow
Raw Data
β
[Clean] β Remove duplicates, fix obvious errors
β
[Transform] β Handle missing, encode categorical, scale
β
[Validate] β Check quality, range, distribution
β
[Split] β Train (70%) / Validation (15%) / Test (15%)
β
[Document] β Version data, document transformations
β
Ready for Modeling
Key Decision Points:
| Decision | Options | Trade-offs |
|---|---|---|
| Missing Values | Drop / Impute (mean/median/KNN) | Lose data vs bias |
| Categorical Encoding | One-hot / Label / Ordinal | Sparsity vs information |
| Feature Scaling | StandardScaler / MinMaxScaler / RobustScaler | Interpretability vs performance |
| Imbalanced Data | Oversample / Undersample / SMOTE | Overfitting vs underfitting |
14.4 Baseline & Iteration
14.4.1 Establishing Baseline
βBaselineβ = Simplest possible model untuk problem Anda.
Baseline bukan tentang performa tinggi. Baseline adalah: 1. Sanity check β Model Anda better than baseline? 2. Reference point β Berapa improvement dari baseline? 3. Proof of concept β Apakah problem solvable dengan ML?
Baseline Ideas by Problem Type:
CLASSIFICATION:
ββ Majority class (predict always positive/negative)
ββ Random classifier (50% untuk binary)
ββ Logistic Regression
ββ Decision Tree
REGRESSION:
ββ Mean predictor (always predict mean)
ββ Median predictor
ββ Linear Regression
ββ Decision Tree
RANKING:
ββ Random ranking
ββ Popularity ranking
ββ TF-IDF based ranking
Example: Fraud Detection Baseline
from sklearn.metrics import precision_recall_curve
import numpy as np
# Baseline 1: Always predict "no fraud" (majority class)
baseline1_accuracy = (fraud_data.label == 0).mean() # e.g., 98.5%
baseline1_recall = 0 # Catches 0% of frauds
# Baseline 2: Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
baseline2_accuracy = lr.score(X_test, y_test)
baseline2_recall = (lr.predict(X_test) == 1).mean()
print(f"Baseline 1 (always negative): Acc={baseline1_accuracy:.1%}")
print(f"Baseline 2 (Logistic): Acc={baseline2_accuracy:.1%}, "
f"Recall={baseline2_recall:.1%}")
# Your model harus BOTH lebih akurat AND better recall14.4.2 Systematic Iteration Process
Donβt randomly try 100 algorithms. Iterate systematically.
Phase 1: Simple Models (Week 1-2)
Try: Logistic Regression, Decision Tree, KNN
Goal: Understand problem & get baseline
Phase 2: Intermediate Models (Week 2-3)
Try: Random Forest, SVM, Gradient Boosting
Goal: Find algorithm yang works best
Phase 3: Advanced Models (Week 3-4)
Try: Neural Networks, Ensemble, State-of-art
Goal: Push towards target performance
Phase 4: Optimization (Week 4-5)
Try: Hyperparameter tuning, ensemble methods
Goal: Final squeeze on performance
Iteration Template to Document:
## Experiment Log
### Experiment 1: Logistic Regression Baseline
- Date: 2024-01-10
- Model: LogisticRegression(C=1.0, max_iter=1000)
- Features: 35 features, no scaling
- Result: Accuracy=0.82, Recall=0.75, F1=0.78
- Note: Baseline established
- Next: Try feature scaling
### Experiment 2: Logistic Regression + Scaling
- Date: 2024-01-11
- Model: LogisticRegression with StandardScaler
- Features: 35 features, StandardScaled
- Result: Accuracy=0.84, Recall=0.78, F1=0.81
- Note: Slight improvement from scaling
- Next: Try feature engineering
### Experiment 3: Random Forest
- Date: 2024-01-12
- Model: RandomForest(n_estimators=100, max_depth=10)
- Features: 35 features + 8 engineered features
- Result: Accuracy=0.87, Recall=0.85, F1=0.86
- Note: Significant improvement!
- Next: Hyperparameter tuning for RF14.4.3 Common Iteration Pitfalls
β Tuning on Test Set
# WRONG: Evaluating on test set repeatedly
for hyperparams in search_space:
model = fit_model(X_train, y_train, hyperparams)
score = model.score(X_test, y_test) # OVERFITTING TO TEST!
# RIGHT: Evaluate on validation set
for hyperparams in search_space:
model = fit_model(X_train, y_train, hyperparams)
score = model.score(X_val, y_val) # Use validation
# Final evaluation pada test set saja!β Data Leakage
# WRONG: Scale entire dataset, then split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # fit on ALL data!
X_train, X_test = train_test_split(X_scaled)
# RIGHT: Fit scaler only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # transform using train statsβ Metric Obsession
# WRONG: Chasing single metric
# "My model has 97% accuracy! I'm done!"
# But: 0% recall, 500ms latency, not deployable
# RIGHT: Consider all metrics
# Accuracy: 85%, Recall: 90%, Latency: 50ms, Interpretable: Yes14.5 Evaluation & Validation
14.5.1 Comprehensive Evaluation Framework
Jangan evaluasi model berdasarkan single metric saja.
Model Evaluation Matrix:
| Dimension | Metric | Target |
|---|---|---|
| Accuracy | F1-Score | β₯ 0.85 |
| Fairness | Demographic Parity | < 0.1 difference |
| Robustness | Adversarial Accuracy | > 0.80 |
| Interpretability | SHAP importance | Top-5 features clear |
| Speed | Latency (p95) | < 100ms |
| Resource Usage | Memory, CPU | < 1GB RAM |
14.5.2 Advanced Validation Techniques
1. K-Fold Cross Validation
from sklearn.model_selection import cross_validate
# Better than single train/test split
cv_results = cross_validate(
model, X, y,
cv=5, # 5-fold
scoring=['accuracy', 'precision', 'recall', 'f1'],
return_train_score=True
)
# Check for overfitting
train_score = cv_results['train_accuracy'].mean()
test_score = cv_results['test_accuracy'].mean()
print(f"Train: {train_score:.3f}, Test: {test_score:.3f}")
if (train_score - test_score) > 0.15:
print("β οΈ Possible overfitting! Gap > 15%")2. Stratified Split (untuk imbalanced data)
from sklearn.model_selection import StratifiedKFold
# Ensure class distribution maintained
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# Training loop3. Time Series Cross Validation (untuk sequential data)
from sklearn.model_selection import TimeSeriesSplit
# Don't shuffle! Respect temporal order
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
# Model trained on past, tested on future14.5.3 Diagnostic Plots
Essential diagnostic plots untuk capstone:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
import seaborn as sns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, ax=axes[0, 0], cmap='Blues')
axes[0, 0].set_title('Confusion Matrix')
# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = auc(fpr, tpr)
axes[0, 1].plot(fpr, tpr, label=f'AUC={auc_score:.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'k--', label='Random')
axes[0, 1].set_xlabel('FPR'), axes[0, 1].set_ylabel('TPR')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
# 3. Precision-Recall Curve
from sklearn.metrics import precision_recall_curve
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
axes[1, 0].plot(recall, precision)
axes[1, 0].set_xlabel('Recall'), axes[1, 0].set_ylabel('Precision')
axes[1, 0].set_title('Precision-Recall Curve')
# 4. Feature Importance
feature_importance = pd.DataFrame({
'feature': X_test.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
axes[1, 1].barh(feature_importance['feature'][:10],
feature_importance['importance'][:10])
axes[1, 1].set_title('Top 10 Important Features')
plt.tight_layout()
plt.savefig('model_diagnostics.png', dpi=300, bbox_inches='tight')14.6 Documentation Best Practices
14.6.1 Project Structure & Documentation
Professional project structure:
my-capstone-project/
βββ README.md # Project overview
βββ DOCUMENTATION.md # Detailed documentation
βββ LICENSE # MIT or similar
βββ .gitignore # Exclude large files, cache
β
βββ data/
β βββ raw/ # Original data
β βββ processed/ # Cleaned, transformed data
β βββ README.md # Data dictionary
β
βββ notebooks/
β βββ 01_eda.ipynb # Exploratory analysis
β βββ 02_preprocessing.ipynb # Data preparation
β βββ 03_modeling.ipynb # Model training
β
βββ src/
β βββ __init__.py
β βββ preprocessing.py # Data prep functions
β βββ features.py # Feature engineering
β βββ models.py # Model definitions
β βββ evaluation.py # Evaluation metrics
β βββ utils.py # Helper functions
β
βββ models/
β βββ model_v1.0.pkl # Saved models
β βββ model_v1.1.pkl
β βββ model_card.md # Model documentation
β
βββ reports/
β βββ eda_report.html # EDA visualization
β βββ model_comparison.csv # Experiment results
β βββ final_report.pdf # Final analysis
β
βββ tests/
β βββ test_preprocessing.py # Unit tests
β βββ test_models.py
β βββ test_pipeline.py
β
βββ requirements.txt # Dependencies
βββ setup.py # Package setup (if publishing)
βββ train.py # Main training script
14.6.2 Documentation Template
README.md Example:
# Credit Risk Prediction Model
## Overview
Building an automated credit scoring system to reduce loan
approval time from 3 days to <24 hours while maintaining
default rate below 5%.
## Dataset
- **Source**: GrowthBank historical loans (2018-2023)
- **Size**: 10,000 samples Γ 42 features
- **Target**: Binary (Default/Non-default), 8% positive class
- **Time Period**: 2018-01-01 to 2023-12-31
## Project Structure
[File structure description]
## Quick Startpip install -r requirements.txt python train.py
## Results
- **Best Model**: Random Forest
- **Accuracy**: 87.2% Β± 1.3% (5-fold CV)
- **Recall (Default)**: 91.5% (catch 91.5% of defaults)
- **Precision**: 45.2% (acceptable false positive rate)
- **ROC-AUC**: 0.934
## Model Performance
[Performance visualization and metrics table]
## Key Findings
1. Transaction frequency is strongest indicator of default risk
2. Recent payment history more important than historical average
3. Model shows no significant bias against age groups
## Limitations
- Limited to B2B loans (may not generalize to consumer)
- Training data from 2018-2023 (concept drift possible)
- No alternative data sources (e.g., behavioral)
## Future Work
- Implement real-time model monitoring
- Extend to multi-class risk levels (low/medium/high)
- Add fairness constraints for protected attributes
14.6.3 Model Card Documentation
Model Card = standardized documentation untuk ML model.
# Model Card: Credit Risk Classifier v1.2
## Model Details
- **Model Type**: Random Forest Classification
- **Framework**: Scikit-learn
- **Version**: 1.2
- **Date**: 2024-01-15
- **Authors**: [Your Name]
- **License**: MIT
## Intended Use
- **Intended Use**: Automated credit risk assessment for B2B loans
- **Primary Users**: Credit department, lending officers
- **Out-of-Scope**: Consumer lending, international markets
## Performance
- **Training Data**: 8,000 samples (80%)
- **Test Data**: 2,000 samples (20%)
- **Metric**: Accuracy, Precision, Recall, F1
### Detailed Performance Metrics Precision Recall F1-Score Support
Non-Default (0) 0.93 0.88 0.90 1840 Default (1) 0.45 0.92 0.61 160 Accuracy 0.87 2000
## Fairness Analysis
- **Gender Bias**: FPR diff = 1.2% (acceptable)
- **Age Bias**: ROC-AUC for <30 = 0.92, >50 = 0.94 (no significant difference)
## Limitations
- Only trained on B2B segment
- Assumes data distribution similar to training period
- Requires regular retraining
## Data and Preprocessing
- **Training Data**: 8000 historical loans with known outcomes
- **Input Features**: 42 features (company profile + financials)
- **Preprocessing**: StandardScaler on numerical, OneHot on categorical
14.7 Code Quality & Reproducibility
14.7.1 Reproducibility Checklist
Code Reproducibility:
Data Reproducibility:
Results Reproducibility:
Example: Reproducible Training Script
# train.py - Fully reproducible training
import numpy as np
import random
import tensorflow as tf
from pathlib import Path
import json
# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
def main(config_path='config.json'):
"""Main training function"""
# Load configuration
with open(config_path, 'r') as f:
config = json.load(f)
# Load and preprocess data
X_train, y_train, X_test, y_test = load_data(
config['data_path'],
test_size=config['test_size'],
random_state=RANDOM_SEED
)
# Train model
model = train_model(
X_train, y_train,
**config['model_params']
)
# Evaluate
metrics = evaluate_model(model, X_test, y_test)
# Save results
save_artifacts(model, metrics, config)
return metrics
if __name__ == "__main__":
metrics = main()
print(f"Final accuracy: {metrics['accuracy']:.3f}")config.json:
{
"data_path": "data/processed/",
"test_size": 0.2,
"model_params": {
"n_estimators": 100,
"max_depth": 10,
"random_state": 42
},
"output_dir": "results/model_v1.2/"
}14.7.2 Code Quality Standards
Minimal standards untuk capstone:
# β
Good: Functions with docstrings
def prepare_features(X, categorical_cols, numerical_cols):
"""
Prepare features untuk model training.
Parameters
----------
X : pd.DataFrame
Input features dengan categorical dan numerical columns
categorical_cols : list
Names of categorical columns untuk OneHot encoding
numerical_cols : list
Names of numerical columns untuk scaling
Returns
-------
X_prepared : np.ndarray
Prepared feature matrix ready for training
"""
# Implementation
return X_prepared
# β
Good: Comments explain WHY, not WHAT
# Skip features dengan >50% missing values
# (imputation would introduce too much bias)
mask = feature_missing_rate < 0.5
X_filtered = X.loc[:, mask]
# β Bad: Comments repeat code
for i in range(len(data)): # Loop through data
if data[i] > threshold: # Check if greater
results.append(data[i]) # Add to resultsType hints untuk clarity:
from typing import Tuple, List, Dict
import pandas as pd
import numpy as np
def split_features(
X: pd.DataFrame,
target_ratio: float = 0.2
) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""Split into train and test with specified ratio."""
n_test = int(len(X) * target_ratio)
return X[:-n_test], X[-n_test:]
def calculate_metrics(
y_true: np.ndarray,
y_pred: np.ndarray
) -> Dict[str, float]:
"""Calculate evaluation metrics."""
return {
'accuracy': (y_true == y_pred).mean(),
# ... more metrics
}14.8 Technical Report Writing
14.8.1 Report Structure
Standar technical report capstone (15-30 pages):
1. Executive Summary (1 page)
- Problem, solution, results
- Key recommendation
- Business impact
2. Introduction (2-3 pages)
- Context dan motivation
- Problem statement
- Research questions
- Contributions
3. Literature Review (2-3 pages)
- Related work
- Existing solutions
- Knowledge gaps
- How your work differs
4. Methodology (3-4 pages)
- Problem formulation (mathematical)
- Algorithms & approaches
- Evaluation metrics
- Hyperparameters
5. Data Description (2-3 pages)
- Dataset characteristics
- Data collection & preprocessing
- Feature engineering
- Class distribution, missing values
- Train/val/test splits
6. Results (3-5 pages)
- Model comparison
- Best model performance
- Ablation studies
- Visualizations (confusion matrix, ROC, etc.)
7. Analysis & Discussion (2-3 pages)
- Interpret results
- Why did model work/fail
- Limitations
- Error analysis
8. Conclusion & Future Work (1-2 pages)
- Summary of findings
- Practical implications
- Directions for future research
9. References (1-2 pages)
- Academic citations
- Data sources
- Software libraries
14.8.2 Writing Best Practices
β Avoid:
- Overly technical jargon tanpa explanation
- Unsupported claims (βOur model is the best!β)
- Lengthy code listings in main report
- Vague statements (βWe tried many modelsβ)
β Do:
- Explain technical concepts tersedia untuk general audience
- Support claims dengan evidence (metrics, citations)
- Put code in appendix
- Be specific (βWe evaluated 12 models using GridSearchCVβ)
Example: Good vs Bad Writing
β BAD:
"We used an RF with optimized hyperparameters to
maximize the AUC on the validation dataset."
β
GOOD:
"We trained a Random Forest classifier with
100 trees and max depth of 10 (selected via 5-fold
cross-validation to maximize AUC-ROC). The model
achieved 87.2% accuracy on the test set."
14.9 Presentation & Demo Skills
14.9.1 Presentation Structure
Final presentation (15-20 minutes):
0-1 min: Title slide, introduce yourself
1-2 min: The Problem (why should audience care?)
2-3 min: Your Solution (brief overview)
3-8 min: Key Results (metrics, visualizations)
8-12 min: Technical Deep Dive (1-2 complex topics)
12-15 min: Limitations & Future Work
15-20 min: Q&A
Slide Guidelines:
- Slide 1-2: Title + Problem (make audience care!)
- Slide 3-4: Data overview (n samples, features, class distribution)
- Slide 5-6: Approach (your method vs baselines)
- Slide 7-10: Results (visualizations!)
- Confusion matrix
- ROC/PR curve
- Feature importance
- Model comparison
- Slide 11-12: Deep dive on 1-2 interesting findings
- Slide 13: Limitations honestly discussed
- Slide 14: Future work & lessons learned
- Lead with Why: βWhy should anyone care about this problem?β
- Show Impact: How does this solve a real problem?
- Use Visualizations: Data > tables > text
- Tell a Story: Not just βWe did X and got Y accuracyβ
- Prepare for Questions: Know your code, data, and assumptions
- Practice: Time yourself, get feedback
14.9.2 Live Demo Best Practices
If demonstrating working system:
β
DO:
- [ ] Test demo multiple times before presentation
- [ ] Have backup screenshots/video if live breaks
- [ ] Keep demo simple (don't show complex edge cases)
- [ ] Make prediction in <5 seconds
- [ ] Explain what model learned, not just output
β DON'T:
- Don't go off script ("Let me show you something...")
- Don't click around aimlessly
- Don't make predictions on untypical data
- Don't forget what model is doing
- Don't spend >2 minutes on demo
14.10 Project Examples & Case Studies
14.10.1 Case Study 1: Fraud Detection (Classification)
Problem: E-commerce platform dengan 100K+ transactions/day, fraud rate 0.8%.
Approach:
1. Baseline: Logistic Regression
- Accuracy: 99.8% (predicting all non-fraud!)
- Recall: 0% (catches 0 frauds) β
2. Strategy: Class imbalance handling
- SMOTE oversampling untuk minority class
- Use Recall as primary metric
- Adjust decision threshold
3. Best Model: Gradient Boosting + Custom Threshold
- Accuracy: 98.5%
- Recall: 92% (catch 92% of frauds)
- Precision: 35% (acceptable for fraud detection)
- Live performance: catches $2M fraud/month
Key Lessons:
- Accuracy misleading pada imbalanced data
- Recall > Precision untuk fraud (catch fraud, accept false alarms)
- Real-time constraints matter (must score transaction in <100ms)
14.10.2 Case Study 2: Predictive Maintenance (Regression)
Problem: Manufacturing plant, reduce unplanned downtime.
Approach:
1. Data: 5 years sensor data (temperature, vibration, pressure)
2. Target: RUL (Remaining Useful Life) prediction
3. Baseline: Linear regression on recent sensor readings
- RMSE: 150 hours
4. Advanced: Sequence-to-sequence LSTM
- Input: 7 days sensor history
- Output: Days until failure
- RMSE: 42 hours (3x better)
- Enables preventive maintenance
Challenges & Solutions:
- Class imbalance: Most machines work fine
- Solution: Weighted loss function, focus on failures
- Seasonality: Equipment behaves differently by season
- Solution: Add seasonal features
- Concept drift: Equipment degrades over 5 years
- Solution: Retrain monthly with recent data
14.11 Common Pitfalls & Prevention
14.11.1 Technical Pitfalls
β Problem: Information from future leaks into training
# WRONG: Use future information in features
for idx in range(len(data)):
# Using data[idx+1] (future) to predict data[idx]
features[idx] = [
data[idx]['price'],
data[idx+1]['price'], # β FUTURE DATA!
data[idx+1]['volume'] # β FUTURE DATA!
]
# RIGHT: Use only past information
for idx in range(1, len(data)): # Start from idx=1
features[idx] = [
data[idx]['price'],
data[idx-1]['price'], # β Past
data[idx-1]['volume'] # β Past
]β Problem: Scaling entire dataset before train/test split
# WRONG: Fit scaler on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # β Fit on ALL data
X_train, X_test = train_test_split(X_scaled)
# Model "sees" test set information during scaling!
# RIGHT: Fit scaler ONLY on training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # β Fit on train only
X_test_scaled = scaler.transform(X_test) # β Transform using train statsHow to prevent:
β Training accuracy 99%, test accuracy 70%
# Signs of overfitting:
if train_acc - test_acc > 0.15: # Gap > 15%
print("β οΈ Probable overfitting!")
# Solutions:
# 1. More training data
# 2. Reduce model complexity
# 3. Regularization (L1/L2)
# 4. Early stopping
# 5. Dropout (neural networks)
# β
Regularized model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.01, penalty='l2')
# C < 1 means stronger regularization# β WRONG: Evaluating on test set repeatedly
best_score = 0
for C in [0.001, 0.01, 0.1, 1, 10]:
model = LogisticRegression(C=C)
model.fit(X_train, y_train)
score = model.score(X_test, y_test) # β TUNING ON TEST!
if score > best_score:
best_score = score
best_C = C
# β
RIGHT: Use validation set for hyperparameter selection
for C in [0.001, 0.01, 0.1, 1, 10]:
model = LogisticRegression(C=C)
model.fit(X_train, y_train)
score = model.score(X_val, y_val) # β USE VALIDATION
if score > best_score:
best_score = score
best_C = C
# Final test on held-out test set
final_score = best_model.score(X_test, y_test)14.11.2 Project Management Pitfalls
| Pitfall | Symptom | Prevention |
|---|---|---|
| Scope Creep | Keeps adding features | Lock scope week 2, use change request process |
| No Baseline | Donβt know if model good | Implement baseline in week 1 |
| Single Train/Test | High variance in metrics | Use 5-fold cross-validation |
| Ignoring Class Imbalance | 98% accuracy but useless | Check class distribution day 1 |
| No Documentation | Canβt reproduce results | Write as you code, not at end |
| Last-Minute Demo | Presentation full of errors | Practice presentation 1 week before |
14.11.3 Communication Pitfalls
β WRONG:
βOur model is 95% accurate!β (In context of imbalanced dataset: 95% = predicting majority class)
β RIGHT:
βOur model achieves 95% accuracy, 85% recall, and 60% precision. This means we catch 85% of frauds with 60% of alerts being true positives.β (Specific, contextualized, honest)
14.12 Project Grading Rubric
14.12.1 Comprehensive Rubric untuk Evaluasi
Total: 100 points
1. Problem Definition & Planning (15 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Problem Statement | Clear, specific, measurable, business-aligned | Clear problem, minor scope issues | Problem vague, weak business case | Problem ill-defined |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Data Planning | Complete strategy, quality assessment, privacy check | Data plan documented, mostly complete | Minimal planning, data quality not assessed | No data strategy |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Feasibility & Timeline | Realistic scope, detailed milestone, risk mitigation | Reasonable scope, milestone defined | Over-scoped or unrealistic timeline | No feasibility analysis |
| (0-5 pts) | 5 | 4 | 2 | 0 |
2. Data & Analysis (15 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Data Quality | Clean, well-documented, proper handling of missing/outliers | Mostly clean, minor issues handled | Some quality issues, incomplete cleaning | Poor data quality |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| EDA & Insights | Comprehensive EDA, clear insights, good visualizations | Good EDA, identified patterns | Basic EDA, minimal insights | Insufficient analysis |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Feature Engineering | Smart features, validated impact, interpretable | Some feature engineering, reasonable | Minimal feature work | No feature engineering |
| (0-5 pts) | 5 | 4 | 2 | 0 |
3. Model Development (25 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Approach | Multiple algorithms tested, systematic comparison, justification | 2-3 algorithms, reasonable comparison | Limited algorithm exploration | Single model tried |
| (0-8 pts) | 8 | 6 | 4 | 0 |
| Validation | K-fold CV, proper train/val/test split, no data leakage | Proper splitting, no obvious leakage | Basic validation, possible issues | Train/test contamination |
| (0-9 pts) | 9 | 7 | 5 | 0 |
| Hyperparameter Tuning | Systematic tuning, documented process, prevents overfitting | Good tuning, reasonable results | Limited tuning | Random/no tuning |
| (0-8 pts) | 8 | 6 | 4 | 0 |
4. Evaluation & Results (20 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Metrics Selection | Appropriate for problem, multiple metrics, justified | Good metrics, mostly justified | Limited metrics, weak justification | Inappropriate metrics |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Results & Analysis | Clear improvement over baseline, thorough analysis, error discussion | Good results with analysis | Acceptable results, minimal analysis | Poor results, no analysis |
| (0-8 pts) | 8 | 6 | 4 | 0 |
| Reproducibility | Complete reproducibility, clear results, documented seeds | Mostly reproducible, good documentation | Somewhat reproducible, missing details | Not reproducible |
| (0-7 pts) | 7 | 5 | 3 | 0 |
5. Documentation & Code Quality (15 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Code Quality | Clean, documented, proper structure, follows best practices | Generally clean, documented, reasonable structure | Some documentation, inconsistent style | Poor documentation, messy code |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Documentation | Complete report, clear writing, all sections thorough | Good documentation, mostly complete | Incomplete documentation, some clarity issues | Minimal documentation |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Repository | Organized structure, git history clean, includes all artifacts | Good organization, mostly complete | Basic organization, missing some items | Disorganized or incomplete |
| (0-5 pts) | 5 | 4 | 2 | 0 |
6. Presentation (10 points)
| Criteria | Excellent (A) | Good (B) | Fair (C) | Poor (D) |
|---|---|---|---|---|
| Clarity | Clear story, appropriate level, engaging for audience | Generally clear, minor issues | Some clarity issues, organization weak | Confusing or hard to follow |
| (0-5 pts) | 5 | 4 | 2 | 0 |
| Delivery | Confident, paced well, handles questions effectively | Good delivery, minor pacing issues | Nervous or rushed, struggles with questions | Poor presentation skills |
| (0-5 pts) | 5 | 4 | 2 | 0 |
14.12.2 Grading Scale
90-100: A (Excellent)
- Professional-quality project
- Clear contribution
- Production-ready code
- Excellent presentation
80-89: B (Good)
- Solid project with minor issues
- Good approach and results
- Well-documented
- Good presentation
70-79: C (Fair)
- Acceptable project, significant gaps
- Basic approach, adequate results
- Documentation could be better
- Adequate presentation
60-69: D (Poor)
- Project not fully meeting standards
- Issues in approach or execution
- Poor documentation
- Weak presentation
<60: F (Fail)
- Does not meet minimum standards
14.13 Ringkasan
1. Project Planning - Scoping adalah kunci sukses - Define problem dengan SMART criteria - Plan realistic timeline dengan milestones
2. Problem Formulation - Clear problem statement (context, problem, data, metrics) - Match metrics dengan business goals - Consider constraints (latency, privacy, fairness)
3. Data Strategy - Comprehensive data collection plan - Structured EDA dengan insights - Proper preprocessing dengan documentation
4. Baseline & Iteration - Establish baseline untuk context - Iterate systematically (not randomly) - Document experiments carefully
5. Evaluation & Validation - Use multiple metrics (not single number) - K-fold cross-validation untuk robustness - Create diagnostic plots untuk understanding
6. Documentation - Professional project structure - Complete README dan model card - Reproducible code dengan seeds
7. Presentation - Lead dengan WHY - Use visualizations effectively - Tell coherent story
8. Avoid Common Pitfalls - Data leakage (most common!) - Overfitting - Tuning on test set - Ignoring class imbalance
Checklist Akhir: 30 Days Before Submission
Week 1-2 (Code finalization) - [ ] All code reviewed and cleaned - [ ] Tests written and passing - [ ] No hardcoded paths or credentials - [ ] Requirements.txt updated dengan versions - [ ] Git history clean (meaningful commits)
Week 2-3 (Documentation) - [ ] README.md complete dan tested - [ ] Code comments explain WHY not WHAT - [ ] Model card written - [ ] EDA report finalized - [ ] All results reproducible
Week 3-4 (Presentation) - [ ] Slides drafted (14-15 slides) - [ ] Key visualizations created - [ ] Practice presentation (3x minimum) - [ ] Get feedback from mentor/friend - [ ] Prepare for common Q&A
Final Week (Testing) - [ ] Run entire pipeline end-to-end - [ ] Verify all outputs match report - [ ] Check presentation on actual equipment - [ ] Submit early (day before deadline) - [ ] Take screenshot proof
π Selamat! Anda siap memulai capstone project!
Ingat: Scope kecil tapi selesai dengan baik > scope besar tapi tidak selesai.
Fokus pada QUALITY over QUANTITY.
Good luck! π