Lab 04 - Ensemble Methods & Advanced Model Evaluation

Building Robust Classifiers for Imbalanced Datasets

Author

Machine Learning Course

Published

December 15, 2025

Learning Outcomes

After completing this lab, you will be able to:

Implement ensemble learning methods including Bagging, Boosting, and Stacking
Handle imbalanced datasets using multiple techniques
Calculate and interpret comprehensive evaluation metrics (Precision, Recall, F1, ROC-AUC, PR-AUC)
Analyze bias-variance tradeoffs in ensemble models
Optimize hyperparameters using GridSearchCV with proper validation
Create production-ready model evaluation frameworks
Make data-driven decisions based on business context and cost-benefit analysis

Lab Information

Duration: 3-4 hours
Difficulty: Intermediate to Advanced
Prerequisites: Python, scikit-learn, pandas, understanding of basic ML algorithms
Dataset: Credit Card Fraud Detection (highly imbalanced real-world scenario)

11 Background and Motivation

11.1 Why Ensemble Methods Matter

Imagine you’re building a fraud detection system for a bank. A single decision tree might make mistakes, but what if you combined the predictions of 100 trees? This is the core idea behind ensemble methods.

Key Benefits:

Reduced Overfitting: By combining multiple models, we average out individual model errors
Improved Generalization: Diverse models capture different patterns in the data
Robustness: Less sensitive to noise and outliers
State-of-the-Art Performance: Most competition-winning solutions use ensembles

11.2 The Imbalanced Dataset Challenge

In fraud detection, only 0.1-0.5% of transactions are fraudulent. This creates severe class imbalance where:

Accuracy is Misleading: A model predicting “Not Fraud” for everything achieves 99.5% accuracy but is useless
Standard Metrics Fail: We need specialized metrics like Precision-Recall curves
Model Bias: Algorithms tend to ignore the minority class
Business Impact: Missing one fraud costs more than one false alarm

11.3 Real-World Context: Cost-Sensitive Learning

Scenario	Cost of False Positive	Cost of False Negative
Fraud Detection	Low (minor inconvenience)	High (financial loss)
Medical Diagnosis	Medium (unnecessary tests)	Very High (missed disease)
Spam Detection	Low (user checks spam folder)	Low (mild annoyance)

Understanding these costs is crucial for choosing the right evaluation metric and decision threshold.

12 Lab Setup

12.1 Import Required Libraries

# Data manipulation
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')

# Machine Learning - Base Models
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Ensemble Methods
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier,
    StackingClassifier,
    VotingClassifier
)

# Evaluation Metrics
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    roc_auc_score,
    precision_recall_curve,
    average_precision_score,
    ConfusionMatrixDisplay
)

# Imbalanced Learning (if available, otherwise we'll use class weights)
try:
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as ImbPipeline
    SMOTE_AVAILABLE = True
    print("✓ SMOTE available for handling class imbalance")
except ImportError:
    SMOTE_AVAILABLE = False
    print("⚠ SMOTE not available. We'll use class_weight='balanced' instead")
    print("  Install imbalanced-learn: pip install imbalanced-learn")

print("\n✓ All essential libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

12.2 Load and Explore Dataset

For this lab, we’ll simulate a credit card fraud detection dataset with realistic class imbalance.

Using Real Data

For production work, download the actual Credit Card Fraud Detection dataset from Kaggle:

https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset contains 284,807 transactions with only 492 frauds (0.172% fraud rate).

# Simulate realistic fraud detection dataset
np.random.seed(42)

def create_fraud_dataset(n_samples=10000, fraud_ratio=0.02):
    """
    Create a synthetic fraud detection dataset with realistic characteristics

    Parameters:
    -----------
    n_samples : int
        Total number of transactions
    fraud_ratio : float
        Proportion of fraudulent transactions (0.02 = 2%)
    """
    n_fraud = int(n_samples * fraud_ratio)
    n_normal = n_samples - n_fraud

    # Normal transactions (legitimate)
    normal_data = np.random.randn(n_normal, 28) * 0.5
    normal_data[:, 0] += 1.0  # Shift distribution

    # Fraudulent transactions (different pattern)
    fraud_data = np.random.randn(n_fraud, 28) * 1.5
    fraud_data[:, 0] -= 2.0  # Different center
    fraud_data[:, 5] += 3.0  # Strong signal in some features
    fraud_data[:, 10] -= 2.5

    # Transaction amounts
    normal_amounts = np.abs(np.random.gamma(2, 50, n_normal))
    fraud_amounts = np.abs(np.random.gamma(3, 150, n_fraud))

    # Combine features
    X_normal = np.column_stack([normal_data, normal_amounts])
    X_fraud = np.column_stack([fraud_data, fraud_amounts])

    X = np.vstack([X_normal, X_fraud])
    y = np.hstack([np.zeros(n_normal), np.ones(n_fraud)])

    # Shuffle
    shuffle_idx = np.random.permutation(n_samples)
    X = X[shuffle_idx]
    y = y[shuffle_idx]

    # Create DataFrame
    feature_names = [f'V{i}' for i in range(1, 29)] + ['Amount']
    df = pd.DataFrame(X, columns=feature_names)
    df['Class'] = y.astype(int)

    return df

# Create dataset
df = create_fraud_dataset(n_samples=10000, fraud_ratio=0.02)

print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())

13 Step 1: Understanding Class Imbalance

The first critical step is visualizing and quantifying the class imbalance problem.

# Calculate class distribution
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

print("Class Distribution:")
print("="*50)
print(f"Normal Transactions (0): {class_counts[0]:,} ({class_percentages[0]:.2f}%)")
print(f"Fraudulent Transactions (1): {class_counts[1]:,} ({class_percentages[1]:.2f}%)")
print(f"\nImbalance Ratio: 1:{class_counts[0]/class_counts[1]:.1f}")
print(f"This means for every 1 fraud, there are {class_counts[0]/class_counts[1]:.1f} normal transactions")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
axes[0].bar(['Normal (0)', 'Fraud (1)'], class_counts.values, color=['#2ecc71', '#e74c3c'])
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Class Distribution (Absolute)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(class_counts.values):
    axes[0].text(i, v + 50, f'{v:,}', ha='center', fontweight='bold')

# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[1].pie(class_counts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%',
            colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

Why This Matters

With such extreme class imbalance, a naive model can achieve 98% accuracy by simply predicting “Not Fraud” for every transaction. This is called the accuracy paradox and highlights why we need better evaluation metrics.

14 Step 2: Baseline Model - The Accuracy Trap

Let’s demonstrate why accuracy alone is misleading on imbalanced datasets.

# Prepare data
X = df.drop('Class', axis=1)
y = df['Class']

# Split data with stratification (maintains class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("Training set distribution:")
print(y_train.value_counts())
print("\nTest set distribution:")
print(y_test.value_counts())

# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Naive baseline: Logistic Regression without handling imbalance
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train_scaled, y_train)
y_pred_baseline = baseline_model.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred_baseline)
precision = precision_score(y_test, y_pred_baseline)
recall = recall_score(y_test, y_pred_baseline)
f1 = f1_score(y_test, y_pred_baseline)

print("\n" + "="*60)
print("BASELINE MODEL PERFORMANCE (WITHOUT HANDLING IMBALANCE)")
print("="*60)
print(f"Accuracy:  {accuracy:.4f}  ✓ Looks great!")
print(f"Precision: {precision:.4f}  ✗ But precision is poor")
print(f"Recall:    {recall:.4f}  ✗ And recall is terrible!")
print(f"F1-Score:  {f1:.4f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_baseline)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Baseline Model - Confusion Matrix\n(Without Handling Imbalance)',
             fontsize=14, fontweight='bold')
plt.show()

print("\nConfusion Matrix Interpretation:")
print(f"True Negatives (TN):  {cm[0,0]:,} - Correctly identified normal transactions")
print(f"False Positives (FP): {cm[0,1]:,} - Normal flagged as fraud (Type I Error)")
print(f"False Negatives (FN): {cm[1,0]:,} - Fraud missed (Type II Error) ⚠ CRITICAL!")
print(f"True Positives (TP):  {cm[1,1]:,} - Correctly identified fraud")

# Calculate the percentage of frauds caught
fraud_catch_rate = (cm[1,1] / (cm[1,0] + cm[1,1])) * 100
print(f"\n⚠ We're only catching {fraud_catch_rate:.1f}% of fraudulent transactions!")

Key Insight: The Cost of False Negatives

In fraud detection:

False Positive (FP): Customer annoyed by declined card → ~$1 cost
False Negative (FN): Fraud not detected → ~$100-$1000+ cost

Missing frauds (low recall) is much worse than false alarms!

15 Step 3: Ensemble Method 1 - Bagging

Bagging (Bootstrap Aggregating) creates multiple models on random subsets of data and averages their predictions.

# Bagging with balanced class weights
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10, class_weight='balanced'),
    n_estimators=50,
    random_state=42,
    n_jobs=-1
)

bagging_model.fit(X_train_scaled, y_train)
y_pred_bagging = bagging_model.predict(X_test_scaled)
y_proba_bagging = bagging_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("="*60)
print("BAGGING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_bagging):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_bagging):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_bagging):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_bagging):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_bagging):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_bagging, target_names=['Normal', 'Fraud']))

# Confusion Matrix
cm_bagging = confusion_matrix(y_test, y_pred_bagging)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_bagging, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Greens', values_format='d')
ax.set_title('Bagging Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

fraud_catch_rate_bagging = (cm_bagging[1,1] / (cm_bagging[1,0] + cm_bagging[1,1])) * 100
print(f"\n✓ Fraud catch rate improved to {fraud_catch_rate_bagging:.1f}%")

16 Step 4: Ensemble Method 2 - Random Forest with Feature Importance

Random Forest extends Bagging by also randomizing feature selection at each split.

# Random Forest with balanced class weights
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=10,
    min_samples_leaf=4,
    class_weight='balanced',
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)
y_proba_rf = rf_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("="*60)
print("RANDOM FOREST CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_rf):.4f}")

# Feature Importance Analysis
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

# Visualize Feature Importance
fig, ax = plt.subplots(figsize=(10, 8))
top_features = feature_importance.head(15)
ax.barh(range(len(top_features)), top_features['Importance'].values, color='#3498db')
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['Feature'].values)
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('Random Forest - Top 15 Feature Importances', fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

Feature Importance in Production

Feature importance helps you: 1. Reduce dimensionality - Remove unimportant features 2. Understand model - Which features drive fraud detection? 3. Feature engineering - Focus efforts on important features 4. Regulatory compliance - Explain model decisions

17 Step 5: Ensemble Method 3 - AdaBoost

AdaBoost (Adaptive Boosting) trains models sequentially, focusing on previously misclassified examples.

# AdaBoost with balanced base estimator
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3, class_weight='balanced'),
    n_estimators=50,
    learning_rate=0.1,
    random_state=42,
    algorithm='SAMME'
)

ada_model.fit(X_train_scaled, y_train)
y_pred_ada = ada_model.predict(X_test_scaled)
y_proba_ada = ada_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("="*60)
print("ADABOOST CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_ada):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_ada):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_ada):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_ada):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_ada):.4f}")

# Confusion Matrix
cm_ada = confusion_matrix(y_test, y_pred_ada)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_ada, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Oranges', values_format='d')
ax.set_title('AdaBoost Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

18 Step 6: Ensemble Method 4 - Gradient Boosting

Gradient Boosting builds models sequentially, each correcting errors of the previous ensemble.

# Gradient Boosting
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=4,
    subsample=0.8,
    random_state=42
)

gb_model.fit(X_train_scaled, y_train)
y_pred_gb = gb_model.predict(X_test_scaled)
y_proba_gb = gb_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("="*60)
print("GRADIENT BOOSTING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_gb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_gb):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_gb):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_gb):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_gb):.4f}")

# Feature Importance
gb_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 10 Most Important Features (Gradient Boosting):")
print(gb_importance.head(10))

19 Step 7: Ensemble Method 5 - Stacking

Stacking uses a meta-learner to combine predictions from multiple base models.

# Define base models
base_models = [
    ('rf', RandomForestClassifier(n_estimators=50, max_depth=10,
                                   class_weight='balanced', random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)),
    ('ada', AdaBoostClassifier(n_estimators=50, random_state=42))
]

# Meta-learner
meta_learner = LogisticRegression(class_weight='balanced', random_state=42)

# Stacking Classifier
stacking_model = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_learner,
    cv=5,
    n_jobs=-1
)

stacking_model.fit(X_train_scaled, y_train)
y_pred_stack = stacking_model.predict(X_test_scaled)
y_proba_stack = stacking_model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
print("="*60)
print("STACKING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_stack):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_stack):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_stack):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_stack):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_stack):.4f}")

# Confusion Matrix
cm_stack = confusion_matrix(y_test, y_pred_stack)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_stack, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Purples', values_format='d')
ax.set_title('Stacking Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

Stacking Architecture

Base Models (Level 0):

Random Forest (captures non-linear patterns)
Gradient Boosting (sequential error correction)
AdaBoost (focuses on hard examples)

Meta-Learner (Level 1):

Logistic Regression (learns optimal weighting of base predictions)

This architecture leverages the strengths of different algorithms!

20 Step 8: Comprehensive Metrics Comparison

Let’s create a comprehensive comparison of all models across multiple metrics.

# Compile all predictions
models_results = {
    'Baseline (LR)': {
        'predictions': y_pred_baseline,
        'probabilities': baseline_model.predict_proba(X_test_scaled)[:, 1]
    },
    'Bagging': {
        'predictions': y_pred_bagging,
        'probabilities': y_proba_bagging
    },
    'Random Forest': {
        'predictions': y_pred_rf,
        'probabilities': y_proba_rf
    },
    'AdaBoost': {
        'predictions': y_pred_ada,
        'probabilities': y_proba_ada
    },
    'Gradient Boosting': {
        'predictions': y_pred_gb,
        'probabilities': y_proba_gb
    },
    'Stacking': {
        'predictions': y_pred_stack,
        'probabilities': y_proba_stack
    }
}

# Calculate all metrics
metrics_df = []
for model_name, results in models_results.items():
    y_pred = results['predictions']
    y_proba = results['probabilities']

    metrics_df.append({
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1-Score': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba),
        'PR-AUC': average_precision_score(y_test, y_proba)
    })

metrics_df = pd.DataFrame(metrics_df)
print("="*80)
print("COMPREHENSIVE MODEL COMPARISON")
print("="*80)
print(metrics_df.to_string(index=False))

# Highlight best scores
print("\n" + "="*80)
print("BEST PERFORMING MODELS PER METRIC")
print("="*80)
for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']:
    best_model = metrics_df.loc[metrics_df[metric].idxmax(), 'Model']
    best_score = metrics_df[metric].max()
    print(f"{metric:12s}: {best_model:20s} ({best_score:.4f})")

# Visualize metrics comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes[idx // 3, idx % 3]

    colors = ['#e74c3c' if m == 'Baseline (LR)' else '#3498db'
              for m in metrics_df['Model']]

    bars = ax.bar(range(len(metrics_df)), metrics_df[metric], color=colors)
    ax.set_xticks(range(len(metrics_df)))
    ax.set_xticklabels(metrics_df['Model'], rotation=45, ha='right')
    ax.set_ylabel(metric, fontsize=11)
    ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    ax.set_ylim([0, 1.0])

    # Add value labels
    for i, bar in enumerate(bars):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

21 Step 9: ROC Curve Analysis

ROC (Receiver Operating Characteristic) curves visualize the tradeoff between True Positive Rate and False Positive Rate.

# Plot ROC curves for all models
fig, ax = plt.subplots(figsize=(12, 8))

colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c']

for (model_name, results), color in zip(models_results.items(), colors):
    y_proba = results['probabilities']
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)

    ax.plot(fpr, tpr, color=color, lw=2,
            label=f'{model_name} (AUC = {auc:.3f})')

# Random classifier line
ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier (AUC = 0.500)')

ax.set_xlabel('False Positive Rate', fontsize=12, fontweight='bold')
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12, fontweight='bold')
ax.set_title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("ROC-AUC Interpretation:")
print("="*60)
print("AUC = 1.0   : Perfect classifier")
print("AUC = 0.9-1 : Excellent")
print("AUC = 0.8-0.9 : Good")
print("AUC = 0.7-0.8 : Fair")
print("AUC = 0.5   : Random guessing")
print("AUC < 0.5   : Worse than random (predictions inverted)")

ROC-AUC Limitation for Imbalanced Data

ROC-AUC can be misleading on highly imbalanced datasets because:

It gives equal weight to both classes
A model can have high AUC but still miss most frauds
Precision-Recall curves are more informative for imbalanced problems

22 Step 10: Precision-Recall Curve Analysis

For imbalanced datasets, Precision-Recall curves are more informative than ROC curves.

# Plot Precision-Recall curves for all models
fig, ax = plt.subplots(figsize=(12, 8))

colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c']

for (model_name, results), color in zip(models_results.items(), colors):
    y_proba = results['probabilities']
    precision, recall, _ = precision_recall_curve(y_test, y_proba)
    pr_auc = average_precision_score(y_test, y_proba)

    ax.plot(recall, precision, color=color, lw=2,
            label=f'{model_name} (AP = {pr_auc:.3f})')

# Baseline (no-skill) line for imbalanced dataset
no_skill = sum(y_test) / len(y_test)
ax.plot([0, 1], [no_skill, no_skill], 'k--', lw=2,
        label=f'No Skill (AP = {no_skill:.3f})')

ax.set_xlabel('Recall (True Positive Rate)', fontsize=12, fontweight='bold')
ax.set_ylabel('Precision', fontsize=12, fontweight='bold')
ax.set_title('Precision-Recall Curves - Model Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='lower left', fontsize=10)
ax.grid(alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.show()

print("Precision-Recall Tradeoff:")
print("="*60)
print("High Precision, Low Recall  : Conservative (few false alarms, miss frauds)")
print("Low Precision, High Recall  : Aggressive (catch frauds, many false alarms)")
print("Optimal Balance             : Depends on business cost of FP vs FN")

Choosing the Right Metric

Use ROC-AUC when:

Classes are balanced
Both FP and FN have similar costs
You care about overall ranking ability

Use PR-AUC when:

Classes are imbalanced (like fraud detection)
Positive class (fraud) is more important
You care about precision-recall tradeoff

23 Step 11: Handling Class Imbalance with SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples of the minority class.

if SMOTE_AVAILABLE:
    # Apply SMOTE to training data
    smote = SMOTE(random_state=42)
    X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

    print("Class distribution before SMOTE:")
    print(pd.Series(y_train).value_counts())
    print("\nClass distribution after SMOTE:")
    print(pd.Series(y_train_smote).value_counts())

    # Train Random Forest on balanced data
    rf_smote = RandomForestClassifier(
        n_estimators=100,
        max_depth=15,
        random_state=42,
        n_jobs=-1
    )

    rf_smote.fit(X_train_smote, y_train_smote)
    y_pred_smote = rf_smote.predict(X_test_scaled)
    y_proba_smote = rf_smote.predict_proba(X_test_scaled)[:, 1]

    print("\n" + "="*60)
    print("RANDOM FOREST WITH SMOTE PERFORMANCE")
    print("="*60)
    print(f"Accuracy:  {accuracy_score(y_test, y_pred_smote):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred_smote):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred_smote):.4f}")
    print(f"F1-Score:  {f1_score(y_test, y_pred_smote):.4f}")
    print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_smote):.4f}")

    # Compare with RF without SMOTE
    print("\nComparison with RF using class_weight='balanced':")
    print(f"{'Metric':<12} {'With SMOTE':>12} {'class_weight':>12} {'Difference':>12}")
    print("-" * 50)
    metrics_compare = [
        ('Precision', precision_score(y_test, y_pred_smote), precision_score(y_test, y_pred_rf)),
        ('Recall', recall_score(y_test, y_pred_smote), recall_score(y_test, y_pred_rf)),
        ('F1-Score', f1_score(y_test, y_pred_smote), f1_score(y_test, y_pred_rf)),
    ]
    for metric_name, smote_val, weight_val in metrics_compare:
        diff = smote_val - weight_val
        print(f"{metric_name:<12} {smote_val:>12.4f} {weight_val:>12.4f} {diff:>+12.4f}")

else:
    print("SMOTE not available. Using class_weight='balanced' is a simpler alternative.")
    print("\nAlternatives to SMOTE:")
    print("1. class_weight='balanced' - Weight samples inversely to class frequency")
    print("2. Random undersampling - Reduce majority class samples")
    print("3. Threshold tuning - Adjust decision threshold (covered in next step)")

24 Step 12: Threshold Tuning for Optimal Business Impact

The default classification threshold is 0.5, but we can optimize this based on business costs.

# Calculate metrics across different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
threshold_metrics = []

# Use Gradient Boosting model for this analysis
for threshold in thresholds:
    y_pred_threshold = (y_proba_gb >= threshold).astype(int)

    cm = confusion_matrix(y_test, y_pred_threshold)
    tn, fp, fn, tp = cm.ravel()

    # Calculate metrics
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    # Business cost (assuming FN costs $100, FP costs $1)
    cost_fp = 1
    cost_fn = 100
    total_cost = (fp * cost_fp) + (fn * cost_fn)

    threshold_metrics.append({
        'Threshold': threshold,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'FP': fp,
        'FN': fn,
        'Total_Cost': total_cost
    })

threshold_df = pd.DataFrame(threshold_metrics)

# Find optimal threshold (minimize cost)
optimal_idx = threshold_df['Total_Cost'].idxmin()
optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
optimal_cost = threshold_df.loc[optimal_idx, 'Total_Cost']

print("="*60)
print("THRESHOLD OPTIMIZATION ANALYSIS")
print("="*60)
print(f"Cost Assumptions: FP = $1, FN = $100")
print(f"\nOptimal Threshold: {optimal_threshold:.2f}")
print(f"Total Cost at Optimal: ${optimal_cost:,.0f}")
print(f"\nMetrics at Optimal Threshold:")
print(threshold_df.loc[optimal_idx][['Precision', 'Recall', 'F1-Score', 'FP', 'FN']])

# Visualize threshold impact
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Precision and Recall vs Threshold
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Precision'],
                'b-', label='Precision', linewidth=2)
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Recall'],
                'r-', label='Recall', linewidth=2)
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['F1-Score'],
                'g-', label='F1-Score', linewidth=2)
axes[0, 0].axvline(optimal_threshold, color='orange', linestyle='--',
                   label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[0, 0].set_xlabel('Threshold', fontweight='bold')
axes[0, 0].set_ylabel('Score', fontweight='bold')
axes[0, 0].set_title('Metrics vs Threshold', fontweight='bold', fontsize=12)
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# False Positives and False Negatives vs Threshold
axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FP'],
                'b-', label='False Positives', linewidth=2)
axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FN'],
                'r-', label='False Negatives', linewidth=2)
axes[0, 1].axvline(optimal_threshold, color='orange', linestyle='--',
                   label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[0, 1].set_xlabel('Threshold', fontweight='bold')
axes[0, 1].set_ylabel('Count', fontweight='bold')
axes[0, 1].set_title('Errors vs Threshold', fontweight='bold', fontsize=12)
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# Total Cost vs Threshold
axes[1, 0].plot(threshold_df['Threshold'], threshold_df['Total_Cost'],
                'purple', linewidth=2)
axes[1, 0].axvline(optimal_threshold, color='orange', linestyle='--',
                   label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[1, 0].scatter(optimal_threshold, optimal_cost, color='red', s=100,
                   zorder=5, label=f'Min Cost = ${optimal_cost:,.0f}')
axes[1, 0].set_xlabel('Threshold', fontweight='bold')
axes[1, 0].set_ylabel('Total Cost ($)', fontweight='bold')
axes[1, 0].set_title('Business Cost vs Threshold', fontweight='bold', fontsize=12)
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# Confusion Matrix at Optimal Threshold
y_pred_optimal = (y_proba_gb >= optimal_threshold).astype(int)
cm_optimal = confusion_matrix(y_test, y_pred_optimal)
im = axes[1, 1].imshow(cm_optimal, cmap='RdYlGn_r', aspect='auto')
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_yticks([0, 1])
axes[1, 1].set_xticklabels(['Normal', 'Fraud'])
axes[1, 1].set_yticklabels(['Normal', 'Fraud'])
axes[1, 1].set_xlabel('Predicted', fontweight='bold')
axes[1, 1].set_ylabel('Actual', fontweight='bold')
axes[1, 1].set_title(f'Confusion Matrix (Threshold={optimal_threshold:.2f})',
                     fontweight='bold', fontsize=12)

# Add text annotations
for i in range(2):
    for j in range(2):
        text = axes[1, 1].text(j, i, f'{cm_optimal[i, j]:,}',
                              ha="center", va="center", color="black",
                              fontweight='bold', fontsize=14)

plt.colorbar(im, ax=axes[1, 1])
plt.tight_layout()
plt.show()

Business Impact of Threshold Tuning

By adjusting the threshold from 0.5 to the optimal value:

We balance precision and recall based on actual business costs
In fraud detection, lower thresholds catch more frauds (higher recall)
But this increases false alarms (lower precision)
The optimal threshold minimizes total business cost

25 Step 13: GridSearchCV for Hyperparameter Tuning

Let’s systematically find the best hyperparameters for our top-performing model.

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 15, 20],
    'min_samples_split': [5, 10, 20],
    'min_samples_leaf': [2, 4, 8],
    'class_weight': ['balanced', {0: 1, 1: 10}, {0: 1, 1: 20}]
}

# Use stratified k-fold for imbalanced data
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV with F1-score as scoring metric (better for imbalanced data)
print("Starting GridSearchCV... This may take a few minutes.")
print(f"Total combinations to try: {np.prod([len(v) for v in param_grid.values()])}")
print("Using 5-fold stratified cross-validation\n")

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=cv_strategy,
    scoring='f1',  # Optimize for F1-score (balance of precision and recall)
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)

print("\n" + "="*60)
print("GRID SEARCH RESULTS")
print("="*60)
print(f"Best F1-Score (CV): {grid_search.best_score_:.4f}")
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]

print("\n" + "="*60)
print("BEST MODEL PERFORMANCE ON TEST SET")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba_best):.4f}")
print(f"PR-AUC:    {average_precision_score(y_test, y_proba_best):.4f}")

# Top 10 parameter combinations
cv_results = pd.DataFrame(grid_search.cv_results_)
top_10 = cv_results.nlargest(10, 'mean_test_score')[
    ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
]
print("\nTop 10 Parameter Combinations:")
print(top_10.to_string(index=False))

GridSearchCV Best Practices

For Imbalanced Data:

Use StratifiedKFold to maintain class distribution in folds
Optimize for f1, recall, or custom scorer (not accuracy)
Include class_weight in parameter grid
Consider RandomizedSearchCV for large search spaces

Performance Tips:

Use n_jobs=-1 to parallelize
Start with coarse grid, then refine
Monitor with verbose=1 or verbose=2

26 Step 14: Cross-Validation with Stratification

Perform robust cross-validation to assess model stability.

# Cross-validation for all models
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models_cv = {
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=15,
                                            class_weight='balanced', random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5,
                                                     random_state=42),
    'Best Model (GridSearch)': best_model
}

cv_results = []

for model_name, model in models_cv.items():
    # Cross-validate multiple metrics
    accuracy_scores = cross_val_score(model, X_train_scaled, y_train,
                                       cv=cv_strategy, scoring='accuracy', n_jobs=-1)
    precision_scores = cross_val_score(model, X_train_scaled, y_train,
                                        cv=cv_strategy, scoring='precision', n_jobs=-1)
    recall_scores = cross_val_score(model, X_train_scaled, y_train,
                                     cv=cv_strategy, scoring='recall', n_jobs=-1)
    f1_scores = cross_val_score(model, X_train_scaled, y_train,
                                 cv=cv_strategy, scoring='f1', n_jobs=-1)

    cv_results.append({
        'Model': model_name,
        'Accuracy (mean±std)': f"{accuracy_scores.mean():.4f} ± {accuracy_scores.std():.4f}",
        'Precision (mean±std)': f"{precision_scores.mean():.4f} ± {precision_scores.std():.4f}",
        'Recall (mean±std)': f"{recall_scores.mean():.4f} ± {recall_scores.std():.4f}",
        'F1-Score (mean±std)': f"{f1_scores.mean():.4f} ± {f1_scores.std():.4f}"
    })

cv_df = pd.DataFrame(cv_results)
print("="*100)
print("5-FOLD STRATIFIED CROSS-VALIDATION RESULTS")
print("="*100)
print(cv_df.to_string(index=False))

print("\n" + "="*60)
print("INTERPRETATION")
print("="*60)
print("Lower standard deviation indicates more stable performance across folds.")
print("This suggests the model generalizes well and is not overfitting.")

27 Step 15: Final Model Evaluation Checklist

Let’s create a comprehensive production-ready evaluation report.

# Select final model (best from GridSearch)
final_model = best_model
final_model_name = "Random Forest (Optimized)"

# Generate comprehensive report
print("="*80)
print(f"FINAL MODEL EVALUATION REPORT: {final_model_name}")
print("="*80)

# 1. Model Architecture
print("\n1. MODEL ARCHITECTURE")
print("-" * 40)
print(f"Algorithm: {type(final_model).__name__}")
print(f"Parameters: {final_model.get_params()}")

# 2. Test Set Performance
print("\n2. TEST SET PERFORMANCE")
print("-" * 40)
y_pred_final = final_model.predict(X_test_scaled)
y_proba_final = final_model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, y_pred_final, target_names=['Normal', 'Fraud']))

# 3. Confusion Matrix
print("\n3. CONFUSION MATRIX ANALYSIS")
print("-" * 40)
cm_final = confusion_matrix(y_test, y_pred_final)
tn, fp, fn, tp = cm_final.ravel()

print(f"True Negatives (TN):  {tn:,} - Correctly identified normal transactions")
print(f"False Positives (FP): {fp:,} - Normal flagged as fraud (Type I Error)")
print(f"False Negatives (FN): {fn:,} - Fraud missed (Type II Error)")
print(f"True Positives (TP):  {tp:,} - Correctly identified fraud")

fraud_detection_rate = (tp / (tp + fn)) * 100
false_alarm_rate = (fp / (fp + tn)) * 100

print(f"\nFraud Detection Rate: {fraud_detection_rate:.1f}%")
print(f"False Alarm Rate: {false_alarm_rate:.2f}%")

# 4. Advanced Metrics
print("\n4. ADVANCED METRICS")
print("-" * 40)
print(f"ROC-AUC Score:  {roc_auc_score(y_test, y_proba_final):.4f}")
print(f"PR-AUC Score:   {average_precision_score(y_test, y_proba_final):.4f}")

# 5. Business Impact
print("\n5. BUSINESS IMPACT ANALYSIS")
print("-" * 40)
cost_fp = 1  # Cost of false alarm
cost_fn = 100  # Cost of missed fraud
total_cost = (fp * cost_fp) + (fn * cost_fn)
baseline_cost = len(y_test[y_test == 1]) * cost_fn  # Cost if we catch no frauds

savings = baseline_cost - total_cost
savings_pct = (savings / baseline_cost) * 100

print(f"False Positive Cost: ${fp * cost_fp:,}")
print(f"False Negative Cost: ${fn * cost_fn:,}")
print(f"Total Cost: ${total_cost:,}")
print(f"Baseline Cost (no model): ${baseline_cost:,}")
print(f"Cost Savings: ${savings:,} ({savings_pct:.1f}% reduction)")

# 6. Cross-Validation Stability
print("\n6. CROSS-VALIDATION STABILITY")
print("-" * 40)
cv_f1_scores = cross_val_score(final_model, X_train_scaled, y_train,
                                cv=cv_strategy, scoring='f1', n_jobs=-1)
print(f"F1-Score across 5 folds: {cv_f1_scores}")
print(f"Mean: {cv_f1_scores.mean():.4f}")
print(f"Std:  {cv_f1_scores.std():.4f}")
print(f"Min:  {cv_f1_scores.min():.4f}")
print(f"Max:  {cv_f1_scores.max():.4f}")

# 7. Feature Importance
print("\n7. TOP 10 MOST IMPORTANT FEATURES")
print("-" * 40)
feature_importance_final = pd.DataFrame({
    'Feature': X.columns,
    'Importance': final_model.feature_importances_
}).sort_values('Importance', ascending=False)

print(feature_importance_final.head(10).to_string(index=False))

# 8. Production Readiness Checklist
print("\n8. PRODUCTION READINESS CHECKLIST")
print("-" * 40)
checklist = [
    ("✓", "Model trained and validated"),
    ("✓", "Cross-validation performed"),
    ("✓", "Hyperparameters optimized"),
    ("✓", "Class imbalance handled"),
    ("✓", "Comprehensive metrics calculated"),
    ("✓", "Business impact quantified"),
    ("✓", "Feature importance analyzed"),
    ("✗", "Model serialized and saved (TODO)"),
    ("✗", "Deployment pipeline created (TODO)"),
    ("✗", "Monitoring system setup (TODO)")
]

for status, item in checklist:
    print(f"{status} {item}")

print("\n" + "="*80)
print("RECOMMENDATION")
print("="*80)
print(f"The {final_model_name} is recommended for production deployment.")
print(f"It achieves {fraud_detection_rate:.1f}% fraud detection rate with only")
print(f"{false_alarm_rate:.2f}% false alarm rate, resulting in {savings_pct:.1f}% cost savings.")
print("\nNext steps:")
print("1. Save model using joblib or pickle")
print("2. Set up real-time prediction API")
print("3. Implement monitoring for model drift")
print("4. Plan for periodic retraining")

28 Step 16: Final Visualizations

Create comprehensive visualization summary for presentation and reporting.

# Create comprehensive final visualization
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. Confusion Matrix
ax1 = fig.add_subplot(gs[0, 0])
disp = ConfusionMatrixDisplay(confusion_matrix=cm_final, display_labels=['Normal', 'Fraud'])
disp.plot(ax=ax1, cmap='Blues', values_format='d')
ax1.set_title('Confusion Matrix', fontweight='bold', fontsize=11)

# 2. ROC Curve
ax2 = fig.add_subplot(gs[0, 1])
fpr_final, tpr_final, _ = roc_curve(y_test, y_proba_final)
auc_final = roc_auc_score(y_test, y_proba_final)
ax2.plot(fpr_final, tpr_final, 'b-', lw=2, label=f'ROC (AUC={auc_final:.3f})')
ax2.plot([0, 1], [0, 1], 'k--', lw=1)
ax2.set_xlabel('False Positive Rate', fontweight='bold')
ax2.set_ylabel('True Positive Rate', fontweight='bold')
ax2.set_title('ROC Curve', fontweight='bold', fontsize=11)
ax2.legend()
ax2.grid(alpha=0.3)

# 3. Precision-Recall Curve
ax3 = fig.add_subplot(gs[0, 2])
precision_final, recall_final, _ = precision_recall_curve(y_test, y_proba_final)
pr_auc_final = average_precision_score(y_test, y_proba_final)
ax3.plot(recall_final, precision_final, 'r-', lw=2, label=f'PR (AP={pr_auc_final:.3f})')
ax3.set_xlabel('Recall', fontweight='bold')
ax3.set_ylabel('Precision', fontweight='bold')
ax3.set_title('Precision-Recall Curve', fontweight='bold', fontsize=11)
ax3.legend()
ax3.grid(alpha=0.3)

# 4. Feature Importance
ax4 = fig.add_subplot(gs[1, :])
top_15_features = feature_importance_final.head(15)
ax4.barh(range(len(top_15_features)), top_15_features['Importance'].values, color='#3498db')
ax4.set_yticks(range(len(top_15_features)))
ax4.set_yticklabels(top_15_features['Feature'].values)
ax4.set_xlabel('Importance Score', fontweight='bold')
ax4.set_title('Top 15 Feature Importances', fontweight='bold', fontsize=11)
ax4.invert_yaxis()
ax4.grid(axis='x', alpha=0.3)

# 5. Model Comparison
ax5 = fig.add_subplot(gs[2, 0])
comparison_metrics = metrics_df[['Model', 'F1-Score']].copy()
colors_comp = ['#e74c3c' if 'Baseline' in m else '#3498db' for m in comparison_metrics['Model']]
ax5.bar(range(len(comparison_metrics)), comparison_metrics['F1-Score'], color=colors_comp)
ax5.set_xticks(range(len(comparison_metrics)))
ax5.set_xticklabels(comparison_metrics['Model'], rotation=45, ha='right', fontsize=8)
ax5.set_ylabel('F1-Score', fontweight='bold')
ax5.set_title('Model Comparison (F1-Score)', fontweight='bold', fontsize=11)
ax5.grid(axis='y', alpha=0.3)

# 6. Threshold Analysis
ax6 = fig.add_subplot(gs[2, 1])
ax6.plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision', lw=2)
ax6.plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall', lw=2)
ax6.plot(threshold_df['Threshold'], threshold_df['F1-Score'], 'g-', label='F1', lw=2)
ax6.axvline(optimal_threshold, color='orange', linestyle='--', lw=2, label='Optimal')
ax6.set_xlabel('Threshold', fontweight='bold')
ax6.set_ylabel('Score', fontweight='bold')
ax6.set_title('Threshold Optimization', fontweight='bold', fontsize=11)
ax6.legend(fontsize=8)
ax6.grid(alpha=0.3)

# 7. Business Impact
ax7 = fig.add_subplot(gs[2, 2])
categories = ['Baseline\n(No Model)', 'Final Model']
costs = [baseline_cost, total_cost]
colors_cost = ['#e74c3c', '#2ecc71']
bars = ax7.bar(categories, costs, color=colors_cost)
ax7.set_ylabel('Total Cost ($)', fontweight='bold')
ax7.set_title('Business Cost Comparison', fontweight='bold', fontsize=11)
ax7.grid(axis='y', alpha=0.3)
for bar in bars:
    height = bar.get_height()
    ax7.text(bar.get_x() + bar.get_width()/2., height,
            f'${height:,.0f}', ha='center', va='bottom', fontweight='bold')

# Add savings annotation
savings_text = f'{savings_pct:.1f}% Cost Reduction'
ax7.text(0.5, max(costs)*0.5, savings_text, ha='center',
        fontsize=12, fontweight='bold', color='green',
        bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))

plt.suptitle(f'Final Model Evaluation Summary: {final_model_name}',
             fontsize=16, fontweight='bold', y=0.995)
plt.show()

29 Summary and Key Takeaways

What You’ve Learned

1. Ensemble Methods:

Bagging reduces variance through bootstrap aggregation
Boosting reduces bias through sequential error correction
Stacking combines diverse models for optimal performance
Random Forest adds feature randomization to Bagging

2. Handling Imbalanced Data:

Accuracy is misleading on imbalanced datasets
Use class_weight='balanced' or SMOTE
Optimize for F1-score, not accuracy
Stratified k-fold maintains class distribution

3. Evaluation Metrics:

Precision: Of predicted frauds, how many are actually frauds?
Recall: Of actual frauds, how many did we catch?
F1-Score: Harmonic mean of precision and recall
ROC-AUC: Overall ranking ability (can be misleading for imbalanced data)
PR-AUC: Better for imbalanced datasets

4. Production Considerations:

Threshold tuning based on business costs
Cross-validation for model stability
Feature importance for interpretability
Business impact quantification

Critical Concepts for Exams

Why ensemble methods work: Bias-variance tradeoff
When to use which ensemble:
- Bagging: High variance models (decision trees)
- Boosting: High bias models (weak learners)
- Stacking: Combining diverse algorithms
Imbalanced data challenges:
- Accuracy paradox
- Class weighting vs. resampling
- Proper evaluation metrics
GridSearchCV best practices:
- Stratified CV for imbalanced data
- Choose appropriate scoring metric
- Balance search space vs. computation time

30 Further Reading and Resources

Research Papers:

Breiman, L. (2001). “Random Forests.” Machine Learning, 45(1), 5-32.
Freund, Y., & Schapire, R. E. (1997). “A decision-theoretic generalization of on-line learning and an application to boosting.”
Chawla, N. V., et al. (2002). “SMOTE: Synthetic Minority Over-sampling Technique.” JAIR, 16, 321-357.

Online Resources:

scikit-learn Ensemble Methods Guide: https://scikit-learn.org/stable/modules/ensemble.html
Imbalanced-learn Documentation: https://imbalanced-learn.org/
ROC vs PR curves: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

Kaggle Datasets:

Credit Card Fraud Detection: https://www.kaggle.com/mlg-ulb/creditcardfraud
IEEE-CIS Fraud Detection: https://www.kaggle.com/c/ieee-fraud-detection

Next Steps

Now that you’ve mastered ensemble methods and evaluation: 1. Complete the worksheet.md exercises 2. Experiment with different ensemble combinations 3. Try on other imbalanced datasets 4. Implement custom cost-sensitive metrics 5. Deploy your model as a REST API

Lab Completion: Make sure to save your work and submit all required deliverables according to the rubric!

--- title: "Lab 04 - Ensemble Methods & Advanced Model Evaluation" subtitle: "Building Robust Classifiers for Imbalanced Datasets" author: "Machine Learning Course" date: today format: html: toc: true toc-depth: 3 number-sections: true code-fold: false code-tools: true df-print: paged theme: cosmo css: ../../styles/labs.css execute: warning: false message: false --- ::: {.callout-note icon=true} ## Learning Outcomes After completing this lab, you will be able to: 1. **Implement** ensemble learning methods including Bagging, Boosting, and Stacking 2. **Handle** imbalanced datasets using multiple techniques 3. **Calculate** and interpret comprehensive evaluation metrics (Precision, Recall, F1, ROC-AUC, PR-AUC) 4. **Analyze** bias-variance tradeoffs in ensemble models 5. **Optimize** hyperparameters using GridSearchCV with proper validation 6. **Create** production-ready model evaluation frameworks 7. **Make** data-driven decisions based on business context and cost-benefit analysis ::: ::: {.callout-important} ## Lab Information - **Duration:** 3-4 hours - **Difficulty:** Intermediate to Advanced - **Prerequisites:** Python, scikit-learn, pandas, understanding of basic ML algorithms - **Dataset:** Credit Card Fraud Detection (highly imbalanced real-world scenario) ::: # Background and Motivation ## Why Ensemble Methods Matter Imagine you're building a fraud detection system for a bank. A single decision tree might make mistakes, but what if you combined the predictions of 100 trees? This is the core idea behind ensemble methods. **Key Benefits:** - **Reduced Overfitting:** By combining multiple models, we average out individual model errors - **Improved Generalization:** Diverse models capture different patterns in the data - **Robustness:** Less sensitive to noise and outliers - **State-of-the-Art Performance:** Most competition-winning solutions use ensembles ## The Imbalanced Dataset Challenge In fraud detection, only 0.1-0.5% of transactions are fraudulent. This creates severe class imbalance where: - **Accuracy is Misleading:** A model predicting "Not Fraud" for everything achieves 99.5% accuracy but is useless - **Standard Metrics Fail:** We need specialized metrics like Precision-Recall curves - **Model Bias:** Algorithms tend to ignore the minority class - **Business Impact:** Missing one fraud costs more than one false alarm ## Real-World Context: Cost-Sensitive Learning | Scenario | Cost of False Positive | Cost of False Negative | |----------|------------------------|------------------------| | Fraud Detection | Low (minor inconvenience) | High (financial loss) | | Medical Diagnosis | Medium (unnecessary tests) | Very High (missed disease) | | Spam Detection | Low (user checks spam folder) | Low (mild annoyance) | Understanding these costs is crucial for choosing the right evaluation metric and decision threshold. # Lab Setup ## Import Required Libraries ```{python} # Data manipulation import numpy as np import pandas as pd import warnings warnings.filterwarnings('ignore') # Visualization import matplotlib.pyplot as plt import seaborn as sns plt.style.use('seaborn-v0_8-darkgrid') # Machine Learning - Base Models from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier # Ensemble Methods from sklearn.ensemble import ( BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, StackingClassifier, VotingClassifier ) # Evaluation Metrics from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, roc_auc_score, precision_recall_curve, average_precision_score, ConfusionMatrixDisplay ) # Imbalanced Learning (if available, otherwise we'll use class weights) try: from imblearn.over_sampling import SMOTE from imblearn.pipeline import Pipeline as ImbPipeline SMOTE_AVAILABLE = True print("✓ SMOTE available for handling class imbalance") except ImportError: SMOTE_AVAILABLE = False print("⚠ SMOTE not available. We'll use class_weight='balanced' instead") print(" Install imbalanced-learn: pip install imbalanced-learn") print("\n✓ All essential libraries imported successfully!") print(f"NumPy version: {np.__version__}") print(f"Pandas version: {pd.__version__}") ``` ## Load and Explore Dataset For this lab, we'll simulate a credit card fraud detection dataset with realistic class imbalance. ::: {.callout-tip} ## Using Real Data For production work, download the actual Credit Card Fraud Detection dataset from Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud The dataset contains 284,807 transactions with only 492 frauds (0.172% fraud rate). ::: ```{python} # Simulate realistic fraud detection dataset np.random.seed(42) def create_fraud_dataset(n_samples=10000, fraud_ratio=0.02): """ Create a synthetic fraud detection dataset with realistic characteristics Parameters: ----------- n_samples : int Total number of transactions fraud_ratio : float Proportion of fraudulent transactions (0.02 = 2%) """ n_fraud = int(n_samples * fraud_ratio) n_normal = n_samples - n_fraud # Normal transactions (legitimate) normal_data = np.random.randn(n_normal, 28) * 0.5 normal_data[:, 0] += 1.0 # Shift distribution # Fraudulent transactions (different pattern) fraud_data = np.random.randn(n_fraud, 28) * 1.5 fraud_data[:, 0] -= 2.0 # Different center fraud_data[:, 5] += 3.0 # Strong signal in some features fraud_data[:, 10] -= 2.5 # Transaction amounts normal_amounts = np.abs(np.random.gamma(2, 50, n_normal)) fraud_amounts = np.abs(np.random.gamma(3, 150, n_fraud)) # Combine features X_normal = np.column_stack([normal_data, normal_amounts]) X_fraud = np.column_stack([fraud_data, fraud_amounts]) X = np.vstack([X_normal, X_fraud]) y = np.hstack([np.zeros(n_normal), np.ones(n_fraud)]) # Shuffle shuffle_idx = np.random.permutation(n_samples) X = X[shuffle_idx] y = y[shuffle_idx] # Create DataFrame feature_names = [f'V{i}' for i in range(1, 29)] + ['Amount'] df = pd.DataFrame(X, columns=feature_names) df['Class'] = y.astype(int) return df # Create dataset df = create_fraud_dataset(n_samples=10000, fraud_ratio=0.02) print("Dataset Shape:", df.shape) print("\nFirst few rows:") print(df.head()) print("\nDataset Info:") print(df.info()) ``` # Step 1: Understanding Class Imbalance The first critical step is visualizing and quantifying the class imbalance problem. ```{python} # Calculate class distribution class_counts = df['Class'].value_counts() class_percentages = df['Class'].value_counts(normalize=True) * 100 print("Class Distribution:") print("="*50) print(f"Normal Transactions (0): {class_counts[0]:,} ({class_percentages[0]:.2f}%)") print(f"Fraudulent Transactions (1): {class_counts[1]:,} ({class_percentages[1]:.2f}%)") print(f"\nImbalance Ratio: 1:{class_counts[0]/class_counts[1]:.1f}") print(f"This means for every 1 fraud, there are {class_counts[0]/class_counts[1]:.1f} normal transactions") # Visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Bar plot axes[0].bar(['Normal (0)', 'Fraud (1)'], class_counts.values, color=['#2ecc71', '#e74c3c']) axes[0].set_ylabel('Count', fontsize=12) axes[0].set_title('Class Distribution (Absolute)', fontsize=14, fontweight='bold') axes[0].grid(axis='y', alpha=0.3) for i, v in enumerate(class_counts.values): axes[0].text(i, v + 50, f'{v:,}', ha='center', fontweight='bold') # Pie chart colors = ['#2ecc71', '#e74c3c'] axes[1].pie(class_counts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%', colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'}) axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold') plt.tight_layout() plt.show() ``` ::: {.callout-warning} ## Why This Matters With such extreme class imbalance, a naive model can achieve 98% accuracy by simply predicting "Not Fraud" for every transaction. This is called the **accuracy paradox** and highlights why we need better evaluation metrics. ::: # Step 2: Baseline Model - The Accuracy Trap Let's demonstrate why accuracy alone is misleading on imbalanced datasets. ```{python} # Prepare data X = df.drop('Class', axis=1) y = df['Class'] # Split data with stratification (maintains class distribution) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) print("Training set distribution:") print(y_train.value_counts()) print("\nTest set distribution:") print(y_test.value_counts()) # Scale features (important for many algorithms) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Naive baseline: Logistic Regression without handling imbalance baseline_model = LogisticRegression(random_state=42, max_iter=1000) baseline_model.fit(X_train_scaled, y_train) y_pred_baseline = baseline_model.predict(X_test_scaled) # Evaluate accuracy = accuracy_score(y_test, y_pred_baseline) precision = precision_score(y_test, y_pred_baseline) recall = recall_score(y_test, y_pred_baseline) f1 = f1_score(y_test, y_pred_baseline) print("\n" + "="*60) print("BASELINE MODEL PERFORMANCE (WITHOUT HANDLING IMBALANCE)") print("="*60) print(f"Accuracy: {accuracy:.4f} ✓ Looks great!") print(f"Precision: {precision:.4f} ✗ But precision is poor") print(f"Recall: {recall:.4f} ✗ And recall is terrible!") print(f"F1-Score: {f1:.4f}") # Confusion Matrix cm = confusion_matrix(y_test, y_pred_baseline) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Normal', 'Fraud']) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap='Blues', values_format='d') ax.set_title('Baseline Model - Confusion Matrix\n(Without Handling Imbalance)', fontsize=14, fontweight='bold') plt.show() print("\nConfusion Matrix Interpretation:") print(f"True Negatives (TN): {cm[0,0]:,} - Correctly identified normal transactions") print(f"False Positives (FP): {cm[0,1]:,} - Normal flagged as fraud (Type I Error)") print(f"False Negatives (FN): {cm[1,0]:,} - Fraud missed (Type II Error) ⚠ CRITICAL!") print(f"True Positives (TP): {cm[1,1]:,} - Correctly identified fraud") # Calculate the percentage of frauds caught fraud_catch_rate = (cm[1,1] / (cm[1,0] + cm[1,1])) * 100 print(f"\n⚠ We're only catching {fraud_catch_rate:.1f}% of fraudulent transactions!") ``` ::: {.callout-important} ## Key Insight: The Cost of False Negatives In fraud detection: - **False Positive (FP):** Customer annoyed by declined card → ~$1 cost - **False Negative (FN):** Fraud not detected → ~$100-$1000+ cost Missing frauds (low recall) is much worse than false alarms! ::: # Step 3: Ensemble Method 1 - Bagging Bagging (Bootstrap Aggregating) creates multiple models on random subsets of data and averages their predictions. ```{python} # Bagging with balanced class weights bagging_model = BaggingClassifier( estimator=DecisionTreeClassifier(max_depth=10, class_weight='balanced'), n_estimators=50, random_state=42, n_jobs=-1 ) bagging_model.fit(X_train_scaled, y_train) y_pred_bagging = bagging_model.predict(X_test_scaled) y_proba_bagging = bagging_model.predict_proba(X_test_scaled)[:, 1] # Evaluate print("="*60) print("BAGGING CLASSIFIER PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_bagging):.4f}") print(f"Precision: {precision_score(y_test, y_pred_bagging):.4f}") print(f"Recall: {recall_score(y_test, y_pred_bagging):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_bagging):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_bagging):.4f}") print("\nClassification Report:") print(classification_report(y_test, y_pred_bagging, target_names=['Normal', 'Fraud'])) # Confusion Matrix cm_bagging = confusion_matrix(y_test, y_pred_bagging) disp = ConfusionMatrixDisplay(confusion_matrix=cm_bagging, display_labels=['Normal', 'Fraud']) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap='Greens', values_format='d') ax.set_title('Bagging Classifier - Confusion Matrix', fontsize=14, fontweight='bold') plt.show() fraud_catch_rate_bagging = (cm_bagging[1,1] / (cm_bagging[1,0] + cm_bagging[1,1])) * 100 print(f"\n✓ Fraud catch rate improved to {fraud_catch_rate_bagging:.1f}%") ``` # Step 4: Ensemble Method 2 - Random Forest with Feature Importance Random Forest extends Bagging by also randomizing feature selection at each split. ```{python} # Random Forest with balanced class weights rf_model = RandomForestClassifier( n_estimators=100, max_depth=15, min_samples_split=10, min_samples_leaf=4, class_weight='balanced', random_state=42, n_jobs=-1 ) rf_model.fit(X_train_scaled, y_train) y_pred_rf = rf_model.predict(X_test_scaled) y_proba_rf = rf_model.predict_proba(X_test_scaled)[:, 1] # Evaluate print("="*60) print("RANDOM FOREST CLASSIFIER PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}") print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}") print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_rf):.4f}") # Feature Importance Analysis feature_importance = pd.DataFrame({ 'Feature': X.columns, 'Importance': rf_model.feature_importances_ }).sort_values('Importance', ascending=False) print("\nTop 10 Most Important Features:") print(feature_importance.head(10)) # Visualize Feature Importance fig, ax = plt.subplots(figsize=(10, 8)) top_features = feature_importance.head(15) ax.barh(range(len(top_features)), top_features['Importance'].values, color='#3498db') ax.set_yticks(range(len(top_features))) ax.set_yticklabels(top_features['Feature'].values) ax.set_xlabel('Importance Score', fontsize=12) ax.set_title('Random Forest - Top 15 Feature Importances', fontsize=14, fontweight='bold') ax.invert_yaxis() plt.tight_layout() plt.show() ``` ::: {.callout-tip} ## Feature Importance in Production Feature importance helps you: 1. **Reduce dimensionality** - Remove unimportant features 2. **Understand model** - Which features drive fraud detection? 3. **Feature engineering** - Focus efforts on important features 4. **Regulatory compliance** - Explain model decisions ::: # Step 5: Ensemble Method 3 - AdaBoost AdaBoost (Adaptive Boosting) trains models sequentially, focusing on previously misclassified examples. ```{python} # AdaBoost with balanced base estimator ada_model = AdaBoostClassifier( estimator=DecisionTreeClassifier(max_depth=3, class_weight='balanced'), n_estimators=50, learning_rate=0.1, random_state=42, algorithm='SAMME' ) ada_model.fit(X_train_scaled, y_train) y_pred_ada = ada_model.predict(X_test_scaled) y_proba_ada = ada_model.predict_proba(X_test_scaled)[:, 1] # Evaluate print("="*60) print("ADABOOST CLASSIFIER PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_ada):.4f}") print(f"Precision: {precision_score(y_test, y_pred_ada):.4f}") print(f"Recall: {recall_score(y_test, y_pred_ada):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_ada):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_ada):.4f}") # Confusion Matrix cm_ada = confusion_matrix(y_test, y_pred_ada) disp = ConfusionMatrixDisplay(confusion_matrix=cm_ada, display_labels=['Normal', 'Fraud']) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap='Oranges', values_format='d') ax.set_title('AdaBoost Classifier - Confusion Matrix', fontsize=14, fontweight='bold') plt.show() ``` # Step 6: Ensemble Method 4 - Gradient Boosting Gradient Boosting builds models sequentially, each correcting errors of the previous ensemble. ```{python} # Gradient Boosting gb_model = GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, max_depth=5, min_samples_split=10, min_samples_leaf=4, subsample=0.8, random_state=42 ) gb_model.fit(X_train_scaled, y_train) y_pred_gb = gb_model.predict(X_test_scaled) y_proba_gb = gb_model.predict_proba(X_test_scaled)[:, 1] # Evaluate print("="*60) print("GRADIENT BOOSTING CLASSIFIER PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}") print(f"Precision: {precision_score(y_test, y_pred_gb):.4f}") print(f"Recall: {recall_score(y_test, y_pred_gb):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_gb):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_gb):.4f}") # Feature Importance gb_importance = pd.DataFrame({ 'Feature': X.columns, 'Importance': gb_model.feature_importances_ }).sort_values('Importance', ascending=False) print("\nTop 10 Most Important Features (Gradient Boosting):") print(gb_importance.head(10)) ``` # Step 7: Ensemble Method 5 - Stacking Stacking uses a meta-learner to combine predictions from multiple base models. ```{python} # Define base models base_models = [ ('rf', RandomForestClassifier(n_estimators=50, max_depth=10, class_weight='balanced', random_state=42)), ('gb', GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)), ('ada', AdaBoostClassifier(n_estimators=50, random_state=42)) ] # Meta-learner meta_learner = LogisticRegression(class_weight='balanced', random_state=42) # Stacking Classifier stacking_model = StackingClassifier( estimators=base_models, final_estimator=meta_learner, cv=5, n_jobs=-1 ) stacking_model.fit(X_train_scaled, y_train) y_pred_stack = stacking_model.predict(X_test_scaled) y_proba_stack = stacking_model.predict_proba(X_test_scaled)[:, 1] # Evaluate print("="*60) print("STACKING CLASSIFIER PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_stack):.4f}") print(f"Precision: {precision_score(y_test, y_pred_stack):.4f}") print(f"Recall: {recall_score(y_test, y_pred_stack):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_stack):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_stack):.4f}") # Confusion Matrix cm_stack = confusion_matrix(y_test, y_pred_stack) disp = ConfusionMatrixDisplay(confusion_matrix=cm_stack, display_labels=['Normal', 'Fraud']) fig, ax = plt.subplots(figsize=(8, 6)) disp.plot(ax=ax, cmap='Purples', values_format='d') ax.set_title('Stacking Classifier - Confusion Matrix', fontsize=14, fontweight='bold') plt.show() ``` ::: {.callout-note} ## Stacking Architecture **Base Models (Level 0):** - Random Forest (captures non-linear patterns) - Gradient Boosting (sequential error correction) - AdaBoost (focuses on hard examples) **Meta-Learner (Level 1):** - Logistic Regression (learns optimal weighting of base predictions) This architecture leverages the strengths of different algorithms! ::: # Step 8: Comprehensive Metrics Comparison Let's create a comprehensive comparison of all models across multiple metrics. ```{python} # Compile all predictions models_results = { 'Baseline (LR)': { 'predictions': y_pred_baseline, 'probabilities': baseline_model.predict_proba(X_test_scaled)[:, 1] }, 'Bagging': { 'predictions': y_pred_bagging, 'probabilities': y_proba_bagging }, 'Random Forest': { 'predictions': y_pred_rf, 'probabilities': y_proba_rf }, 'AdaBoost': { 'predictions': y_pred_ada, 'probabilities': y_proba_ada }, 'Gradient Boosting': { 'predictions': y_pred_gb, 'probabilities': y_proba_gb }, 'Stacking': { 'predictions': y_pred_stack, 'probabilities': y_proba_stack } } # Calculate all metrics metrics_df = [] for model_name, results in models_results.items(): y_pred = results['predictions'] y_proba = results['probabilities'] metrics_df.append({ 'Model': model_name, 'Accuracy': accuracy_score(y_test, y_pred), 'Precision': precision_score(y_test, y_pred), 'Recall': recall_score(y_test, y_pred), 'F1-Score': f1_score(y_test, y_pred), 'ROC-AUC': roc_auc_score(y_test, y_proba), 'PR-AUC': average_precision_score(y_test, y_proba) }) metrics_df = pd.DataFrame(metrics_df) print("="*80) print("COMPREHENSIVE MODEL COMPARISON") print("="*80) print(metrics_df.to_string(index=False)) # Highlight best scores print("\n" + "="*80) print("BEST PERFORMING MODELS PER METRIC") print("="*80) for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']: best_model = metrics_df.loc[metrics_df[metric].idxmax(), 'Model'] best_score = metrics_df[metric].max() print(f"{metric:12s}: {best_model:20s} ({best_score:.4f})") ``` ```{python} # Visualize metrics comparison fig, axes = plt.subplots(2, 3, figsize=(18, 10)) metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC'] for idx, metric in enumerate(metrics_to_plot): ax = axes[idx // 3, idx % 3] colors = ['#e74c3c' if m == 'Baseline (LR)' else '#3498db' for m in metrics_df['Model']] bars = ax.bar(range(len(metrics_df)), metrics_df[metric], color=colors) ax.set_xticks(range(len(metrics_df))) ax.set_xticklabels(metrics_df['Model'], rotation=45, ha='right') ax.set_ylabel(metric, fontsize=11) ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold') ax.grid(axis='y', alpha=0.3) ax.set_ylim([0, 1.0]) # Add value labels for i, bar in enumerate(bars): height = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2., height, f'{height:.3f}', ha='center', va='bottom', fontsize=9) plt.tight_layout() plt.show() ``` # Step 9: ROC Curve Analysis ROC (Receiver Operating Characteristic) curves visualize the tradeoff between True Positive Rate and False Positive Rate. ```{python} # Plot ROC curves for all models fig, ax = plt.subplots(figsize=(12, 8)) colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c'] for (model_name, results), color in zip(models_results.items(), colors): y_proba = results['probabilities'] fpr, tpr, _ = roc_curve(y_test, y_proba) auc = roc_auc_score(y_test, y_proba) ax.plot(fpr, tpr, color=color, lw=2, label=f'{model_name} (AUC = {auc:.3f})') # Random classifier line ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier (AUC = 0.500)') ax.set_xlabel('False Positive Rate', fontsize=12, fontweight='bold') ax.set_ylabel('True Positive Rate (Recall)', fontsize=12, fontweight='bold') ax.set_title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold') ax.legend(loc='lower right', fontsize=10) ax.grid(alpha=0.3) plt.tight_layout() plt.show() print("ROC-AUC Interpretation:") print("="*60) print("AUC = 1.0 : Perfect classifier") print("AUC = 0.9-1 : Excellent") print("AUC = 0.8-0.9 : Good") print("AUC = 0.7-0.8 : Fair") print("AUC = 0.5 : Random guessing") print("AUC < 0.5 : Worse than random (predictions inverted)") ``` ::: {.callout-important} ## ROC-AUC Limitation for Imbalanced Data ROC-AUC can be **misleading** on highly imbalanced datasets because: - It gives equal weight to both classes - A model can have high AUC but still miss most frauds - Precision-Recall curves are more informative for imbalanced problems ::: # Step 10: Precision-Recall Curve Analysis For imbalanced datasets, Precision-Recall curves are more informative than ROC curves. ```{python} # Plot Precision-Recall curves for all models fig, ax = plt.subplots(figsize=(12, 8)) colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c'] for (model_name, results), color in zip(models_results.items(), colors): y_proba = results['probabilities'] precision, recall, _ = precision_recall_curve(y_test, y_proba) pr_auc = average_precision_score(y_test, y_proba) ax.plot(recall, precision, color=color, lw=2, label=f'{model_name} (AP = {pr_auc:.3f})') # Baseline (no-skill) line for imbalanced dataset no_skill = sum(y_test) / len(y_test) ax.plot([0, 1], [no_skill, no_skill], 'k--', lw=2, label=f'No Skill (AP = {no_skill:.3f})') ax.set_xlabel('Recall (True Positive Rate)', fontsize=12, fontweight='bold') ax.set_ylabel('Precision', fontsize=12, fontweight='bold') ax.set_title('Precision-Recall Curves - Model Comparison', fontsize=14, fontweight='bold') ax.legend(loc='lower left', fontsize=10) ax.grid(alpha=0.3) ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) plt.tight_layout() plt.show() print("Precision-Recall Tradeoff:") print("="*60) print("High Precision, Low Recall : Conservative (few false alarms, miss frauds)") print("Low Precision, High Recall : Aggressive (catch frauds, many false alarms)") print("Optimal Balance : Depends on business cost of FP vs FN") ``` ::: {.callout-tip} ## Choosing the Right Metric **Use ROC-AUC when:** - Classes are balanced - Both FP and FN have similar costs - You care about overall ranking ability **Use PR-AUC when:** - Classes are imbalanced (like fraud detection) - Positive class (fraud) is more important - You care about precision-recall tradeoff ::: # Step 11: Handling Class Imbalance with SMOTE SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples of the minority class. ```{python} if SMOTE_AVAILABLE: # Apply SMOTE to training data smote = SMOTE(random_state=42) X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train) print("Class distribution before SMOTE:") print(pd.Series(y_train).value_counts()) print("\nClass distribution after SMOTE:") print(pd.Series(y_train_smote).value_counts()) # Train Random Forest on balanced data rf_smote = RandomForestClassifier( n_estimators=100, max_depth=15, random_state=42, n_jobs=-1 ) rf_smote.fit(X_train_smote, y_train_smote) y_pred_smote = rf_smote.predict(X_test_scaled) y_proba_smote = rf_smote.predict_proba(X_test_scaled)[:, 1] print("\n" + "="*60) print("RANDOM FOREST WITH SMOTE PERFORMANCE") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_smote):.4f}") print(f"Precision: {precision_score(y_test, y_pred_smote):.4f}") print(f"Recall: {recall_score(y_test, y_pred_smote):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_smote):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_smote):.4f}") # Compare with RF without SMOTE print("\nComparison with RF using class_weight='balanced':") print(f"{'Metric':<12} {'With SMOTE':>12} {'class_weight':>12} {'Difference':>12}") print("-" * 50) metrics_compare = [ ('Precision', precision_score(y_test, y_pred_smote), precision_score(y_test, y_pred_rf)), ('Recall', recall_score(y_test, y_pred_smote), recall_score(y_test, y_pred_rf)), ('F1-Score', f1_score(y_test, y_pred_smote), f1_score(y_test, y_pred_rf)), ] for metric_name, smote_val, weight_val in metrics_compare: diff = smote_val - weight_val print(f"{metric_name:<12} {smote_val:>12.4f} {weight_val:>12.4f} {diff:>+12.4f}") else: print("SMOTE not available. Using class_weight='balanced' is a simpler alternative.") print("\nAlternatives to SMOTE:") print("1. class_weight='balanced' - Weight samples inversely to class frequency") print("2. Random undersampling - Reduce majority class samples") print("3. Threshold tuning - Adjust decision threshold (covered in next step)") ``` # Step 12: Threshold Tuning for Optimal Business Impact The default classification threshold is 0.5, but we can optimize this based on business costs. ```{python} # Calculate metrics across different thresholds thresholds = np.arange(0.1, 0.9, 0.05) threshold_metrics = [] # Use Gradient Boosting model for this analysis for threshold in thresholds: y_pred_threshold = (y_proba_gb >= threshold).astype(int) cm = confusion_matrix(y_test, y_pred_threshold) tn, fp, fn, tp = cm.ravel() # Calculate metrics precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 # Business cost (assuming FN costs $100, FP costs $1) cost_fp = 1 cost_fn = 100 total_cost = (fp * cost_fp) + (fn * cost_fn) threshold_metrics.append({ 'Threshold': threshold, 'Precision': precision, 'Recall': recall, 'F1-Score': f1, 'FP': fp, 'FN': fn, 'Total_Cost': total_cost }) threshold_df = pd.DataFrame(threshold_metrics) # Find optimal threshold (minimize cost) optimal_idx = threshold_df['Total_Cost'].idxmin() optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold'] optimal_cost = threshold_df.loc[optimal_idx, 'Total_Cost'] print("="*60) print("THRESHOLD OPTIMIZATION ANALYSIS") print("="*60) print(f"Cost Assumptions: FP = $1, FN = $100") print(f"\nOptimal Threshold: {optimal_threshold:.2f}") print(f"Total Cost at Optimal: ${optimal_cost:,.0f}") print(f"\nMetrics at Optimal Threshold:") print(threshold_df.loc[optimal_idx][['Precision', 'Recall', 'F1-Score', 'FP', 'FN']]) # Visualize threshold impact fig, axes = plt.subplots(2, 2, figsize=(15, 10)) # Precision and Recall vs Threshold axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision', linewidth=2) axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall', linewidth=2) axes[0, 0].plot(threshold_df['Threshold'], threshold_df['F1-Score'], 'g-', label='F1-Score', linewidth=2) axes[0, 0].axvline(optimal_threshold, color='orange', linestyle='--', label=f'Optimal ({optimal_threshold:.2f})', linewidth=2) axes[0, 0].set_xlabel('Threshold', fontweight='bold') axes[0, 0].set_ylabel('Score', fontweight='bold') axes[0, 0].set_title('Metrics vs Threshold', fontweight='bold', fontsize=12) axes[0, 0].legend() axes[0, 0].grid(alpha=0.3) # False Positives and False Negatives vs Threshold axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FP'], 'b-', label='False Positives', linewidth=2) axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FN'], 'r-', label='False Negatives', linewidth=2) axes[0, 1].axvline(optimal_threshold, color='orange', linestyle='--', label=f'Optimal ({optimal_threshold:.2f})', linewidth=2) axes[0, 1].set_xlabel('Threshold', fontweight='bold') axes[0, 1].set_ylabel('Count', fontweight='bold') axes[0, 1].set_title('Errors vs Threshold', fontweight='bold', fontsize=12) axes[0, 1].legend() axes[0, 1].grid(alpha=0.3) # Total Cost vs Threshold axes[1, 0].plot(threshold_df['Threshold'], threshold_df['Total_Cost'], 'purple', linewidth=2) axes[1, 0].axvline(optimal_threshold, color='orange', linestyle='--', label=f'Optimal ({optimal_threshold:.2f})', linewidth=2) axes[1, 0].scatter(optimal_threshold, optimal_cost, color='red', s=100, zorder=5, label=f'Min Cost = ${optimal_cost:,.0f}') axes[1, 0].set_xlabel('Threshold', fontweight='bold') axes[1, 0].set_ylabel('Total Cost ($)', fontweight='bold') axes[1, 0].set_title('Business Cost vs Threshold', fontweight='bold', fontsize=12) axes[1, 0].legend() axes[1, 0].grid(alpha=0.3) # Confusion Matrix at Optimal Threshold y_pred_optimal = (y_proba_gb >= optimal_threshold).astype(int) cm_optimal = confusion_matrix(y_test, y_pred_optimal) im = axes[1, 1].imshow(cm_optimal, cmap='RdYlGn_r', aspect='auto') axes[1, 1].set_xticks([0, 1]) axes[1, 1].set_yticks([0, 1]) axes[1, 1].set_xticklabels(['Normal', 'Fraud']) axes[1, 1].set_yticklabels(['Normal', 'Fraud']) axes[1, 1].set_xlabel('Predicted', fontweight='bold') axes[1, 1].set_ylabel('Actual', fontweight='bold') axes[1, 1].set_title(f'Confusion Matrix (Threshold={optimal_threshold:.2f})', fontweight='bold', fontsize=12) # Add text annotations for i in range(2): for j in range(2): text = axes[1, 1].text(j, i, f'{cm_optimal[i, j]:,}', ha="center", va="center", color="black", fontweight='bold', fontsize=14) plt.colorbar(im, ax=axes[1, 1]) plt.tight_layout() plt.show() ``` ::: {.callout-important} ## Business Impact of Threshold Tuning By adjusting the threshold from 0.5 to the optimal value: - We balance precision and recall based on actual business costs - In fraud detection, lower thresholds catch more frauds (higher recall) - But this increases false alarms (lower precision) - The optimal threshold minimizes total business cost ::: # Step 13: GridSearchCV for Hyperparameter Tuning Let's systematically find the best hyperparameters for our top-performing model. ```{python} # Define parameter grid for Random Forest param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [10, 15, 20], 'min_samples_split': [5, 10, 20], 'min_samples_leaf': [2, 4, 8], 'class_weight': ['balanced', {0: 1, 1: 10}, {0: 1, 1: 20}] } # Use stratified k-fold for imbalanced data cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # GridSearchCV with F1-score as scoring metric (better for imbalanced data) print("Starting GridSearchCV... This may take a few minutes.") print(f"Total combinations to try: {np.prod([len(v) for v in param_grid.values()])}") print("Using 5-fold stratified cross-validation\n") grid_search = GridSearchCV( estimator=RandomForestClassifier(random_state=42, n_jobs=-1), param_grid=param_grid, cv=cv_strategy, scoring='f1', # Optimize for F1-score (balance of precision and recall) n_jobs=-1, verbose=1 ) grid_search.fit(X_train_scaled, y_train) print("\n" + "="*60) print("GRID SEARCH RESULTS") print("="*60) print(f"Best F1-Score (CV): {grid_search.best_score_:.4f}") print(f"\nBest Parameters:") for param, value in grid_search.best_params_.items(): print(f" {param}: {value}") # Evaluate best model on test set best_model = grid_search.best_estimator_ y_pred_best = best_model.predict(X_test_scaled) y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1] print("\n" + "="*60) print("BEST MODEL PERFORMANCE ON TEST SET") print("="*60) print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}") print(f"Precision: {precision_score(y_test, y_pred_best):.4f}") print(f"Recall: {recall_score(y_test, y_pred_best):.4f}") print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}") print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_best):.4f}") print(f"PR-AUC: {average_precision_score(y_test, y_proba_best):.4f}") # Top 10 parameter combinations cv_results = pd.DataFrame(grid_search.cv_results_) top_10 = cv_results.nlargest(10, 'mean_test_score')[ ['params', 'mean_test_score', 'std_test_score', 'rank_test_score'] ] print("\nTop 10 Parameter Combinations:") print(top_10.to_string(index=False)) ``` ::: {.callout-tip} ## GridSearchCV Best Practices **For Imbalanced Data:** 1. Use `StratifiedKFold` to maintain class distribution in folds 2. Optimize for `f1`, `recall`, or custom scorer (not accuracy) 3. Include `class_weight` in parameter grid 4. Consider `RandomizedSearchCV` for large search spaces **Performance Tips:** - Use `n_jobs=-1` to parallelize - Start with coarse grid, then refine - Monitor with `verbose=1` or `verbose=2` ::: # Step 14: Cross-Validation with Stratification Perform robust cross-validation to assess model stability. ```{python} # Cross-validation for all models cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) models_cv = { 'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=15, class_weight='balanced', random_state=42), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42), 'Best Model (GridSearch)': best_model } cv_results = [] for model_name, model in models_cv.items(): # Cross-validate multiple metrics accuracy_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv_strategy, scoring='accuracy', n_jobs=-1) precision_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv_strategy, scoring='precision', n_jobs=-1) recall_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv_strategy, scoring='recall', n_jobs=-1) f1_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv_strategy, scoring='f1', n_jobs=-1) cv_results.append({ 'Model': model_name, 'Accuracy (mean±std)': f"{accuracy_scores.mean():.4f} ± {accuracy_scores.std():.4f}", 'Precision (mean±std)': f"{precision_scores.mean():.4f} ± {precision_scores.std():.4f}", 'Recall (mean±std)': f"{recall_scores.mean():.4f} ± {recall_scores.std():.4f}", 'F1-Score (mean±std)': f"{f1_scores.mean():.4f} ± {f1_scores.std():.4f}" }) cv_df = pd.DataFrame(cv_results) print("="*100) print("5-FOLD STRATIFIED CROSS-VALIDATION RESULTS") print("="*100) print(cv_df.to_string(index=False)) print("\n" + "="*60) print("INTERPRETATION") print("="*60) print("Lower standard deviation indicates more stable performance across folds.") print("This suggests the model generalizes well and is not overfitting.") ``` # Step 15: Final Model Evaluation Checklist Let's create a comprehensive production-ready evaluation report. ```{python} # Select final model (best from GridSearch) final_model = best_model final_model_name = "Random Forest (Optimized)" # Generate comprehensive report print("="*80) print(f"FINAL MODEL EVALUATION REPORT: {final_model_name}") print("="*80) # 1. Model Architecture print("\n1. MODEL ARCHITECTURE") print("-" * 40) print(f"Algorithm: {type(final_model).__name__}") print(f"Parameters: {final_model.get_params()}") # 2. Test Set Performance print("\n2. TEST SET PERFORMANCE") print("-" * 40) y_pred_final = final_model.predict(X_test_scaled) y_proba_final = final_model.predict_proba(X_test_scaled)[:, 1] print(classification_report(y_test, y_pred_final, target_names=['Normal', 'Fraud'])) # 3. Confusion Matrix print("\n3. CONFUSION MATRIX ANALYSIS") print("-" * 40) cm_final = confusion_matrix(y_test, y_pred_final) tn, fp, fn, tp = cm_final.ravel() print(f"True Negatives (TN): {tn:,} - Correctly identified normal transactions") print(f"False Positives (FP): {fp:,} - Normal flagged as fraud (Type I Error)") print(f"False Negatives (FN): {fn:,} - Fraud missed (Type II Error)") print(f"True Positives (TP): {tp:,} - Correctly identified fraud") fraud_detection_rate = (tp / (tp + fn)) * 100 false_alarm_rate = (fp / (fp + tn)) * 100 print(f"\nFraud Detection Rate: {fraud_detection_rate:.1f}%") print(f"False Alarm Rate: {false_alarm_rate:.2f}%") # 4. Advanced Metrics print("\n4. ADVANCED METRICS") print("-" * 40) print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba_final):.4f}") print(f"PR-AUC Score: {average_precision_score(y_test, y_proba_final):.4f}") # 5. Business Impact print("\n5. BUSINESS IMPACT ANALYSIS") print("-" * 40) cost_fp = 1 # Cost of false alarm cost_fn = 100 # Cost of missed fraud total_cost = (fp * cost_fp) + (fn * cost_fn) baseline_cost = len(y_test[y_test == 1]) * cost_fn # Cost if we catch no frauds savings = baseline_cost - total_cost savings_pct = (savings / baseline_cost) * 100 print(f"False Positive Cost: ${fp * cost_fp:,}") print(f"False Negative Cost: ${fn * cost_fn:,}") print(f"Total Cost: ${total_cost:,}") print(f"Baseline Cost (no model): ${baseline_cost:,}") print(f"Cost Savings: ${savings:,} ({savings_pct:.1f}% reduction)") # 6. Cross-Validation Stability print("\n6. CROSS-VALIDATION STABILITY") print("-" * 40) cv_f1_scores = cross_val_score(final_model, X_train_scaled, y_train, cv=cv_strategy, scoring='f1', n_jobs=-1) print(f"F1-Score across 5 folds: {cv_f1_scores}") print(f"Mean: {cv_f1_scores.mean():.4f}") print(f"Std: {cv_f1_scores.std():.4f}") print(f"Min: {cv_f1_scores.min():.4f}") print(f"Max: {cv_f1_scores.max():.4f}") # 7. Feature Importance print("\n7. TOP 10 MOST IMPORTANT FEATURES") print("-" * 40) feature_importance_final = pd.DataFrame({ 'Feature': X.columns, 'Importance': final_model.feature_importances_ }).sort_values('Importance', ascending=False) print(feature_importance_final.head(10).to_string(index=False)) # 8. Production Readiness Checklist print("\n8. PRODUCTION READINESS CHECKLIST") print("-" * 40) checklist = [ ("✓", "Model trained and validated"), ("✓", "Cross-validation performed"), ("✓", "Hyperparameters optimized"), ("✓", "Class imbalance handled"), ("✓", "Comprehensive metrics calculated"), ("✓", "Business impact quantified"), ("✓", "Feature importance analyzed"), ("✗", "Model serialized and saved (TODO)"), ("✗", "Deployment pipeline created (TODO)"), ("✗", "Monitoring system setup (TODO)") ] for status, item in checklist: print(f"{status} {item}") print("\n" + "="*80) print("RECOMMENDATION") print("="*80) print(f"The {final_model_name} is recommended for production deployment.") print(f"It achieves {fraud_detection_rate:.1f}% fraud detection rate with only") print(f"{false_alarm_rate:.2f}% false alarm rate, resulting in {savings_pct:.1f}% cost savings.") print("\nNext steps:") print("1. Save model using joblib or pickle") print("2. Set up real-time prediction API") print("3. Implement monitoring for model drift") print("4. Plan for periodic retraining") ``` # Step 16: Final Visualizations Create comprehensive visualization summary for presentation and reporting. ```{python} # Create comprehensive final visualization fig = plt.figure(figsize=(18, 12)) gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3) # 1. Confusion Matrix ax1 = fig.add_subplot(gs[0, 0]) disp = ConfusionMatrixDisplay(confusion_matrix=cm_final, display_labels=['Normal', 'Fraud']) disp.plot(ax=ax1, cmap='Blues', values_format='d') ax1.set_title('Confusion Matrix', fontweight='bold', fontsize=11) # 2. ROC Curve ax2 = fig.add_subplot(gs[0, 1]) fpr_final, tpr_final, _ = roc_curve(y_test, y_proba_final) auc_final = roc_auc_score(y_test, y_proba_final) ax2.plot(fpr_final, tpr_final, 'b-', lw=2, label=f'ROC (AUC={auc_final:.3f})') ax2.plot([0, 1], [0, 1], 'k--', lw=1) ax2.set_xlabel('False Positive Rate', fontweight='bold') ax2.set_ylabel('True Positive Rate', fontweight='bold') ax2.set_title('ROC Curve', fontweight='bold', fontsize=11) ax2.legend() ax2.grid(alpha=0.3) # 3. Precision-Recall Curve ax3 = fig.add_subplot(gs[0, 2]) precision_final, recall_final, _ = precision_recall_curve(y_test, y_proba_final) pr_auc_final = average_precision_score(y_test, y_proba_final) ax3.plot(recall_final, precision_final, 'r-', lw=2, label=f'PR (AP={pr_auc_final:.3f})') ax3.set_xlabel('Recall', fontweight='bold') ax3.set_ylabel('Precision', fontweight='bold') ax3.set_title('Precision-Recall Curve', fontweight='bold', fontsize=11) ax3.legend() ax3.grid(alpha=0.3) # 4. Feature Importance ax4 = fig.add_subplot(gs[1, :]) top_15_features = feature_importance_final.head(15) ax4.barh(range(len(top_15_features)), top_15_features['Importance'].values, color='#3498db') ax4.set_yticks(range(len(top_15_features))) ax4.set_yticklabels(top_15_features['Feature'].values) ax4.set_xlabel('Importance Score', fontweight='bold') ax4.set_title('Top 15 Feature Importances', fontweight='bold', fontsize=11) ax4.invert_yaxis() ax4.grid(axis='x', alpha=0.3) # 5. Model Comparison ax5 = fig.add_subplot(gs[2, 0]) comparison_metrics = metrics_df[['Model', 'F1-Score']].copy() colors_comp = ['#e74c3c' if 'Baseline' in m else '#3498db' for m in comparison_metrics['Model']] ax5.bar(range(len(comparison_metrics)), comparison_metrics['F1-Score'], color=colors_comp) ax5.set_xticks(range(len(comparison_metrics))) ax5.set_xticklabels(comparison_metrics['Model'], rotation=45, ha='right', fontsize=8) ax5.set_ylabel('F1-Score', fontweight='bold') ax5.set_title('Model Comparison (F1-Score)', fontweight='bold', fontsize=11) ax5.grid(axis='y', alpha=0.3) # 6. Threshold Analysis ax6 = fig.add_subplot(gs[2, 1]) ax6.plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision', lw=2) ax6.plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall', lw=2) ax6.plot(threshold_df['Threshold'], threshold_df['F1-Score'], 'g-', label='F1', lw=2) ax6.axvline(optimal_threshold, color='orange', linestyle='--', lw=2, label='Optimal') ax6.set_xlabel('Threshold', fontweight='bold') ax6.set_ylabel('Score', fontweight='bold') ax6.set_title('Threshold Optimization', fontweight='bold', fontsize=11) ax6.legend(fontsize=8) ax6.grid(alpha=0.3) # 7. Business Impact ax7 = fig.add_subplot(gs[2, 2]) categories = ['Baseline\n(No Model)', 'Final Model'] costs = [baseline_cost, total_cost] colors_cost = ['#e74c3c', '#2ecc71'] bars = ax7.bar(categories, costs, color=colors_cost) ax7.set_ylabel('Total Cost ($)', fontweight='bold') ax7.set_title('Business Cost Comparison', fontweight='bold', fontsize=11) ax7.grid(axis='y', alpha=0.3) for bar in bars: height = bar.get_height() ax7.text(bar.get_x() + bar.get_width()/2., height, f'${height:,.0f}', ha='center', va='bottom', fontweight='bold') # Add savings annotation savings_text = f'{savings_pct:.1f}% Cost Reduction' ax7.text(0.5, max(costs)*0.5, savings_text, ha='center', fontsize=12, fontweight='bold', color='green', bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7)) plt.suptitle(f'Final Model Evaluation Summary: {final_model_name}', fontsize=16, fontweight='bold', y=0.995) plt.show() ``` # Summary and Key Takeaways ::: {.callout-note icon=true} ## What You've Learned **1. Ensemble Methods:** - Bagging reduces variance through bootstrap aggregation - Boosting reduces bias through sequential error correction - Stacking combines diverse models for optimal performance - Random Forest adds feature randomization to Bagging **2. Handling Imbalanced Data:** - Accuracy is misleading on imbalanced datasets - Use `class_weight='balanced'` or SMOTE - Optimize for F1-score, not accuracy - Stratified k-fold maintains class distribution **3. Evaluation Metrics:** - **Precision:** Of predicted frauds, how many are actually frauds? - **Recall:** Of actual frauds, how many did we catch? - **F1-Score:** Harmonic mean of precision and recall - **ROC-AUC:** Overall ranking ability (can be misleading for imbalanced data) - **PR-AUC:** Better for imbalanced datasets **4. Production Considerations:** - Threshold tuning based on business costs - Cross-validation for model stability - Feature importance for interpretability - Business impact quantification ::: ::: {.callout-important} ## Critical Concepts for Exams 1. **Why ensemble methods work:** Bias-variance tradeoff 2. **When to use which ensemble:** - Bagging: High variance models (decision trees) - Boosting: High bias models (weak learners) - Stacking: Combining diverse algorithms 3. **Imbalanced data challenges:** - Accuracy paradox - Class weighting vs. resampling - Proper evaluation metrics 4. **GridSearchCV best practices:** - Stratified CV for imbalanced data - Choose appropriate scoring metric - Balance search space vs. computation time ::: # Further Reading and Resources **Research Papers:** - Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32. - Freund, Y., & Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." - Chawla, N. V., et al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." JAIR, 16, 321-357. **Online Resources:** - scikit-learn Ensemble Methods Guide: https://scikit-learn.org/stable/modules/ensemble.html - Imbalanced-learn Documentation: https://imbalanced-learn.org/ - ROC vs PR curves: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ **Kaggle Datasets:** - Credit Card Fraud Detection: https://www.kaggle.com/mlg-ulb/creditcardfraud - IEEE-CIS Fraud Detection: https://www.kaggle.com/c/ieee-fraud-detection --- ::: {.callout-tip} ## Next Steps Now that you've mastered ensemble methods and evaluation: 1. Complete the **worksheet.md** exercises 2. Experiment with different ensemble combinations 3. Try on other imbalanced datasets 4. Implement custom cost-sensitive metrics 5. Deploy your model as a REST API ::: **Lab Completion:** Make sure to save your work and submit all required deliverables according to the rubric!