# Data manipulation
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-v0_8-darkgrid')
# Machine Learning - Base Models
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
# Ensemble Methods
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
StackingClassifier,
VotingClassifier
)
# Evaluation Metrics
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
confusion_matrix,
classification_report,
roc_curve,
roc_auc_score,
precision_recall_curve,
average_precision_score,
ConfusionMatrixDisplay
)
# Imbalanced Learning (if available, otherwise we'll use class weights)
try:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
SMOTE_AVAILABLE = True
print("✓ SMOTE available for handling class imbalance")
except ImportError:
SMOTE_AVAILABLE = False
print("⚠ SMOTE not available. We'll use class_weight='balanced' instead")
print(" Install imbalanced-learn: pip install imbalanced-learn")
print("\n✓ All essential libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")Lab 04 - Ensemble Methods & Advanced Model Evaluation
Building Robust Classifiers for Imbalanced Datasets
After completing this lab, you will be able to:
- Implement ensemble learning methods including Bagging, Boosting, and Stacking
- Handle imbalanced datasets using multiple techniques
- Calculate and interpret comprehensive evaluation metrics (Precision, Recall, F1, ROC-AUC, PR-AUC)
- Analyze bias-variance tradeoffs in ensemble models
- Optimize hyperparameters using GridSearchCV with proper validation
- Create production-ready model evaluation frameworks
- Make data-driven decisions based on business context and cost-benefit analysis
- Duration: 3-4 hours
- Difficulty: Intermediate to Advanced
- Prerequisites: Python, scikit-learn, pandas, understanding of basic ML algorithms
- Dataset: Credit Card Fraud Detection (highly imbalanced real-world scenario)
11 Background and Motivation
11.1 Why Ensemble Methods Matter
Imagine you’re building a fraud detection system for a bank. A single decision tree might make mistakes, but what if you combined the predictions of 100 trees? This is the core idea behind ensemble methods.
Key Benefits:
- Reduced Overfitting: By combining multiple models, we average out individual model errors
- Improved Generalization: Diverse models capture different patterns in the data
- Robustness: Less sensitive to noise and outliers
- State-of-the-Art Performance: Most competition-winning solutions use ensembles
11.2 The Imbalanced Dataset Challenge
In fraud detection, only 0.1-0.5% of transactions are fraudulent. This creates severe class imbalance where:
- Accuracy is Misleading: A model predicting “Not Fraud” for everything achieves 99.5% accuracy but is useless
- Standard Metrics Fail: We need specialized metrics like Precision-Recall curves
- Model Bias: Algorithms tend to ignore the minority class
- Business Impact: Missing one fraud costs more than one false alarm
11.3 Real-World Context: Cost-Sensitive Learning
| Scenario | Cost of False Positive | Cost of False Negative |
|---|---|---|
| Fraud Detection | Low (minor inconvenience) | High (financial loss) |
| Medical Diagnosis | Medium (unnecessary tests) | Very High (missed disease) |
| Spam Detection | Low (user checks spam folder) | Low (mild annoyance) |
Understanding these costs is crucial for choosing the right evaluation metric and decision threshold.
12 Lab Setup
12.1 Import Required Libraries
12.2 Load and Explore Dataset
For this lab, we’ll simulate a credit card fraud detection dataset with realistic class imbalance.
For production work, download the actual Credit Card Fraud Detection dataset from Kaggle:
https://www.kaggle.com/mlg-ulb/creditcardfraud
The dataset contains 284,807 transactions with only 492 frauds (0.172% fraud rate).
# Simulate realistic fraud detection dataset
np.random.seed(42)
def create_fraud_dataset(n_samples=10000, fraud_ratio=0.02):
"""
Create a synthetic fraud detection dataset with realistic characteristics
Parameters:
-----------
n_samples : int
Total number of transactions
fraud_ratio : float
Proportion of fraudulent transactions (0.02 = 2%)
"""
n_fraud = int(n_samples * fraud_ratio)
n_normal = n_samples - n_fraud
# Normal transactions (legitimate)
normal_data = np.random.randn(n_normal, 28) * 0.5
normal_data[:, 0] += 1.0 # Shift distribution
# Fraudulent transactions (different pattern)
fraud_data = np.random.randn(n_fraud, 28) * 1.5
fraud_data[:, 0] -= 2.0 # Different center
fraud_data[:, 5] += 3.0 # Strong signal in some features
fraud_data[:, 10] -= 2.5
# Transaction amounts
normal_amounts = np.abs(np.random.gamma(2, 50, n_normal))
fraud_amounts = np.abs(np.random.gamma(3, 150, n_fraud))
# Combine features
X_normal = np.column_stack([normal_data, normal_amounts])
X_fraud = np.column_stack([fraud_data, fraud_amounts])
X = np.vstack([X_normal, X_fraud])
y = np.hstack([np.zeros(n_normal), np.ones(n_fraud)])
# Shuffle
shuffle_idx = np.random.permutation(n_samples)
X = X[shuffle_idx]
y = y[shuffle_idx]
# Create DataFrame
feature_names = [f'V{i}' for i in range(1, 29)] + ['Amount']
df = pd.DataFrame(X, columns=feature_names)
df['Class'] = y.astype(int)
return df
# Create dataset
df = create_fraud_dataset(n_samples=10000, fraud_ratio=0.02)
print("Dataset Shape:", df.shape)
print("\nFirst few rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())13 Step 1: Understanding Class Imbalance
The first critical step is visualizing and quantifying the class imbalance problem.
# Calculate class distribution
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100
print("Class Distribution:")
print("="*50)
print(f"Normal Transactions (0): {class_counts[0]:,} ({class_percentages[0]:.2f}%)")
print(f"Fraudulent Transactions (1): {class_counts[1]:,} ({class_percentages[1]:.2f}%)")
print(f"\nImbalance Ratio: 1:{class_counts[0]/class_counts[1]:.1f}")
print(f"This means for every 1 fraud, there are {class_counts[0]/class_counts[1]:.1f} normal transactions")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot
axes[0].bar(['Normal (0)', 'Fraud (1)'], class_counts.values, color=['#2ecc71', '#e74c3c'])
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Class Distribution (Absolute)', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
for i, v in enumerate(class_counts.values):
axes[0].text(i, v + 50, f'{v:,}', ha='center', fontweight='bold')
# Pie chart
colors = ['#2ecc71', '#e74c3c']
axes[1].pie(class_counts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%',
colors=colors, startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title('Class Distribution (Percentage)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()With such extreme class imbalance, a naive model can achieve 98% accuracy by simply predicting “Not Fraud” for every transaction. This is called the accuracy paradox and highlights why we need better evaluation metrics.
14 Step 2: Baseline Model - The Accuracy Trap
Let’s demonstrate why accuracy alone is misleading on imbalanced datasets.
# Prepare data
X = df.drop('Class', axis=1)
y = df['Class']
# Split data with stratification (maintains class distribution)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
print("Training set distribution:")
print(y_train.value_counts())
print("\nTest set distribution:")
print(y_test.value_counts())
# Scale features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Naive baseline: Logistic Regression without handling imbalance
baseline_model = LogisticRegression(random_state=42, max_iter=1000)
baseline_model.fit(X_train_scaled, y_train)
y_pred_baseline = baseline_model.predict(X_test_scaled)
# Evaluate
accuracy = accuracy_score(y_test, y_pred_baseline)
precision = precision_score(y_test, y_pred_baseline)
recall = recall_score(y_test, y_pred_baseline)
f1 = f1_score(y_test, y_pred_baseline)
print("\n" + "="*60)
print("BASELINE MODEL PERFORMANCE (WITHOUT HANDLING IMBALANCE)")
print("="*60)
print(f"Accuracy: {accuracy:.4f} ✓ Looks great!")
print(f"Precision: {precision:.4f} ✗ But precision is poor")
print(f"Recall: {recall:.4f} ✗ And recall is terrible!")
print(f"F1-Score: {f1:.4f}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_baseline)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Baseline Model - Confusion Matrix\n(Without Handling Imbalance)',
fontsize=14, fontweight='bold')
plt.show()
print("\nConfusion Matrix Interpretation:")
print(f"True Negatives (TN): {cm[0,0]:,} - Correctly identified normal transactions")
print(f"False Positives (FP): {cm[0,1]:,} - Normal flagged as fraud (Type I Error)")
print(f"False Negatives (FN): {cm[1,0]:,} - Fraud missed (Type II Error) ⚠ CRITICAL!")
print(f"True Positives (TP): {cm[1,1]:,} - Correctly identified fraud")
# Calculate the percentage of frauds caught
fraud_catch_rate = (cm[1,1] / (cm[1,0] + cm[1,1])) * 100
print(f"\n⚠ We're only catching {fraud_catch_rate:.1f}% of fraudulent transactions!")In fraud detection:
- False Positive (FP): Customer annoyed by declined card → ~$1 cost
- False Negative (FN): Fraud not detected → ~$100-$1000+ cost
Missing frauds (low recall) is much worse than false alarms!
15 Step 3: Ensemble Method 1 - Bagging
Bagging (Bootstrap Aggregating) creates multiple models on random subsets of data and averages their predictions.
# Bagging with balanced class weights
bagging_model = BaggingClassifier(
estimator=DecisionTreeClassifier(max_depth=10, class_weight='balanced'),
n_estimators=50,
random_state=42,
n_jobs=-1
)
bagging_model.fit(X_train_scaled, y_train)
y_pred_bagging = bagging_model.predict(X_test_scaled)
y_proba_bagging = bagging_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("="*60)
print("BAGGING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_bagging):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_bagging):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_bagging):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_bagging):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_bagging):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_bagging, target_names=['Normal', 'Fraud']))
# Confusion Matrix
cm_bagging = confusion_matrix(y_test, y_pred_bagging)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_bagging, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Greens', values_format='d')
ax.set_title('Bagging Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()
fraud_catch_rate_bagging = (cm_bagging[1,1] / (cm_bagging[1,0] + cm_bagging[1,1])) * 100
print(f"\n✓ Fraud catch rate improved to {fraud_catch_rate_bagging:.1f}%")16 Step 4: Ensemble Method 2 - Random Forest with Feature Importance
Random Forest extends Bagging by also randomizing feature selection at each split.
# Random Forest with balanced class weights
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=15,
min_samples_split=10,
min_samples_leaf=4,
class_weight='balanced',
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)
y_proba_rf = rf_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("="*60)
print("RANDOM FOREST CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_rf):.4f}")
# Feature Importance Analysis
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))
# Visualize Feature Importance
fig, ax = plt.subplots(figsize=(10, 8))
top_features = feature_importance.head(15)
ax.barh(range(len(top_features)), top_features['Importance'].values, color='#3498db')
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['Feature'].values)
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('Random Forest - Top 15 Feature Importances', fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()Feature importance helps you: 1. Reduce dimensionality - Remove unimportant features 2. Understand model - Which features drive fraud detection? 3. Feature engineering - Focus efforts on important features 4. Regulatory compliance - Explain model decisions
17 Step 5: Ensemble Method 3 - AdaBoost
AdaBoost (Adaptive Boosting) trains models sequentially, focusing on previously misclassified examples.
# AdaBoost with balanced base estimator
ada_model = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=3, class_weight='balanced'),
n_estimators=50,
learning_rate=0.1,
random_state=42,
algorithm='SAMME'
)
ada_model.fit(X_train_scaled, y_train)
y_pred_ada = ada_model.predict(X_test_scaled)
y_proba_ada = ada_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("="*60)
print("ADABOOST CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_ada):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_ada):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_ada):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_ada):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_ada):.4f}")
# Confusion Matrix
cm_ada = confusion_matrix(y_test, y_pred_ada)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_ada, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Oranges', values_format='d')
ax.set_title('AdaBoost Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()18 Step 6: Ensemble Method 4 - Gradient Boosting
Gradient Boosting builds models sequentially, each correcting errors of the previous ensemble.
# Gradient Boosting
gb_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=5,
min_samples_split=10,
min_samples_leaf=4,
subsample=0.8,
random_state=42
)
gb_model.fit(X_train_scaled, y_train)
y_pred_gb = gb_model.predict(X_test_scaled)
y_proba_gb = gb_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("="*60)
print("GRADIENT BOOSTING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_gb):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_gb):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_gb):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_gb):.4f}")
# Feature Importance
gb_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nTop 10 Most Important Features (Gradient Boosting):")
print(gb_importance.head(10))19 Step 7: Ensemble Method 5 - Stacking
Stacking uses a meta-learner to combine predictions from multiple base models.
# Define base models
base_models = [
('rf', RandomForestClassifier(n_estimators=50, max_depth=10,
class_weight='balanced', random_state=42)),
('gb', GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)),
('ada', AdaBoostClassifier(n_estimators=50, random_state=42))
]
# Meta-learner
meta_learner = LogisticRegression(class_weight='balanced', random_state=42)
# Stacking Classifier
stacking_model = StackingClassifier(
estimators=base_models,
final_estimator=meta_learner,
cv=5,
n_jobs=-1
)
stacking_model.fit(X_train_scaled, y_train)
y_pred_stack = stacking_model.predict(X_test_scaled)
y_proba_stack = stacking_model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
print("="*60)
print("STACKING CLASSIFIER PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_stack):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_stack):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_stack):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_stack):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_stack):.4f}")
# Confusion Matrix
cm_stack = confusion_matrix(y_test, y_pred_stack)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_stack, display_labels=['Normal', 'Fraud'])
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Purples', values_format='d')
ax.set_title('Stacking Classifier - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()Base Models (Level 0):
- Random Forest (captures non-linear patterns)
- Gradient Boosting (sequential error correction)
- AdaBoost (focuses on hard examples)
Meta-Learner (Level 1):
- Logistic Regression (learns optimal weighting of base predictions)
This architecture leverages the strengths of different algorithms!
20 Step 8: Comprehensive Metrics Comparison
Let’s create a comprehensive comparison of all models across multiple metrics.
# Compile all predictions
models_results = {
'Baseline (LR)': {
'predictions': y_pred_baseline,
'probabilities': baseline_model.predict_proba(X_test_scaled)[:, 1]
},
'Bagging': {
'predictions': y_pred_bagging,
'probabilities': y_proba_bagging
},
'Random Forest': {
'predictions': y_pred_rf,
'probabilities': y_proba_rf
},
'AdaBoost': {
'predictions': y_pred_ada,
'probabilities': y_proba_ada
},
'Gradient Boosting': {
'predictions': y_pred_gb,
'probabilities': y_proba_gb
},
'Stacking': {
'predictions': y_pred_stack,
'probabilities': y_proba_stack
}
}
# Calculate all metrics
metrics_df = []
for model_name, results in models_results.items():
y_pred = results['predictions']
y_proba = results['probabilities']
metrics_df.append({
'Model': model_name,
'Accuracy': accuracy_score(y_test, y_pred),
'Precision': precision_score(y_test, y_pred),
'Recall': recall_score(y_test, y_pred),
'F1-Score': f1_score(y_test, y_pred),
'ROC-AUC': roc_auc_score(y_test, y_proba),
'PR-AUC': average_precision_score(y_test, y_proba)
})
metrics_df = pd.DataFrame(metrics_df)
print("="*80)
print("COMPREHENSIVE MODEL COMPARISON")
print("="*80)
print(metrics_df.to_string(index=False))
# Highlight best scores
print("\n" + "="*80)
print("BEST PERFORMING MODELS PER METRIC")
print("="*80)
for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']:
best_model = metrics_df.loc[metrics_df[metric].idxmax(), 'Model']
best_score = metrics_df[metric].max()
print(f"{metric:12s}: {best_model:20s} ({best_score:.4f})")# Visualize metrics comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC', 'PR-AUC']
for idx, metric in enumerate(metrics_to_plot):
ax = axes[idx // 3, idx % 3]
colors = ['#e74c3c' if m == 'Baseline (LR)' else '#3498db'
for m in metrics_df['Model']]
bars = ax.bar(range(len(metrics_df)), metrics_df[metric], color=colors)
ax.set_xticks(range(len(metrics_df)))
ax.set_xticklabels(metrics_df['Model'], rotation=45, ha='right')
ax.set_ylabel(metric, fontsize=11)
ax.set_title(f'{metric} Comparison', fontsize=12, fontweight='bold')
ax.grid(axis='y', alpha=0.3)
ax.set_ylim([0, 1.0])
# Add value labels
for i, bar in enumerate(bars):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2., height,
f'{height:.3f}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.show()21 Step 9: ROC Curve Analysis
ROC (Receiver Operating Characteristic) curves visualize the tradeoff between True Positive Rate and False Positive Rate.
# Plot ROC curves for all models
fig, ax = plt.subplots(figsize=(12, 8))
colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c']
for (model_name, results), color in zip(models_results.items(), colors):
y_proba = results['probabilities']
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
ax.plot(fpr, tpr, color=color, lw=2,
label=f'{model_name} (AUC = {auc:.3f})')
# Random classifier line
ax.plot([0, 1], [0, 1], 'k--', lw=2, label='Random Classifier (AUC = 0.500)')
ax.set_xlabel('False Positive Rate', fontsize=12, fontweight='bold')
ax.set_ylabel('True Positive Rate (Recall)', fontsize=12, fontweight='bold')
ax.set_title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("ROC-AUC Interpretation:")
print("="*60)
print("AUC = 1.0 : Perfect classifier")
print("AUC = 0.9-1 : Excellent")
print("AUC = 0.8-0.9 : Good")
print("AUC = 0.7-0.8 : Fair")
print("AUC = 0.5 : Random guessing")
print("AUC < 0.5 : Worse than random (predictions inverted)")ROC-AUC can be misleading on highly imbalanced datasets because:
- It gives equal weight to both classes
- A model can have high AUC but still miss most frauds
- Precision-Recall curves are more informative for imbalanced problems
22 Step 10: Precision-Recall Curve Analysis
For imbalanced datasets, Precision-Recall curves are more informative than ROC curves.
# Plot Precision-Recall curves for all models
fig, ax = plt.subplots(figsize=(12, 8))
colors = ['#e74c3c', '#2ecc71', '#3498db', '#f39c12', '#9b59b6', '#1abc9c']
for (model_name, results), color in zip(models_results.items(), colors):
y_proba = results['probabilities']
precision, recall, _ = precision_recall_curve(y_test, y_proba)
pr_auc = average_precision_score(y_test, y_proba)
ax.plot(recall, precision, color=color, lw=2,
label=f'{model_name} (AP = {pr_auc:.3f})')
# Baseline (no-skill) line for imbalanced dataset
no_skill = sum(y_test) / len(y_test)
ax.plot([0, 1], [no_skill, no_skill], 'k--', lw=2,
label=f'No Skill (AP = {no_skill:.3f})')
ax.set_xlabel('Recall (True Positive Rate)', fontsize=12, fontweight='bold')
ax.set_ylabel('Precision', fontsize=12, fontweight='bold')
ax.set_title('Precision-Recall Curves - Model Comparison', fontsize=14, fontweight='bold')
ax.legend(loc='lower left', fontsize=10)
ax.grid(alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.show()
print("Precision-Recall Tradeoff:")
print("="*60)
print("High Precision, Low Recall : Conservative (few false alarms, miss frauds)")
print("Low Precision, High Recall : Aggressive (catch frauds, many false alarms)")
print("Optimal Balance : Depends on business cost of FP vs FN")Use ROC-AUC when:
- Classes are balanced
- Both FP and FN have similar costs
- You care about overall ranking ability
Use PR-AUC when:
- Classes are imbalanced (like fraud detection)
- Positive class (fraud) is more important
- You care about precision-recall tradeoff
23 Step 11: Handling Class Imbalance with SMOTE
SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples of the minority class.
if SMOTE_AVAILABLE:
# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)
print("Class distribution before SMOTE:")
print(pd.Series(y_train).value_counts())
print("\nClass distribution after SMOTE:")
print(pd.Series(y_train_smote).value_counts())
# Train Random Forest on balanced data
rf_smote = RandomForestClassifier(
n_estimators=100,
max_depth=15,
random_state=42,
n_jobs=-1
)
rf_smote.fit(X_train_smote, y_train_smote)
y_pred_smote = rf_smote.predict(X_test_scaled)
y_proba_smote = rf_smote.predict_proba(X_test_scaled)[:, 1]
print("\n" + "="*60)
print("RANDOM FOREST WITH SMOTE PERFORMANCE")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_smote):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_smote):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_smote):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_smote):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_smote):.4f}")
# Compare with RF without SMOTE
print("\nComparison with RF using class_weight='balanced':")
print(f"{'Metric':<12} {'With SMOTE':>12} {'class_weight':>12} {'Difference':>12}")
print("-" * 50)
metrics_compare = [
('Precision', precision_score(y_test, y_pred_smote), precision_score(y_test, y_pred_rf)),
('Recall', recall_score(y_test, y_pred_smote), recall_score(y_test, y_pred_rf)),
('F1-Score', f1_score(y_test, y_pred_smote), f1_score(y_test, y_pred_rf)),
]
for metric_name, smote_val, weight_val in metrics_compare:
diff = smote_val - weight_val
print(f"{metric_name:<12} {smote_val:>12.4f} {weight_val:>12.4f} {diff:>+12.4f}")
else:
print("SMOTE not available. Using class_weight='balanced' is a simpler alternative.")
print("\nAlternatives to SMOTE:")
print("1. class_weight='balanced' - Weight samples inversely to class frequency")
print("2. Random undersampling - Reduce majority class samples")
print("3. Threshold tuning - Adjust decision threshold (covered in next step)")24 Step 12: Threshold Tuning for Optimal Business Impact
The default classification threshold is 0.5, but we can optimize this based on business costs.
# Calculate metrics across different thresholds
thresholds = np.arange(0.1, 0.9, 0.05)
threshold_metrics = []
# Use Gradient Boosting model for this analysis
for threshold in thresholds:
y_pred_threshold = (y_proba_gb >= threshold).astype(int)
cm = confusion_matrix(y_test, y_pred_threshold)
tn, fp, fn, tp = cm.ravel()
# Calculate metrics
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
# Business cost (assuming FN costs $100, FP costs $1)
cost_fp = 1
cost_fn = 100
total_cost = (fp * cost_fp) + (fn * cost_fn)
threshold_metrics.append({
'Threshold': threshold,
'Precision': precision,
'Recall': recall,
'F1-Score': f1,
'FP': fp,
'FN': fn,
'Total_Cost': total_cost
})
threshold_df = pd.DataFrame(threshold_metrics)
# Find optimal threshold (minimize cost)
optimal_idx = threshold_df['Total_Cost'].idxmin()
optimal_threshold = threshold_df.loc[optimal_idx, 'Threshold']
optimal_cost = threshold_df.loc[optimal_idx, 'Total_Cost']
print("="*60)
print("THRESHOLD OPTIMIZATION ANALYSIS")
print("="*60)
print(f"Cost Assumptions: FP = $1, FN = $100")
print(f"\nOptimal Threshold: {optimal_threshold:.2f}")
print(f"Total Cost at Optimal: ${optimal_cost:,.0f}")
print(f"\nMetrics at Optimal Threshold:")
print(threshold_df.loc[optimal_idx][['Precision', 'Recall', 'F1-Score', 'FP', 'FN']])
# Visualize threshold impact
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Precision and Recall vs Threshold
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Precision'],
'b-', label='Precision', linewidth=2)
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['Recall'],
'r-', label='Recall', linewidth=2)
axes[0, 0].plot(threshold_df['Threshold'], threshold_df['F1-Score'],
'g-', label='F1-Score', linewidth=2)
axes[0, 0].axvline(optimal_threshold, color='orange', linestyle='--',
label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[0, 0].set_xlabel('Threshold', fontweight='bold')
axes[0, 0].set_ylabel('Score', fontweight='bold')
axes[0, 0].set_title('Metrics vs Threshold', fontweight='bold', fontsize=12)
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
# False Positives and False Negatives vs Threshold
axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FP'],
'b-', label='False Positives', linewidth=2)
axes[0, 1].plot(threshold_df['Threshold'], threshold_df['FN'],
'r-', label='False Negatives', linewidth=2)
axes[0, 1].axvline(optimal_threshold, color='orange', linestyle='--',
label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[0, 1].set_xlabel('Threshold', fontweight='bold')
axes[0, 1].set_ylabel('Count', fontweight='bold')
axes[0, 1].set_title('Errors vs Threshold', fontweight='bold', fontsize=12)
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
# Total Cost vs Threshold
axes[1, 0].plot(threshold_df['Threshold'], threshold_df['Total_Cost'],
'purple', linewidth=2)
axes[1, 0].axvline(optimal_threshold, color='orange', linestyle='--',
label=f'Optimal ({optimal_threshold:.2f})', linewidth=2)
axes[1, 0].scatter(optimal_threshold, optimal_cost, color='red', s=100,
zorder=5, label=f'Min Cost = ${optimal_cost:,.0f}')
axes[1, 0].set_xlabel('Threshold', fontweight='bold')
axes[1, 0].set_ylabel('Total Cost ($)', fontweight='bold')
axes[1, 0].set_title('Business Cost vs Threshold', fontweight='bold', fontsize=12)
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
# Confusion Matrix at Optimal Threshold
y_pred_optimal = (y_proba_gb >= optimal_threshold).astype(int)
cm_optimal = confusion_matrix(y_test, y_pred_optimal)
im = axes[1, 1].imshow(cm_optimal, cmap='RdYlGn_r', aspect='auto')
axes[1, 1].set_xticks([0, 1])
axes[1, 1].set_yticks([0, 1])
axes[1, 1].set_xticklabels(['Normal', 'Fraud'])
axes[1, 1].set_yticklabels(['Normal', 'Fraud'])
axes[1, 1].set_xlabel('Predicted', fontweight='bold')
axes[1, 1].set_ylabel('Actual', fontweight='bold')
axes[1, 1].set_title(f'Confusion Matrix (Threshold={optimal_threshold:.2f})',
fontweight='bold', fontsize=12)
# Add text annotations
for i in range(2):
for j in range(2):
text = axes[1, 1].text(j, i, f'{cm_optimal[i, j]:,}',
ha="center", va="center", color="black",
fontweight='bold', fontsize=14)
plt.colorbar(im, ax=axes[1, 1])
plt.tight_layout()
plt.show()By adjusting the threshold from 0.5 to the optimal value:
- We balance precision and recall based on actual business costs
- In fraud detection, lower thresholds catch more frauds (higher recall)
- But this increases false alarms (lower precision)
- The optimal threshold minimizes total business cost
25 Step 13: GridSearchCV for Hyperparameter Tuning
Let’s systematically find the best hyperparameters for our top-performing model.
# Define parameter grid for Random Forest
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 15, 20],
'min_samples_split': [5, 10, 20],
'min_samples_leaf': [2, 4, 8],
'class_weight': ['balanced', {0: 1, 1: 10}, {0: 1, 1: 20}]
}
# Use stratified k-fold for imbalanced data
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV with F1-score as scoring metric (better for imbalanced data)
print("Starting GridSearchCV... This may take a few minutes.")
print(f"Total combinations to try: {np.prod([len(v) for v in param_grid.values()])}")
print("Using 5-fold stratified cross-validation\n")
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
param_grid=param_grid,
cv=cv_strategy,
scoring='f1', # Optimize for F1-score (balance of precision and recall)
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train_scaled, y_train)
print("\n" + "="*60)
print("GRID SEARCH RESULTS")
print("="*60)
print(f"Best F1-Score (CV): {grid_search.best_score_:.4f}")
print(f"\nBest Parameters:")
for param, value in grid_search.best_params_.items():
print(f" {param}: {value}")
# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test_scaled)
y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]
print("\n" + "="*60)
print("BEST MODEL PERFORMANCE ON TEST SET")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_best):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_best):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_best):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba_best):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, y_proba_best):.4f}")
# Top 10 parameter combinations
cv_results = pd.DataFrame(grid_search.cv_results_)
top_10 = cv_results.nlargest(10, 'mean_test_score')[
['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
]
print("\nTop 10 Parameter Combinations:")
print(top_10.to_string(index=False))For Imbalanced Data:
- Use
StratifiedKFoldto maintain class distribution in folds - Optimize for
f1,recall, or custom scorer (not accuracy) - Include
class_weightin parameter grid - Consider
RandomizedSearchCVfor large search spaces
Performance Tips:
- Use
n_jobs=-1to parallelize - Start with coarse grid, then refine
- Monitor with
verbose=1orverbose=2
26 Step 14: Cross-Validation with Stratification
Perform robust cross-validation to assess model stability.
# Cross-validation for all models
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models_cv = {
'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=15,
class_weight='balanced', random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5,
random_state=42),
'Best Model (GridSearch)': best_model
}
cv_results = []
for model_name, model in models_cv.items():
# Cross-validate multiple metrics
accuracy_scores = cross_val_score(model, X_train_scaled, y_train,
cv=cv_strategy, scoring='accuracy', n_jobs=-1)
precision_scores = cross_val_score(model, X_train_scaled, y_train,
cv=cv_strategy, scoring='precision', n_jobs=-1)
recall_scores = cross_val_score(model, X_train_scaled, y_train,
cv=cv_strategy, scoring='recall', n_jobs=-1)
f1_scores = cross_val_score(model, X_train_scaled, y_train,
cv=cv_strategy, scoring='f1', n_jobs=-1)
cv_results.append({
'Model': model_name,
'Accuracy (mean±std)': f"{accuracy_scores.mean():.4f} ± {accuracy_scores.std():.4f}",
'Precision (mean±std)': f"{precision_scores.mean():.4f} ± {precision_scores.std():.4f}",
'Recall (mean±std)': f"{recall_scores.mean():.4f} ± {recall_scores.std():.4f}",
'F1-Score (mean±std)': f"{f1_scores.mean():.4f} ± {f1_scores.std():.4f}"
})
cv_df = pd.DataFrame(cv_results)
print("="*100)
print("5-FOLD STRATIFIED CROSS-VALIDATION RESULTS")
print("="*100)
print(cv_df.to_string(index=False))
print("\n" + "="*60)
print("INTERPRETATION")
print("="*60)
print("Lower standard deviation indicates more stable performance across folds.")
print("This suggests the model generalizes well and is not overfitting.")27 Step 15: Final Model Evaluation Checklist
Let’s create a comprehensive production-ready evaluation report.
# Select final model (best from GridSearch)
final_model = best_model
final_model_name = "Random Forest (Optimized)"
# Generate comprehensive report
print("="*80)
print(f"FINAL MODEL EVALUATION REPORT: {final_model_name}")
print("="*80)
# 1. Model Architecture
print("\n1. MODEL ARCHITECTURE")
print("-" * 40)
print(f"Algorithm: {type(final_model).__name__}")
print(f"Parameters: {final_model.get_params()}")
# 2. Test Set Performance
print("\n2. TEST SET PERFORMANCE")
print("-" * 40)
y_pred_final = final_model.predict(X_test_scaled)
y_proba_final = final_model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, y_pred_final, target_names=['Normal', 'Fraud']))
# 3. Confusion Matrix
print("\n3. CONFUSION MATRIX ANALYSIS")
print("-" * 40)
cm_final = confusion_matrix(y_test, y_pred_final)
tn, fp, fn, tp = cm_final.ravel()
print(f"True Negatives (TN): {tn:,} - Correctly identified normal transactions")
print(f"False Positives (FP): {fp:,} - Normal flagged as fraud (Type I Error)")
print(f"False Negatives (FN): {fn:,} - Fraud missed (Type II Error)")
print(f"True Positives (TP): {tp:,} - Correctly identified fraud")
fraud_detection_rate = (tp / (tp + fn)) * 100
false_alarm_rate = (fp / (fp + tn)) * 100
print(f"\nFraud Detection Rate: {fraud_detection_rate:.1f}%")
print(f"False Alarm Rate: {false_alarm_rate:.2f}%")
# 4. Advanced Metrics
print("\n4. ADVANCED METRICS")
print("-" * 40)
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba_final):.4f}")
print(f"PR-AUC Score: {average_precision_score(y_test, y_proba_final):.4f}")
# 5. Business Impact
print("\n5. BUSINESS IMPACT ANALYSIS")
print("-" * 40)
cost_fp = 1 # Cost of false alarm
cost_fn = 100 # Cost of missed fraud
total_cost = (fp * cost_fp) + (fn * cost_fn)
baseline_cost = len(y_test[y_test == 1]) * cost_fn # Cost if we catch no frauds
savings = baseline_cost - total_cost
savings_pct = (savings / baseline_cost) * 100
print(f"False Positive Cost: ${fp * cost_fp:,}")
print(f"False Negative Cost: ${fn * cost_fn:,}")
print(f"Total Cost: ${total_cost:,}")
print(f"Baseline Cost (no model): ${baseline_cost:,}")
print(f"Cost Savings: ${savings:,} ({savings_pct:.1f}% reduction)")
# 6. Cross-Validation Stability
print("\n6. CROSS-VALIDATION STABILITY")
print("-" * 40)
cv_f1_scores = cross_val_score(final_model, X_train_scaled, y_train,
cv=cv_strategy, scoring='f1', n_jobs=-1)
print(f"F1-Score across 5 folds: {cv_f1_scores}")
print(f"Mean: {cv_f1_scores.mean():.4f}")
print(f"Std: {cv_f1_scores.std():.4f}")
print(f"Min: {cv_f1_scores.min():.4f}")
print(f"Max: {cv_f1_scores.max():.4f}")
# 7. Feature Importance
print("\n7. TOP 10 MOST IMPORTANT FEATURES")
print("-" * 40)
feature_importance_final = pd.DataFrame({
'Feature': X.columns,
'Importance': final_model.feature_importances_
}).sort_values('Importance', ascending=False)
print(feature_importance_final.head(10).to_string(index=False))
# 8. Production Readiness Checklist
print("\n8. PRODUCTION READINESS CHECKLIST")
print("-" * 40)
checklist = [
("✓", "Model trained and validated"),
("✓", "Cross-validation performed"),
("✓", "Hyperparameters optimized"),
("✓", "Class imbalance handled"),
("✓", "Comprehensive metrics calculated"),
("✓", "Business impact quantified"),
("✓", "Feature importance analyzed"),
("✗", "Model serialized and saved (TODO)"),
("✗", "Deployment pipeline created (TODO)"),
("✗", "Monitoring system setup (TODO)")
]
for status, item in checklist:
print(f"{status} {item}")
print("\n" + "="*80)
print("RECOMMENDATION")
print("="*80)
print(f"The {final_model_name} is recommended for production deployment.")
print(f"It achieves {fraud_detection_rate:.1f}% fraud detection rate with only")
print(f"{false_alarm_rate:.2f}% false alarm rate, resulting in {savings_pct:.1f}% cost savings.")
print("\nNext steps:")
print("1. Save model using joblib or pickle")
print("2. Set up real-time prediction API")
print("3. Implement monitoring for model drift")
print("4. Plan for periodic retraining")28 Step 16: Final Visualizations
Create comprehensive visualization summary for presentation and reporting.
# Create comprehensive final visualization
fig = plt.figure(figsize=(18, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
# 1. Confusion Matrix
ax1 = fig.add_subplot(gs[0, 0])
disp = ConfusionMatrixDisplay(confusion_matrix=cm_final, display_labels=['Normal', 'Fraud'])
disp.plot(ax=ax1, cmap='Blues', values_format='d')
ax1.set_title('Confusion Matrix', fontweight='bold', fontsize=11)
# 2. ROC Curve
ax2 = fig.add_subplot(gs[0, 1])
fpr_final, tpr_final, _ = roc_curve(y_test, y_proba_final)
auc_final = roc_auc_score(y_test, y_proba_final)
ax2.plot(fpr_final, tpr_final, 'b-', lw=2, label=f'ROC (AUC={auc_final:.3f})')
ax2.plot([0, 1], [0, 1], 'k--', lw=1)
ax2.set_xlabel('False Positive Rate', fontweight='bold')
ax2.set_ylabel('True Positive Rate', fontweight='bold')
ax2.set_title('ROC Curve', fontweight='bold', fontsize=11)
ax2.legend()
ax2.grid(alpha=0.3)
# 3. Precision-Recall Curve
ax3 = fig.add_subplot(gs[0, 2])
precision_final, recall_final, _ = precision_recall_curve(y_test, y_proba_final)
pr_auc_final = average_precision_score(y_test, y_proba_final)
ax3.plot(recall_final, precision_final, 'r-', lw=2, label=f'PR (AP={pr_auc_final:.3f})')
ax3.set_xlabel('Recall', fontweight='bold')
ax3.set_ylabel('Precision', fontweight='bold')
ax3.set_title('Precision-Recall Curve', fontweight='bold', fontsize=11)
ax3.legend()
ax3.grid(alpha=0.3)
# 4. Feature Importance
ax4 = fig.add_subplot(gs[1, :])
top_15_features = feature_importance_final.head(15)
ax4.barh(range(len(top_15_features)), top_15_features['Importance'].values, color='#3498db')
ax4.set_yticks(range(len(top_15_features)))
ax4.set_yticklabels(top_15_features['Feature'].values)
ax4.set_xlabel('Importance Score', fontweight='bold')
ax4.set_title('Top 15 Feature Importances', fontweight='bold', fontsize=11)
ax4.invert_yaxis()
ax4.grid(axis='x', alpha=0.3)
# 5. Model Comparison
ax5 = fig.add_subplot(gs[2, 0])
comparison_metrics = metrics_df[['Model', 'F1-Score']].copy()
colors_comp = ['#e74c3c' if 'Baseline' in m else '#3498db' for m in comparison_metrics['Model']]
ax5.bar(range(len(comparison_metrics)), comparison_metrics['F1-Score'], color=colors_comp)
ax5.set_xticks(range(len(comparison_metrics)))
ax5.set_xticklabels(comparison_metrics['Model'], rotation=45, ha='right', fontsize=8)
ax5.set_ylabel('F1-Score', fontweight='bold')
ax5.set_title('Model Comparison (F1-Score)', fontweight='bold', fontsize=11)
ax5.grid(axis='y', alpha=0.3)
# 6. Threshold Analysis
ax6 = fig.add_subplot(gs[2, 1])
ax6.plot(threshold_df['Threshold'], threshold_df['Precision'], 'b-', label='Precision', lw=2)
ax6.plot(threshold_df['Threshold'], threshold_df['Recall'], 'r-', label='Recall', lw=2)
ax6.plot(threshold_df['Threshold'], threshold_df['F1-Score'], 'g-', label='F1', lw=2)
ax6.axvline(optimal_threshold, color='orange', linestyle='--', lw=2, label='Optimal')
ax6.set_xlabel('Threshold', fontweight='bold')
ax6.set_ylabel('Score', fontweight='bold')
ax6.set_title('Threshold Optimization', fontweight='bold', fontsize=11)
ax6.legend(fontsize=8)
ax6.grid(alpha=0.3)
# 7. Business Impact
ax7 = fig.add_subplot(gs[2, 2])
categories = ['Baseline\n(No Model)', 'Final Model']
costs = [baseline_cost, total_cost]
colors_cost = ['#e74c3c', '#2ecc71']
bars = ax7.bar(categories, costs, color=colors_cost)
ax7.set_ylabel('Total Cost ($)', fontweight='bold')
ax7.set_title('Business Cost Comparison', fontweight='bold', fontsize=11)
ax7.grid(axis='y', alpha=0.3)
for bar in bars:
height = bar.get_height()
ax7.text(bar.get_x() + bar.get_width()/2., height,
f'${height:,.0f}', ha='center', va='bottom', fontweight='bold')
# Add savings annotation
savings_text = f'{savings_pct:.1f}% Cost Reduction'
ax7.text(0.5, max(costs)*0.5, savings_text, ha='center',
fontsize=12, fontweight='bold', color='green',
bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.7))
plt.suptitle(f'Final Model Evaluation Summary: {final_model_name}',
fontsize=16, fontweight='bold', y=0.995)
plt.show()29 Summary and Key Takeaways
1. Ensemble Methods:
- Bagging reduces variance through bootstrap aggregation
- Boosting reduces bias through sequential error correction
- Stacking combines diverse models for optimal performance
- Random Forest adds feature randomization to Bagging
2. Handling Imbalanced Data:
- Accuracy is misleading on imbalanced datasets
- Use
class_weight='balanced'or SMOTE - Optimize for F1-score, not accuracy
- Stratified k-fold maintains class distribution
3. Evaluation Metrics:
- Precision: Of predicted frauds, how many are actually frauds?
- Recall: Of actual frauds, how many did we catch?
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Overall ranking ability (can be misleading for imbalanced data)
- PR-AUC: Better for imbalanced datasets
4. Production Considerations:
- Threshold tuning based on business costs
- Cross-validation for model stability
- Feature importance for interpretability
- Business impact quantification
Why ensemble methods work: Bias-variance tradeoff
When to use which ensemble:
- Bagging: High variance models (decision trees)
- Boosting: High bias models (weak learners)
- Stacking: Combining diverse algorithms
Imbalanced data challenges:
- Accuracy paradox
- Class weighting vs. resampling
- Proper evaluation metrics
GridSearchCV best practices:
- Stratified CV for imbalanced data
- Choose appropriate scoring metric
- Balance search space vs. computation time
30 Further Reading and Resources
Research Papers:
- Breiman, L. (2001). “Random Forests.” Machine Learning, 45(1), 5-32.
- Freund, Y., & Schapire, R. E. (1997). “A decision-theoretic generalization of on-line learning and an application to boosting.”
- Chawla, N. V., et al. (2002). “SMOTE: Synthetic Minority Over-sampling Technique.” JAIR, 16, 321-357.
Online Resources:
- scikit-learn Ensemble Methods Guide: https://scikit-learn.org/stable/modules/ensemble.html
- Imbalanced-learn Documentation: https://imbalanced-learn.org/
- ROC vs PR curves: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
Kaggle Datasets:
- Credit Card Fraud Detection: https://www.kaggle.com/mlg-ulb/creditcardfraud
- IEEE-CIS Fraud Detection: https://www.kaggle.com/c/ieee-fraud-detection
Now that you’ve mastered ensemble methods and evaluation: 1. Complete the worksheet.md exercises 2. Experiment with different ensemble combinations 3. Try on other imbalanced datasets 4. Implement custom cost-sensitive metrics 5. Deploy your model as a REST API
Lab Completion: Make sure to save your work and submit all required deliverables according to the rubric!