# Data manipulation
import pandas as pd
import numpy as np
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')
# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10
print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")Lab 02 - Titanic Dataset: Data Preprocessing & Feature Engineering
Pembelajaran Mesin - Politeknik Siber dan Sandi Negara
Lab Overview
Learning Outcomes
Setelah menyelesaikan lab ini, mahasiswa diharapkan mampu:
- [CPMK-2] Melakukan exploratory data analysis (EDA) komprehensif pada data real-world yang memiliki masalah kualitas data
- [CPMK-2] Menangani missing values menggunakan berbagai strategi (deletion, imputation, prediction)
- [CPMK-2] Mendeteksi dan menangani outliers menggunakan metode statistik dan visualisasi
- [CPMK-2] Melakukan encoding pada variabel kategorik menggunakan teknik one-hot dan label encoding
- [CPMK-3] Membuat fitur baru (feature engineering) dari data yang ada untuk meningkatkan kualitas prediksi
- [CPMK-3] Membangun preprocessing pipeline yang lengkap dan reproducible
Lab Information
| Item | Detail |
|---|---|
| Durasi | 3-4 jam |
| Tingkat Kesulitan | Menengah |
| Prerequisites | Chapter 2: Data Preprocessing & EDA |
| Dataset | Titanic - Machine Learning from Disaster |
| Tools | Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn |
Background Context
8.0.1 Tragedy of RMS Titanic
Pada tanggal 15 April 1912, kapal RMS Titanic tenggelam setelah menabrak gunung es dalam perjalanan perdananya dari Southampton ke New York City. Tragedi ini menyebabkan kematian 1502 dari 2224 penumpang dan awak kapal, menjadikannya salah satu bencana maritim terburuk dalam sejarah.
8.0.2 Dataset Description
Dataset Titanic berisi informasi tentang penumpang kapal Titanic, termasuk apakah mereka selamat atau tidak. Dataset ini sangat populer untuk pembelajaran machine learning karena:
- Real-world data: Memiliki missing values, outliers, dan masalah kualitas data yang umum
- Mixed data types: Kombinasi numerik dan kategorik
- Clear objective: Prediksi survival (binary classification)
- Feature engineering opportunities: Banyak fitur dapat diekstrak dari data yang ada
8.0.3 Dataset Features
| Feature | Type | Description |
|---|---|---|
PassengerId |
Integer | ID unik penumpang |
Survived |
Integer | Target variable (0 = Tidak selamat, 1 = Selamat) |
Pclass |
Integer | Kelas tiket (1 = First, 2 = Second, 3 = Third) |
Name |
String | Nama penumpang |
Sex |
String | Jenis kelamin (male/female) |
Age |
Float | Usia dalam tahun |
SibSp |
Integer | Jumlah saudara kandung/pasangan di kapal |
Parch |
Integer | Jumlah orang tua/anak di kapal |
Ticket |
String | Nomor tiket |
Fare |
Float | Harga tiket |
Cabin |
String | Nomor kabin |
Embarked |
String | Pelabuhan keberangkatan (C = Cherbourg, Q = Queenstown, S = Southampton) |
9 Part 1: Environment Setup & Data Loading
9.1 Step 1.1: Import Required Libraries
Expected Output:
Libraries imported successfully!
Pandas version: 2.x.x
NumPy version: 1.x.x
9.2 Step 1.2: Load Titanic Dataset
# Load dataset from seaborn
titanic_raw = sns.load_dataset('titanic')
# Create a working copy
df = titanic_raw.copy()
print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")Expected Output:
Dataset loaded successfully!
Shape: (891, 15)
Memory usage: ~128 KB
9.3 Step 1.3: Initial Data Inspection
# Display first few rows
print("="*80)
print("FIRST 10 ROWS")
print("="*80)
display(df.head(10))
print("\n" + "="*80)
print("DATASET INFO")
print("="*80)
df.info()
print("\n" + "="*80)
print("BASIC STATISTICS")
print("="*80)
display(df.describe(include='all'))Ketika melakukan initial inspection, perhatikan:
- Data types: Apakah sesuai dengan ekspektasi?
- Missing values: Kolom mana yang memiliki null values?
- Value ranges: Apakah ada nilai yang tidak masuk akal?
- Unique values: Berapa banyak kategori pada variabel kategorik?
- Distribution: Apakah data terdistribusi normal atau skewed?
10 Part 2: Exploratory Data Analysis (EDA)
10.1 Step 2.1: Missing Values Analysis
# Calculate missing values
missing_data = pd.DataFrame({
'Column': df.columns,
'Missing_Count': df.isnull().sum(),
'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
'Data_Type': df.dtypes
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values(
'Missing_Percentage', ascending=False
)
print("="*80)
print("MISSING VALUES SUMMARY")
print("="*80)
display(missing_data)
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Bar plot
missing_data.plot(
kind='barh',
x='Column',
y='Missing_Percentage',
ax=axes[0],
color='salmon',
legend=False
)
axes[0].set_xlabel('Missing Percentage (%)')
axes[0].set_title('Missing Values by Column', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)
# Heatmap
sns.heatmap(
df.isnull(),
yticklabels=False,
cbar=True,
cmap='viridis',
ax=axes[1]
)
axes[1].set_title('Missing Values Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()Cabin (77.1% missing): Hampir 3/4 data hilang - mungkin hanya dicatat untuk penumpang kelas atas.
Age (19.9% missing): ~20% data hilang - perlu strategi imputation yang hati-hati.
Embarked (0.22% missing): Sangat sedikit - bisa di-drop atau di-impute dengan mode.
10.2 Step 2.2: Target Variable Analysis
# Survival rate
print("="*80)
print("SURVIVAL STATISTICS")
print("="*80)
print(f"Total passengers: {len(df)}")
print(f"Survived: {df['survived'].sum()} ({df['survived'].mean()*100:.2f}%)")
print(f"Perished: {(1-df['survived']).sum()} ({(1-df['survived'].mean())*100:.2f}%)")
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Count plot
survival_counts = df['survived'].value_counts()
axes[0].bar(['Perished', 'Survived'], survival_counts.values, color=['#e74c3c', '#2ecc71'])
axes[0].set_ylabel('Count')
axes[0].set_title('Survival Count', fontsize=12, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)
# Add value labels
for i, v in enumerate(survival_counts.values):
axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')
# Pie chart
axes[1].pie(
survival_counts.values,
labels=['Perished', 'Survived'],
autopct='%1.1f%%',
colors=['#e74c3c', '#2ecc71'],
startangle=90,
explode=(0.05, 0.05)
)
axes[1].set_title('Survival Rate', fontsize=12, fontweight='bold')
# Survival by class
survival_by_class = df.groupby('pclass')['survived'].mean()
axes[2].bar(survival_by_class.index, survival_by_class.values, color='steelblue')
axes[2].set_xlabel('Passenger Class')
axes[2].set_ylabel('Survival Rate')
axes[2].set_title('Survival Rate by Class', fontsize=12, fontweight='bold')
axes[2].set_xticks([1, 2, 3])
axes[2].set_xticklabels(['1st Class', '2nd Class', '3rd Class'])
axes[2].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()10.3 Step 2.3: Univariate Analysis - Numerical Features
# Select numerical columns
numerical_cols = ['age', 'fare', 'sibsp', 'parch']
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for idx, col in enumerate(numerical_cols):
# Histogram
axes[0, idx].hist(df[col].dropna(), bins=30, color='skyblue', edgecolor='black')
axes[0, idx].set_title(f'{col.upper()} - Distribution', fontweight='bold')
axes[0, idx].set_xlabel(col.capitalize())
axes[0, idx].set_ylabel('Frequency')
axes[0, idx].grid(alpha=0.3)
# Box plot
axes[1, idx].boxplot(df[col].dropna(), vert=True)
axes[1, idx].set_title(f'{col.upper()} - Box Plot', fontweight='bold')
axes[1, idx].set_ylabel(col.capitalize())
axes[1, idx].grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Statistical summary
print("="*80)
print("NUMERICAL FEATURES STATISTICS")
print("="*80)
display(df[numerical_cols].describe())Dari histogram dan box plot, perhatikan:
- Age: Distribusi agak normal dengan puncak di usia muda (20-30 tahun)
- Fare: Sangat right-skewed, ada outliers ekstrem (tiket mahal)
- SibSp & Parch: Kebanyakan penumpang bepergian sendiri atau dengan keluarga kecil
10.4 Step 2.4: Univariate Analysis - Categorical Features
# Select categorical columns
categorical_cols = ['sex', 'pclass', 'embarked', 'who', 'alone']
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()
for idx, col in enumerate(categorical_cols):
value_counts = df[col].value_counts()
axes[idx].bar(range(len(value_counts)), value_counts.values, color='coral')
axes[idx].set_xticks(range(len(value_counts)))
axes[idx].set_xticklabels(value_counts.index, rotation=45, ha='right')
axes[idx].set_title(f'{col.upper()} - Distribution', fontweight='bold')
axes[idx].set_ylabel('Count')
axes[idx].grid(axis='y', alpha=0.3)
# Add value labels
for i, v in enumerate(value_counts.values):
axes[idx].text(i, v + 5, str(v), ha='center', fontweight='bold')
# Hide extra subplot
axes[-1].axis('off')
plt.tight_layout()
plt.show()10.5 Step 2.5: Bivariate Analysis - Survival vs Features
# Survival by categorical features
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()
categorical_features = ['sex', 'pclass', 'embarked', 'who', 'alone']
for idx, col in enumerate(categorical_features):
survival_rate = df.groupby(col)['survived'].agg(['mean', 'count'])
x_pos = range(len(survival_rate))
axes[idx].bar(x_pos, survival_rate['mean'], color='teal', alpha=0.7)
axes[idx].set_xticks(x_pos)
axes[idx].set_xticklabels(survival_rate.index, rotation=45, ha='right')
axes[idx].set_ylabel('Survival Rate')
axes[idx].set_title(f'Survival Rate by {col.upper()}', fontweight='bold')
axes[idx].set_ylim(0, 1)
axes[idx].grid(axis='y', alpha=0.3)
# Add percentage labels
for i, v in enumerate(survival_rate['mean']):
axes[idx].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
# Hide extra subplot
axes[-1].axis('off')
plt.tight_layout()
plt.show()- Gender: Perempuan memiliki survival rate jauh lebih tinggi (~74% vs ~19%)
- Class: Kelas 1 survival rate ~63%, kelas 3 hanya ~24%
- Embarked: Penumpang dari Cherbourg (C) survival rate lebih tinggi
- Alone: Penumpang yang tidak sendirian cenderung lebih survive
10.6 Step 2.6: Correlation Analysis
# Select numerical features for correlation
correlation_features = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
correlation_data = df[correlation_features].copy()
# Encode sex for correlation
correlation_data['sex_male'] = (df['sex'] == 'male').astype(int)
# Calculate correlation matrix
corr_matrix = correlation_data.corr()
# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
corr_matrix,
annot=True,
fmt='.2f',
cmap='coolwarm',
center=0,
square=True,
linewidths=1,
cbar_kws={"shrink": 0.8},
ax=ax
)
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
# Print strongest correlations with survival
print("="*80)
print("CORRELATIONS WITH SURVIVAL (sorted by absolute value)")
print("="*80)
survival_corr = corr_matrix['survived'].drop('survived').sort_values(key=abs, ascending=False)
for feature, corr in survival_corr.items():
print(f"{feature:15s}: {corr:+.3f}")11 Part 3: Missing Values Handling
11.1 Step 3.1: Handle Missing Values in ‘Embarked’
# Check missing embarked
print("="*80)
print("EMBARKED - MISSING VALUES")
print("="*80)
print(f"Missing count: {df['embarked'].isnull().sum()}")
# Show passengers with missing embarked
missing_embarked = df[df['embarked'].isnull()]
print("\nPassengers with missing Embarked:")
display(missing_embarked[['pclass', 'fare', 'embarked', 'sex', 'age']])
# Check mode
print("\nEmbarked distribution:")
print(df['embarked'].value_counts())
print(f"\nMode: {df['embarked'].mode()[0]}")
# Strategy: Impute with mode (Southampton - most common)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
print(f"\nAfter imputation - Missing count: {df['embarked'].isnull().sum()}")11.2 Step 3.2: Handle Missing Values in ‘Age’
# Age missing analysis
print("="*80)
print("AGE - MISSING VALUES ANALYSIS")
print("="*80)
print(f"Missing count: {df['age'].isnull().sum()}")
print(f"Missing percentage: {df['age'].isnull().sum() / len(df) * 100:.2f}%")
# Check age distribution by title
# NOTE: Seaborn's Titanic dataset doesn't include 'name' column
# If you have a dataset with 'name', you can extract title like this:
# df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# For now, we'll use sex and pclass for group-based imputation
print("\nAge statistics by Sex and Pclass:")
age_by_group = df.groupby(['sex', 'pclass'])['age'].agg(['mean', 'median', 'count'])
display(age_by_group)
# Visualize age distribution by passenger class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot by passenger class
df.boxplot(column='age', by='pclass', ax=axes[0], figsize=(14, 5))
axes[0].set_title('Age Distribution by Passenger Class', fontweight='bold')
axes[0].set_xlabel('Passenger Class')
axes[0].set_ylabel('Age')
plt.sca(axes[0])
plt.xticks(rotation=0)
# Missing vs non-missing
age_status = df['age'].isnull().map({True: 'Missing', False: 'Available'})
age_status.value_counts().plot(kind='bar', ax=axes[1], color=['salmon', 'lightgreen'])
axes[1].set_title('Age Data Availability', fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['Available', 'Missing'], rotation=0)
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()IMPORTANT: The imputation code below calculates statistics from the ENTIRE dataset for demonstration and learning purposes.
In production, you MUST:
- Split data into train/test FIRST
- Calculate imputation values from TRAINING data ONLY
- Apply those same values to both train and test sets
Why this matters: If you calculate statistics from all data (including test), your model has “seen” test data during preprocessing, leading to overly optimistic performance estimates.
Correct approach will be demonstrated in later steps when we do the final train/test split.
# Strategy: Impute with median age by sex and pclass
print("\n" + "="*80)
print("AGE IMPUTATION STRATEGY")
print("="*80)
print("Using median age grouped by Sex and Pclass")
def impute_age(row):
if pd.isnull(row['age']):
# Get median age for same sex and class
median_age = df[(df['sex'] == row['sex']) &
(df['pclass'] == row['pclass'])]['age'].median()
# If no match, use overall median for that sex
if pd.isnull(median_age):
median_age = df[df['sex'] == row['sex']]['age'].median()
# If still no match, use overall median
if pd.isnull(median_age):
median_age = df['age'].median()
return median_age
return row['age']
df['age'] = df.apply(impute_age, axis=1)
print(f"\nAfter imputation - Missing count: {df['age'].isnull().sum()}")Mengapa menggunakan median berdasarkan Sex dan Pclass?
Sex (male/female) berpengaruh terhadap distribusi usia penumpang
Pclass berkorelasi dengan usia (kelas 1 cenderung lebih tua, punya uang)
Median lebih robust terhadap outliers dibanding mean
Hierarchical fallback memastikan semua missing values terisi:
- First: Try sex + pclass combination
- Second: Try sex only
- Third: Use overall median
11.3 Step 3.3: Handle Missing Values in ‘Cabin’
# Cabin analysis
print("="*80)
print("CABIN - MISSING VALUES ANALYSIS")
print("="*80)
print(f"Missing count: {df['cabin'].isnull().sum()}")
print(f"Missing percentage: {df['cabin'].isnull().sum() / len(df) * 100:.2f}%")
# Check survival rate by cabin availability
df['has_cabin'] = df['cabin'].notna().astype(int)
print("\nSurvival rate by Cabin availability:")
cabin_survival = df.groupby('has_cabin')['survived'].agg(['mean', 'count'])
cabin_survival.index = ['No Cabin Info', 'Has Cabin Info']
display(cabin_survival)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Survival rate
cabin_survival['mean'].plot(kind='bar', ax=axes[0], color=['salmon', 'lightgreen'])
axes[0].set_title('Survival Rate by Cabin Availability', fontweight='bold')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xticklabels(cabin_survival.index, rotation=45, ha='right')
axes[0].grid(axis='y', alpha=0.3)
# Add percentage labels
for i, v in enumerate(cabin_survival['mean']):
axes[0].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
# Count
cabin_survival['count'].plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('Count by Cabin Availability', fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(cabin_survival.index, rotation=45, ha='right')
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
# Strategy: Create binary feature 'has_cabin', drop original 'cabin'
print("\n" + "="*80)
print("CABIN HANDLING STRATEGY")
print("="*80)
print("Creating binary feature 'has_cabin' (1 if cabin info exists, 0 otherwise)")
print("Dropping original 'cabin' column (77% missing)")
# We already created 'has_cabin' above
# Will drop 'cabin' later in feature selection12 Part 4: Outlier Detection and Handling
12.1 Step 4.1: Detect Outliers using IQR Method
# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
return outliers, lower_bound, upper_bound
# Detect outliers in Age and Fare
print("="*80)
print("OUTLIER DETECTION - AGE")
print("="*80)
age_outliers, age_lower, age_upper = detect_outliers_iqr(df, 'age')
print(f"Lower bound: {age_lower:.2f}")
print(f"Upper bound: {age_upper:.2f}")
print(f"Number of outliers: {len(age_outliers)} ({len(age_outliers)/len(df)*100:.2f}%)")
print("\n" + "="*80)
print("OUTLIER DETECTION - FARE")
print("="*80)
fare_outliers, fare_lower, fare_upper = detect_outliers_iqr(df, 'fare')
print(f"Lower bound: {fare_lower:.2f}")
print(f"Upper bound: {fare_upper:.2f}")
print(f"Number of outliers: {len(fare_outliers)} ({len(fare_outliers)/len(df)*100:.2f}%)")12.2 Step 4.2: Visualize Outliers
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Age - Box plot
axes[0, 0].boxplot(df['age'], vert=True)
axes[0, 0].axhline(y=age_upper, color='r', linestyle='--', label=f'Upper: {age_upper:.1f}')
axes[0, 0].axhline(y=age_lower, color='r', linestyle='--', label=f'Lower: {age_lower:.1f}')
axes[0, 0].set_ylabel('Age')
axes[0, 0].set_title('Age - Box Plot with IQR Bounds', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
# Age - Histogram
axes[0, 1].hist(df['age'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(x=age_upper, color='r', linestyle='--', linewidth=2, label=f'Upper: {age_upper:.1f}')
axes[0, 1].axvline(x=age_lower, color='r', linestyle='--', linewidth=2, label=f'Lower: {age_lower:.1f}')
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Age - Distribution with IQR Bounds', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
# Fare - Box plot
axes[1, 0].boxplot(df['fare'], vert=True)
axes[1, 0].axhline(y=fare_upper, color='r', linestyle='--', label=f'Upper: {fare_upper:.1f}')
axes[1, 0].set_ylabel('Fare')
axes[1, 0].set_title('Fare - Box Plot with IQR Bounds', fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
# Fare - Histogram (log scale for better visualization)
axes[1, 1].hist(df['fare'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=fare_upper, color='r', linestyle='--', linewidth=2, label=f'Upper: {fare_upper:.1f}')
axes[1, 1].set_xlabel('Fare')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Fare - Distribution with IQR Bounds', fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
plt.show()12.3 Step 4.3: Analyze and Handle Outliers
# Analyze outliers
print("="*80)
print("OUTLIER ANALYSIS")
print("="*80)
# Age outliers
print("\nAge Outliers (sample):")
display(age_outliers[['age', 'pclass', 'sex', 'survived']].head(10))
# Check if age outliers are legitimate
print(f"\nAge outliers statistics:")
print(f"Mean survival rate of age outliers: {age_outliers['survived'].mean():.2%}")
print(f"Overall survival rate: {df['survived'].mean():.2%}")
# Fare outliers
print("\n\nFare Outliers (sample - highest fares):")
display(fare_outliers.nlargest(10, 'fare')[['fare', 'pclass', 'sex', 'survived']])
# Check if fare outliers are legitimate
print(f"\nFare outliers statistics:")
print(f"Mean survival rate of fare outliers: {fare_outliers['survived'].mean():.2%}")
print(f"Overall survival rate: {df['survived'].mean():.2%}")
# Decision: Keep outliers but note them
print("\n" + "="*80)
print("OUTLIER HANDLING DECISION")
print("="*80)
print("KEEP all outliers because:")
print("1. Age outliers are legitimate (elderly passengers)")
print("2. Fare outliers represent genuine first-class tickets")
print("3. Outliers contain valuable information about survival patterns")
print("4. Can use robust scaling methods later if needed")Tidak semua outliers harus dihilangkan! Pertimbangkan:
- Domain knowledge: Apakah outlier masuk akal? (Usia 80 tahun = valid)
- Impact on target: Apakah outliers memiliki pola survival berbeda?
- Percentage: Jika <5% dan random, bisa di-remove
- Alternative: Gunakan robust methods (median, robust scaling)
Dalam kasus Titanic, outliers adalah legitimate data points yang memberikan informasi penting.
13 Part 5: Feature Engineering
13.1 Step 5.1: Extract Title from Name
# We already extracted title earlier, let's refine it
print("="*80)
print("FEATURE ENGINEERING: TITLE EXTRACTION")
print("="*80)
# Show unique titles
print("Original titles:")
print(df['title'].value_counts())
# Group rare titles
title_mapping = {
'Mr': 'Mr',
'Miss': 'Miss',
'Mrs': 'Mrs',
'Master': 'Master',
'Rev': 'Officer',
'Dr': 'Officer',
'Col': 'Officer',
'Major': 'Officer',
'Capt': 'Officer',
'Jonkheer': 'Royalty',
'Don': 'Royalty',
'Sir': 'Royalty',
'Lady': 'Royalty',
'Countess': 'Royalty',
'Dona': 'Royalty',
'Mme': 'Mrs',
'Mlle': 'Miss',
'Ms': 'Miss'
}
df['title'] = df['title'].map(title_mapping)
print("\nRefined titles:")
print(df['title'].value_counts())
# Visualize survival by title
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count by title
df['title'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Count by Title', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Title')
axes[0].grid(axis='y', alpha=0.3)
# Survival rate by title
survival_by_title = df.groupby('title')['survived'].mean().sort_values(ascending=False)
survival_by_title.plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Survival Rate by Title', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Title')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)
# Add percentage labels
for i, v in enumerate(survival_by_title.values):
axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()13.2 Step 5.2: Create Family Size Feature
print("="*80)
print("FEATURE ENGINEERING: FAMILY SIZE")
print("="*80)
# Create family_size = sibsp + parch + 1 (self)
df['family_size'] = df['sibsp'] + df['parch'] + 1
print("Family Size distribution:")
print(df['family_size'].value_counts().sort_index())
# Create family size categories
def categorize_family_size(size):
if size == 1:
return 'Alone'
elif size <= 4:
return 'Small'
else:
return 'Large'
df['family_size_category'] = df['family_size'].apply(categorize_family_size)
print("\nFamily Size Category distribution:")
print(df['family_size_category'].value_counts())
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Distribution
df['family_size'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='teal')
axes[0].set_title('Family Size Distribution', fontweight='bold')
axes[0].set_xlabel('Family Size')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)
# Survival by family size
survival_by_family = df.groupby('family_size')['survived'].mean()
survival_by_family.plot(kind='line', marker='o', ax=axes[1], color='purple', linewidth=2)
axes[1].set_title('Survival Rate by Family Size', fontweight='bold')
axes[1].set_xlabel('Family Size')
axes[1].set_ylabel('Survival Rate')
axes[1].grid(alpha=0.3)
axes[1].set_ylim(0, 1)
# Survival by family category
survival_by_category = df.groupby('family_size_category')['survived'].mean()
survival_by_category.plot(kind='bar', ax=axes[2], color='orange')
axes[2].set_title('Survival Rate by Family Category', fontweight='bold')
axes[2].set_ylabel('Survival Rate')
axes[2].set_xticklabels(survival_by_category.index, rotation=0)
axes[2].grid(axis='y', alpha=0.3)
axes[2].set_ylim(0, 1)
# Add percentage labels
for i, v in enumerate(survival_by_category.values):
axes[2].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()- Alone (size=1): Survival rate ~30% - worst performing group
- Small families (2-4): Survival rate ~50-70% - best performing
- Large families (5+): Survival rate drops - possibly harder to evacuate together
Hypothesis: Small families may have helped each other survive, while solo travelers and large families faced challenges.
13.3 Step 5.3: Create Age Groups
print("="*80)
print("FEATURE ENGINEERING: AGE GROUPS")
print("="*80)
# Create age bins
age_bins = [0, 12, 18, 35, 60, 100]
age_labels = ['Child', 'Teenager', 'Adult', 'Middle Age', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)
print("Age Group distribution:")
print(df['age_group'].value_counts())
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count by age group
df['age_group'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Count by Age Group', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Age Group')
axes[0].grid(axis='y', alpha=0.3)
# Survival by age group
survival_by_age_group = df.groupby('age_group')['survived'].mean()
survival_by_age_group.plot(kind='bar', ax=axes[1], color='salmon')
axes[1].set_title('Survival Rate by Age Group', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Age Group')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)
# Add percentage labels
for i, v in enumerate(survival_by_age_group.values):
axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()13.4 Step 5.4: Create Fare Groups
print("="*80)
print("FEATURE ENGINEERING: FARE GROUPS")
print("="*80)
# Create fare bins based on quartiles
fare_bins = [0, 7.91, 14.45, 31, 513] # Based on quartiles
fare_labels = ['Low', 'Medium', 'High', 'Very High']
df['fare_group'] = pd.cut(df['fare'], bins=fare_bins, labels=fare_labels)
print("Fare Group distribution:")
print(df['fare_group'].value_counts())
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Count
df['fare_group'].value_counts().plot(kind='bar', ax=axes[0], color='gold')
axes[0].set_title('Count by Fare Group', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Fare Group')
axes[0].grid(axis='y', alpha=0.3)
# Survival by fare group
survival_by_fare = df.groupby('fare_group')['survived'].mean()
survival_by_fare.plot(kind='bar', ax=axes[1], color='darkgreen')
axes[1].set_title('Survival Rate by Fare Group', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Fare Group')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)
# Add percentage labels
for i, v in enumerate(survival_by_fare.values):
axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()13.5 Step 5.5: Create is_alone Feature
print("="*80)
print("FEATURE ENGINEERING: IS_ALONE")
print("="*80)
# Create is_alone (1 if traveling alone, 0 otherwise)
df['is_alone'] = (df['family_size'] == 1).astype(int)
print("Is Alone distribution:")
print(df['is_alone'].value_counts())
# Compare with existing 'alone' column
print("\nComparison with seaborn's 'alone' column:")
print(f"Matches: {(df['is_alone'] == df['alone'].astype(int)).sum()} / {len(df)}")
# Visualize survival
fig, ax = plt.subplots(figsize=(8, 5))
survival_by_alone = df.groupby('is_alone')['survived'].mean()
bars = ax.bar(['With Family', 'Alone'], survival_by_alone.values[::-1],
color=['lightgreen', 'salmon'])
ax.set_ylabel('Survival Rate')
ax.set_title('Survival Rate: Alone vs With Family', fontweight='bold', fontsize=14)
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(0, 1)
# Add percentage labels
for i, v in enumerate(survival_by_alone.values[::-1]):
ax.text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
print(f"\nSurvival rate - With family: {survival_by_alone[0]*100:.1f}%")
print(f"Survival rate - Alone: {survival_by_alone[1]*100:.1f}%")14 Part 6: Feature Encoding
14.1 Step 6.1: Label Encoding for Binary Features
print("="*80)
print("FEATURE ENCODING: BINARY FEATURES")
print("="*80)
# Sex: male=1, female=0
df['sex_encoded'] = (df['sex'] == 'male').astype(int)
print("Sex encoding:")
print(df[['sex', 'sex_encoded']].drop_duplicates())
# Embarked: Create dummy variables
print("\n\nEmbarked before encoding:")
print(df['embarked'].value_counts())14.2 Step 6.2: One-Hot Encoding for Categorical Features
print("="*80)
print("FEATURE ENCODING: ONE-HOT ENCODING")
print("="*80)
# One-hot encode: embarked, title, family_size_category
categorical_features = ['embarked', 'title', 'family_size_category', 'age_group', 'fare_group']
# Create dummy variables
df_encoded = pd.get_dummies(df, columns=categorical_features, prefix=categorical_features, drop_first=True)
print(f"Original shape: {df.shape}")
print(f"After one-hot encoding: {df_encoded.shape}")
print("\nNew columns created:")
new_cols = [col for col in df_encoded.columns if col not in df.columns]
for col in sorted(new_cols):
print(f" - {col}")14.3 Step 6.3: Label Encoding for Ordinal Features
print("="*80)
print("FEATURE ENCODING: ORDINAL FEATURES")
print("="*80)
# Pclass is already encoded as 1, 2, 3 (ordinal)
print("Pclass (already encoded ordinally):")
print(df_encoded['pclass'].value_counts().sort_index())
# Verify encoding makes sense
print("\nPclass survival rates:")
pclass_survival = df_encoded.groupby('pclass')['survived'].mean()
for pclass, rate in pclass_survival.items():
print(f" Class {pclass}: {rate*100:.1f}%")15 Part 7: Feature Selection and Final Dataset
15.1 Step 7.1: Select Features for Modeling
print("="*80)
print("FEATURE SELECTION")
print("="*80)
# Drop unnecessary columns
columns_to_drop = [
'name', # Text, already extracted title
'ticket', # Unique identifier, no pattern
'cabin', # Too many missing values, created has_cabin
'sex', # Encoded as sex_encoded
'embarked', # One-hot encoded
'title', # One-hot encoded
'family_size_category', # One-hot encoded
'age_group', # One-hot encoded
'fare_group', # One-hot encoded
'who', # Redundant with sex and age
'adult_male', # Redundant
'deck', # Mostly missing
'embark_town', # Same as embarked
'alive', # Same as survived
'alone', # We have is_alone
'class' # Same as pclass
]
# Keep only existing columns
columns_to_drop = [col for col in columns_to_drop if col in df_encoded.columns]
df_final = df_encoded.drop(columns=columns_to_drop)
print(f"Columns dropped: {len(columns_to_drop)}")
print(f"Final dataset shape: {df_final.shape}")
print("\nFinal feature set:")
feature_cols = [col for col in df_final.columns if col != 'survived']
for i, col in enumerate(sorted(feature_cols), 1):
print(f"{i:2d}. {col}")15.2 Step 7.2: Check for Missing Values
print("="*80)
print("FINAL DATASET - MISSING VALUES CHECK")
print("="*80)
missing_final = df_final.isnull().sum()
if missing_final.sum() == 0:
print("SUCCESS: No missing values in final dataset!")
else:
print("WARNING: Missing values found:")
print(missing_final[missing_final > 0])15.3 Step 7.3: Feature Summary Statistics
print("="*80)
print("FINAL DATASET - SUMMARY STATISTICS")
print("="*80)
display(df_final.describe())
# Show data types
print("\nData types:")
print(df_final.dtypes.value_counts())15.4 Step 7.4: Correlation with Target
# Calculate correlation with survived
correlations = df_final.corr()['survived'].drop('survived').sort_values(key=abs, ascending=False)
print("="*80)
print("FEATURE CORRELATIONS WITH SURVIVAL (Top 15)")
print("="*80)
for feature, corr in correlations.head(15).items():
print(f"{feature:30s}: {corr:+.3f}")
# Visualize top correlations
fig, ax = plt.subplots(figsize=(10, 8))
top_features = correlations.head(15)
colors = ['green' if x > 0 else 'red' for x in top_features.values]
ax.barh(range(len(top_features)), top_features.values, color=colors, alpha=0.6)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features.index)
ax.set_xlabel('Correlation with Survival')
ax.set_title('Top 15 Features Correlated with Survival', fontweight='bold', fontsize=14)
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()16 Part 8: Prepare for Machine Learning
16.1 Step 8.1: Separate Features and Target
print("="*80)
print("PREPARING FOR MACHINE LEARNING")
print("="*80)
# Separate features (X) and target (y)
X = df_final.drop('survived', axis=1)
y = df_final['survived']
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nTarget distribution:")
print(f" Survived (1): {y.sum()} ({y.mean()*100:.2f}%)")
print(f" Perished (0): {(1-y).sum()} ({(1-y.mean())*100:.2f}%)")
print(f"\nFeatures list ({len(X.columns)} features):")
for i, col in enumerate(X.columns, 1):
print(f"{i:2d}. {col}")16.2 Step 8.2: Feature Scaling (Preparation)
print("="*80)
print("FEATURE SCALING (StandardScaler)")
print("="*80)
# Initialize scaler
scaler = StandardScaler()
# Features to scale (numerical features)
numerical_features = ['age', 'fare', 'sibsp', 'parch', 'family_size']
# Create a copy for scaling
X_scaled = X.copy()
# Fit and transform
X_scaled[numerical_features] = scaler.fit_transform(X[numerical_features])
print("Before scaling:")
display(X[numerical_features].describe())
print("\nAfter scaling:")
display(X_scaled[numerical_features].describe())Fit on training data only: Jangan fit scaler pada seluruh dataset sebelum train-test split!
Transform test data: Gunakan scaler yang sudah di-fit pada training data
When to scale:
- Tree-based models (Random Forest, XGBoost): NO scaling needed
- Distance-based models (KNN, SVM): YES scaling needed
- Neural Networks: YES scaling needed
- Linear Regression: YES scaling needed
16.3 Step 8.3: Save Preprocessed Data
# Save final preprocessed dataset
output_path = 'titanic_preprocessed.csv'
df_final.to_csv(output_path, index=False)
print("="*80)
print("DATASET SAVED")
print("="*80)
print(f"File: {output_path}")
print(f"Shape: {df_final.shape}")
print(f"Size: {df_final.memory_usage(deep=True).sum() / 1024:.2f} KB")17 Part 9: Final Summary
17.1 Step 9.1: Preprocessing Pipeline Summary
print("="*80)
print("DATA PREPROCESSING PIPELINE SUMMARY")
print("="*80)
summary = {
'Original Dataset': {
'Rows': len(titanic_raw),
'Columns': len(titanic_raw.columns),
'Missing Values': titanic_raw.isnull().sum().sum(),
},
'Final Dataset': {
'Rows': len(df_final),
'Columns': len(df_final.columns),
'Missing Values': df_final.isnull().sum().sum(),
},
'Transformations': {
'Missing Values Handled': 'Embarked (mode), Age (median by title/class), Cabin (binary feature)',
'Outliers Handled': 'Kept (legitimate values)',
'Features Engineered': 'title, family_size, is_alone, age_group, fare_group, has_cabin',
'Encoding Applied': 'One-hot (embarked, title, categories), Label (sex)',
'Features Dropped': len(columns_to_drop),
'Features Created': len(df_final.columns) - len(titanic_raw.columns) + len(columns_to_drop)
}
}
for section, details in summary.items():
print(f"\n{section}:")
for key, value in details.items():
print(f" {key}: {value}")17.2 Step 9.2: Key Insights
print("\n" + "="*80)
print("KEY INSIGHTS FROM PREPROCESSING")
print("="*80)
insights = [
"1. Gender was the strongest predictor: Women had 74% survival vs 19% for men",
"2. Class matters: 1st class 63% survival, 3rd class only 24%",
"3. Age affects survival: Children had highest survival rates",
"4. Family size is important: Small families (2-4) had best survival rates",
"5. Fare correlates with survival: Higher fare = higher survival (class proxy)",
"6. Title reveals social status and age: Miss/Mrs higher survival than Mr",
"7. Having cabin info indicates higher class and survival",
"8. Embarked from Cherbourg had higher survival (more 1st class passengers)"
]
for insight in insights:
print(f"\n{insight}")17.3 Step 9.3: Next Steps for Machine Learning
print("\n" + "="*80)
print("READY FOR MACHINE LEARNING!")
print("="*80)
next_steps = """
Next steps to build prediction models:
1. TRAIN-TEST SPLIT
- Split data 80-20 or 70-30
- Use stratified split to preserve class balance
- Set random_state for reproducibility
2. MODEL SELECTION
- Start with baseline (Logistic Regression)
- Try ensemble methods (Random Forest, XGBoost)
- Compare performance metrics
3. MODEL EVALUATION
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix
- ROC Curve and AUC
4. HYPERPARAMETER TUNING
- Grid Search or Random Search
- Cross-validation
5. FEATURE IMPORTANCE
- Identify most important features
- Consider feature selection
The preprocessing is complete and data is ready for modeling!
"""
print(next_steps)Lab Deliverables
What to Submit
Jupyter Notebook (.ipynb) dengan:
- Semua code cells yang telah dijalankan
- Output visualisasi yang jelas
- Markdown cells berisi analisis dan interpretasi
- Kesimpulan dari setiap tahap preprocessing
Preprocessed Dataset (titanic_preprocessed.csv)
- Hasil akhir preprocessing yang siap untuk modeling
Report (PDF) berisi:
- Executive summary dari preprocessing pipeline
- Justifikasi untuk setiap keputusan preprocessing
- Visualisasi key insights
- Rekomendasi untuk modeling
Grading Rubric
Lihat file rubric.md untuk detail kriteria penilaian.
Total: 25 points
References
- Kaggle Titanic Competition: https://www.kaggle.com/c/titanic
- Pandas Documentation: https://pandas.pydata.org/docs/
- Scikit-learn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
- Seaborn Tutorial: https://seaborn.pydata.org/tutorial.html
- Understand the why: Jangan hanya copy-paste code, pahami alasan di balik setiap preprocessing step
- Document decisions: Catat mengapa memilih satu metode dibanding yang lain
- Visualize everything: Plot membantu memahami data dan mengkomunikasikan insight
- Think about leakage: Jangan gunakan informasi dari test set saat preprocessing training set
- Iterate and experiment: Coba berbagai strategi, bandingkan hasilnya