Lab 02 - Titanic Dataset: Data Preprocessing & Feature Engineering

Pembelajaran Mesin - Politeknik Siber dan Sandi Negara

Author

Girinoto, S.Si., M.Si.

Lab Overview

Learning Outcomes

Setelah menyelesaikan lab ini, mahasiswa diharapkan mampu:

  1. [CPMK-2] Melakukan exploratory data analysis (EDA) komprehensif pada data real-world yang memiliki masalah kualitas data
  2. [CPMK-2] Menangani missing values menggunakan berbagai strategi (deletion, imputation, prediction)
  3. [CPMK-2] Mendeteksi dan menangani outliers menggunakan metode statistik dan visualisasi
  4. [CPMK-2] Melakukan encoding pada variabel kategorik menggunakan teknik one-hot dan label encoding
  5. [CPMK-3] Membuat fitur baru (feature engineering) dari data yang ada untuk meningkatkan kualitas prediksi
  6. [CPMK-3] Membangun preprocessing pipeline yang lengkap dan reproducible

Lab Information

Item Detail
Durasi 3-4 jam
Tingkat Kesulitan Menengah
Prerequisites Chapter 2: Data Preprocessing & EDA
Dataset Titanic - Machine Learning from Disaster
Tools Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn

Background Context

8.0.1 Tragedy of RMS Titanic

Pada tanggal 15 April 1912, kapal RMS Titanic tenggelam setelah menabrak gunung es dalam perjalanan perdananya dari Southampton ke New York City. Tragedi ini menyebabkan kematian 1502 dari 2224 penumpang dan awak kapal, menjadikannya salah satu bencana maritim terburuk dalam sejarah.

8.0.2 Dataset Description

Dataset Titanic berisi informasi tentang penumpang kapal Titanic, termasuk apakah mereka selamat atau tidak. Dataset ini sangat populer untuk pembelajaran machine learning karena:

  • Real-world data: Memiliki missing values, outliers, dan masalah kualitas data yang umum
  • Mixed data types: Kombinasi numerik dan kategorik
  • Clear objective: Prediksi survival (binary classification)
  • Feature engineering opportunities: Banyak fitur dapat diekstrak dari data yang ada

8.0.3 Dataset Features

Feature Type Description
PassengerId Integer ID unik penumpang
Survived Integer Target variable (0 = Tidak selamat, 1 = Selamat)
Pclass Integer Kelas tiket (1 = First, 2 = Second, 3 = Third)
Name String Nama penumpang
Sex String Jenis kelamin (male/female)
Age Float Usia dalam tahun
SibSp Integer Jumlah saudara kandung/pasangan di kapal
Parch Integer Jumlah orang tua/anak di kapal
Ticket String Nomor tiket
Fare Float Harga tiket
Cabin String Nomor kabin
Embarked String Pelabuhan keberangkatan (C = Cherbourg, Q = Queenstown, S = Southampton)

9 Part 1: Environment Setup & Data Loading

9.1 Step 1.1: Import Required Libraries

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Expected Output:

Libraries imported successfully!
Pandas version: 2.x.x
NumPy version: 1.x.x

9.2 Step 1.2: Load Titanic Dataset

# Load dataset from seaborn
titanic_raw = sns.load_dataset('titanic')

# Create a working copy
df = titanic_raw.copy()

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

Expected Output:

Dataset loaded successfully!
Shape: (891, 15)
Memory usage: ~128 KB

9.3 Step 1.3: Initial Data Inspection

# Display first few rows
print("="*80)
print("FIRST 10 ROWS")
print("="*80)
display(df.head(10))

print("\n" + "="*80)
print("DATASET INFO")
print("="*80)
df.info()

print("\n" + "="*80)
print("BASIC STATISTICS")
print("="*80)
display(df.describe(include='all'))
Data Inspection Checklist

Ketika melakukan initial inspection, perhatikan:

  1. Data types: Apakah sesuai dengan ekspektasi?
  2. Missing values: Kolom mana yang memiliki null values?
  3. Value ranges: Apakah ada nilai yang tidak masuk akal?
  4. Unique values: Berapa banyak kategori pada variabel kategorik?
  5. Distribution: Apakah data terdistribusi normal atau skewed?

10 Part 2: Exploratory Data Analysis (EDA)

10.1 Step 2.1: Missing Values Analysis

# Calculate missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2),
    'Data_Type': df.dtypes
})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values(
    'Missing_Percentage', ascending=False
)

print("="*80)
print("MISSING VALUES SUMMARY")
print("="*80)
display(missing_data)

# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
missing_data.plot(
    kind='barh',
    x='Column',
    y='Missing_Percentage',
    ax=axes[0],
    color='salmon',
    legend=False
)
axes[0].set_xlabel('Missing Percentage (%)')
axes[0].set_title('Missing Values by Column', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Heatmap
sns.heatmap(
    df.isnull(),
    yticklabels=False,
    cbar=True,
    cmap='viridis',
    ax=axes[1]
)
axes[1].set_title('Missing Values Heatmap', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()
Missing Data Patterns

Cabin (77.1% missing): Hampir 3/4 data hilang - mungkin hanya dicatat untuk penumpang kelas atas.

Age (19.9% missing): ~20% data hilang - perlu strategi imputation yang hati-hati.

Embarked (0.22% missing): Sangat sedikit - bisa di-drop atau di-impute dengan mode.

10.2 Step 2.2: Target Variable Analysis

# Survival rate
print("="*80)
print("SURVIVAL STATISTICS")
print("="*80)
print(f"Total passengers: {len(df)}")
print(f"Survived: {df['survived'].sum()} ({df['survived'].mean()*100:.2f}%)")
print(f"Perished: {(1-df['survived']).sum()} ({(1-df['survived'].mean())*100:.2f}%)")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Count plot
survival_counts = df['survived'].value_counts()
axes[0].bar(['Perished', 'Survived'], survival_counts.values, color=['#e74c3c', '#2ecc71'])
axes[0].set_ylabel('Count')
axes[0].set_title('Survival Count', fontsize=12, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Add value labels
for i, v in enumerate(survival_counts.values):
    axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')

# Pie chart
axes[1].pie(
    survival_counts.values,
    labels=['Perished', 'Survived'],
    autopct='%1.1f%%',
    colors=['#e74c3c', '#2ecc71'],
    startangle=90,
    explode=(0.05, 0.05)
)
axes[1].set_title('Survival Rate', fontsize=12, fontweight='bold')

# Survival by class
survival_by_class = df.groupby('pclass')['survived'].mean()
axes[2].bar(survival_by_class.index, survival_by_class.values, color='steelblue')
axes[2].set_xlabel('Passenger Class')
axes[2].set_ylabel('Survival Rate')
axes[2].set_title('Survival Rate by Class', fontsize=12, fontweight='bold')
axes[2].set_xticks([1, 2, 3])
axes[2].set_xticklabels(['1st Class', '2nd Class', '3rd Class'])
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

10.3 Step 2.3: Univariate Analysis - Numerical Features

# Select numerical columns
numerical_cols = ['age', 'fare', 'sibsp', 'parch']

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

for idx, col in enumerate(numerical_cols):
    # Histogram
    axes[0, idx].hist(df[col].dropna(), bins=30, color='skyblue', edgecolor='black')
    axes[0, idx].set_title(f'{col.upper()} - Distribution', fontweight='bold')
    axes[0, idx].set_xlabel(col.capitalize())
    axes[0, idx].set_ylabel('Frequency')
    axes[0, idx].grid(alpha=0.3)

    # Box plot
    axes[1, idx].boxplot(df[col].dropna(), vert=True)
    axes[1, idx].set_title(f'{col.upper()} - Box Plot', fontweight='bold')
    axes[1, idx].set_ylabel(col.capitalize())
    axes[1, idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical summary
print("="*80)
print("NUMERICAL FEATURES STATISTICS")
print("="*80)
display(df[numerical_cols].describe())
Univariate Analysis Insights

Dari histogram dan box plot, perhatikan:

  • Age: Distribusi agak normal dengan puncak di usia muda (20-30 tahun)
  • Fare: Sangat right-skewed, ada outliers ekstrem (tiket mahal)
  • SibSp & Parch: Kebanyakan penumpang bepergian sendiri atau dengan keluarga kecil

10.4 Step 2.4: Univariate Analysis - Categorical Features

# Select categorical columns
categorical_cols = ['sex', 'pclass', 'embarked', 'who', 'alone']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

for idx, col in enumerate(categorical_cols):
    value_counts = df[col].value_counts()

    axes[idx].bar(range(len(value_counts)), value_counts.values, color='coral')
    axes[idx].set_xticks(range(len(value_counts)))
    axes[idx].set_xticklabels(value_counts.index, rotation=45, ha='right')
    axes[idx].set_title(f'{col.upper()} - Distribution', fontweight='bold')
    axes[idx].set_ylabel('Count')
    axes[idx].grid(axis='y', alpha=0.3)

    # Add value labels
    for i, v in enumerate(value_counts.values):
        axes[idx].text(i, v + 5, str(v), ha='center', fontweight='bold')

# Hide extra subplot
axes[-1].axis('off')

plt.tight_layout()
plt.show()

10.5 Step 2.5: Bivariate Analysis - Survival vs Features

# Survival by categorical features
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

categorical_features = ['sex', 'pclass', 'embarked', 'who', 'alone']

for idx, col in enumerate(categorical_features):
    survival_rate = df.groupby(col)['survived'].agg(['mean', 'count'])

    x_pos = range(len(survival_rate))
    axes[idx].bar(x_pos, survival_rate['mean'], color='teal', alpha=0.7)
    axes[idx].set_xticks(x_pos)
    axes[idx].set_xticklabels(survival_rate.index, rotation=45, ha='right')
    axes[idx].set_ylabel('Survival Rate')
    axes[idx].set_title(f'Survival Rate by {col.upper()}', fontweight='bold')
    axes[idx].set_ylim(0, 1)
    axes[idx].grid(axis='y', alpha=0.3)

    # Add percentage labels
    for i, v in enumerate(survival_rate['mean']):
        axes[idx].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

# Hide extra subplot
axes[-1].axis('off')

plt.tight_layout()
plt.show()
Key Survival Patterns
  1. Gender: Perempuan memiliki survival rate jauh lebih tinggi (~74% vs ~19%)
  2. Class: Kelas 1 survival rate ~63%, kelas 3 hanya ~24%
  3. Embarked: Penumpang dari Cherbourg (C) survival rate lebih tinggi
  4. Alone: Penumpang yang tidak sendirian cenderung lebih survive

10.6 Step 2.6: Correlation Analysis

# Select numerical features for correlation
correlation_features = ['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']
correlation_data = df[correlation_features].copy()

# Encode sex for correlation
correlation_data['sex_male'] = (df['sex'] == 'male').astype(int)

# Calculate correlation matrix
corr_matrix = correlation_data.corr()

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={"shrink": 0.8},
    ax=ax
)
ax.set_title('Feature Correlation Matrix', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Print strongest correlations with survival
print("="*80)
print("CORRELATIONS WITH SURVIVAL (sorted by absolute value)")
print("="*80)
survival_corr = corr_matrix['survived'].drop('survived').sort_values(key=abs, ascending=False)
for feature, corr in survival_corr.items():
    print(f"{feature:15s}: {corr:+.3f}")

11 Part 3: Missing Values Handling

11.1 Step 3.1: Handle Missing Values in ‘Embarked’

# Check missing embarked
print("="*80)
print("EMBARKED - MISSING VALUES")
print("="*80)
print(f"Missing count: {df['embarked'].isnull().sum()}")

# Show passengers with missing embarked
missing_embarked = df[df['embarked'].isnull()]
print("\nPassengers with missing Embarked:")
display(missing_embarked[['pclass', 'fare', 'embarked', 'sex', 'age']])

# Check mode
print("\nEmbarked distribution:")
print(df['embarked'].value_counts())
print(f"\nMode: {df['embarked'].mode()[0]}")

# Strategy: Impute with mode (Southampton - most common)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

print(f"\nAfter imputation - Missing count: {df['embarked'].isnull().sum()}")

11.2 Step 3.2: Handle Missing Values in ‘Age’

# Age missing analysis
print("="*80)
print("AGE - MISSING VALUES ANALYSIS")
print("="*80)
print(f"Missing count: {df['age'].isnull().sum()}")
print(f"Missing percentage: {df['age'].isnull().sum() / len(df) * 100:.2f}%")

# Check age distribution by title
# NOTE: Seaborn's Titanic dataset doesn't include 'name' column
# If you have a dataset with 'name', you can extract title like this:
# df['title'] = df['name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# For now, we'll use sex and pclass for group-based imputation

print("\nAge statistics by Sex and Pclass:")
age_by_group = df.groupby(['sex', 'pclass'])['age'].agg(['mean', 'median', 'count'])
display(age_by_group)

# Visualize age distribution by passenger class
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot by passenger class
df.boxplot(column='age', by='pclass', ax=axes[0], figsize=(14, 5))
axes[0].set_title('Age Distribution by Passenger Class', fontweight='bold')
axes[0].set_xlabel('Passenger Class')
axes[0].set_ylabel('Age')
plt.sca(axes[0])
plt.xticks(rotation=0)

# Missing vs non-missing
age_status = df['age'].isnull().map({True: 'Missing', False: 'Available'})
age_status.value_counts().plot(kind='bar', ax=axes[1], color=['salmon', 'lightgreen'])
axes[1].set_title('Age Data Availability', fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['Available', 'Missing'], rotation=0)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()
⚠️ Data Leakage Alert!

IMPORTANT: The imputation code below calculates statistics from the ENTIRE dataset for demonstration and learning purposes.

In production, you MUST:

  1. Split data into train/test FIRST
  2. Calculate imputation values from TRAINING data ONLY
  3. Apply those same values to both train and test sets

Why this matters: If you calculate statistics from all data (including test), your model has “seen” test data during preprocessing, leading to overly optimistic performance estimates.

Correct approach will be demonstrated in later steps when we do the final train/test split.

# Strategy: Impute with median age by sex and pclass
print("\n" + "="*80)
print("AGE IMPUTATION STRATEGY")
print("="*80)
print("Using median age grouped by Sex and Pclass")

def impute_age(row):
    if pd.isnull(row['age']):
        # Get median age for same sex and class
        median_age = df[(df['sex'] == row['sex']) &
                        (df['pclass'] == row['pclass'])]['age'].median()

        # If no match, use overall median for that sex
        if pd.isnull(median_age):
            median_age = df[df['sex'] == row['sex']]['age'].median()

        # If still no match, use overall median
        if pd.isnull(median_age):
            median_age = df['age'].median()

        return median_age
    return row['age']

df['age'] = df.apply(impute_age, axis=1)

print(f"\nAfter imputation - Missing count: {df['age'].isnull().sum()}")
Age Imputation Strategy

Mengapa menggunakan median berdasarkan Sex dan Pclass?

  1. Sex (male/female) berpengaruh terhadap distribusi usia penumpang

  2. Pclass berkorelasi dengan usia (kelas 1 cenderung lebih tua, punya uang)

  3. Median lebih robust terhadap outliers dibanding mean

  4. Hierarchical fallback memastikan semua missing values terisi:

    • First: Try sex + pclass combination
    • Second: Try sex only
    • Third: Use overall median

11.3 Step 3.3: Handle Missing Values in ‘Cabin’

# Cabin analysis
print("="*80)
print("CABIN - MISSING VALUES ANALYSIS")
print("="*80)
print(f"Missing count: {df['cabin'].isnull().sum()}")
print(f"Missing percentage: {df['cabin'].isnull().sum() / len(df) * 100:.2f}%")

# Check survival rate by cabin availability
df['has_cabin'] = df['cabin'].notna().astype(int)

print("\nSurvival rate by Cabin availability:")
cabin_survival = df.groupby('has_cabin')['survived'].agg(['mean', 'count'])
cabin_survival.index = ['No Cabin Info', 'Has Cabin Info']
display(cabin_survival)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Survival rate
cabin_survival['mean'].plot(kind='bar', ax=axes[0], color=['salmon', 'lightgreen'])
axes[0].set_title('Survival Rate by Cabin Availability', fontweight='bold')
axes[0].set_ylabel('Survival Rate')
axes[0].set_xticklabels(cabin_survival.index, rotation=45, ha='right')
axes[0].grid(axis='y', alpha=0.3)

# Add percentage labels
for i, v in enumerate(cabin_survival['mean']):
    axes[0].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

# Count
cabin_survival['count'].plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('Count by Cabin Availability', fontweight='bold')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(cabin_survival.index, rotation=45, ha='right')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Strategy: Create binary feature 'has_cabin', drop original 'cabin'
print("\n" + "="*80)
print("CABIN HANDLING STRATEGY")
print("="*80)
print("Creating binary feature 'has_cabin' (1 if cabin info exists, 0 otherwise)")
print("Dropping original 'cabin' column (77% missing)")

# We already created 'has_cabin' above
# Will drop 'cabin' later in feature selection

12 Part 4: Outlier Detection and Handling

12.1 Step 4.1: Detect Outliers using IQR Method

# Function to detect outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]

    return outliers, lower_bound, upper_bound

# Detect outliers in Age and Fare
print("="*80)
print("OUTLIER DETECTION - AGE")
print("="*80)
age_outliers, age_lower, age_upper = detect_outliers_iqr(df, 'age')
print(f"Lower bound: {age_lower:.2f}")
print(f"Upper bound: {age_upper:.2f}")
print(f"Number of outliers: {len(age_outliers)} ({len(age_outliers)/len(df)*100:.2f}%)")

print("\n" + "="*80)
print("OUTLIER DETECTION - FARE")
print("="*80)
fare_outliers, fare_lower, fare_upper = detect_outliers_iqr(df, 'fare')
print(f"Lower bound: {fare_lower:.2f}")
print(f"Upper bound: {fare_upper:.2f}")
print(f"Number of outliers: {len(fare_outliers)} ({len(fare_outliers)/len(df)*100:.2f}%)")

12.2 Step 4.2: Visualize Outliers

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Age - Box plot
axes[0, 0].boxplot(df['age'], vert=True)
axes[0, 0].axhline(y=age_upper, color='r', linestyle='--', label=f'Upper: {age_upper:.1f}')
axes[0, 0].axhline(y=age_lower, color='r', linestyle='--', label=f'Lower: {age_lower:.1f}')
axes[0, 0].set_ylabel('Age')
axes[0, 0].set_title('Age - Box Plot with IQR Bounds', fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Age - Histogram
axes[0, 1].hist(df['age'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(x=age_upper, color='r', linestyle='--', linewidth=2, label=f'Upper: {age_upper:.1f}')
axes[0, 1].axvline(x=age_lower, color='r', linestyle='--', linewidth=2, label=f'Lower: {age_lower:.1f}')
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Age - Distribution with IQR Bounds', fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# Fare - Box plot
axes[1, 0].boxplot(df['fare'], vert=True)
axes[1, 0].axhline(y=fare_upper, color='r', linestyle='--', label=f'Upper: {fare_upper:.1f}')
axes[1, 0].set_ylabel('Fare')
axes[1, 0].set_title('Fare - Box Plot with IQR Bounds', fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)

# Fare - Histogram (log scale for better visualization)
axes[1, 1].hist(df['fare'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1, 1].axvline(x=fare_upper, color='r', linestyle='--', linewidth=2, label=f'Upper: {fare_upper:.1f}')
axes[1, 1].set_xlabel('Fare')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Fare - Distribution with IQR Bounds', fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

12.3 Step 4.3: Analyze and Handle Outliers

# Analyze outliers
print("="*80)
print("OUTLIER ANALYSIS")
print("="*80)

# Age outliers
print("\nAge Outliers (sample):")
display(age_outliers[['age', 'pclass', 'sex', 'survived']].head(10))

# Check if age outliers are legitimate
print(f"\nAge outliers statistics:")
print(f"Mean survival rate of age outliers: {age_outliers['survived'].mean():.2%}")
print(f"Overall survival rate: {df['survived'].mean():.2%}")

# Fare outliers
print("\n\nFare Outliers (sample - highest fares):")
display(fare_outliers.nlargest(10, 'fare')[['fare', 'pclass', 'sex', 'survived']])

# Check if fare outliers are legitimate
print(f"\nFare outliers statistics:")
print(f"Mean survival rate of fare outliers: {fare_outliers['survived'].mean():.2%}")
print(f"Overall survival rate: {df['survived'].mean():.2%}")

# Decision: Keep outliers but note them
print("\n" + "="*80)
print("OUTLIER HANDLING DECISION")
print("="*80)
print("KEEP all outliers because:")
print("1. Age outliers are legitimate (elderly passengers)")
print("2. Fare outliers represent genuine first-class tickets")
print("3. Outliers contain valuable information about survival patterns")
print("4. Can use robust scaling methods later if needed")
Outlier Handling Strategy

Tidak semua outliers harus dihilangkan! Pertimbangkan:

  1. Domain knowledge: Apakah outlier masuk akal? (Usia 80 tahun = valid)
  2. Impact on target: Apakah outliers memiliki pola survival berbeda?
  3. Percentage: Jika <5% dan random, bisa di-remove
  4. Alternative: Gunakan robust methods (median, robust scaling)

Dalam kasus Titanic, outliers adalah legitimate data points yang memberikan informasi penting.


13 Part 5: Feature Engineering

13.1 Step 5.1: Extract Title from Name

# We already extracted title earlier, let's refine it
print("="*80)
print("FEATURE ENGINEERING: TITLE EXTRACTION")
print("="*80)

# Show unique titles
print("Original titles:")
print(df['title'].value_counts())

# Group rare titles
title_mapping = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Rev': 'Officer',
    'Dr': 'Officer',
    'Col': 'Officer',
    'Major': 'Officer',
    'Capt': 'Officer',
    'Jonkheer': 'Royalty',
    'Don': 'Royalty',
    'Sir': 'Royalty',
    'Lady': 'Royalty',
    'Countess': 'Royalty',
    'Dona': 'Royalty',
    'Mme': 'Mrs',
    'Mlle': 'Miss',
    'Ms': 'Miss'
}

df['title'] = df['title'].map(title_mapping)

print("\nRefined titles:")
print(df['title'].value_counts())

# Visualize survival by title
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count by title
df['title'].value_counts().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Count by Title', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Title')
axes[0].grid(axis='y', alpha=0.3)

# Survival rate by title
survival_by_title = df.groupby('title')['survived'].mean().sort_values(ascending=False)
survival_by_title.plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Survival Rate by Title', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Title')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(survival_by_title.values):
    axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

13.2 Step 5.2: Create Family Size Feature

print("="*80)
print("FEATURE ENGINEERING: FAMILY SIZE")
print("="*80)

# Create family_size = sibsp + parch + 1 (self)
df['family_size'] = df['sibsp'] + df['parch'] + 1

print("Family Size distribution:")
print(df['family_size'].value_counts().sort_index())

# Create family size categories
def categorize_family_size(size):
    if size == 1:
        return 'Alone'
    elif size <= 4:
        return 'Small'
    else:
        return 'Large'

df['family_size_category'] = df['family_size'].apply(categorize_family_size)

print("\nFamily Size Category distribution:")
print(df['family_size_category'].value_counts())

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Distribution
df['family_size'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='teal')
axes[0].set_title('Family Size Distribution', fontweight='bold')
axes[0].set_xlabel('Family Size')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)

# Survival by family size
survival_by_family = df.groupby('family_size')['survived'].mean()
survival_by_family.plot(kind='line', marker='o', ax=axes[1], color='purple', linewidth=2)
axes[1].set_title('Survival Rate by Family Size', fontweight='bold')
axes[1].set_xlabel('Family Size')
axes[1].set_ylabel('Survival Rate')
axes[1].grid(alpha=0.3)
axes[1].set_ylim(0, 1)

# Survival by family category
survival_by_category = df.groupby('family_size_category')['survived'].mean()
survival_by_category.plot(kind='bar', ax=axes[2], color='orange')
axes[2].set_title('Survival Rate by Family Category', fontweight='bold')
axes[2].set_ylabel('Survival Rate')
axes[2].set_xticklabels(survival_by_category.index, rotation=0)
axes[2].grid(axis='y', alpha=0.3)
axes[2].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(survival_by_category.values):
    axes[2].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()
Family Size Insights
  • Alone (size=1): Survival rate ~30% - worst performing group
  • Small families (2-4): Survival rate ~50-70% - best performing
  • Large families (5+): Survival rate drops - possibly harder to evacuate together

Hypothesis: Small families may have helped each other survive, while solo travelers and large families faced challenges.

13.3 Step 5.3: Create Age Groups

print("="*80)
print("FEATURE ENGINEERING: AGE GROUPS")
print("="*80)

# Create age bins
age_bins = [0, 12, 18, 35, 60, 100]
age_labels = ['Child', 'Teenager', 'Adult', 'Middle Age', 'Senior']

df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels)

print("Age Group distribution:")
print(df['age_group'].value_counts())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count by age group
df['age_group'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_title('Count by Age Group', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Age Group')
axes[0].grid(axis='y', alpha=0.3)

# Survival by age group
survival_by_age_group = df.groupby('age_group')['survived'].mean()
survival_by_age_group.plot(kind='bar', ax=axes[1], color='salmon')
axes[1].set_title('Survival Rate by Age Group', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Age Group')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(survival_by_age_group.values):
    axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

13.4 Step 5.4: Create Fare Groups

print("="*80)
print("FEATURE ENGINEERING: FARE GROUPS")
print("="*80)

# Create fare bins based on quartiles
fare_bins = [0, 7.91, 14.45, 31, 513]  # Based on quartiles
fare_labels = ['Low', 'Medium', 'High', 'Very High']

df['fare_group'] = pd.cut(df['fare'], bins=fare_bins, labels=fare_labels)

print("Fare Group distribution:")
print(df['fare_group'].value_counts())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count
df['fare_group'].value_counts().plot(kind='bar', ax=axes[0], color='gold')
axes[0].set_title('Count by Fare Group', fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].set_xlabel('Fare Group')
axes[0].grid(axis='y', alpha=0.3)

# Survival by fare group
survival_by_fare = df.groupby('fare_group')['survived'].mean()
survival_by_fare.plot(kind='bar', ax=axes[1], color='darkgreen')
axes[1].set_title('Survival Rate by Fare Group', fontweight='bold')
axes[1].set_ylabel('Survival Rate')
axes[1].set_xlabel('Fare Group')
axes[1].grid(axis='y', alpha=0.3)
axes[1].set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(survival_by_fare.values):
    axes[1].text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

13.5 Step 5.5: Create is_alone Feature

print("="*80)
print("FEATURE ENGINEERING: IS_ALONE")
print("="*80)

# Create is_alone (1 if traveling alone, 0 otherwise)
df['is_alone'] = (df['family_size'] == 1).astype(int)

print("Is Alone distribution:")
print(df['is_alone'].value_counts())

# Compare with existing 'alone' column
print("\nComparison with seaborn's 'alone' column:")
print(f"Matches: {(df['is_alone'] == df['alone'].astype(int)).sum()} / {len(df)}")

# Visualize survival
fig, ax = plt.subplots(figsize=(8, 5))

survival_by_alone = df.groupby('is_alone')['survived'].mean()
bars = ax.bar(['With Family', 'Alone'], survival_by_alone.values[::-1],
              color=['lightgreen', 'salmon'])
ax.set_ylabel('Survival Rate')
ax.set_title('Survival Rate: Alone vs With Family', fontweight='bold', fontsize=14)
ax.grid(axis='y', alpha=0.3)
ax.set_ylim(0, 1)

# Add percentage labels
for i, v in enumerate(survival_by_alone.values[::-1]):
    ax.text(i, v + 0.02, f'{v*100:.1f}%', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nSurvival rate - With family: {survival_by_alone[0]*100:.1f}%")
print(f"Survival rate - Alone: {survival_by_alone[1]*100:.1f}%")

14 Part 6: Feature Encoding

14.1 Step 6.1: Label Encoding for Binary Features

print("="*80)
print("FEATURE ENCODING: BINARY FEATURES")
print("="*80)

# Sex: male=1, female=0
df['sex_encoded'] = (df['sex'] == 'male').astype(int)

print("Sex encoding:")
print(df[['sex', 'sex_encoded']].drop_duplicates())

# Embarked: Create dummy variables
print("\n\nEmbarked before encoding:")
print(df['embarked'].value_counts())

14.2 Step 6.2: One-Hot Encoding for Categorical Features

print("="*80)
print("FEATURE ENCODING: ONE-HOT ENCODING")
print("="*80)

# One-hot encode: embarked, title, family_size_category
categorical_features = ['embarked', 'title', 'family_size_category', 'age_group', 'fare_group']

# Create dummy variables
df_encoded = pd.get_dummies(df, columns=categorical_features, prefix=categorical_features, drop_first=True)

print(f"Original shape: {df.shape}")
print(f"After one-hot encoding: {df_encoded.shape}")

print("\nNew columns created:")
new_cols = [col for col in df_encoded.columns if col not in df.columns]
for col in sorted(new_cols):
    print(f"  - {col}")

14.3 Step 6.3: Label Encoding for Ordinal Features

print("="*80)
print("FEATURE ENCODING: ORDINAL FEATURES")
print("="*80)

# Pclass is already encoded as 1, 2, 3 (ordinal)
print("Pclass (already encoded ordinally):")
print(df_encoded['pclass'].value_counts().sort_index())

# Verify encoding makes sense
print("\nPclass survival rates:")
pclass_survival = df_encoded.groupby('pclass')['survived'].mean()
for pclass, rate in pclass_survival.items():
    print(f"  Class {pclass}: {rate*100:.1f}%")

15 Part 7: Feature Selection and Final Dataset

15.1 Step 7.1: Select Features for Modeling

print("="*80)
print("FEATURE SELECTION")
print("="*80)

# Drop unnecessary columns
columns_to_drop = [
    'name',           # Text, already extracted title
    'ticket',         # Unique identifier, no pattern
    'cabin',          # Too many missing values, created has_cabin
    'sex',            # Encoded as sex_encoded
    'embarked',       # One-hot encoded
    'title',          # One-hot encoded
    'family_size_category',  # One-hot encoded
    'age_group',      # One-hot encoded
    'fare_group',     # One-hot encoded
    'who',            # Redundant with sex and age
    'adult_male',     # Redundant
    'deck',           # Mostly missing
    'embark_town',    # Same as embarked
    'alive',          # Same as survived
    'alone',          # We have is_alone
    'class'           # Same as pclass
]

# Keep only existing columns
columns_to_drop = [col for col in columns_to_drop if col in df_encoded.columns]

df_final = df_encoded.drop(columns=columns_to_drop)

print(f"Columns dropped: {len(columns_to_drop)}")
print(f"Final dataset shape: {df_final.shape}")

print("\nFinal feature set:")
feature_cols = [col for col in df_final.columns if col != 'survived']
for i, col in enumerate(sorted(feature_cols), 1):
    print(f"{i:2d}. {col}")

15.2 Step 7.2: Check for Missing Values

print("="*80)
print("FINAL DATASET - MISSING VALUES CHECK")
print("="*80)

missing_final = df_final.isnull().sum()
if missing_final.sum() == 0:
    print("SUCCESS: No missing values in final dataset!")
else:
    print("WARNING: Missing values found:")
    print(missing_final[missing_final > 0])

15.3 Step 7.3: Feature Summary Statistics

print("="*80)
print("FINAL DATASET - SUMMARY STATISTICS")
print("="*80)

display(df_final.describe())

# Show data types
print("\nData types:")
print(df_final.dtypes.value_counts())

15.4 Step 7.4: Correlation with Target

# Calculate correlation with survived
correlations = df_final.corr()['survived'].drop('survived').sort_values(key=abs, ascending=False)

print("="*80)
print("FEATURE CORRELATIONS WITH SURVIVAL (Top 15)")
print("="*80)

for feature, corr in correlations.head(15).items():
    print(f"{feature:30s}: {corr:+.3f}")

# Visualize top correlations
fig, ax = plt.subplots(figsize=(10, 8))

top_features = correlations.head(15)
colors = ['green' if x > 0 else 'red' for x in top_features.values]

ax.barh(range(len(top_features)), top_features.values, color=colors, alpha=0.6)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features.index)
ax.set_xlabel('Correlation with Survival')
ax.set_title('Top 15 Features Correlated with Survival', fontweight='bold', fontsize=14)
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

16 Part 8: Prepare for Machine Learning

16.1 Step 8.1: Separate Features and Target

print("="*80)
print("PREPARING FOR MACHINE LEARNING")
print("="*80)

# Separate features (X) and target (y)
X = df_final.drop('survived', axis=1)
y = df_final['survived']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")

print(f"\nTarget distribution:")
print(f"  Survived (1): {y.sum()} ({y.mean()*100:.2f}%)")
print(f"  Perished (0): {(1-y).sum()} ({(1-y.mean())*100:.2f}%)")

print(f"\nFeatures list ({len(X.columns)} features):")
for i, col in enumerate(X.columns, 1):
    print(f"{i:2d}. {col}")

16.2 Step 8.2: Feature Scaling (Preparation)

print("="*80)
print("FEATURE SCALING (StandardScaler)")
print("="*80)

# Initialize scaler
scaler = StandardScaler()

# Features to scale (numerical features)
numerical_features = ['age', 'fare', 'sibsp', 'parch', 'family_size']

# Create a copy for scaling
X_scaled = X.copy()

# Fit and transform
X_scaled[numerical_features] = scaler.fit_transform(X[numerical_features])

print("Before scaling:")
display(X[numerical_features].describe())

print("\nAfter scaling:")
display(X_scaled[numerical_features].describe())
Scaling Best Practices
  1. Fit on training data only: Jangan fit scaler pada seluruh dataset sebelum train-test split!

  2. Transform test data: Gunakan scaler yang sudah di-fit pada training data

  3. When to scale:

    • Tree-based models (Random Forest, XGBoost): NO scaling needed
    • Distance-based models (KNN, SVM): YES scaling needed
    • Neural Networks: YES scaling needed
    • Linear Regression: YES scaling needed

16.3 Step 8.3: Save Preprocessed Data

# Save final preprocessed dataset
output_path = 'titanic_preprocessed.csv'
df_final.to_csv(output_path, index=False)

print("="*80)
print("DATASET SAVED")
print("="*80)
print(f"File: {output_path}")
print(f"Shape: {df_final.shape}")
print(f"Size: {df_final.memory_usage(deep=True).sum() / 1024:.2f} KB")

17 Part 9: Final Summary

17.1 Step 9.1: Preprocessing Pipeline Summary

print("="*80)
print("DATA PREPROCESSING PIPELINE SUMMARY")
print("="*80)

summary = {
    'Original Dataset': {
        'Rows': len(titanic_raw),
        'Columns': len(titanic_raw.columns),
        'Missing Values': titanic_raw.isnull().sum().sum(),
    },
    'Final Dataset': {
        'Rows': len(df_final),
        'Columns': len(df_final.columns),
        'Missing Values': df_final.isnull().sum().sum(),
    },
    'Transformations': {
        'Missing Values Handled': 'Embarked (mode), Age (median by title/class), Cabin (binary feature)',
        'Outliers Handled': 'Kept (legitimate values)',
        'Features Engineered': 'title, family_size, is_alone, age_group, fare_group, has_cabin',
        'Encoding Applied': 'One-hot (embarked, title, categories), Label (sex)',
        'Features Dropped': len(columns_to_drop),
        'Features Created': len(df_final.columns) - len(titanic_raw.columns) + len(columns_to_drop)
    }
}

for section, details in summary.items():
    print(f"\n{section}:")
    for key, value in details.items():
        print(f"  {key}: {value}")

17.2 Step 9.2: Key Insights

print("\n" + "="*80)
print("KEY INSIGHTS FROM PREPROCESSING")
print("="*80)

insights = [
    "1. Gender was the strongest predictor: Women had 74% survival vs 19% for men",
    "2. Class matters: 1st class 63% survival, 3rd class only 24%",
    "3. Age affects survival: Children had highest survival rates",
    "4. Family size is important: Small families (2-4) had best survival rates",
    "5. Fare correlates with survival: Higher fare = higher survival (class proxy)",
    "6. Title reveals social status and age: Miss/Mrs higher survival than Mr",
    "7. Having cabin info indicates higher class and survival",
    "8. Embarked from Cherbourg had higher survival (more 1st class passengers)"
]

for insight in insights:
    print(f"\n{insight}")

17.3 Step 9.3: Next Steps for Machine Learning

print("\n" + "="*80)
print("READY FOR MACHINE LEARNING!")
print("="*80)

next_steps = """
Next steps to build prediction models:

1. TRAIN-TEST SPLIT
   - Split data 80-20 or 70-30
   - Use stratified split to preserve class balance
   - Set random_state for reproducibility

2. MODEL SELECTION
   - Start with baseline (Logistic Regression)
   - Try ensemble methods (Random Forest, XGBoost)
   - Compare performance metrics

3. MODEL EVALUATION
   - Accuracy, Precision, Recall, F1-Score
   - Confusion Matrix
   - ROC Curve and AUC

4. HYPERPARAMETER TUNING
   - Grid Search or Random Search
   - Cross-validation

5. FEATURE IMPORTANCE
   - Identify most important features
   - Consider feature selection

The preprocessing is complete and data is ready for modeling!
"""

print(next_steps)

Lab Deliverables

What to Submit

  1. Jupyter Notebook (.ipynb) dengan:

    • Semua code cells yang telah dijalankan
    • Output visualisasi yang jelas
    • Markdown cells berisi analisis dan interpretasi
    • Kesimpulan dari setiap tahap preprocessing
  2. Preprocessed Dataset (titanic_preprocessed.csv)

    • Hasil akhir preprocessing yang siap untuk modeling
  3. Report (PDF) berisi:

    • Executive summary dari preprocessing pipeline
    • Justifikasi untuk setiap keputusan preprocessing
    • Visualisasi key insights
    • Rekomendasi untuk modeling

Grading Rubric

Lihat file rubric.md untuk detail kriteria penilaian.

Total: 25 points


References

  1. Kaggle Titanic Competition: https://www.kaggle.com/c/titanic
  2. Pandas Documentation: https://pandas.pydata.org/docs/
  3. Scikit-learn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
  4. Seaborn Tutorial: https://seaborn.pydata.org/tutorial.html

Tips for Success
  1. Understand the why: Jangan hanya copy-paste code, pahami alasan di balik setiap preprocessing step
  2. Document decisions: Catat mengapa memilih satu metode dibanding yang lain
  3. Visualize everything: Plot membantu memahami data dan mengkomunikasikan insight
  4. Think about leakage: Jangan gunakan informasi dari test set saat preprocessing training set
  5. Iterate and experiment: Coba berbagai strategi, bandingkan hasilnya