Bab 2: Data Preprocessing dan Exploratory Data Analysis

Mempersiapkan Data Berkualitas Tinggi untuk Machine Learning

Bab 2: Data Preprocessing dan Exploratory Data Analysis (EDA)

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

  1. Melakukan exploratory data analysis (EDA) komprehensif untuk memahami struktur dan karakteristik data
  2. Mengidentifikasi dan menangani missing values dengan strategi yang tepat
  3. Mendeteksi dan mengatasi outliers menggunakan berbagai teknik
  4. Mengenkode variabel kategorikal dengan metode yang sesuai untuk ML
  5. Menerapkan feature scaling dan normalization secara efektif
  6. Mengrekayasa fitur-fitur baru yang meningkatkan predictive power model
  7. Mencegah data leakage dan memastikan data quality untuk production models

2.1 Pengantar: Pentingnya Data Quality

Garbage In Garbage Out (GIGO) prinsip

β€œGarbage in, garbage out” - Prinsip fundamental dalam machine learning

Data adalah jantung dari setiap ML project. Algoritma terbaik sekalipun tidak bisa menghasilkan model berkualitas jika dilatih dengan data berkualitas rendah. Dalam praktik nyata, 70-80% waktu data scientist dihabiskan untuk data preprocessing dan EDA, bukan untuk modeling.

Dalam bab ini, kita akan mempelajari:

  • Bagaimana memahami data secara mendalam melalui EDA
  • Bagaimana membersihkan dan mempersiapkan data untuk modeling
  • Bagaimana menangani masalah umum seperti missing values dan outliers
  • Bagaimana mengrekayasa fitur yang lebih powerful
  • Bagaimana mencegah data leakage dan memastikan reproducibility

Pertanyaan Utama:

  • Bagaimana cara memahami dataset baru secara sistematis?
  • Apa masalah data quality yang umum dan bagaimana mengatasinya?
  • Bagaimana cara mempersiapkan data sehingga siap untuk algoritma ML?

2.2 Exploratory Data Analysis (EDA)

Important

EDA adalah proses investigasi sistematis terhadap dataset untuk memahami karakteristik, pola, dan anomalinya. EDA bukan tentang membuat model, tetapi tentang memahami data, mencari insight yang tersembunyi dalam data.

2.2.1 Tujuan dan Aktivitas EDA

Tujuan EDA:

  • Understand data structure dan content
  • Identify missing values, outliers, anomalies
  • Discover patterns dan relationships
  • Generate hypotheses untuk selanjutnya
  • Inform preprocessing dan feature engineering decisions

Aktivitas EDA:

Code
flowchart TD
    A["Dataset"] --> B["Load & Inspect"]
    B --> C["Basic Statistics"]
    C --> D["Univariate Analysis"]
    D --> D1["Distributions \n Histograms \n Box Plots"]

    D1 --> E["Bivariate Analysis"]
    E --> E1["Correlations \n Scatter Plots \n Pair Plots"]

    E1 --> F["Multivariate Analysis"]
    F --> F1["Heatmaps \n PCA Plots \n Clustering"]

    F1 --> G["Data Quality Report"]
    G --> H["Ready for Preprocessing"]

    style A fill:#e8f5e9
    style H fill:#c8e6c9
    style D fill:#a1d99f
    style E fill:#a1d99f
    style F fill:#a1d99f

flowchart TD
    A["Dataset"] --> B["Load & Inspect"]
    B --> C["Basic Statistics"]
    C --> D["Univariate Analysis"]
    D --> D1["Distributions \n Histograms \n Box Plots"]

    D1 --> E["Bivariate Analysis"]
    E --> E1["Correlations \n Scatter Plots \n Pair Plots"]

    E1 --> F["Multivariate Analysis"]
    F --> F1["Heatmaps \n PCA Plots \n Clustering"]

    F1 --> G["Data Quality Report"]
    G --> H["Ready for Preprocessing"]

    style A fill:#e8f5e9
    style H fill:#c8e6c9
    style D fill:#a1d99f
    style E fill:#a1d99f
    style F fill:#a1d99f

Exploratory Data Analysis Process Flow

2.2.2 Step 1: Load & Inspect Data

Langkah pertama adalah memuat data dan mendapatkan gambaran awal.

Code
import pandas as pd
import numpy as np

# Load sample dataset (Titanic)
url = 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'
df = pd.read_csv(url)

print("πŸ“Š DATASET SHAPE:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\nπŸ“‹ FIRST FEW ROWS:")
print(df.head())

print("\nπŸ“Œ DATA TYPES:")
print(df.dtypes)

print("\n❓ MISSING VALUES:")
print(df.isnull().sum())

print("\nπŸ“ BASIC INFO:")
print(f"Memory usage: {df.memory_usage().sum() / 1024**2:.2f} MB")
print(f"Duplicate rows: {df.duplicated().sum()}")

2.2.3 Step 2: Univariate Analysis

Analisis setiap variabel secara individual untuk memahami distribusinya.

Untuk Variabel Numerik:

  • Central tendency: Mean, median, mode
  • Spread: Std dev, variance, range
  • Shape: Skewness, kurtosis
  • Visualisasi: Histogram, KDE plot, box plot
Klik untuk melihat univariate analysis
import matplotlib.pyplot as plt
import seaborn as sns

# Descriptive statistics
print("πŸ“Š DESCRIPTIVE STATISTICS:")
print(df.describe())

print("\nπŸ“ˆ SKEWNESS & KURTOSIS:")
print(f"Age skewness: {df['Age'].skew():.3f}")
print(f"Age kurtosis: {df['Age'].kurtosis():.3f}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
df['Age'].hist(bins=30, ax=axes[0], edgecolor='black')
axes[0].set_title('Age Distribution (Histogram)')
axes[0].set_xlabel('Age')

# KDE plot
df['Age'].plot(kind='kde', ax=axes[1])
axes[1].set_title('Age Distribution (KDE)')
axes[1].set_xlabel('Age')

# Box plot
df['Age'].plot(kind='box', ax=axes[2])
axes[2].set_title('Age (Box Plot)')

plt.tight_layout()
plt.show()

print("\nβœ… Univariate analysis shows Age distribution")

Untuk Variabel Kategorikal:

  • Frequency distribution
  • Unique values count
  • Mode
  • Visualisasi: Bar plot, pie chart
Categorical variable analysis
print("πŸ“Š CATEGORICAL VARIABLES:")
print(f"Embarked unique values: {df['Embarked'].nunique()}")
print(f"Embarked value counts:\n{df['Embarked'].value_counts()}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['Embarked'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Embarked Port Distribution')
axes[0].set_xlabel('Port')

df['Pclass'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
axes[1].set_title('Passenger Class Distribution')

plt.tight_layout()
plt.show()

2.2.4 Step 3: Bivariate Analysis

Analisis relationship antara dua variabel.

Bivariate analysis examples
# Correlation matrix
print("πŸ“Š CORRELATION MATRIX:")
numeric_cols = df.select_dtypes(include=[np.number]).columns
corr_matrix = df[numeric_cols].corr()
print(corr_matrix)

# Scatter plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Numeric vs Numeric
df.plot(x='Age', y='Fare', kind='scatter', ax=axes[0], alpha=0.5)
axes[0].set_title('Age vs Fare Relationship')

# Categorical vs Numeric
df.boxplot(column='Age', by='Pclass', ax=axes[1])
axes[1].set_title('Age by Passenger Class')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("\nπŸ“ˆ Survival rate by gender:")
print(df.groupby('Sex')['Survived'].mean())

2.2.5 Step 4: Multivariate Analysis & Correlation Heatmap

Correlation heatmap
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation Heatmap of Numerical Features')
plt.tight_layout()
plt.show()

print("βœ… Heatmap reveals feature relationships")

2.3 Data Cleaning & Preprocessing

Data real-world selalu memiliki masalah. Mari kita belajar cara mengatasinya.

2.3.1 Handling Missing Values

Missing values dapat terjadi karena:

  • Data entry errors
  • Equipment failure
  • Confidentiality reasons
  • Data loss
Missing Value Strategies

Jangan pernah menghapus semua rows dengan missing values tanpa pertimbangan!

Pilih strategi berdasarkan:

  • Seberapa banyak missing values (threshold)?
  • Apakah missing values MCAR, MAR, atau MNAR?
  • Pentingnya variabel tersebut
  • Dampak pada model performance

Strategi Handling Missing Values:

Code
flowchart TD
    A["Missing Values \n Detected"] --> B{"Percentage \n Missing?"}

    B -->|> 50%| C["Drop Column \n Too Much Missing"]
    B -->|< 50%| D{"Type of \n Variable?"}

    D -->|Numeric| E["Fill Strategies"]
    D -->|Categorical| F["Fill Strategies"]

    E --> E1["Mean/Median \n Forward Fill \n Interpolation"]
    F --> F1["Mode \n Create 'Unknown' \n Forward Fill"]

    E1 --> G["Check Distribution"]
    F1 --> G

    G --> H["Evaluate Impact"]
    H --> I["Proceed with Model"]

    style C fill:#ffcdd2
    style E1 fill:#c8e6c9
    style F1 fill:#c8e6c9
    style I fill:#a5d6a7

flowchart TD
    A["Missing Values \n Detected"] --> B{"Percentage \n Missing?"}

    B -->|> 50%| C["Drop Column \n Too Much Missing"]
    B -->|< 50%| D{"Type of \n Variable?"}

    D -->|Numeric| E["Fill Strategies"]
    D -->|Categorical| F["Fill Strategies"]

    E --> E1["Mean/Median \n Forward Fill \n Interpolation"]
    F --> F1["Mode \n Create 'Unknown' \n Forward Fill"]

    E1 --> G["Check Distribution"]
    F1 --> G

    G --> H["Evaluate Impact"]
    H --> I["Proceed with Model"]

    style C fill:#ffcdd2
    style E1 fill:#c8e6c9
    style F1 fill:#c8e6c9
    style I fill:#a5d6a7

Missing Values Handling Strategies

Metode Imputation:

Missing value handling examples
print("❓ MISSING VALUES ANALYSIS:")
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0])

# Strategy 1: Drop column if > 50% missing
print("\nπŸ“‹ STRATEGY 1: Drop Columns")
df_clean = df.drop(['Cabin'], axis=1)  # Cabin has 77% missing
print(f"Dropped Cabin column")

# Strategy 2: Impute numerical with median
print("\nπŸ“‹ STRATEGY 2: Impute Numeric with Median")
df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
print(f"Age missing values filled with median")

# Strategy 3: Impute categorical with mode
print("\nπŸ“‹ STRATEGY 3: Impute Categorical with Mode")
df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)
print(f"Embarked filled with mode")

# Strategy 4: Forward fill (untuk time series)
print("\nπŸ“‹ STRATEGY 4: Forward Fill (Time Series)")
# df_clean.fillna(method='ffill', inplace=True)
print("For time series, forward fill propagates previous value")

print("\nβœ… Missing values handled:")
print(df_clean.isnull().sum())

2.3.2 Handling Outliers

Outliers adalah nilai yang sangat berbeda dari mayoritas data. Mereka bisa:

  • Valid: Natural variation (e.g., billionaire dalam dataset income)
  • Invalid: Data entry error, sensor malfunction

Deteksi Outliers:

Method 1: IQR (Interquartile Range)

Q1 = 25th percentile
Q3 = 75th percentile
IQR = Q3 - Q1

Lower Bound = Q1 - 1.5 * IQR
Upper Bound = Q3 + 1.5 * IQR

Outliers = values outside [Lower Bound, Upper Bound]

Method 2: Z-Score

Z = (value - mean) / std_dev

|Z| > 3 β‰ˆ 99.7% of data (likely outlier)
Outlier detection methods
# Method 1: IQR
print("πŸ” OUTLIER DETECTION - IQR METHOD:")
Q1 = df_clean['Age'].quantile(0.25)
Q3 = df_clean['Age'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df_clean[(df_clean['Age'] < lower_bound) |
                        (df_clean['Age'] > upper_bound)]
print(f"IQR bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")
print(f"Outliers detected: {len(outliers_iqr)}")

# Method 2: Z-Score
print("\nπŸ” OUTLIER DETECTION - Z-SCORE METHOD:")
from scipy import stats
z_scores = np.abs(stats.zscore(df_clean['Age'].dropna()))
outliers_zscore = (z_scores > 3).sum()
print(f"Outliers (|Z| > 3): {outliers_zscore}")

# Handling outliers
print("\nβœ… OUTLIER HANDLING STRATEGIES:")
print("1. Remove: df.drop(outlier_indices)")
print("2. Cap: df['col'].clip(lower, upper)")
print("3. Transform: log, sqrt, Box-Cox")
print("4. Keep: if valid and important")

# Example: Cap at bounds
df_clean['Age_capped'] = df_clean['Age'].clip(lower_bound, upper_bound)
print("\nβœ… Age outliers capped at IQR bounds")

2.3.3 Encoding Categorical Variables

Algoritma ML membutuhkan input numerik. Variabel kategorikal perlu dienkode.

Strategi Encoding:

Code
flowchart TD
    A["Categorical Variable"] --> B{"Ordinal \n or \n Nominal?"}

    B -->|Ordinal| C["Ordinal Encoding"]
    C --> C1["low→1, medium→2 \n high→3 \n Preserves order"]

    B -->|Nominal| D{"Binary \n or \n Multi?"}

    D -->|Binary| E["Label Encoding"]
    E --> E1["No→0, Yes→1 \n Simple & Fast"]

    D -->|Multi| F["One-Hot Encoding"]
    F --> F1["Create dummy \n variables \n No multicollinearity"]

    C1 --> G["Ready for Model"]
    E1 --> G
    F1 --> G

    style C1 fill:#bbdefb
    style E1 fill:#90caf9
    style F1 fill:#64b5f6
    style G fill:#c8e6c9

flowchart TD
    A["Categorical Variable"] --> B{"Ordinal \n or \n Nominal?"}

    B -->|Ordinal| C["Ordinal Encoding"]
    C --> C1["low→1, medium→2 \n high→3 \n Preserves order"]

    B -->|Nominal| D{"Binary \n or \n Multi?"}

    D -->|Binary| E["Label Encoding"]
    E --> E1["No→0, Yes→1 \n Simple & Fast"]

    D -->|Multi| F["One-Hot Encoding"]
    F --> F1["Create dummy \n variables \n No multicollinearity"]

    C1 --> G["Ready for Model"]
    E1 --> G
    F1 --> G

    style C1 fill:#bbdefb
    style E1 fill:#90caf9
    style F1 fill:#64b5f6
    style G fill:#c8e6c9

Categorical Variable Encoding Strategies

Categorical encoding examples
print("πŸ“‹ ENCODING STRATEGIES:")

# Strategy 1: Label Encoding (Binary categorical)
print("\n1️⃣ LABEL ENCODING (Binary):")
df_clean['Sex_encoded'] = df_clean['Sex'].map({'male': 1, 'female': 0})
print(df_clean[['Sex', 'Sex_encoded']].head())

# Strategy 2: Ordinal Encoding (Ordinal categorical)
print("\n2️⃣ ORDINAL ENCODING (Ordered Categories):")
class_mapping = {'Lower': 1, 'Middle': 2, 'Upper': 3}
# df_clean['Class_encoded'] = df_clean['Class'].map(class_mapping)
print("Maps preserve order: Lower < Middle < Upper")

# Strategy 3: One-Hot Encoding (Nominal categorical)
print("\n3️⃣ ONE-HOT ENCODING (Nominal):")
embarked_encoded = pd.get_dummies(df_clean['Embarked'], prefix='Embarked')
print(embarked_encoded.head())

print("\nβœ… Categorical variables encoded for ML")
One-Hot Encoding Caution

Dummy Variable Trap: Jika memiliki k kategori, jangan buat k dummy variables!

Gunakan k-1 dummy variables (drop first category) untuk menghindari perfect multicollinearity.

pd.get_dummies(df['col'], drop_first=True)

2.3.4 Feature Scaling & Normalization

Beberapa algoritma sensitif terhadap skala feature (e.g., distance-based, gradient-based). Scaling menormalisasi features ke range yang sama.

Metode Scaling:

Feature scaling methods
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Original data
ages = df_clean['Age'].values.reshape(-1, 1)

print("πŸ“Š ORIGINAL DATA STATISTICS:")
print(f"Mean: {ages.mean():.2f}, Std: {ages.std():.2f}")
print(f"Min: {ages.min():.2f}, Max: {ages.max():.2f}")

# Method 1: Standardization (Z-score normalization)
print("\n1️⃣ STANDARDIZATION (Z-score):")
scaler_std = StandardScaler()
ages_standardized = scaler_std.fit_transform(ages)
print(f"After standardization - Mean: {ages_standardized.mean():.2f}, Std: {ages_standardized.std():.2f}")
print(f"Formula: (x - mean) / std")

# Method 2: Min-Max Scaling (Normalization)
print("\n2️⃣ MIN-MAX SCALING (Normalization):")
scaler_minmax = MinMaxScaler(feature_range=(0, 1))
ages_normalized = scaler_minmax.fit_transform(ages)
print(f"After min-max - Min: {ages_normalized.min():.2f}, Max: {ages_normalized.max():.2f}")
print(f"Formula: (x - min) / (max - min)")

# Method 3: Robust Scaling (resistant to outliers)
print("\n3️⃣ ROBUST SCALING (Resistant to Outliers):")
scaler_robust = RobustScaler()
ages_robust = scaler_robust.fit_transform(ages)
print(f"After robust scaling - Median: {np.median(ages_robust):.2f}")
print(f"Formula: (x - median) / IQR")

print("\nπŸ“Š COMPARISON:")
print(f"Original range: [{ages.min():.2f}, {ages.max():.2f}]")
print(f"Standardized range: [{ages_standardized.min():.2f}, {ages_standardized.max():.2f}]")
print(f"Normalized range: [{ages_normalized.min():.2f}, {ages_normalized.max():.2f}]")

print("\nβœ… Choose scaling based on algorithm requirements")

Kapan menggunakan scaling?

Algoritma Membutuhkan Scaling?
Linear Regression, Logistic Regression βœ… Yes
Neural Networks βœ… Yes
SVM, KNN βœ… Yes
Decision Trees, Random Forest ❌ No
Gradient Boosting ❌ No

2.4 Feature Engineering

Feature engineering adalah seni & sains menciptakan features baru dari raw data yang meningkatkan model performance.

2.4.1 Tipe-Tipe Feature Engineering

1. Polynomial Features

Polynomial feature creation
from sklearn.preprocessing import PolynomialFeatures

# Original feature
X = df_clean[['Age']].values[:100]

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

print("πŸ“Š POLYNOMIAL FEATURES:")
print(f"Original: {X.shape[1]} feature")
print(f"After polynomial (degree 2): {X_poly.shape[1]} features")
print(f"Features: Age, AgeΒ²")
print(f"\nExample:")
print(f"Age=25 β†’ [25, 625]")

2. Interaction Features

Code
# Create interaction feature
df_clean['Age_Fare_Interaction'] = df_clean['Age'] * df_clean['Fare']

print("πŸ“Š INTERACTION FEATURES:")
print("Combined Age Γ— Fare")
print(df_clean[['Age', 'Fare', 'Age_Fare_Interaction']].head())

3. Domain-Specific Features

Code
# Extract title from name
print("πŸ“Š DOMAIN-SPECIFIC FEATURES:")
df_clean['Title'] = df_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
print("Extracted title from passenger name:")
print(df_clean[['Name', 'Title']].head())

# Family size feature
print("\nFamily size from SibSp and Parch:")
df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] + 1
print(df_clean[['SibSp', 'Parch', 'FamilySize']].head())

2.4.2 Feature Selection

Tidak semua features berguna. Feature selection menghilangkan features yang:

  • Tidak relevant dengan target
  • Redundant dengan features lain
  • Meningkatkan noise/overfitting

Metode Feature Selection:

Code
flowchart TD
    A["Features \n Brainstorm"] --> B{"Selection \n Method?"}

    B --> C["Statistical Methods"]
    C --> C1["Univariate Tests \n Correlation Analysis \n Mutual Information"]

    B --> D["Model-Based"]
    D --> D1["Feature Importance \n Permutation \n SHAP Values"]

    B --> E["Iterative"]
    E --> E1["Recursive Elimination \n Forward Selection \n Backward Elimination"]

    C1 --> F["Selected Features"]
    D1 --> F
    E1 --> F

    F --> G["Model Training"]

    style F fill:#c8e6c9
    style G fill:#a5d6a7

flowchart TD
    A["Features \n Brainstorm"] --> B{"Selection \n Method?"}

    B --> C["Statistical Methods"]
    C --> C1["Univariate Tests \n Correlation Analysis \n Mutual Information"]

    B --> D["Model-Based"]
    D --> D1["Feature Importance \n Permutation \n SHAP Values"]

    B --> E["Iterative"]
    E --> E1["Recursive Elimination \n Forward Selection \n Backward Elimination"]

    C1 --> F["Selected Features"]
    D1 --> F
    E1 --> F

    F --> G["Model Training"]

    style F fill:#c8e6c9
    style G fill:#a5d6a7

Feature Selection Methods

Feature selection techniques
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

# Prepare data
X = df_clean[['Age', 'Fare', 'Pclass']].dropna()
y = df_clean.loc[X.index, 'Survived']

print("πŸ“Š FEATURE SELECTION METHODS:")

# Method 1: Univariate statistical test
print("\n1️⃣ UNIVARIATE STATISTICAL TEST:")
selector = SelectKBest(score_func=f_classif, k=2)
X_selected = selector.fit_transform(X, y)
scores = selector.scores_

for i, col in enumerate(X.columns):
    print(f"{col}: {scores[i]:.2f}")

# Method 2: Feature Importance from Random Forest
print("\n2️⃣ FEATURE IMPORTANCE (Random Forest):")
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

for i, col in enumerate(X.columns):
    print(f"{col}: {rf.feature_importances_[i]:.3f}")

print("\nβœ… Features ranked by importance")

2.5 Data Validation & Quality Checks

Sebelum modeling, pastikan data berkualitas tinggi.

2.5.1 Data Leakage Prevention

⚠️ Data Leakage Warning!

Data leakage terjadi ketika informasi dari test set β€œbocor” ke training set, menyebabkan overly optimistic model performance.

Contoh data leakage:

  1. Scaling data sebelum train-test split
  2. Using future information untuk predict past
  3. Including target variable information di features
  4. Tuning hyperparameters pada test set

Best Practices untuk mencegah data leakage:

Data leakage prevention
from sklearn.model_selection import train_test_split

print("βœ… CORRECT WORKFLOW:")
print("""
1. Train-Test Split PERTAMA
   X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

2. Fit preprocessor HANYA pada training data
   scaler.fit(X_train)  # NOT fit on entire X!

3. Transform BOTH sets dengan scaler yang sudah fitted
   X_train_scaled = scaler.transform(X_train)
   X_test_scaled = scaler.transform(X_test)

4. Train model pada training set saja
   model.fit(X_train_scaled, y_train)

5. Evaluate pada test set
   score = model.score(X_test_scaled, y_test)
""")

print("❌ INCORRECT WORKFLOW (leads to data leakage):")
print("""
1. Scaling semua data sekaligus
   X_scaled = scaler.fit_transform(X)  # WRONG!

2. Kemudian baru split
   X_train, X_test = train_test_split(X_scaled)

   Result: Test set sudah "dilihat" scaler β†’ leakage!
""")

2.5.2 Data Quality Checklist

# Comprehensive data quality checks
print("""
πŸ“‹ DATA QUALITY CHECKLIST:

Completeness:
βœ“ No critical missing values (or properly handled)
βœ“ All rows have valid data

Accuracy:
βœ“ Values are within expected ranges
βœ“ No obvious data entry errors
βœ“ Outliers validated as legitimate

Consistency:
βœ“ Consistent data types
βœ“ Consistent naming conventions
βœ“ No duplicates (unless intentional)

Validity:
βœ“ All values conform to business rules
βœ“ Categorical variables have expected categories
βœ“ Numeric values are reasonable

Uniqueness:
βœ“ Primary key is unique
βœ“ No unintended duplicates
βœ“ De-duplication done properly

Timeliness:
βœ“ Data is current/recent enough
βœ“ No time-series data gaps
βœ“ Temporal consistency maintained
""")

2.6 Complete Preprocessing Pipeline

Mari kita gabungkan semua langkah menjadi preprocessing pipeline yang reusable.

Complete preprocessing pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

print("πŸ”§ BUILDING PREPROCESSING PIPELINE:")

# Define numeric and categorical columns
numeric_features = ['Age', 'Fare']
categorical_features = ['Embarked', 'Sex']

# Create transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

print("βœ… Preprocessing pipeline created")
print("\nSteps:")
print("1. Impute missing values")
print("2. Scale numeric features")
print("3. Encode categorical features")
print("4. Combine all features")

# Fit preprocessing on training data
X = df_clean[numeric_features + categorical_features]
X_preprocessed = preprocessor.fit_transform(X)

print(f"\nβœ… Input shape: {X.shape}")
print(f"βœ… Output shape: {X_preprocessed.shape}")
print(f"βœ… Ready for modeling!")

2.7 Case Study: Titanic Dataset Preprocessing

Mari aplikasikan semua konsep pada dataset Titanic.

Problem

Memprediksi apakah penumpang Titanic akan selamat (survival prediction)

Step-by-Step Preprocessing

Complete Titanic preprocessing
print("🚒 TITANIC DATASET PREPROCESSING:\n")

# 1. Load data
print("1️⃣ LOAD DATA")
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv')
print(f"Shape: {df.shape}")

# 2. Analyze missing values
print("\n2️⃣ ANALYZE MISSING VALUES")
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0])

# 3. Handle missing values
print("\n3️⃣ HANDLE MISSING VALUES")
df_processed = df.copy()
df_processed['Age'].fillna(df_processed['Age'].median(), inplace=True)
df_processed['Embarked'].fillna(df_processed['Embarked'].mode()[0], inplace=True)
df_processed = df_processed.drop(['Cabin'], axis=1)
print("βœ… Missing values handled")

# 4. Feature engineering
print("\n4️⃣ FEATURE ENGINEERING")
df_processed['Title'] = df_processed['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df_processed['FamilySize'] = df_processed['SibSp'] + df_processed['Parch'] + 1
df_processed['IsAlone'] = (df_processed['FamilySize'] == 1).astype(int)
print("βœ… New features created: Title, FamilySize, IsAlone")

# 5. Encode categorical variables
print("\n5️⃣ ENCODE CATEGORICAL VARIABLES")
df_processed['Sex'] = df_processed['Sex'].map({'male': 1, 'female': 0})
df_processed = pd.get_dummies(df_processed, columns=['Embarked', 'Title'], drop_first=True)
print("βœ… Categorical variables encoded")

# 6. Select features for modeling
print("\n6️⃣ SELECT FEATURES")
feature_cols = [col for col in df_processed.columns if col not in ['PassengerId', 'Name', 'Ticket', 'Survived']]
X = df_processed[feature_cols]
y = df_processed['Survived']
print(f"βœ… Features selected: {X.shape[1]} features, {X.shape[0]} samples")

# 7. Final validation
print("\n7️⃣ FINAL VALIDATION")
print(f"No missing values remaining: {X.isnull().sum().sum() == 0}")
print(f"All numeric features: {X.dtypes.nunique() == 1}")
print(f"Target variable balanced: {y.value_counts().to_dict()}")

print("\nβœ… DATA READY FOR MODELING!")

πŸ§ͺ Hands-on Exercise 2

Objektif: Practice EDA dan preprocessing dengan real dataset

Instruksi:

  1. Load dataset (Titanic atau Iris)

  2. Perform complete EDA:

    • Check shape, dtypes, missing values
    • Univariate analysis (histograms, box plots)
    • Bivariate analysis (correlations, scatter plots)
  3. Preprocess data:

    • Handle missing values
    • Detect and handle outliers
    • Encode categorical variables
  4. Create feature engineering:

    • At least 2 new features
    • Perform feature selection
  5. Build preprocessing pipeline

Bonus: Visualize before/after distributions


πŸ“ Review Questions

Conceptual Questions

  1. Jelaskan perbedaan antara Exploratory Data Analysis (EDA) dan Data Preprocessing. Mengapa kedua tahap ini penting?

  2. Anda menemukan bahwa 70% data dalam suatu column sudah missing. Apa keputusan yang tepat dan mengapa?

  3. Jelaskan tiga metode untuk mendeteksi outliers. Bagaimana cara menentukan apakah outlier adalah error atau valid data?

  4. Apa perbedaan antara standardization (z-score), normalization (min-max), dan robust scaling? Kapan menggunakan masing-masing?

  5. Jelaskan data leakage dan berikan dua contoh konkret. Bagaimana cara mencegahnya?

Practical Questions

  1. Dataset Anda memiliki kolom β€˜Age’ dengan nilai ranging dari 0 hingga 1500 (error). Apa strategi pembersihan yang tepat?

  2. Variabel β€˜Gender’ memiliki 3 kategori: β€˜M’, β€˜F’, β€˜Unknown’. Bagaimana Anda akan mengenkodenya untuk ML model?

  3. Anda punya feature β€˜Birth_Date’. Fitur apa yang bisa Anda engineer dari ini untuk prediksi churn?

  4. Setelah preprocessing, Anda punya 150 features. Bagaimana cara memilih 20 features terbaik?

  5. Di mana posisi scaling dalam preprocessing pipeline? Mengapa timing penting?


🎯 Key Takeaways

βœ… EDA adalah langkah kritis untuk memahami data secara mendalam

βœ… Missing values harus ditangani dengan strategi yang tepat, bukan hanya dihapus

βœ… Outliers bisa valid atau error - validasi sebelum handling

βœ… Encoding categorical variables adalah keharusan untuk ML algorithms

βœ… Feature scaling penting untuk distance-based dan gradient-based algorithms

βœ… Feature engineering dapat meningkatkan model performance significantly

βœ… Data leakage prevention adalah fundamental untuk reliable models

βœ… Preprocessing pipeline harus reproducible dan automated


πŸ“š References

Bacaan Lebih Lanjut

  • Pandas Documentation: https://pandas.pydata.org/
  • Scikit-learn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
  • Feature Engineering Tutorial: https://www.kaggle.com/learn/feature-engineering

Tools & Libraries

  • Pandas: Data manipulation and analysis
  • NumPy: Numerical computing
  • Scikit-learn: Preprocessing and feature engineering
  • Matplotlib/Seaborn: Data visualization

✨ Kesimpulan

Data preprocessing dan EDA adalah fondasi dari machine learning projects yang sukses. Waktu yang Anda investasikan dalam memahami dan membersihkan data akan membayar dividen besar dalam bentuk model performance yang lebih baik.

Di Bab 3 (Classical ML Algorithms), kita akan menggunakan data yang sudah dipreproses ini untuk melatih berbagai algoritma dan memahami kekuatan serta kelemahan masing-masing. πŸš€