Mempersiapkan Data Berkualitas Tinggi untuk Machine Learning
Bab 2: Data Preprocessing dan Exploratory Data Analysis (EDA)
π― Hasil Pembelajaran (Learning Outcomes)
Setelah mempelajari bab ini, Anda akan mampu:
Melakukan exploratory data analysis (EDA) komprehensif untuk memahami struktur dan karakteristik data
Mengidentifikasi dan menangani missing values dengan strategi yang tepat
Mendeteksi dan mengatasi outliers menggunakan berbagai teknik
Mengenkode variabel kategorikal dengan metode yang sesuai untuk ML
Menerapkan feature scaling dan normalization secara efektif
Mengrekayasa fitur-fitur baru yang meningkatkan predictive power model
Mencegah data leakage dan memastikan data quality untuk production models
2.1 Pengantar: Pentingnya Data Quality
Garbage In Garbage Out (GIGO) prinsip
βGarbage in, garbage outβ - Prinsip fundamental dalam machine learning
Data adalah jantung dari setiap ML project. Algoritma terbaik sekalipun tidak bisa menghasilkan model berkualitas jika dilatih dengan data berkualitas rendah. Dalam praktik nyata, 70-80% waktu data scientist dihabiskan untuk data preprocessing dan EDA, bukan untuk modeling.
Dalam bab ini, kita akan mempelajari:
Bagaimana memahami data secara mendalam melalui EDA
Bagaimana membersihkan dan mempersiapkan data untuk modeling
Bagaimana menangani masalah umum seperti missing values dan outliers
Bagaimana mengrekayasa fitur yang lebih powerful
Bagaimana mencegah data leakage dan memastikan reproducibility
Pertanyaan Utama:
Bagaimana cara memahami dataset baru secara sistematis?
Apa masalah data quality yang umum dan bagaimana mengatasinya?
Bagaimana cara mempersiapkan data sehingga siap untuk algoritma ML?
2.2 Exploratory Data Analysis (EDA)
Important
EDA adalah proses investigasi sistematis terhadap dataset untuk memahami karakteristik, pola, dan anomalinya. EDA bukan tentang membuat model, tetapi tentang memahami data, mencari insight yang tersembunyi dalam data.
2.2.1 Tujuan dan Aktivitas EDA
Tujuan EDA:
Understand data structure dan content
Identify missing values, outliers, anomalies
Discover patterns dan relationships
Generate hypotheses untuk selanjutnya
Inform preprocessing dan feature engineering decisions
Aktivitas EDA:
Code
flowchart TD A["Dataset"] --> B["Load & Inspect"] B --> C["Basic Statistics"] C --> D["Univariate Analysis"] D --> D1["Distributions \n Histograms \n Box Plots"] D1 --> E["Bivariate Analysis"] E --> E1["Correlations \n Scatter Plots \n Pair Plots"] E1 --> F["Multivariate Analysis"] F --> F1["Heatmaps \n PCA Plots \n Clustering"] F1 --> G["Data Quality Report"] G --> H["Ready for Preprocessing"] style A fill:#e8f5e9 style H fill:#c8e6c9 style D fill:#a1d99f style E fill:#a1d99f style F fill:#a1d99f
flowchart TD
A["Dataset"] --> B["Load & Inspect"]
B --> C["Basic Statistics"]
C --> D["Univariate Analysis"]
D --> D1["Distributions \n Histograms \n Box Plots"]
D1 --> E["Bivariate Analysis"]
E --> E1["Correlations \n Scatter Plots \n Pair Plots"]
E1 --> F["Multivariate Analysis"]
F --> F1["Heatmaps \n PCA Plots \n Clustering"]
F1 --> G["Data Quality Report"]
G --> H["Ready for Preprocessing"]
style A fill:#e8f5e9
style H fill:#c8e6c9
style D fill:#a1d99f
style E fill:#a1d99f
style F fill:#a1d99f
Exploratory Data Analysis Process Flow
2.2.2 Step 1: Load & Inspect Data
Langkah pertama adalah memuat data dan mendapatkan gambaran awal.
Code
import pandas as pdimport numpy as np# Load sample dataset (Titanic)url ='https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'df = pd.read_csv(url)print("π DATASET SHAPE:")print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")print("\nπ FIRST FEW ROWS:")print(df.head())print("\nπ DATA TYPES:")print(df.dtypes)print("\nβ MISSING VALUES:")print(df.isnull().sum())print("\nπ BASIC INFO:")print(f"Memory usage: {df.memory_usage().sum() /1024**2:.2f} MB")print(f"Duplicate rows: {df.duplicated().sum()}")
2.2.3 Step 2: Univariate Analysis
Analisis setiap variabel secara individual untuk memahami distribusinya.
Untuk Variabel Numerik:
Central tendency: Mean, median, mode
Spread: Std dev, variance, range
Shape: Skewness, kurtosis
Visualisasi: Histogram, KDE plot, box plot
Klik untuk melihat univariate analysis
import matplotlib.pyplot as pltimport seaborn as sns# Descriptive statisticsprint("π DESCRIPTIVE STATISTICS:")print(df.describe())print("\nπ SKEWNESS & KURTOSIS:")print(f"Age skewness: {df['Age'].skew():.3f}")print(f"Age kurtosis: {df['Age'].kurtosis():.3f}")# Visualizationfig, axes = plt.subplots(1, 3, figsize=(15, 4))# Histogramdf['Age'].hist(bins=30, ax=axes[0], edgecolor='black')axes[0].set_title('Age Distribution (Histogram)')axes[0].set_xlabel('Age')# KDE plotdf['Age'].plot(kind='kde', ax=axes[1])axes[1].set_title('Age Distribution (KDE)')axes[1].set_xlabel('Age')# Box plotdf['Age'].plot(kind='box', ax=axes[2])axes[2].set_title('Age (Box Plot)')plt.tight_layout()plt.show()print("\nβ Univariate analysis shows Age distribution")
Untuk Variabel Kategorikal:
Frequency distribution
Unique values count
Mode
Visualisasi: Bar plot, pie chart
Categorical variable analysis
print("π CATEGORICAL VARIABLES:")print(f"Embarked unique values: {df['Embarked'].nunique()}")print(f"Embarked value counts:\n{df['Embarked'].value_counts()}")# Visualizationfig, axes = plt.subplots(1, 2, figsize=(12, 4))df['Embarked'].value_counts().plot(kind='bar', ax=axes[0])axes[0].set_title('Embarked Port Distribution')axes[0].set_xlabel('Port')df['Pclass'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')axes[1].set_title('Passenger Class Distribution')plt.tight_layout()plt.show()
2.2.4 Step 3: Bivariate Analysis
Analisis relationship antara dua variabel.
Bivariate analysis examples
# Correlation matrixprint("π CORRELATION MATRIX:")numeric_cols = df.select_dtypes(include=[np.number]).columnscorr_matrix = df[numeric_cols].corr()print(corr_matrix)# Scatter plotfig, axes = plt.subplots(1, 2, figsize=(12, 4))# Numeric vs Numericdf.plot(x='Age', y='Fare', kind='scatter', ax=axes[0], alpha=0.5)axes[0].set_title('Age vs Fare Relationship')# Categorical vs Numericdf.boxplot(column='Age', by='Pclass', ax=axes[1])axes[1].set_title('Age by Passenger Class')plt.suptitle('') # Remove default titleplt.tight_layout()plt.show()print("\nπ Survival rate by gender:")print(df.groupby('Sex')['Survived'].mean())
Data real-world selalu memiliki masalah. Mari kita belajar cara mengatasinya.
2.3.1 Handling Missing Values
Missing values dapat terjadi karena:
Data entry errors
Equipment failure
Confidentiality reasons
Data loss
Missing Value Strategies
Jangan pernah menghapus semua rows dengan missing values tanpa pertimbangan!
Pilih strategi berdasarkan:
Seberapa banyak missing values (threshold)?
Apakah missing values MCAR, MAR, atau MNAR?
Pentingnya variabel tersebut
Dampak pada model performance
Strategi Handling Missing Values:
Code
flowchart TD A["Missing Values \n Detected"] --> B{"Percentage \n Missing?"} B -->|> 50%| C["Drop Column \n Too Much Missing"] B -->|< 50%| D{"Type of \n Variable?"} D -->|Numeric| E["Fill Strategies"] D -->|Categorical| F["Fill Strategies"] E --> E1["Mean/Median \n Forward Fill \n Interpolation"] F --> F1["Mode \n Create 'Unknown' \n Forward Fill"] E1 --> G["Check Distribution"] F1 --> G G --> H["Evaluate Impact"] H --> I["Proceed with Model"] style C fill:#ffcdd2 style E1 fill:#c8e6c9 style F1 fill:#c8e6c9 style I fill:#a5d6a7
flowchart TD
A["Missing Values \n Detected"] --> B{"Percentage \n Missing?"}
B -->|> 50%| C["Drop Column \n Too Much Missing"]
B -->|< 50%| D{"Type of \n Variable?"}
D -->|Numeric| E["Fill Strategies"]
D -->|Categorical| F["Fill Strategies"]
E --> E1["Mean/Median \n Forward Fill \n Interpolation"]
F --> F1["Mode \n Create 'Unknown' \n Forward Fill"]
E1 --> G["Check Distribution"]
F1 --> G
G --> H["Evaluate Impact"]
H --> I["Proceed with Model"]
style C fill:#ffcdd2
style E1 fill:#c8e6c9
style F1 fill:#c8e6c9
style I fill:#a5d6a7
Missing Values Handling Strategies
Metode Imputation:
Missing value handling examples
print("β MISSING VALUES ANALYSIS:")missing_pct = (df.isnull().sum() /len(df)) *100print(missing_pct[missing_pct >0])# Strategy 1: Drop column if > 50% missingprint("\nπ STRATEGY 1: Drop Columns")df_clean = df.drop(['Cabin'], axis=1) # Cabin has 77% missingprint(f"Dropped Cabin column")# Strategy 2: Impute numerical with medianprint("\nπ STRATEGY 2: Impute Numeric with Median")df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)print(f"Age missing values filled with median")# Strategy 3: Impute categorical with modeprint("\nπ STRATEGY 3: Impute Categorical with Mode")df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)print(f"Embarked filled with mode")# Strategy 4: Forward fill (untuk time series)print("\nπ STRATEGY 4: Forward Fill (Time Series)")# df_clean.fillna(method='ffill', inplace=True)print("For time series, forward fill propagates previous value")print("\nβ Missing values handled:")print(df_clean.isnull().sum())
2.3.2 Handling Outliers
Outliers adalah nilai yang sangat berbeda dari mayoritas data. Mereka bisa:
Valid: Natural variation (e.g., billionaire dalam dataset income)
Algoritma ML membutuhkan input numerik. Variabel kategorikal perlu dienkode.
Strategi Encoding:
Code
flowchart TD A["Categorical Variable"] --> B{"Ordinal \n or \n Nominal?"} B -->|Ordinal| C["Ordinal Encoding"] C --> C1["lowβ1, mediumβ2 \n highβ3 \n Preserves order"] B -->|Nominal| D{"Binary \n or \n Multi?"} D -->|Binary| E["Label Encoding"] E --> E1["Noβ0, Yesβ1 \n Simple & Fast"] D -->|Multi| F["One-Hot Encoding"] F --> F1["Create dummy \n variables \n No multicollinearity"] C1 --> G["Ready for Model"] E1 --> G F1 --> G style C1 fill:#bbdefb style E1 fill:#90caf9 style F1 fill:#64b5f6 style G fill:#c8e6c9
flowchart TD
A["Categorical Variable"] --> B{"Ordinal \n or \n Nominal?"}
B -->|Ordinal| C["Ordinal Encoding"]
C --> C1["lowβ1, mediumβ2 \n highβ3 \n Preserves order"]
B -->|Nominal| D{"Binary \n or \n Multi?"}
D -->|Binary| E["Label Encoding"]
E --> E1["Noβ0, Yesβ1 \n Simple & Fast"]
D -->|Multi| F["One-Hot Encoding"]
F --> F1["Create dummy \n variables \n No multicollinearity"]
C1 --> G["Ready for Model"]
E1 --> G
F1 --> G
style C1 fill:#bbdefb
style E1 fill:#90caf9
style F1 fill:#64b5f6
style G fill:#c8e6c9
# Extract title from nameprint("π DOMAIN-SPECIFIC FEATURES:")df_clean['Title'] = df_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)print("Extracted title from passenger name:")print(df_clean[['Name', 'Title']].head())# Family size featureprint("\nFamily size from SibSp and Parch:")df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] +1print(df_clean[['SibSp', 'Parch', 'FamilySize']].head())
2.4.2 Feature Selection
Tidak semua features berguna. Feature selection menghilangkan features yang:
Tidak relevant dengan target
Redundant dengan features lain
Meningkatkan noise/overfitting
Metode Feature Selection:
Code
flowchart TD A["Features \n Brainstorm"] --> B{"Selection \n Method?"} B --> C["Statistical Methods"] C --> C1["Univariate Tests \n Correlation Analysis \n Mutual Information"] B --> D["Model-Based"] D --> D1["Feature Importance \n Permutation \n SHAP Values"] B --> E["Iterative"] E --> E1["Recursive Elimination \n Forward Selection \n Backward Elimination"] C1 --> F["Selected Features"] D1 --> F E1 --> F F --> G["Model Training"] style F fill:#c8e6c9 style G fill:#a5d6a7
flowchart TD
A["Features \n Brainstorm"] --> B{"Selection \n Method?"}
B --> C["Statistical Methods"]
C --> C1["Univariate Tests \n Correlation Analysis \n Mutual Information"]
B --> D["Model-Based"]
D --> D1["Feature Importance \n Permutation \n SHAP Values"]
B --> E["Iterative"]
E --> E1["Recursive Elimination \n Forward Selection \n Backward Elimination"]
C1 --> F["Selected Features"]
D1 --> F
E1 --> F
F --> G["Model Training"]
style F fill:#c8e6c9
style G fill:#a5d6a7
Feature Selection Methods
Feature selection techniques
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classiffrom sklearn.ensemble import RandomForestClassifier# Prepare dataX = df_clean[['Age', 'Fare', 'Pclass']].dropna()y = df_clean.loc[X.index, 'Survived']print("π FEATURE SELECTION METHODS:")# Method 1: Univariate statistical testprint("\n1οΈβ£ UNIVARIATE STATISTICAL TEST:")selector = SelectKBest(score_func=f_classif, k=2)X_selected = selector.fit_transform(X, y)scores = selector.scores_for i, col inenumerate(X.columns):print(f"{col}: {scores[i]:.2f}")# Method 2: Feature Importance from Random Forestprint("\n2οΈβ£ FEATURE IMPORTANCE (Random Forest):")rf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X, y)for i, col inenumerate(X.columns):print(f"{col}: {rf.feature_importances_[i]:.3f}")print("\nβ Features ranked by importance")
2.5 Data Validation & Quality Checks
Sebelum modeling, pastikan data berkualitas tinggi.
2.5.1 Data Leakage Prevention
β οΈ Data Leakage Warning!
Data leakage terjadi ketika informasi dari test set βbocorβ ke training set, menyebabkan overly optimistic model performance.
Contoh data leakage:
Scaling data sebelum train-test split
Using future information untuk predict past
Including target variable information di features
Tuning hyperparameters pada test set
Best Practices untuk mencegah data leakage:
Data leakage prevention
from sklearn.model_selection import train_test_splitprint("β CORRECT WORKFLOW:")print("""1. Train-Test Split PERTAMA X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)2. Fit preprocessor HANYA pada training data scaler.fit(X_train) # NOT fit on entire X!3. Transform BOTH sets dengan scaler yang sudah fitted X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)4. Train model pada training set saja model.fit(X_train_scaled, y_train)5. Evaluate pada test set score = model.score(X_test_scaled, y_test)""")print("β INCORRECT WORKFLOW (leads to data leakage):")print("""1. Scaling semua data sekaligus X_scaled = scaler.fit_transform(X) # WRONG!2. Kemudian baru split X_train, X_test = train_test_split(X_scaled) Result: Test set sudah "dilihat" scaler β leakage!""")
2.5.2 Data Quality Checklist
# Comprehensive data quality checksprint("""π DATA QUALITY CHECKLIST:Completeness:β No critical missing values (or properly handled)β All rows have valid dataAccuracy:β Values are within expected rangesβ No obvious data entry errorsβ Outliers validated as legitimateConsistency:β Consistent data typesβ Consistent naming conventionsβ No duplicates (unless intentional)Validity:β All values conform to business rulesβ Categorical variables have expected categoriesβ Numeric values are reasonableUniqueness:β Primary key is uniqueβ No unintended duplicatesβ De-duplication done properlyTimeliness:β Data is current/recent enoughβ No time-series data gapsβ Temporal consistency maintained""")
2.6 Complete Preprocessing Pipeline
Mari kita gabungkan semua langkah menjadi preprocessing pipeline yang reusable.
Scikit-learn: Preprocessing and feature engineering
Matplotlib/Seaborn: Data visualization
β¨ Kesimpulan
Data preprocessing dan EDA adalah fondasi dari machine learning projects yang sukses. Waktu yang Anda investasikan dalam memahami dan membersihkan data akan membayar dividen besar dalam bentuk model performance yang lebih baik.
Di Bab 3 (Classical ML Algorithms), kita akan menggunakan data yang sudah dipreproses ini untuk melatih berbagai algoritma dan memahami kekuatan serta kelemahan masing-masing. π
---title: "Bab 2: Data Preprocessing dan Exploratory Data Analysis"subtitle: "Mempersiapkan Data Berkualitas Tinggi untuk Machine Learning"number-sections: false---# Bab 2: Data Preprocessing dan Exploratory Data Analysis (EDA) {#sec-chapter-02}::: {.callout-note}## π― Hasil Pembelajaran (Learning Outcomes)Setelah mempelajari bab ini, Anda akan mampu:1. **Melakukan** exploratory data analysis (EDA) komprehensif untuk memahami struktur dan karakteristik data2. **Mengidentifikasi** dan **menangani** missing values dengan strategi yang tepat3. **Mendeteksi** dan **mengatasi** outliers menggunakan berbagai teknik4. **Mengenkode** variabel kategorikal dengan metode yang sesuai untuk ML5. **Menerapkan** feature scaling dan normalization secara efektif6. **Mengrekayasa** fitur-fitur baru yang meningkatkan predictive power model7. **Mencegah** data leakage dan memastikan data quality untuk production models:::## 2.1 Pengantar: Pentingnya Data Quality> **"Garbage in, garbage out"** - Prinsip fundamental dalam machine learningData adalah jantung dari setiap ML project. Algoritma terbaik sekalipun tidak bisa menghasilkan model berkualitas jika dilatih dengan data berkualitas rendah. Dalam praktik nyata, **70-80% waktu data scientist dihabiskan untuk data preprocessing dan EDA**, bukan untuk modeling.Dalam bab ini, kita akan mempelajari:- Bagaimana **memahami data secara mendalam** melalui EDA- Bagaimana **membersihkan dan mempersiapkan data** untuk modeling- Bagaimana **menangani masalah umum** seperti missing values dan outliers- Bagaimana **mengrekayasa fitur** yang lebih powerful- Bagaimana **mencegah data leakage** dan memastikan reproducibility**Pertanyaan Utama:**- Bagaimana cara memahami dataset baru secara sistematis?- Apa masalah data quality yang umum dan bagaimana mengatasinya?- Bagaimana cara mempersiapkan data sehingga siap untuk algoritma ML?---## 2.2 Exploratory Data Analysis (EDA)::: {.callout-important}EDA adalah proses investigasi sistematis terhadap dataset untuk memahami karakteristik, pola, dan anomalinya. EDA bukan tentang membuat model, tetapi tentang **memahami data**, mencari *insight* yang tersembunyi dalam data.:::### 2.2.1 Tujuan dan Aktivitas EDA**Tujuan EDA:**- Understand data structure dan content- Identify missing values, outliers, anomalies- Discover patterns dan relationships- Generate hypotheses untuk selanjutnya- Inform preprocessing dan feature engineering decisions**Aktivitas EDA:**```{mermaid}%%| fig-cap: "Exploratory Data Analysis Process Flow"flowchart TD A["Dataset"] --> B["Load & Inspect"] B --> C["Basic Statistics"] C --> D["Univariate Analysis"] D --> D1["Distributions \n Histograms \n Box Plots"] D1 --> E["Bivariate Analysis"] E --> E1["Correlations \n Scatter Plots \n Pair Plots"] E1 --> F["Multivariate Analysis"] F --> F1["Heatmaps \n PCA Plots \n Clustering"] F1 --> G["Data Quality Report"] G --> H["Ready for Preprocessing"] style A fill:#e8f5e9 style H fill:#c8e6c9 style D fill:#a1d99f style E fill:#a1d99f style F fill:#a1d99f```### 2.2.2 Step 1: Load & Inspect DataLangkah pertama adalah memuat data dan mendapatkan gambaran awal.```{python}#| echo: true#| eval: trueimport pandas as pdimport numpy as np# Load sample dataset (Titanic)url ='https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'df = pd.read_csv(url)print("π DATASET SHAPE:")print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")print("\nπ FIRST FEW ROWS:")print(df.head())print("\nπ DATA TYPES:")print(df.dtypes)print("\nβ MISSING VALUES:")print(df.isnull().sum())print("\nπ BASIC INFO:")print(f"Memory usage: {df.memory_usage().sum() /1024**2:.2f} MB")print(f"Duplicate rows: {df.duplicated().sum()}")```### 2.2.3 Step 2: Univariate AnalysisAnalisis setiap variabel secara individual untuk memahami distribusinya.**Untuk Variabel Numerik:**- Central tendency: Mean, median, mode- Spread: Std dev, variance, range- Shape: Skewness, kurtosis- Visualisasi: Histogram, KDE plot, box plot```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Klik untuk melihat univariate analysis"import matplotlib.pyplot as pltimport seaborn as sns# Descriptive statisticsprint("π DESCRIPTIVE STATISTICS:")print(df.describe())print("\nπ SKEWNESS & KURTOSIS:")print(f"Age skewness: {df['Age'].skew():.3f}")print(f"Age kurtosis: {df['Age'].kurtosis():.3f}")# Visualizationfig, axes = plt.subplots(1, 3, figsize=(15, 4))# Histogramdf['Age'].hist(bins=30, ax=axes[0], edgecolor='black')axes[0].set_title('Age Distribution (Histogram)')axes[0].set_xlabel('Age')# KDE plotdf['Age'].plot(kind='kde', ax=axes[1])axes[1].set_title('Age Distribution (KDE)')axes[1].set_xlabel('Age')# Box plotdf['Age'].plot(kind='box', ax=axes[2])axes[2].set_title('Age (Box Plot)')plt.tight_layout()plt.show()print("\nβ Univariate analysis shows Age distribution")```**Untuk Variabel Kategorikal:**- Frequency distribution- Unique values count- Mode- Visualisasi: Bar plot, pie chart```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Categorical variable analysis"print("π CATEGORICAL VARIABLES:")print(f"Embarked unique values: {df['Embarked'].nunique()}")print(f"Embarked value counts:\n{df['Embarked'].value_counts()}")# Visualizationfig, axes = plt.subplots(1, 2, figsize=(12, 4))df['Embarked'].value_counts().plot(kind='bar', ax=axes[0])axes[0].set_title('Embarked Port Distribution')axes[0].set_xlabel('Port')df['Pclass'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')axes[1].set_title('Passenger Class Distribution')plt.tight_layout()plt.show()```### 2.2.4 Step 3: Bivariate AnalysisAnalisis relationship antara dua variabel.```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Bivariate analysis examples"# Correlation matrixprint("π CORRELATION MATRIX:")numeric_cols = df.select_dtypes(include=[np.number]).columnscorr_matrix = df[numeric_cols].corr()print(corr_matrix)# Scatter plotfig, axes = plt.subplots(1, 2, figsize=(12, 4))# Numeric vs Numericdf.plot(x='Age', y='Fare', kind='scatter', ax=axes[0], alpha=0.5)axes[0].set_title('Age vs Fare Relationship')# Categorical vs Numericdf.boxplot(column='Age', by='Pclass', ax=axes[1])axes[1].set_title('Age by Passenger Class')plt.suptitle('') # Remove default titleplt.tight_layout()plt.show()print("\nπ Survival rate by gender:")print(df.groupby('Sex')['Survived'].mean())```### 2.2.5 Step 4: Multivariate Analysis & Correlation Heatmap```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Correlation heatmap"# Correlation heatmapplt.figure(figsize=(10, 8))sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0)plt.title('Correlation Heatmap of Numerical Features')plt.tight_layout()plt.show()print("β Heatmap reveals feature relationships")```---## 2.3 Data Cleaning & PreprocessingData real-world selalu memiliki masalah. Mari kita belajar cara mengatasinya.### 2.3.1 Handling Missing ValuesMissing values dapat terjadi karena:- Data entry errors- Equipment failure- Confidentiality reasons- Data loss::: {.callout-important}## Missing Value Strategies**Jangan pernah menghapus semua rows dengan missing values tanpa pertimbangan!**Pilih strategi berdasarkan:- Seberapa banyak missing values (threshold)?- Apakah missing values MCAR, MAR, atau MNAR?- Pentingnya variabel tersebut- Dampak pada model performance:::**Strategi Handling Missing Values:**```{mermaid}%%| fig-cap: "Missing Values Handling Strategies"flowchart TD A["Missing Values \n Detected"] --> B{"Percentage \n Missing?"} B -->|> 50%| C["Drop Column \n Too Much Missing"] B -->|< 50%| D{"Type of \n Variable?"} D -->|Numeric| E["Fill Strategies"] D -->|Categorical| F["Fill Strategies"] E --> E1["Mean/Median \n Forward Fill \n Interpolation"] F --> F1["Mode \n Create 'Unknown' \n Forward Fill"] E1 --> G["Check Distribution"] F1 --> G G --> H["Evaluate Impact"] H --> I["Proceed with Model"] style C fill:#ffcdd2 style E1 fill:#c8e6c9 style F1 fill:#c8e6c9 style I fill:#a5d6a7```**Metode Imputation:**```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Missing value handling examples"print("β MISSING VALUES ANALYSIS:")missing_pct = (df.isnull().sum() /len(df)) *100print(missing_pct[missing_pct >0])# Strategy 1: Drop column if > 50% missingprint("\nπ STRATEGY 1: Drop Columns")df_clean = df.drop(['Cabin'], axis=1) # Cabin has 77% missingprint(f"Dropped Cabin column")# Strategy 2: Impute numerical with medianprint("\nπ STRATEGY 2: Impute Numeric with Median")df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)print(f"Age missing values filled with median")# Strategy 3: Impute categorical with modeprint("\nπ STRATEGY 3: Impute Categorical with Mode")df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)print(f"Embarked filled with mode")# Strategy 4: Forward fill (untuk time series)print("\nπ STRATEGY 4: Forward Fill (Time Series)")# df_clean.fillna(method='ffill', inplace=True)print("For time series, forward fill propagates previous value")print("\nβ Missing values handled:")print(df_clean.isnull().sum())```### 2.3.2 Handling OutliersOutliers adalah nilai yang sangat berbeda dari mayoritas data. Mereka bisa:- **Valid:** Natural variation (e.g., billionaire dalam dataset income)- **Invalid:** Data entry error, sensor malfunction**Deteksi Outliers:****Method 1: IQR (Interquartile Range)**```Q1 = 25th percentileQ3 = 75th percentileIQR = Q3 - Q1Lower Bound = Q1 - 1.5 * IQRUpper Bound = Q3 + 1.5 * IQROutliers = values outside [Lower Bound, Upper Bound]```**Method 2: Z-Score**```Z = (value - mean) / std_dev|Z| > 3 β 99.7% of data (likely outlier)``````{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Outlier detection methods"# Method 1: IQRprint("π OUTLIER DETECTION - IQR METHOD:")Q1 = df_clean['Age'].quantile(0.25)Q3 = df_clean['Age'].quantile(0.75)IQR = Q3 - Q1lower_bound = Q1 -1.5* IQRupper_bound = Q3 +1.5* IQRoutliers_iqr = df_clean[(df_clean['Age'] < lower_bound) | (df_clean['Age'] > upper_bound)]print(f"IQR bounds: [{lower_bound:.1f}, {upper_bound:.1f}]")print(f"Outliers detected: {len(outliers_iqr)}")# Method 2: Z-Scoreprint("\nπ OUTLIER DETECTION - Z-SCORE METHOD:")from scipy import statsz_scores = np.abs(stats.zscore(df_clean['Age'].dropna()))outliers_zscore = (z_scores >3).sum()print(f"Outliers (|Z| > 3): {outliers_zscore}")# Handling outliersprint("\nβ OUTLIER HANDLING STRATEGIES:")print("1. Remove: df.drop(outlier_indices)")print("2. Cap: df['col'].clip(lower, upper)")print("3. Transform: log, sqrt, Box-Cox")print("4. Keep: if valid and important")# Example: Cap at boundsdf_clean['Age_capped'] = df_clean['Age'].clip(lower_bound, upper_bound)print("\nβ Age outliers capped at IQR bounds")```### 2.3.3 Encoding Categorical VariablesAlgoritma ML membutuhkan input numerik. Variabel kategorikal perlu dienkode.**Strategi Encoding:**```{mermaid}%%| fig-cap: "Categorical Variable Encoding Strategies"flowchart TD A["Categorical Variable"] --> B{"Ordinal \n or \n Nominal?"} B -->|Ordinal| C["Ordinal Encoding"] C --> C1["lowβ1, mediumβ2 \n highβ3 \n Preserves order"] B -->|Nominal| D{"Binary \n or \n Multi?"} D -->|Binary| E["Label Encoding"] E --> E1["Noβ0, Yesβ1 \n Simple & Fast"] D -->|Multi| F["One-Hot Encoding"] F --> F1["Create dummy \n variables \n No multicollinearity"] C1 --> G["Ready for Model"] E1 --> G F1 --> G style C1 fill:#bbdefb style E1 fill:#90caf9 style F1 fill:#64b5f6 style G fill:#c8e6c9``````{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Categorical encoding examples"print("π ENCODING STRATEGIES:")# Strategy 1: Label Encoding (Binary categorical)print("\n1οΈβ£ LABEL ENCODING (Binary):")df_clean['Sex_encoded'] = df_clean['Sex'].map({'male': 1, 'female': 0})print(df_clean[['Sex', 'Sex_encoded']].head())# Strategy 2: Ordinal Encoding (Ordinal categorical)print("\n2οΈβ£ ORDINAL ENCODING (Ordered Categories):")class_mapping = {'Lower': 1, 'Middle': 2, 'Upper': 3}# df_clean['Class_encoded'] = df_clean['Class'].map(class_mapping)print("Maps preserve order: Lower < Middle < Upper")# Strategy 3: One-Hot Encoding (Nominal categorical)print("\n3οΈβ£ ONE-HOT ENCODING (Nominal):")embarked_encoded = pd.get_dummies(df_clean['Embarked'], prefix='Embarked')print(embarked_encoded.head())print("\nβ Categorical variables encoded for ML")```::: {.callout-warning}## One-Hot Encoding Caution**Dummy Variable Trap:** Jika memiliki k kategori, jangan buat k dummy variables!Gunakan k-1 dummy variables (drop first category) untuk menghindari perfect multicollinearity.```pythonpd.get_dummies(df['col'], drop_first=True)```:::### 2.3.4 Feature Scaling & NormalizationBeberapa algoritma sensitif terhadap skala feature (e.g., distance-based, gradient-based). Scaling menormalisasi features ke range yang sama.**Metode Scaling:**```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Feature scaling methods"from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler# Original dataages = df_clean['Age'].values.reshape(-1, 1)print("π ORIGINAL DATA STATISTICS:")print(f"Mean: {ages.mean():.2f}, Std: {ages.std():.2f}")print(f"Min: {ages.min():.2f}, Max: {ages.max():.2f}")# Method 1: Standardization (Z-score normalization)print("\n1οΈβ£ STANDARDIZATION (Z-score):")scaler_std = StandardScaler()ages_standardized = scaler_std.fit_transform(ages)print(f"After standardization - Mean: {ages_standardized.mean():.2f}, Std: {ages_standardized.std():.2f}")print(f"Formula: (x - mean) / std")# Method 2: Min-Max Scaling (Normalization)print("\n2οΈβ£ MIN-MAX SCALING (Normalization):")scaler_minmax = MinMaxScaler(feature_range=(0, 1))ages_normalized = scaler_minmax.fit_transform(ages)print(f"After min-max - Min: {ages_normalized.min():.2f}, Max: {ages_normalized.max():.2f}")print(f"Formula: (x - min) / (max - min)")# Method 3: Robust Scaling (resistant to outliers)print("\n3οΈβ£ ROBUST SCALING (Resistant to Outliers):")scaler_robust = RobustScaler()ages_robust = scaler_robust.fit_transform(ages)print(f"After robust scaling - Median: {np.median(ages_robust):.2f}")print(f"Formula: (x - median) / IQR")print("\nπ COMPARISON:")print(f"Original range: [{ages.min():.2f}, {ages.max():.2f}]")print(f"Standardized range: [{ages_standardized.min():.2f}, {ages_standardized.max():.2f}]")print(f"Normalized range: [{ages_normalized.min():.2f}, {ages_normalized.max():.2f}]")print("\nβ Choose scaling based on algorithm requirements")```**Kapan menggunakan scaling?**| Algoritma | Membutuhkan Scaling? ||-----------|-----------------|| Linear Regression, Logistic Regression | β Yes || Neural Networks | β Yes || SVM, KNN | β Yes || Decision Trees, Random Forest | β No || Gradient Boosting | β No |---## 2.4 Feature EngineeringFeature engineering adalah seni & sains **menciptakan features baru** dari raw data yang meningkatkan model performance.### 2.4.1 Tipe-Tipe Feature Engineering**1. Polynomial Features**```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Polynomial feature creation"from sklearn.preprocessing import PolynomialFeatures# Original featureX = df_clean[['Age']].values[:100]# Create polynomial featurespoly = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly.fit_transform(X)print("π POLYNOMIAL FEATURES:")print(f"Original: {X.shape[1]} feature")print(f"After polynomial (degree 2): {X_poly.shape[1]} features")print(f"Features: Age, AgeΒ²")print(f"\nExample:")print(f"Age=25 β [25, 625]")```**2. Interaction Features**```{python}#| echo: true#| eval: true# Create interaction featuredf_clean['Age_Fare_Interaction'] = df_clean['Age'] * df_clean['Fare']print("π INTERACTION FEATURES:")print("Combined Age Γ Fare")print(df_clean[['Age', 'Fare', 'Age_Fare_Interaction']].head())```**3. Domain-Specific Features**```{python}#| echo: true#| eval: true# Extract title from nameprint("π DOMAIN-SPECIFIC FEATURES:")df_clean['Title'] = df_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)print("Extracted title from passenger name:")print(df_clean[['Name', 'Title']].head())# Family size featureprint("\nFamily size from SibSp and Parch:")df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] +1print(df_clean[['SibSp', 'Parch', 'FamilySize']].head())```### 2.4.2 Feature SelectionTidak semua features berguna. Feature selection menghilangkan features yang:- Tidak relevant dengan target- Redundant dengan features lain- Meningkatkan noise/overfitting**Metode Feature Selection:**```{mermaid}%%| fig-cap: "Feature Selection Methods"flowchart TD A["Features \n Brainstorm"] --> B{"Selection \n Method?"} B --> C["Statistical Methods"] C --> C1["Univariate Tests \n Correlation Analysis \n Mutual Information"] B --> D["Model-Based"] D --> D1["Feature Importance \n Permutation \n SHAP Values"] B --> E["Iterative"] E --> E1["Recursive Elimination \n Forward Selection \n Backward Elimination"] C1 --> F["Selected Features"] D1 --> F E1 --> F F --> G["Model Training"] style F fill:#c8e6c9 style G fill:#a5d6a7``````{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Feature selection techniques"from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classiffrom sklearn.ensemble import RandomForestClassifier# Prepare dataX = df_clean[['Age', 'Fare', 'Pclass']].dropna()y = df_clean.loc[X.index, 'Survived']print("π FEATURE SELECTION METHODS:")# Method 1: Univariate statistical testprint("\n1οΈβ£ UNIVARIATE STATISTICAL TEST:")selector = SelectKBest(score_func=f_classif, k=2)X_selected = selector.fit_transform(X, y)scores = selector.scores_for i, col inenumerate(X.columns):print(f"{col}: {scores[i]:.2f}")# Method 2: Feature Importance from Random Forestprint("\n2οΈβ£ FEATURE IMPORTANCE (Random Forest):")rf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X, y)for i, col inenumerate(X.columns):print(f"{col}: {rf.feature_importances_[i]:.3f}")print("\nβ Features ranked by importance")```---## 2.5 Data Validation & Quality ChecksSebelum modeling, pastikan data berkualitas tinggi.### 2.5.1 Data Leakage Prevention::: {.callout-warning}## β οΈ Data Leakage Warning!**Data leakage terjadi ketika informasi dari test set "bocor" ke training set**, menyebabkan overly optimistic model performance.**Contoh data leakage:**1. Scaling data sebelum train-test split2. Using future information untuk predict past3. Including target variable information di features4. Tuning hyperparameters pada test set:::**Best Practices untuk mencegah data leakage:**```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Data leakage prevention"from sklearn.model_selection import train_test_splitprint("β CORRECT WORKFLOW:")print("""1. Train-Test Split PERTAMA X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)2. Fit preprocessor HANYA pada training data scaler.fit(X_train) # NOT fit on entire X!3. Transform BOTH sets dengan scaler yang sudah fitted X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)4. Train model pada training set saja model.fit(X_train_scaled, y_train)5. Evaluate pada test set score = model.score(X_test_scaled, y_test)""")print("β INCORRECT WORKFLOW (leads to data leakage):")print("""1. Scaling semua data sekaligus X_scaled = scaler.fit_transform(X) # WRONG!2. Kemudian baru split X_train, X_test = train_test_split(X_scaled) Result: Test set sudah "dilihat" scaler β leakage!""")```### 2.5.2 Data Quality Checklist```python# Comprehensive data quality checksprint("""π DATA QUALITY CHECKLIST:Completeness:β No critical missing values (or properly handled)β All rows have valid dataAccuracy:β Values are within expected rangesβ No obvious data entry errorsβ Outliers validated as legitimateConsistency:β Consistent data typesβ Consistent naming conventionsβ No duplicates (unless intentional)Validity:β All values conform to business rulesβ Categorical variables have expected categoriesβ Numeric values are reasonableUniqueness:β Primary key is uniqueβ No unintended duplicatesβ De-duplication done properlyTimeliness:β Data is current/recent enoughβ No time-series data gapsβ Temporal consistency maintained""")```---## 2.6 Complete Preprocessing PipelineMari kita gabungkan semua langkah menjadi **preprocessing pipeline yang reusable**.```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Complete preprocessing pipeline"from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.impute import SimpleImputerprint("π§ BUILDING PREPROCESSING PIPELINE:")# Define numeric and categorical columnsnumeric_features = ['Age', 'Fare']categorical_features = ['Embarked', 'Sex']# Create transformersnumeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(drop='first', sparse_output=False))])# Combine transformerspreprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ])print("β Preprocessing pipeline created")print("\nSteps:")print("1. Impute missing values")print("2. Scale numeric features")print("3. Encode categorical features")print("4. Combine all features")# Fit preprocessing on training dataX = df_clean[numeric_features + categorical_features]X_preprocessed = preprocessor.fit_transform(X)print(f"\nβ Input shape: {X.shape}")print(f"β Output shape: {X_preprocessed.shape}")print(f"β Ready for modeling!")```---## 2.7 Case Study: Titanic Dataset PreprocessingMari aplikasikan semua konsep pada dataset Titanic.### ProblemMemprediksi apakah penumpang Titanic akan selamat (survival prediction)### Step-by-Step Preprocessing```{python}#| echo: true#| eval: true#| code-fold: true#| code-summary: "Complete Titanic preprocessing"print("π’ TITANIC DATASET PREPROCESSING:\n")# 1. Load dataprint("1οΈβ£ LOAD DATA")df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv')print(f"Shape: {df.shape}")# 2. Analyze missing valuesprint("\n2οΈβ£ ANALYZE MISSING VALUES")missing_pct = (df.isnull().sum() /len(df)) *100print(missing_pct[missing_pct >0])# 3. Handle missing valuesprint("\n3οΈβ£ HANDLE MISSING VALUES")df_processed = df.copy()df_processed['Age'].fillna(df_processed['Age'].median(), inplace=True)df_processed['Embarked'].fillna(df_processed['Embarked'].mode()[0], inplace=True)df_processed = df_processed.drop(['Cabin'], axis=1)print("β Missing values handled")# 4. Feature engineeringprint("\n4οΈβ£ FEATURE ENGINEERING")df_processed['Title'] = df_processed['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)df_processed['FamilySize'] = df_processed['SibSp'] + df_processed['Parch'] +1df_processed['IsAlone'] = (df_processed['FamilySize'] ==1).astype(int)print("β New features created: Title, FamilySize, IsAlone")# 5. Encode categorical variablesprint("\n5οΈβ£ ENCODE CATEGORICAL VARIABLES")df_processed['Sex'] = df_processed['Sex'].map({'male': 1, 'female': 0})df_processed = pd.get_dummies(df_processed, columns=['Embarked', 'Title'], drop_first=True)print("β Categorical variables encoded")# 6. Select features for modelingprint("\n6οΈβ£ SELECT FEATURES")feature_cols = [col for col in df_processed.columns if col notin ['PassengerId', 'Name', 'Ticket', 'Survived']]X = df_processed[feature_cols]y = df_processed['Survived']print(f"β Features selected: {X.shape[1]} features, {X.shape[0]} samples")# 7. Final validationprint("\n7οΈβ£ FINAL VALIDATION")print(f"No missing values remaining: {X.isnull().sum().sum() ==0}")print(f"All numeric features: {X.dtypes.nunique() ==1}")print(f"Target variable balanced: {y.value_counts().to_dict()}")print("\nβ DATA READY FOR MODELING!")```---## π§ͺ Hands-on Exercise 2**Objektif:** Practice EDA dan preprocessing dengan real dataset**Instruksi:**1. Load dataset (Titanic atau Iris)2. Perform complete EDA: - Check shape, dtypes, missing values - Univariate analysis (histograms, box plots) - Bivariate analysis (correlations, scatter plots)3. Preprocess data: - Handle missing values - Detect and handle outliers - Encode categorical variables4. Create feature engineering: - At least 2 new features - Perform feature selection5. Build preprocessing pipeline**Bonus:** Visualize before/after distributions---## π Review Questions### Conceptual Questions1. Jelaskan perbedaan antara **Exploratory Data Analysis (EDA)** dan **Data Preprocessing**. Mengapa kedua tahap ini penting?2. Anda menemukan bahwa 70% data dalam suatu column sudah missing. Apa keputusan yang tepat dan mengapa?3. Jelaskan tiga metode untuk mendeteksi outliers. Bagaimana cara menentukan apakah outlier adalah error atau valid data?4. Apa perbedaan antara **standardization (z-score)**, **normalization (min-max)**, dan **robust scaling**? Kapan menggunakan masing-masing?5. Jelaskan **data leakage** dan berikan dua contoh konkret. Bagaimana cara mencegahnya?### Practical Questions6. Dataset Anda memiliki kolom 'Age' dengan nilai ranging dari 0 hingga 1500 (error). Apa strategi pembersihan yang tepat?7. Variabel 'Gender' memiliki 3 kategori: 'M', 'F', 'Unknown'. Bagaimana Anda akan mengenkodenya untuk ML model?8. Anda punya feature 'Birth_Date'. Fitur apa yang bisa Anda engineer dari ini untuk prediksi churn?9. Setelah preprocessing, Anda punya 150 features. Bagaimana cara memilih 20 features terbaik?10. Di mana posisi scaling dalam preprocessing pipeline? Mengapa timing penting?---## π― Key Takeawaysβ EDA adalah langkah kritis untuk memahami data secara mendalamβ Missing values harus ditangani dengan strategi yang tepat, bukan hanya dihapusβ Outliers bisa valid atau error - validasi sebelum handlingβ Encoding categorical variables adalah keharusan untuk ML algorithmsβ Feature scaling penting untuk distance-based dan gradient-based algorithmsβ Feature engineering dapat meningkatkan model performance significantlyβ **Data leakage prevention** adalah fundamental untuk reliable modelsβ Preprocessing pipeline harus reproducible dan automated---## π References### Bacaan Lebih Lanjut- Pandas Documentation: https://pandas.pydata.org/- Scikit-learn Preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html- Feature Engineering Tutorial: https://www.kaggle.com/learn/feature-engineering### Tools & Libraries- **Pandas:** Data manipulation and analysis- **NumPy:** Numerical computing- **Scikit-learn:** Preprocessing and feature engineering- **Matplotlib/Seaborn:** Data visualization---## β¨ KesimpulanData preprocessing dan EDA adalah fondasi dari machine learning projects yang sukses. Waktu yang Anda investasikan dalam memahami dan membersihkan data akan membayar dividen besar dalam bentuk model performance yang lebih baik.Di **Bab 3 (Classical ML Algorithms)**, kita akan menggunakan data yang sudah dipreproses ini untuk melatih berbagai algoritma dan memahami kekuatan serta kelemahan masing-masing. π---