Lab 01: Python Environment Setup & Iris Dataset EDA

Pengenalan Machine Learning dengan Dataset Klasik

Author

Pembelajaran Mesin - Semester 6

Informasi Praktikum

Informasi Umum

Durasi: 2-3 jam Tingkat Kesulitan: Pemula Prerequisites: Python dasar, pemahaman statistik deskriptif Tools: Python 3.8+, Jupyter Notebook/Google Colab

Tujuan Pembelajaran

Setelah menyelesaikan praktikum ini, mahasiswa diharapkan dapat:

  1. Menyiapkan environment Python untuk machine learning dengan library standar (scikit-learn, pandas, numpy, matplotlib, seaborn)
  2. Memuat dan mengeksplorasi dataset Iris dari scikit-learn
  3. Melakukan analisis data eksploratif (EDA) untuk memahami karakteristik data
  4. Memahami konsep train/test split dalam machine learning
  5. Melatih model klasifikasi sederhana dan menginterpretasi hasilnya
  6. Mengevaluasi performa model menggunakan metrik dasar

Pemetaan ke CPMK

  • CPMK-1: Memahami konsep fundamental machine learning
  • CPMK-2: Menerapkan tahapan dasar pipeline machine learning

Prerequisites Checklist

Sebelum memulai praktikum, pastikan Anda telah:

Background: Dataset Iris

Mengapa Dataset Iris?

Dataset Iris adalah salah satu dataset paling terkenal dalam machine learning dan statistik. Dataset ini diperkenalkan oleh ahli statistik dan biologi Ronald Fisher pada tahun 1936.

Mengapa Iris sempurna untuk pembelajaran ML?

  1. Ukuran yang tepat: 150 sampel - tidak terlalu besar, tidak terlalu kecil
  2. Multivariate: 4 fitur numerik yang berbeda
  3. Klasifikasi multi-class: 3 jenis spesies bunga iris
  4. Balanced dataset: Setiap kelas memiliki jumlah sampel yang sama (50 sampel)
  5. Clean data: Tidak ada missing values
  6. Visualizable: Dapat divisualisasikan dengan mudah untuk pemahaman intuitif

Deskripsi Dataset

Dataset Iris berisi pengukuran dari 150 bunga iris dari 3 spesies berbeda:

  • Iris Setosa
  • Iris Versicolor
  • Iris Virginica

Fitur yang diukur (dalam cm):

  1. Sepal Length (panjang sepal)
  2. Sepal Width (lebar sepal)
  3. Petal Length (panjang petal)
  4. Petal Width (lebar petal)
Catatan Botani
  • Sepal: Bagian bunga yang melindungi kelopak, biasanya berwarna hijau
  • Petal: Kelopak bunga yang berwarna-warni

Langkah-Langkah Praktikum

Step 1: Verifikasi dan Setup Environment

Pertama, kita akan memverifikasi bahwa semua library yang dibutuhkan sudah terinstal.

# Import library yang dibutuhkan
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set random seed untuk reproducibility
np.random.seed(42)

# Configure plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Display versions
print("=" * 60)
print("ENVIRONMENT VERIFICATION")
print("=" * 60)
print(f"Python Version: {sys.version}")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Matplotlib Version: {plt.matplotlib.__version__}")
print(f"Seaborn Version: {sns.__version__}")
import sklearn
print(f"Scikit-learn Version: {sklearn.__version__}")
print("=" * 60)
print("✓ All libraries imported successfully!")
print("=" * 60)
Troubleshooting: Library Tidak Ditemukan

Jika ada library yang belum terinstal, jalankan:

pip install numpy pandas matplotlib seaborn scikit-learn jupyter

Untuk Google Colab, library utama sudah terinstal by default.

Step 2: Load Iris Dataset

Scikit-learn menyediakan dataset Iris built-in. Mari kita muat dan eksplorasi strukturnya.

# Load Iris dataset from scikit-learn
iris = datasets.load_iris()

# Display dataset information
print("=" * 60)
print("IRIS DATASET STRUCTURE")
print("=" * 60)
print(f"Type: {type(iris)}")
print(f"\nAvailable keys: {iris.keys()}")
print("\n" + "=" * 60)

# Display dataset description
print("\nDATASET DESCRIPTION:")
print("=" * 60)
print(iris.DESCR[:500] + "...\n")  # First 500 characters
# Extract features and target
X = iris.data  # Features (4 columns)
y = iris.target  # Target (species labels: 0, 1, 2)

# Feature names and target names
feature_names = iris.feature_names
target_names = iris.target_names

print("=" * 60)
print("DATASET COMPONENTS")
print("=" * 60)
print(f"Features shape (X): {X.shape}")
print(f"Target shape (y): {y.shape}")
print(f"\nFeature names: {feature_names}")
print(f"Target names: {target_names}")
print("=" * 60)

Step 3: Explore Data Structure dengan Pandas

Untuk analisis yang lebih mudah, kita akan konversi data ke Pandas DataFrame.

# Create DataFrame
df = pd.DataFrame(data=X, columns=feature_names)
df['species'] = y
df['species_name'] = df['species'].apply(lambda x: target_names[x])

print("=" * 60)
print("DATAFRAME STRUCTURE")
print("=" * 60)
print(df.head(10))
print("\n" + "=" * 60)
print("DATAFRAME INFO")
print("=" * 60)
df.info()
# Check for missing values
print("\n" + "=" * 60)
print("MISSING VALUES CHECK")
print("=" * 60)
print(df.isnull().sum())
print("\n✓ No missing values found!")
# Display value counts for each species
print("\n" + "=" * 60)
print("CLASS DISTRIBUTION")
print("=" * 60)
print(df['species_name'].value_counts())
print("\n✓ Balanced dataset - each class has 50 samples")

Step 4: Statistical Summary

Mari kita lihat statistik deskriptif untuk setiap fitur.

# Overall statistical summary
print("=" * 60)
print("OVERALL STATISTICAL SUMMARY")
print("=" * 60)
print(df[feature_names].describe().round(2))
# Statistical summary by species
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY BY SPECIES")
print("=" * 60)
for species in target_names:
    print(f"\n{species.upper()}:")
    print("-" * 60)
    species_data = df[df['species_name'] == species][feature_names]
    print(species_data.describe().round(2))
Observasi Penting

Perhatikan bahwa:

  • Setosa cenderung memiliki petal yang lebih kecil
  • Virginica cenderung memiliki ukuran yang lebih besar
  • Versicolor berada di tengah-tengah

Step 5: Data Visualization - Distribusi Fitur

Visualisasi membantu kita memahami distribusi dan pola data.

# Create distribution plots for all features
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Distribution of Iris Features', fontsize=16, fontweight='bold')

for idx, feature in enumerate(feature_names):
    row = idx // 2
    col = idx % 2
    ax = axes[row, col]

    # Histogram for each species
    for species in target_names:
        data = df[df['species_name'] == species][feature]
        ax.hist(data, alpha=0.6, label=species, bins=15)

    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.set_title(f'{feature} Distribution by Species', fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Step 6: Data Visualization - Scatter Plots

Scatter plots membantu kita melihat hubungan antar fitur.

# Create pairwise scatter plots
fig, axes = plt.subplots(4, 4, figsize=(14, 12))
fig.suptitle('Pairwise Relationships of Iris Features',
             fontsize=16, fontweight='bold')

colors = ['red', 'green', 'blue']
species_colors = [colors[i] for i in y]

for i in range(4):
    for j in range(4):
        ax = axes[i, j]

        if i == j:
            # Diagonal: histogram
            for idx, species in enumerate(target_names):
                data = df[df['species_name'] == species][feature_names[i]]
                ax.hist(data, alpha=0.6, color=colors[idx], bins=15)
            ax.set_ylabel('Frequency', fontsize=9)
        else:
            # Off-diagonal: scatter plot
            ax.scatter(X[:, j], X[:, i], c=species_colors, alpha=0.6, s=30)

        # Labels
        if i == 3:
            ax.set_xlabel(feature_names[j], fontsize=9)
        if j == 0:
            ax.set_ylabel(feature_names[i], fontsize=9)

        ax.grid(True, alpha=0.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=colors[i], label=target_names[i])
                   for i in range(3)]
fig.legend(handles=legend_elements, loc='upper right', fontsize=10)

plt.tight_layout()
plt.show()
Insight dari Scatter Plots
  • Petal length vs Petal width: Menunjukkan separasi yang sangat jelas antar spesies
  • Sepal measurements: Overlap lebih banyak, terutama antara Versicolor dan Virginica
  • Ini mengindikasikan bahwa petal measurements lebih informatif untuk klasifikasi

Step 7: Correlation Analysis

Mari kita analisis korelasi antar fitur.

# Calculate correlation matrix
correlation_matrix = df[feature_names].corr()

print("=" * 60)
print("CORRELATION MATRIX")
print("=" * 60)
print(correlation_matrix.round(3))

# Visualize correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
            center=0, square=True, linewidths=1,
            cbar_kws={"shrink": 0.8}, fmt='.3f')
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
Interpretasi Korelasi
  • Petal length dan petal width: Korelasi sangat tinggi (~0.96)
  • Sepal length dan petal length/width: Korelasi tinggi
  • Sepal width: Korelasi lebih rendah dengan fitur lainnya

Step 8: Train/Test Split

Konsep fundamental dalam ML: membagi data untuk training dan testing.

# Split data into training and testing sets
# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("=" * 60)
print("TRAIN/TEST SPLIT RESULTS")
print("=" * 60)
print(f"Total samples: {len(X)}")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print("\n" + "=" * 60)
print("CLASS DISTRIBUTION IN SPLITS")
print("=" * 60)

# Check class distribution
print("\nTraining set:")
unique, counts = np.unique(y_train, return_counts=True)
for species_idx, count in zip(unique, counts):
    print(f"  {target_names[species_idx]}: {count} samples")

print("\nTesting set:")
unique, counts = np.unique(y_test, return_counts=True)
for species_idx, count in zip(unique, counts):
    print(f"  {target_names[species_idx]}: {count} samples")
Mengapa Train/Test Split?
  • Training set: Digunakan untuk melatih model
  • Testing set: Digunakan untuk mengevaluasi performa model pada data yang belum pernah dilihat
  • Stratify: Memastikan proporsi kelas sama di training dan testing
  • Random state: Untuk reproducibility

Step 9: Feature Scaling (Normalisasi)

Beberapa algoritma ML sensitif terhadap skala fitur. Mari kita normalisasi data.

# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("=" * 60)
print("FEATURE SCALING")
print("=" * 60)
print("\nOriginal training data (first 5 samples):")
print(X_train[:5])
print("\nScaled training data (first 5 samples):")
print(X_train_scaled[:5].round(3))

print("\n" + "=" * 60)
print("SCALING PARAMETERS")
print("=" * 60)
print(f"Mean values: {scaler.mean_.round(3)}")
print(f"Std deviations: {scaler.scale_.round(3)}")
Penting: Fit vs Transform
  • fit_transform pada training set: Belajar parameter (mean, std) dan transform
  • transform pada test set: Hanya transform menggunakan parameter dari training
  • JANGAN fit ulang scaler pada test set - ini akan menyebabkan data leakage!

Step 10: Train First ML Model - Logistic Regression

Sekarang kita akan melatih model klasifikasi pertama!

# Initialize Logistic Regression model
log_reg = LogisticRegression(max_iter=200, random_state=42)

# Train the model
print("=" * 60)
print("TRAINING LOGISTIC REGRESSION MODEL")
print("=" * 60)
log_reg.fit(X_train_scaled, y_train)
print("✓ Model trained successfully!")

# Make predictions
y_pred_train = log_reg.predict(X_train_scaled)
y_pred_test = log_reg.predict(X_test_scaled)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_pred_train)
test_accuracy = accuracy_score(y_test, y_pred_test)

print("\n" + "=" * 60)
print("MODEL PERFORMANCE")
print("=" * 60)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

Step 11: Evaluate Model Performance

Mari kita evaluasi model lebih detail menggunakan classification report dan confusion matrix.

# Classification Report
print("=" * 60)
print("CLASSIFICATION REPORT - LOGISTIC REGRESSION")
print("=" * 60)
print(classification_report(y_test, y_pred_test,
                          target_names=target_names))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Confusion Matrix - Counts
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=target_names, yticklabels=target_names,
            ax=axes[0], cbar_kws={"shrink": 0.8})
axes[0].set_title('Confusion Matrix - Counts', fontweight='bold', fontsize=12)
axes[0].set_ylabel('True Label', fontsize=11)
axes[0].set_xlabel('Predicted Label', fontsize=11)

# Confusion Matrix - Normalized
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Greens',
            xticklabels=target_names, yticklabels=target_names,
            ax=axes[1], cbar_kws={"shrink": 0.8})
axes[1].set_title('Confusion Matrix - Normalized', fontweight='bold', fontsize=12)
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_xlabel('Predicted Label', fontsize=11)

plt.tight_layout()
plt.show()

print("\nConfusion Matrix Interpretation:")
print("- Diagonal values: Correct predictions")
print("- Off-diagonal values: Misclassifications")
Memahami Classification Report
  • Precision: Dari semua yang diprediksi sebagai kelas X, berapa yang benar?
  • Recall: Dari semua yang sebenarnya kelas X, berapa yang berhasil diprediksi?
  • F1-Score: Harmonic mean dari precision dan recall
  • Support: Jumlah sampel sebenarnya di test set

Step 12: Compare with Decision Tree

Mari kita bandingkan dengan algoritma lain: Decision Tree.

# Initialize Decision Tree model
dt_clf = DecisionTreeClassifier(max_depth=3, random_state=42)

# Train the model
print("=" * 60)
print("TRAINING DECISION TREE MODEL")
print("=" * 60)
dt_clf.fit(X_train_scaled, y_train)
print("✓ Model trained successfully!")

# Make predictions
y_pred_dt_train = dt_clf.predict(X_train_scaled)
y_pred_dt_test = dt_clf.predict(X_test_scaled)

# Calculate accuracy
dt_train_accuracy = accuracy_score(y_train, y_pred_dt_train)
dt_test_accuracy = accuracy_score(y_test, y_pred_dt_test)

print("\n" + "=" * 60)
print("MODEL PERFORMANCE")
print("=" * 60)
print(f"Training Accuracy: {dt_train_accuracy:.4f} ({dt_train_accuracy*100:.2f}%)")
print(f"Testing Accuracy: {dt_test_accuracy:.4f} ({dt_test_accuracy*100:.2f}%)")
# Classification Report for Decision Tree
print("=" * 60)
print("CLASSIFICATION REPORT - DECISION TREE")
print("=" * 60)
print(classification_report(y_test, y_pred_dt_test,
                          target_names=target_names))

Step 13: Model Comparison

Mari kita bandingkan kedua model secara visual.

# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Model accuracy comparison
models = ['Logistic\nRegression', 'Decision\nTree']
train_scores = [train_accuracy, dt_train_accuracy]
test_scores = [test_accuracy, dt_test_accuracy]

x = np.arange(len(models))
width = 0.35

axes[0].bar(x - width/2, train_scores, width, label='Training', alpha=0.8)
axes[0].bar(x + width/2, test_scores, width, label='Testing', alpha=0.8)
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title('Model Accuracy Comparison', fontweight='bold', fontsize=12)
axes[0].set_xticks(x)
axes[0].set_xticklabels(models)
axes[0].legend()
axes[0].set_ylim([0.9, 1.01])
axes[0].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, v in enumerate(train_scores):
    axes[0].text(i - width/2, v + 0.005, f'{v:.3f}',
                ha='center', va='bottom', fontsize=9)
for i, v in enumerate(test_scores):
    axes[0].text(i + width/2, v + 0.005, f'{v:.3f}',
                ha='center', va='bottom', fontsize=9)

# Confusion matrices side by side
cm_lr = confusion_matrix(y_test, y_pred_test)
cm_dt = confusion_matrix(y_test, y_pred_dt_test)

# Create combined confusion matrix visualization
combined_data = np.column_stack([cm_lr.flatten(), cm_dt.flatten()])
labels = []
for i, true_label in enumerate(target_names):
    for j, pred_label in enumerate(target_names):
        labels.append(f'{true_label[:4]}\n{pred_label[:4]}')

x_pos = np.arange(len(labels))
axes[1].bar(x_pos - width/2, combined_data[:, 0], width,
           label='Logistic Regression', alpha=0.8)
axes[1].bar(x_pos + width/2, combined_data[:, 1], width,
           label='Decision Tree', alpha=0.8)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].set_title('Prediction Patterns Comparison', fontweight='bold', fontsize=12)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(labels, fontsize=8, rotation=45, ha='right')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Step 14: Feature Importance (Decision Tree)

Decision Tree memberikan informasi tentang pentingnya setiap fitur.

# Get feature importance
feature_importance = dt_clf.feature_importances_

# Create DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("=" * 60)
print("FEATURE IMPORTANCE (DECISION TREE)")
print("=" * 60)
print(importance_df.to_string(index=False))

# Visualize feature importance
plt.figure(figsize=(10, 6))
bars = plt.barh(importance_df['Feature'], importance_df['Importance'], alpha=0.8)

# Color bars
colors = plt.cm.viridis(importance_df['Importance'] / importance_df['Importance'].max())
for bar, color in zip(bars, colors):
    bar.set_color(color)

plt.xlabel('Importance Score', fontsize=11)
plt.ylabel('Features', fontsize=11)
plt.title('Feature Importance in Decision Tree Classifier',
         fontweight='bold', fontsize=13, pad=20)
plt.grid(True, alpha=0.3, axis='x')

# Add value labels
for i, (feature, importance) in enumerate(zip(importance_df['Feature'],
                                               importance_df['Importance'])):
    plt.text(importance + 0.01, i, f'{importance:.4f}',
            va='center', fontsize=10)

plt.tight_layout()
plt.show()
Interpretasi Feature Importance
  • Semakin tinggi nilai, semakin penting fitur tersebut untuk klasifikasi
  • Petal measurements biasanya lebih penting dari sepal measurements
  • Ini konsisten dengan observasi visual kita sebelumnya

Step 15: Make Predictions on New Data

Mari kita coba memprediksi spesies bunga iris baru!

# Create some hypothetical new iris measurements
new_samples = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Likely Setosa (small petals)
    [6.5, 3.0, 5.5, 1.8],  # Likely Virginica (large petals)
    [5.9, 3.0, 4.2, 1.5],  # Likely Versicolor (medium petals)
])

print("=" * 60)
print("PREDICTIONS ON NEW DATA")
print("=" * 60)
print("\nNew Iris Measurements:")
print(pd.DataFrame(new_samples, columns=feature_names))

# Scale the new data
new_samples_scaled = scaler.transform(new_samples)

# Make predictions with both models
pred_lr = log_reg.predict(new_samples_scaled)
pred_dt = dt_clf.predict(new_samples_scaled)

# Get prediction probabilities (Logistic Regression)
pred_proba_lr = log_reg.predict_proba(new_samples_scaled)

print("\n" + "=" * 60)
print("PREDICTIONS")
print("=" * 60)
for i in range(len(new_samples)):
    print(f"\nSample {i+1}:")
    print(f"  Measurements: {new_samples[i]}")
    print(f"  Logistic Regression: {target_names[pred_lr[i]]}")
    print(f"  Decision Tree: {target_names[pred_dt[i]]}")
    print(f"  Probabilities (LR):")
    for j, species in enumerate(target_names):
        print(f"    - {species}: {pred_proba_lr[i][j]:.4f} ({pred_proba_lr[i][j]*100:.2f}%)")

Summary dan Key Takeaways

Apa yang Telah Kita Pelajari?

  1. Environment Setup: Cara setup Python environment untuk machine learning
  2. Data Loading: Memuat dataset dari scikit-learn
  3. EDA: Melakukan analisis eksploratif dengan statistik dan visualisasi
  4. Data Preparation: Train/test split dan feature scaling
  5. Model Training: Melatih dua model klasifikasi (Logistic Regression dan Decision Tree)
  6. Model Evaluation: Menggunakan accuracy, classification report, dan confusion matrix
  7. Model Comparison: Membandingkan performa berbagai algoritma
  8. Predictions: Menggunakan model untuk memprediksi data baru

Hasil yang Diperoleh

  • Kedua model mencapai akurasi tinggi (>95%) pada dataset Iris
  • Iris Setosa paling mudah dipisahkan
  • Petal measurements lebih informatif daripada sepal measurements
  • Tidak ada overfitting yang signifikan (train/test accuracy serupa)

Next Steps

Dalam praktikum berikutnya, Anda akan mempelajari:

  • Algoritma klasifikasi lainnya (SVM, KNN, Random Forest)
  • Cross-validation untuk evaluasi yang lebih robust
  • Hyperparameter tuning
  • Handling imbalanced datasets
  • Feature engineering

Troubleshooting Common Issues

Issue 1: Import Error

Problem: ModuleNotFoundError: No module named 'sklearn'

Solution:

pip install scikit-learn
# atau untuk conda
conda install scikit-learn

Issue 2: Matplotlib Display

Problem: Plots tidak muncul di Jupyter Notebook

Solution:

%matplotlib inline
# Tambahkan di cell pertama notebook

Issue 3: Different Results

Problem: Hasil berbeda setiap kali run

Solution:

# Set random seed untuk reproducibility
np.random.seed(42)
# Tambahkan random_state parameter ke fungsi-fungsi ML

Issue 4: Memory Warning

Problem: Warning tentang memory usage

Solution:

# Untuk dataset besar, gunakan batch processing
# Atau tingkatkan RAM yang dialokasikan ke Jupyter

Resources dan Referensi

Dokumentasi

Dataset Information

  • UCI Iris Dataset
  • Fisher, R.A. “The use of multiple measurements in taxonomic problems” Annual Eugenics, 7, Part II, 179-188 (1936)