Lab 10: Membangun Sistem RAG dengan FAISS

Retrieval-Augmented Generation untuk Q&A dan Document Processing

Author

Pembelajaran Mesin - Data Science for Cybersecurity

Published

December 15, 2025

24 Pendahuluan

24.1 Tujuan Pembelajaran

Setelah menyelesaikan lab ini, Anda diharapkan dapat:

  1. Memahami arsitektur Retrieval-Augmented Generation (RAG)
  2. Menggunakan sentence-transformers untuk embedding dokumen
  3. Membangun vector database dengan FAISS
  4. Melakukan similarity search dan retrieval
  5. Mengimplementasikan sistem Q&A berbasis RAG
  6. Mengoptimalkan performance dengan indexing strategies
  7. Mengevaluasi kualitas retrieval dan ranking
  8. Menangani edge cases dan error handling

24.2 Gambaran Umum Lab

Lab ini fokus pada membangun sistem RAG yang mengintegrasikan document retrieval dengan language model capabilities.

24.2.1 Konsep RAG

Retrieval-Augmented Generation (RAG) adalah teknik yang menggabungkan:

  • Retrieval: Mengambil dokumen relevan dari knowledge base
  • Augmentation: Menambahkan dokumen ke prompt
  • Generation: Menggunakan LLM untuk generate jawaban
graph LR
    A[Query/Pertanyaan] --> B[Embedding]
    B --> C[Vector Search]
    C --> D[Retrieve Documents]
    D --> E[Create Context]
    E --> F[Generate Answer]
    F --> G[Response]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6

graph LR
    A[Query/Pertanyaan] --> B[Embedding]
    B --> C[Vector Search]
    C --> D[Retrieve Documents]
    D --> E[Create Context]
    E --> F[Generate Answer]
    F --> G[Response]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6

24.2.2 Lab Structure

graph TD
    A[Setup & Installation] --> B[Document Loading]
    B --> C[Text Preprocessing]
    C --> D[Embedding Creation]
    D --> E[FAISS Indexing]
    E --> F[Basic Retrieval]
    F --> G[QA System]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

graph TD
    A[Setup & Installation] --> B[Document Loading]
    B --> C[Text Preprocessing]
    C --> D[Embedding Creation]
    D --> E[FAISS Indexing]
    E --> F[Basic Retrieval]
    F --> G[QA System]
    G --> H[Optimization]

    style A fill:#e6f3ff
    style B fill:#ffe6e6
    style C fill:#ffffcc
    style D fill:#ccffcc
    style E fill:#e6ccff
    style F fill:#ffcccc
    style G fill:#ccffe6
    style H fill:#ffccff

24.3 Persiapan Environment

24.3.1 Install Dependencies

import subprocess
import sys

packages = [
    'sentence-transformers',
    'faiss-cpu',      # Gunakan 'faiss-gpu' jika punya GPU
    'langchain',
    'pypdf',
    'python-dotenv'
]

for package in packages:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

print("✓ All packages installed successfully!")

24.3.2 Import Libraries

# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Text processing
import re
from collections import Counter
from typing import List, Tuple, Dict

# Embeddings & Vector Search
from sentence_transformers import SentenceTransformer
import faiss

# File handling
from pypdf import PdfReader
import json

# Utilities
from tqdm import tqdm
import time

print("✓ All imports successful!")

24.3.3 Setup Directories

# Create directories
dirs = {
    'data': Path('data'),
    'embeddings': Path('embeddings'),
    'indices': Path('indices'),
    'results': Path('results'),
    'figures': Path('figures'),
}

for name, path in dirs.items():
    path.mkdir(exist_ok=True, parents=True)
    print(f"✓ {name}: {path}")

24.3.4 Configure Settings

# Configuration
CONFIG = {
    'embedding_model': 'sentence-transformers/all-MiniLM-L6-v2',  # Fast & accurate
    'chunk_size': 300,        # Characters per chunk
    'chunk_overlap': 50,      # Overlap between chunks
    'similarity_threshold': 0.5,
    'top_k': 3,              # Retrieve top 3 documents
    'embedding_dim': 384,    # Dimension of embeddings
    'seed': 42,
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

np.random.seed(CONFIG['seed'])

25 Part 1: Document Loading & Preprocessing

25.1 Load Sample Documents

def create_sample_documents():
    """Create sample documents for demonstration"""

    documents = [
        {
            'title': 'Machine Learning Basics',
            'content': '''Machine Learning adalah cabang dari Artificial Intelligence yang memungkinkan
            komputer belajar dari data tanpa diprogram secara eksplisit. Ada tiga jenis utama:
            1. Supervised Learning: Belajar dari labeled data
            2. Unsupervised Learning: Menemukan pattern dalam unlabeled data
            3. Reinforcement Learning: Belajar melalui interaksi dan reward
            Machine learning telah diaplikasikan dalam berbagai domain seperti computer vision,
            natural language processing, dan predictive analytics.'''
        },
        {
            'title': 'Deep Learning & Neural Networks',
            'content': '''Deep Learning menggunakan artificial neural networks dengan multiple layers
            untuk ekstraksi features yang kompleks. Arsitektur populer termasuk:
            - Convolutional Neural Networks (CNN): untuk image processing
            - Recurrent Neural Networks (RNN): untuk sequence data
            - Transformer Networks: untuk NLP tasks
            Deep learning telah mencapai state-of-the-art performance dalam berbagai aplikasi
            seperti image recognition, machine translation, dan speech recognition.'''
        },
        {
            'title': 'Natural Language Processing',
            'content': '''NLP adalah teknologi yang memungkinkan komputer memahami dan memproses bahasa manusia.
            Task utama NLP meliputi:
            - Tokenization: Memecah text menjadi tokens
            - Named Entity Recognition: Mengidentifikasi entities seperti nama, lokasi
            - Sentiment Analysis: Menentukan sentimen dalam text
            - Machine Translation: Menerjemahkan antar bahasa
            Modern NLP menggunakan transformer-based models seperti BERT dan GPT yang sangat powerful.'''
        },
        {
            'title': 'Computer Vision Applications',
            'content': '''Computer Vision adalah cabang AI yang fokus pada pembuatan komputer yang bisa
            memahami dan menganalisa visual information dari dunia sekitar. Aplikasi utama:
            - Image Classification: Mengkategorikan images
            - Object Detection: Mendeteksi dan melokalisasi objects dalam image
            - Semantic Segmentation: Segmentasi pixel-level dari images
            - Face Recognition: Identifikasi dan verifikasi faces
            CNN architecture seperti ResNet dan YOLOv5 mencapai akurasi tinggi dalam tasks ini.'''
        },
        {
            'title': 'Reinforcement Learning',
            'content': '''Reinforcement Learning adalah paradigma machine learning dimana agent belajar
            dengan berinteraksi dengan environment dan menerima rewards atau penalties.
            Konsep kunci:
            - Agent: Entity yang melakukan actions
            - Environment: Sistem yang merespon actions
            - Reward: Signal yang menunjukkan kualitas action
            - Policy: Strategi agent untuk memilih actions
            RL digunakan dalam game AI, robotics, dan autonomous systems.'''
        }
    ]

    return documents

# Create documents
documents = create_sample_documents()
print(f"✓ Created {len(documents)} sample documents")
print("\nDocuments:")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {doc['title']}")

25.2 Text Preprocessing

def preprocess_text(text: str) -> str:
    """Preprocess text: lowercase, remove extra spaces"""
    # Lowercase
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Remove special characters (keep basic punctuation)
    text = re.sub(r'[^\w\s.,!?-]', '', text)

    return text

def chunk_text(text: str, chunk_size: int = 300, overlap: int = 50) -> List[str]:
    """Split text into overlapping chunks"""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap  # Create overlap

    return chunks

# Create chunks from documents
document_chunks = []
chunk_metadata = []

for doc in documents:
    content = doc['content']
    preprocessed = preprocess_text(content)
    chunks = chunk_text(preprocessed, CONFIG['chunk_size'], CONFIG['chunk_overlap'])

    for chunk_idx, chunk in enumerate(chunks):
        document_chunks.append(chunk)
        chunk_metadata.append({
            'document_title': doc['title'],
            'chunk_index': chunk_idx,
            'chunk_text': chunk,
            'original_length': len(doc['content'])
        })

print(f"✓ Created {len(document_chunks)} chunks from documents")
print(f"  Average chunk length: {np.mean([len(c) for c in document_chunks]):.0f} chars")
print(f"\nFirst chunk example:")
print(f"  {document_chunks[0][:200]}...")

26 Part 2: Embedding Creation & FAISS Indexing

26.1 Load Embedding Model

def load_embedding_model(model_name: str):
    """Load pre-trained embedding model"""
    print(f"Loading embedding model: {model_name}")
    model = SentenceTransformer(model_name)
    print(f"✓ Model loaded!")
    print(f"  Model dimension: {model.get_sentence_embedding_dimension()}")

    return model

# Load model
embedding_model = load_embedding_model(CONFIG['embedding_model'])

26.2 Create Embeddings

def create_embeddings(texts: List[str], model, batch_size: int = 32):
    """Create embeddings for texts"""
    print(f"Creating embeddings for {len(texts)} texts...")

    embeddings = model.encode(texts, show_progress_bar=True, batch_size=batch_size)

    print(f"✓ Embeddings created!")
    print(f"  Shape: {embeddings.shape}")
    print(f"  Data type: {embeddings.dtype}")

    return embeddings

# Create embeddings
embeddings = create_embeddings(document_chunks, embedding_model)

# Save embeddings
np.save(dirs['embeddings'] / 'document_embeddings.npy', embeddings)
print(f"✓ Embeddings saved to {dirs['embeddings']}/document_embeddings.npy")

26.3 Build FAISS Index

def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:
    """Build FAISS index using L2 distance"""
    print(f"Building FAISS index...")

    # Ensure embeddings are C-contiguous
    embeddings = embeddings.astype('float32')

    # Create L2 index (Euclidean distance)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)

    # Add vectors to index
    index.add(embeddings)

    print(f"✓ FAISS index created!")
    print(f"  Total vectors: {index.ntotal}")
    print(f"  Vector dimension: {dimension}")

    return index

# Build index
faiss_index = build_faiss_index(embeddings)

# Save index
index_path = dirs['indices'] / 'documents.index'
faiss.write_index(faiss_index, str(index_path))
print(f"✓ Index saved to {index_path}")

26.4 Verify Index

def verify_index(index: faiss.Index, embeddings: np.ndarray):
    """Verify index consistency"""
    print("Verifying index...")

    # Test with first embedding
    test_embedding = embeddings[:1].astype('float32')
    distances, indices = index.search(test_embedding, 5)

    print(f"✓ Index verification successful!")
    print(f"  Nearest neighbors to first embedding:")
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0]), 1):
        print(f"    {i}. Index {idx}, Distance: {dist:.4f}")

verify_index(faiss_index, embeddings)

27 Part 3: Document Retrieval System

27.1 Basic Retrieval

def retrieve_similar_documents(query: str,
                              embedding_model,
                              faiss_index: faiss.Index,
                              document_chunks: List[str],
                              k: int = 3) -> List[Dict]:
    """Retrieve similar documents for query"""

    # Embed query
    query_embedding = embedding_model.encode([query], convert_to_numpy=True).astype('float32')

    # Search in FAISS
    distances, indices = faiss_index.search(query_embedding, k)

    # Prepare results
    results = []
    for rank, (distance, idx) in enumerate(zip(distances[0], indices[0]), 1):
        metadata = chunk_metadata[idx]
        results.append({
            'rank': rank,
            'distance': float(distance),
            'similarity_score': 1 / (1 + distance),  # Convert distance to similarity
            'document_title': metadata['document_title'],
            'chunk_text': document_chunks[idx],
            'chunk_index': idx
        })

    return results

# Test retrieval
queries = [
    "Apa itu machine learning?",
    "Bagaimana deep learning bekerja?",
    "Sebutkan aplikasi NLP",
]

print("=" * 80)
print("RETRIEVAL RESULTS")
print("=" * 80)

for query in queries:
    print(f"\nQuery: {query}")
    results = retrieve_similar_documents(query, embedding_model, faiss_index,
                                        document_chunks, k=2)

    for result in results:
        print(f"\n  [{result['rank']}] {result['document_title']} (Similarity: {result['similarity_score']:.3f})")
        print(f"      {result['chunk_text'][:150]}...")

27.2 Evaluate Retrieval Quality

def evaluate_retrieval(results: List[Dict]) -> Dict:
    """Evaluate retrieval quality"""

    if not results:
        return {}

    avg_similarity = np.mean([r['similarity_score'] for r in results])
    max_similarity = max([r['similarity_score'] for r in results])
    min_similarity = min([r['similarity_score'] for r in results])

    return {
        'num_results': len(results),
        'avg_similarity': avg_similarity,
        'max_similarity': max_similarity,
        'min_similarity': min_similarity,
        'top_document': results[0]['document_title']
    }

# Evaluate queries
print("\n" + "=" * 80)
print("RETRIEVAL EVALUATION")
print("=" * 80)

for query in queries:
    results = retrieve_similar_documents(query, embedding_model, faiss_index,
                                        document_chunks, k=3)
    evaluation = evaluate_retrieval(results)

    print(f"\nQuery: {query}")
    print(f"  Top Document: {evaluation['top_document']}")
    print(f"  Avg Similarity: {evaluation['avg_similarity']:.3f}")
    print(f"  Similarity Range: [{evaluation['min_similarity']:.3f}, {evaluation['max_similarity']:.3f}]")

28 Part 4: QA System Implementation

28.1 Simple QA System

class SimpleRAGSystem:
    """Simple RAG system for Q&A"""

    def __init__(self, embedding_model, faiss_index, document_chunks, chunk_metadata):
        self.embedding_model = embedding_model
        self.faiss_index = faiss_index
        self.document_chunks = document_chunks
        self.chunk_metadata = chunk_metadata

    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Retrieve relevant documents"""

        # Embed query
        query_embedding = self.embedding_model.encode([query], convert_to_numpy=True).astype('float32')

        # Search
        distances, indices = self.faiss_index.search(query_embedding, top_k)

        # Format results
        results = []
        for rank, (distance, idx) in enumerate(zip(distances[0], indices[0]), 1):
            similarity = 1 / (1 + distance)
            results.append({
                'rank': rank,
                'similarity': similarity,
                'text': self.document_chunks[idx],
                'metadata': self.chunk_metadata[idx]
            })

        return results

    def build_context(self, results: List[Dict]) -> str:
        """Build context from retrieved documents"""
        context = "Dokumen Relevan:\n"
        for result in results:
            context += f"\n[Dokumen {result['rank']}] {result['metadata']['document_title']}:\n"
            context += f"{result['text']}\n"

        return context

    def generate_answer(self, query: str, top_k: int = 3) -> Dict:
        """Generate answer from retrieved context"""

        # Retrieve
        results = self.retrieve(query, top_k)

        # Build context
        context = self.build_context(results)

        # Create prompt (in real system, would use LLM)
        prompt = f"""Berdasarkan dokumen berikut, jawab pertanyaan dengan jelas dan singkat.

{context}

Pertanyaan: {query}
Jawaban:"""

        return {
            'query': query,
            'retrieved_documents': len(results),
            'context': context,
            'prompt': prompt,
            'retrieval_results': results
        }

# Initialize QA system
qa_system = SimpleRAGSystem(embedding_model, faiss_index, document_chunks, chunk_metadata)

# Test QA
test_questions = [
    "Apa perbedaan antara supervised dan unsupervised learning?",
    "Jelaskan arsitektur CNN dan gunanya",
    "Bagaimana reinforcement learning berbeda dari supervised learning?"
]

print("=" * 80)
print("QA SYSTEM TEST")
print("=" * 80)

for question in test_questions:
    result = qa_system.generate_answer(question, top_k=2)

    print(f"\nQuestion: {question}")
    print(f"Retrieved Documents: {result['retrieved_documents']}")
    print(f"\nContext Preview:")
    print(result['context'][:300] + "...")

29 Part 5: Performance Analysis

29.1 Speed Benchmark

def benchmark_retrieval(qa_system, queries: List[str], iterations: int = 10):
    """Benchmark retrieval speed"""

    times = []

    for _ in range(iterations):
        for query in queries:
            start = time.time()
            qa_system.retrieve(query, top_k=3)
            elapsed = time.time() - start
            times.append(elapsed)

    times = np.array(times) * 1000  # Convert to milliseconds

    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'percentile_95': np.percentile(times, 95)
    }

# Benchmark
print("Benchmarking retrieval speed...")
benchmark_results = benchmark_retrieval(qa_system, queries, iterations=5)

print("\nRetrieval Speed Benchmark:")
print(f"  Mean: {benchmark_results['mean']:.2f} ms")
print(f"  Std Dev: {benchmark_results['std']:.2f} ms")
print(f"  Min: {benchmark_results['min']:.2f} ms")
print(f"  Max: {benchmark_results['max']:.2f} ms")
print(f"  95th Percentile: {benchmark_results['percentile_95']:.2f} ms")

29.2 Similarity Distribution Analysis

def analyze_similarity_distribution(qa_system, queries: List[str]):
    """Analyze similarity score distribution"""

    all_similarities = []

    for query in queries:
        results = qa_system.retrieve(query, top_k=5)
        similarities = [r['similarity'] for r in results]
        all_similarities.extend(similarities)

    return {
        'mean': np.mean(all_similarities),
        'std': np.std(all_similarities),
        'min': np.min(all_similarities),
        'max': np.max(all_similarities),
        'median': np.median(all_similarities),
        'values': all_similarities
    }

# Analyze distribution
similarity_analysis = analyze_similarity_distribution(qa_system, queries)

print("\nSimilarity Score Analysis:")
print(f"  Mean: {similarity_analysis['mean']:.3f}")
print(f"  Median: {similarity_analysis['median']:.3f}")
print(f"  Std Dev: {similarity_analysis['std']:.3f}")
print(f"  Range: [{similarity_analysis['min']:.3f}, {similarity_analysis['max']:.3f}]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(similarity_analysis['values'], bins=20, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Similarity Score')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Similarity Scores')
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(similarity_analysis['values'], vert=True)
axes[1].set_ylabel('Similarity Score')
axes[1].set_title('Box Plot of Similarity Scores')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(dirs['figures'] / 'similarity_distribution.png', dpi=300, bbox_inches='tight')
print("\n✓ Figure saved: similarity_distribution.png")

30 Part 6: Advanced Topics

30.2 Document Batch Processing

def process_documents_batch(document_list: List[Dict],
                           embedding_model,
                           chunk_size: int = 300,
                           overlap: int = 50) -> Tuple[List[str], List[Dict]]:
    """Process documents in batch"""

    all_chunks = []
    all_metadata = []

    for doc_idx, doc in enumerate(tqdm(document_list, desc="Processing documents")):
        content = doc.get('content', '')
        preprocessed = preprocess_text(content)
        chunks = chunk_text(preprocessed, chunk_size, overlap)

        for chunk_idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            all_metadata.append({
                'doc_id': doc_idx,
                'doc_title': doc.get('title', 'Unknown'),
                'chunk_id': chunk_idx,
                'chunk_text': chunk,
            })

    return all_chunks, all_metadata

# Example usage
processed_chunks, processed_metadata = process_documents_batch(documents, embedding_model)
print(f"Processed {len(processed_chunks)} chunks from {len(documents)} documents")

31 Summary & Best Practices

31.1 Key Takeaways

  1. RAG Architecture: Combines retrieval + generation for better answers
  2. Embeddings: Semantic representations enable meaningful similarity search
  3. FAISS: Efficient vector indexing for fast retrieval
  4. Chunking: Balance between context preservation and retrieval precision
  5. Evaluation: Assess retrieval quality through similarity metrics

31.2 Performance Tips

  • Use smaller embedding models for speed (all-MiniLM-L6-v2)
  • Build IVF indices for large document collections (millions of chunks)
  • Batch embeddings for efficiency
  • Cache results for frequently asked questions
  • Consider re-ranking with more expensive models

31.3 Best Practices

  • Document chunks should be self-contained
  • Include metadata with each chunk
  • Experiment with chunk size (typically 200-500 chars)
  • Use similarity threshold to filter low-quality results
  • Monitor retrieval metrics in production

31.4 Extensions

  • Add filtering by metadata/tags
  • Implement re-ranking with cross-encoders
  • Use dense passage retrieval (DPR)
  • Integrate with LLMs for answer generation
  • Add feedback loop for continuous improvement

32 Appendix: Code Reference

32.1 Complete RAG Pipeline

def complete_rag_pipeline(query: str,
                         documents: List[Dict],
                         embedding_model,
                         chunk_size: int = 300):
    """Complete pipeline from documents to answer"""

    # Step 1: Preprocess and chunk
    chunks, metadata = [], []
    for doc in documents:
        text = preprocess_text(doc['content'])
        doc_chunks = chunk_text(text, chunk_size)
        chunks.extend(doc_chunks)
        for idx, chunk in enumerate(doc_chunks):
            metadata.append({'title': doc['title'], 'chunk_id': idx})

    # Step 2: Create embeddings
    embeddings = embedding_model.encode(chunks, convert_to_numpy=True).astype('float32')

    # Step 3: Build index
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)

    # Step 4: Retrieve
    query_embedding = embedding_model.encode([query], convert_to_numpy=True).astype('float32')
    distances, indices = index.search(query_embedding, 3)

    # Step 5: Format results
    results = []
    for distance, idx in zip(distances[0], indices[0]):
        results.append({
            'document': metadata[idx]['title'],
            'similarity': 1 / (1 + distance),
            'text': chunks[idx]
        })

    return results

# Usage
qa_results = complete_rag_pipeline("machine learning definition", documents, embedding_model)
for result in qa_results:
    print(f"{result['document']} ({result['similarity']:.3f}): {result['text'][:100]}...")

Lab created for Pembelajaran Mesin course - Data Science for Cybersecurity Last Updated: December 2025