Bab 10: RAG & AI Agents

Retrieval-Augmented Generation, Vector Databases & Intelligent Agents

Bab 10: RAG & AI Agents

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

Memahami konsep Retrieval-Augmented Generation (RAG) dan motivasinya
Mengidentifikasi komponen-komponen sistem RAG (embeddings, vector DB, retrieval)
Mengimplementasikan vector databases dengan FAISS dan ChromaDB
Membangun RAG pipeline sederhana dengan LangChain
Menerapkan chunking strategies dan semantic search
Mengembangkan AI agents dengan tools dan memory
Mengevaluasi kualitas RAG systems menggunakan berbagai metrics

10.1 Dari LLM ke RAG: Mengapa Kita Butuh Retrieval?

10.1.1 Limitasi Large Language Models

Masalah Fundamental LLMs:

Meskipun powerful, LLMs seperti GPT, BERT, dan Llama memiliki critical limitations:

1. Knowledge Cutoff 📅:

Model hanya “tahu” data sampai tanggal training
GPT-4 (trained 2023) tidak tahu berita 2024
Tidak bisa update knowledge tanpa re-training (costly!)

2. Hallucination 🎭:

LLMs bisa generate jawaban yang sounds plausible tapi faktanya salah
Tidak ada grounding ke factual sources
Berbahaya untuk critical applications (medical, legal, financial)

3. Domain-Specific Knowledge 🏢:

LLMs tidak tahu internal company documents
Tidak punya akses ke proprietary data
Tidak bisa query real-time databases

4. No Source Attribution 📚:

Tidak bisa cite sources
Sulit verify informasi
Compliance & legal issues

💡 Contoh Problem

User: “Berapa harga saham Apple hari ini?”

LLM tanpa RAG:

“Maaf, saya tidak punya data real-time…”
Atau worse: hallucinates a price!

LLM dengan RAG: 1. Retrieve dari stock API/database 2. Ground response pada data terkini 3. Jawaban akurat dengan source citation

10.1.2 Solusi: Retrieval-Augmented Generation (RAG)

Definisi:

RAG adalah teknik yang mengkombinasikan retrieval (pencarian informasi dari knowledge base) dengan generation (LLM text generation) untuk menghasilkan jawaban yang lebih akurat, factual, dan dapat diverifikasi.

Arsitektur RAG:

Code

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]

    D[Document Corpus] --> E[Text Chunking]
    E --> F[Embedding Generation]
    F --> G[(Vector Database)]

    C --> G
    G --> H[Retrieve Top-K<br/>Relevant Chunks]

    H --> I[Construct Prompt]
    A --> I
    I --> J[LLM Generation]
    J --> K[Response with<br/>Source Citations]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style K fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style G fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333
    style J fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]

    D[Document Corpus] --> E[Text Chunking]
    E --> F[Embedding Generation]
    F --> G[(Vector Database)]

    C --> G
    G --> H[Retrieve Top-K<br/>Relevant Chunks]

    H --> I[Construct Prompt]
    A --> I
    I --> J[LLM Generation]
    J --> K[Response with<br/>Source Citations]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style K fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style G fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333
    style J fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff

Cara Kerja RAG:

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch
import numpy as np

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Step 1: Document Processing (Left)
ax.add_patch(FancyBboxPatch((0.5, 7.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 8.5, 'Document\nCorpus', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(1.5, 7.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

ax.add_patch(FancyBboxPatch((0.5, 5.5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 6, 'Chunking', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(1.5, 5.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

ax.add_patch(FancyBboxPatch((0.5, 3.5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 4, 'Embeddings', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(1.5, 3.5, 0.8, -0.5, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

# Vector Database (Center)
ax.add_patch(FancyBboxPatch((3, 1.5), 2.5, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#FFA500', facecolor='#FFF8DC', linewidth=3))
ax.text(4.25, 2.25, 'Vector\nDatabase', ha='center', va='center', fontsize=11, fontweight='bold')

# Step 2: Query Processing (Right)
ax.add_patch(FancyBboxPatch((7, 7.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2))
ax.text(8, 8.5, 'User\nQuery', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(8, 7.5, 0, -1.3, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57')

ax.add_patch(FancyBboxPatch((7, 5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2))
ax.text(8, 5.5, 'Query\nEmbedding', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(8, 5, -2.5, -2, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57')

# Retrieval arrow
ax.arrow(5.5, 2.5, 1.3, 1.5, head_width=0.15, head_length=0.15, fc='#9C27B0', ec='#9C27B0')
ax.text(6.5, 4, 'Top-K\nRetrieval', ha='center', va='center', fontsize=9,
        bbox=dict(boxstyle='round', facecolor='#F3E5F5', alpha=0.8))

# LLM Generation
ax.add_patch(FancyBboxPatch((7, 1.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#FF6B6B', facecolor='#FFEBEE', linewidth=2))
ax.text(8, 2.25, 'LLM\nGeneration', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(8, 1.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#C92A2A', ec='#C92A2A')

# Final Response
ax.add_patch(FancyBboxPatch((6.5, 0), 3, 0.5, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#C8E6C9', linewidth=3))
ax.text(8, 0.25, 'Response + Citations', ha='center', va='center',
        fontsize=11, fontweight='bold')

# Title
ax.text(5, 9.5, 'RAG Pipeline Architecture', ha='center', va='center',
        fontsize=14, fontweight='bold', color='#333')

plt.tight_layout()
plt.show()

Keuntungan RAG:

Aspek	LLM Murni	RAG System
Knowledge	Static (cutoff date)	✅ Dynamic, updatable
Hallucination	Frequent	✅ Reduced (grounded)
Source	No citation	✅ Traceable sources
Domain Data	Generic only	✅ Custom knowledge base
Cost	Low inference	Medium (retrieval + gen)
Latency	Fast (~100ms)	Slower (~500ms)

10.1.3 Use Cases RAG di Industry

1. Customer Support Chatbots 🤖:

Knowledge base: Product manuals, FAQs, troubleshooting guides
Example: “How do I reset my password?” → Retrieve from internal docs

2. Legal Document Analysis ⚖️:

Corpus: Case law, regulations, contracts
Example: “Find precedents for patent infringement”

3. Medical Q&A 🏥:

Database: Medical literature, clinical guidelines
Example: “Treatment options for Type 2 diabetes”

4. Code Documentation 💻:

Codebase + docs retrieval
Example: “How to authenticate API requests in our system?”

5. Academic Research 📚:

Literature search + summarization
Example: “Recent advances in quantum computing 2024”

10.2 Embeddings & Vector Representations

10.2.1 Apa itu Embeddings?

Definisi:

Embedding adalah representasi vektor (array of numbers) dari text, image, atau data lain dalam high-dimensional space, dimana semantic similarity tercermin sebagai geometric proximity.

Intuisi:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Simulate word embeddings (normally 768 or 1536 dimensions)
# We'll create simple 2D examples for visualization

np.random.seed(42)

# Categories of words
animals = {
    'cat': [0.2, 0.8], 'dog': [0.3, 0.85], 'lion': [0.25, 0.75],
    'tiger': [0.28, 0.72], 'elephant': [0.35, 0.7]
}

tech = {
    'computer': [0.8, 0.3], 'laptop': [0.82, 0.28], 'phone': [0.85, 0.35],
    'tablet': [0.83, 0.32], 'monitor': [0.78, 0.27]
}

fruits = {
    'apple': [0.5, 0.2], 'banana': [0.52, 0.18], 'orange': [0.48, 0.22],
    'grape': [0.51, 0.19], 'mango': [0.49, 0.21]
}

fig, ax = plt.subplots(figsize=(12, 8))

# Plot each category
for word, pos in animals.items():
    ax.scatter(pos[0], pos[1], c='#FF6B6B', s=300, alpha=0.6, edgecolors='darkred', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

for word, pos in tech.items():
    ax.scatter(pos[0], pos[1], c='#4A90E2', s=300, alpha=0.6, edgecolors='darkblue', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

for word, pos in fruits.items():
    ax.scatter(pos[0], pos[1], c='#50C878', s=300, alpha=0.6, edgecolors='darkgreen', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

# Add cluster circles
from matplotlib.patches import Circle
ax.add_patch(Circle((0.27, 0.76), 0.15, fill=False, edgecolor='#FF6B6B',
                     linewidth=2, linestyle='--', label='Animals'))
ax.add_patch(Circle((0.816, 0.3), 0.08, fill=False, edgecolor='#4A90E2',
                     linewidth=2, linestyle='--', label='Technology'))
ax.add_patch(Circle((0.5, 0.2), 0.05, fill=False, edgecolor='#50C878',
                     linewidth=2, linestyle='--', label='Fruits'))

ax.set_xlabel('Dimension 1', fontsize=12, fontweight='bold')
ax.set_ylabel('Dimension 2', fontsize=12, fontweight='bold')
ax.set_title('Embedding Space: Semantic Similarity as Distance',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='upper right')
ax.grid(True, alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])

plt.tight_layout()
plt.show()

Key Properties:

Similar words = Close vectors:
- “cat” dan “dog” berdekatan
- “laptop” dan “computer” berdekatan
Different concepts = Distant vectors:
- “cat” jauh dari “laptop”
Mathematical operations:
- King - Man + Woman ≈ Queen
- Paris - France + Italy ≈ Rome

10.2.2 Sentence & Document Embeddings

Word Embeddings vs. Sentence Embeddings:

Type	Example	Dimension	Use Case
Word	“cat” → [0.2, 0.8, …]	300 (Word2Vec)	Word similarity
Sentence	“The cat sleeps” → [0.5, …]	768 (BERT)	Semantic search
Document	Full article → [0.3, …]	1536 (OpenAI)	Document retrieval

Implementasi dengan Sentence Transformers:

Code

from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "Machine learning adalah subset dari artificial intelligence",
    "Deep learning menggunakan neural networks dengan banyak layer",
    "Python adalah bahasa pemrograman populer untuk data science",
    "Jakarta adalah ibu kota Indonesia"
]

# Generate embeddings
embeddings = model.encode(sentences)

print(f"Embedding shape: {embeddings.shape}")  # (4, 384)
print(f"First embedding (truncated):\n{embeddings[0][:10]}")

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("\nSimilarity Matrix:")
print(similarity_matrix)

Output:

Embedding shape: (4, 384)
First embedding (truncated):
[ 0.0234 -0.1234  0.5678 ... ]

Similarity Matrix:
[[1.000 0.812 0.456 0.123]   # Sent 1 vs all
 [0.812 1.000 0.489 0.098]   # Sent 2 vs all
 [0.456 0.489 1.000 0.156]   # Sent 3 vs all
 [0.123 0.098 0.156 1.000]]  # Sent 4 vs all

Interpretasi:

Sentence 1 dan 2 (ML/DL) sangat similar (0.812)
Sentence 4 (Jakarta) paling berbeda dari semua (low similarity)

10.2.3 Cosine Similarity untuk Retrieval

Formula:

\[ \text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Range: -1 (opposite) to +1 (identical)

Visualisasi:

import numpy as np
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Case 1: High similarity
ax = axes[0]
v1 = np.array([0.8, 0.6])
v2 = np.array([0.7, 0.5])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#50C878', width=0.01, label='Vector B')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('High Similarity\n(cos ≈ 0.99)', fontweight='bold')

# Case 2: Medium similarity
ax = axes[1]
v1 = np.array([0.8, 0.6])
v2 = np.array([0.6, 0.8])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#FFD700', width=0.01, label='Vector C')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('Medium Similarity\n(cos ≈ 0.96)', fontweight='bold')

# Case 3: Low similarity
ax = axes[2]
v1 = np.array([0.8, 0.2])
v2 = np.array([0.2, 0.8])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#FF6B6B', width=0.01, label='Vector D')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('Low Similarity\n(cos ≈ 0.52)', fontweight='bold')

plt.tight_layout()
plt.show()

⚠️ Common Pitfalls

Dimensionality Matters:
- Lebih tinggi ≠ selalu lebih baik
- 384-dim cukup untuk banyak tasks
- 1536-dim untuk complex semantic understanding
Model Selection:
- all-MiniLM-L6-v2 (384-dim): Fast, good general purpose
- all-mpnet-base-v2 (768-dim): Better quality, slower
- OpenAI text-embedding-ada-002 (1536-dim): Best quality, API cost
Normalization:
- Always normalize vectors untuk cosine similarity
- sklearn.preprocessing.normalize() atau manual L2 norm

10.3 Vector Databases

10.3.1 Mengapa Butuh Vector Database?

Problem Statement:

Bayangkan Anda punya 1 million documents. Untuk setiap query:

Compute cosine similarity dengan semua 1M vectors
Sort untuk find top-K
Time complexity: O(N × D) dimana N=documents, D=dimensions

Result: 🐌 SANGAT LAMBAT!

Solusi: Vector Database dengan ANN (Approximate Nearest Neighbors)

10.3.2 Pilihan Vector Databases

Database	Type	Best For	Pros	Cons
FAISS	Library	Research, prototyping	Fast, free, Facebook-backed	Not distributed
ChromaDB	Embedded	Small-medium apps	Easy, Python-native	Limited scale
Pinecone	Cloud	Production apps	Managed, scalable	Cost, vendor lock-in
Weaviate	Self-hosted	Enterprise	Open source, GraphQL	Complex setup
Milvus	Self-hosted	Large scale	Distributed, fast	Requires infrastructure
Qdrant	Self-hosted	Modern apps	Rust-based, fast	Newer, smaller community

10.3.3 Implementasi dengan FAISS

Installation:

pip install faiss-cpu  # or faiss-gpu for GPU support

Example: Building a simple vector search:

Code

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Sample documents
documents = [
    "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data",
    "Deep learning menggunakan neural networks dengan banyak hidden layers",
    "Natural language processing memproses dan memahami bahasa manusia",
    "Computer vision memungkinkan komputer memahami gambar dan video",
    "Reinforcement learning belajar melalui trial and error dengan rewards",
    "Transfer learning memanfaatkan model pre-trained untuk task baru",
    "Ensemble methods menggabungkan multiple models untuk hasil lebih baik",
    "Python adalah bahasa pemrograman populer untuk data science",
]

# Generate embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(documents)

# Get embedding dimension
d = embeddings.shape[1]  # 384 for all-MiniLM-L6-v2
print(f"Embedding dimension: {d}")

# Create FAISS index
# IndexFlatL2: Exact search dengan L2 distance (bisa diganti dengan cosine)
index = faiss.IndexFlatL2(d)

# Normalize vectors untuk cosine similarity
faiss.normalize_L2(embeddings)

# Add vectors to index
index.add(embeddings.astype('float32'))

print(f"Total vectors dalam index: {index.ntotal}")

# Query
query = "Apa itu neural networks?"
query_embedding = model.encode([query])
faiss.normalize_L2(query_embedding)

# Search for top-3 most similar documents
k = 3
distances, indices = index.search(query_embedding.astype('float32'), k)

print(f"\nQuery: '{query}'")
print("\nTop-3 hasil retrieval:")
for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    # Convert L2 distance to similarity score
    similarity = 1 - (dist / 2)  # Normalized L2 to cosine similarity
    print(f"{i+1}. [Score: {similarity:.3f}] {documents[idx]}")

Output:

Embedding dimension: 384
Total vectors dalam index: 8

Query: 'Apa itu neural networks?'

Top-3 hasil retrieval:
1. [Score: 0.892] Deep learning menggunakan neural networks dengan banyak hidden layers
2. [Score: 0.734] Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data
3. [Score: 0.698] Transfer learning memanfaatkan model pre-trained untuk task baru

10.3.4 Implementasi dengan ChromaDB

ChromaDB lebih user-friendly dan persistent:

Code

import chromadb
from chromadb.config import Settings

# Create client
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./chroma_db"  # Data akan disimpan di sini
))

# Create collection
collection = client.create_collection(
    name="ml_documents",
    metadata={"description": "Machine learning knowledge base"}
)

# Add documents
documents = [
    "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data",
    "Deep learning menggunakan neural networks dengan banyak hidden layers",
    "Natural language processing memproses dan memahami bahasa manusia",
]

collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3"],
    metadatas=[
        {"category": "ML", "difficulty": "beginner"},
        {"category": "DL", "difficulty": "intermediate"},
        {"category": "NLP", "difficulty": "intermediate"}
    ]
)

# Query
results = collection.query(
    query_texts=["Apa itu neural networks?"],
    n_results=2,
    where={"difficulty": "intermediate"}  # Optional metadata filter
)

print("Hasil retrieval:")
for doc, dist, meta in zip(results['documents'][0],
                            results['distances'][0],
                            results['metadatas'][0]):
    print(f"[Distance: {dist:.3f}] {doc}")
    print(f"  Metadata: {meta}\n")

💡 Best Practices

Choose the right index type:
- IndexFlatL2: Exact search, small datasets (<100K)
- IndexIVFFlat: Approximate, medium datasets (100K-1M)
- IndexHNSW: Fast approximate, large datasets (>1M)
Batch processing:
- Add vectors in batches (e.g., 1000 at a time)
- Faster than one-by-one
Persistence:
- FAISS: Save/load dengan faiss.write_index() dan faiss.read_index()
- ChromaDB: Otomatis persistent ke disk

10.4 Building RAG Systems

10.4.1 Text Chunking Strategies

Mengapa Chunking?

Documents terlalu panjang untuk embed sekaligus
Token limits (e.g., 512 tokens untuk BERT)
Better retrieval granularity

Chunking Methods:

Code

# Method 1: Fixed-size chunking
def chunk_by_tokens(text, chunk_size=512, overlap=50):
    """
    Split text into fixed-size chunks dengan overlap.

    Args:
        text: Input text
        chunk_size: Number of tokens per chunk
        overlap: Number of overlapping tokens antar chunk

    Returns:
        List of text chunks
    """
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    tokens = tokenizer.encode(text, add_special_tokens=False)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += (chunk_size - overlap)

    return chunks

# Method 2: Semantic chunking (berdasarkan paragraf/section)
def chunk_by_paragraph(text):
    """Split by paragraph boundaries."""
    paragraphs = text.split('\n\n')
    return [p.strip() for p in paragraphs if p.strip()]

# Method 3: Recursive character splitting (LangChain style)
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Hierarchy of split points
)

sample_text = """
Machine learning adalah bidang yang berkembang pesat.
Dalam beberapa tahun terakhir, deep learning telah merevolusi berbagai domain.

Computer vision kini dapat mengenali objek dengan akurasi superhuman.
Natural language processing memungkinkan chatbot yang sangat natural.

Ke depannya, AI akan semakin terintegrasi dalam kehidupan sehari-hari.
"""

chunks = text_splitter.split_text(sample_text)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(chunk)

Comparison:

Method	Pros	Cons	Best For
Fixed tokens	Consistent size, fast	Might split mid-sentence	Technical docs
Paragraph	Semantic coherence	Variable size	Articles, books
Recursive	Best of both worlds	More complex	General purpose

10.4.2 Complete RAG Pipeline dengan LangChain

LangChain adalah framework populer untuk building LLM applications.

Installation:

pip install langchain langchain-community chromadb openai

Full RAG Example:

Code

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI  # atau Ollama untuk local LLM

# Step 1: Load documents
# Misalnya dari folder berisi txt files
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('./docs/', glob="**/*.txt")
documents = loader.load()

print(f"Loaded {len(documents)} documents")

# Step 2: Chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
texts = text_splitter.split_documents(documents)
print(f"Split into {len(texts)} chunks")

# Step 3: Create embeddings & vector store
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Step 4: Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Retrieve top-3 chunks
)

# Step 5: Create QA chain
llm = OpenAI(temperature=0)  # Ganti dengan model pilihan Anda

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff", "map_reduce", "refine", or "map_rerank"
    retriever=retriever,
    return_source_documents=True
)

# Step 6: Query!
query = "Apa perbedaan antara supervised dan unsupervised learning?"
result = qa_chain({"query": query})

print(f"Question: {query}")
print(f"\nAnswer: {result['result']}")
print(f"\nSources:")
for i, doc in enumerate(result['source_documents'], 1):
    print(f"{i}. {doc.metadata.get('source', 'Unknown')}")
    print(f"   {doc.page_content[:200]}...")

Chain Types Explained:

Stuff: Masukkan semua retrieved docs ke satu prompt
- Pros: Simple, best quality
- Cons: Limited by context window
Map-Reduce: Process each doc separately, then combine
- Pros: Handles banyak docs
- Cons: Might lose connections
Refine: Iteratively refine answer dengan each doc
- Pros: Good for comprehensive answers
- Cons: Slower, more LLM calls
Map-Rerank: Score each doc’s answer, pilih terbaik
- Pros: High quality
- Cons: Expensive (many LLM calls)

10.4.3 Advanced RAG: Hybrid Search

Limitation of pure vector search:

Might miss exact keyword matches
Example: Searching “GPT-4” might tidak find exact mention

Solution: Hybrid Search = Dense (vector) + Sparse (BM25) retrieval

Code

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS

# Vector retriever (dense)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# BM25 retriever (sparse, keyword-based)
bm25_retriever = BM25Retriever.from_documents(texts)
bm25_retriever.k = 3

# Ensemble retriever (kombinasi keduanya)
ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5]  # Equal weight, bisa di-tune
)

# Use dalam QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=ensemble_retriever,
    return_source_documents=True
)

Benefits:

Better recall (lebih banyak relevant docs ditemukan)
Handles both semantic dan exact matches
More robust untuk diverse queries

10.5 AI Agents: Dari RAG ke Autonomous Systems

10.5.1 Apa itu AI Agents?

Definisi:

AI Agent adalah sistem yang dapat menggunakan tools, memory, dan reasoning untuk secara autonomous menyelesaikan tasks complex.

RAG vs. Agents:

Aspect	RAG System	AI Agent
Capability	Retrieve & answer	Reason, plan, act
Tools	None (just retrieval)	Can use tools (calculator, API, etc.)
Memory	Stateless	Can maintain memory
Autonomy	Single-step	Multi-step planning
Example	Q&A chatbot	Personal assistant

Agent Architecture:

Code

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Task] --> B[Agent Core<br/>LLM Reasoning]

    B --> C{Decision}
    C -->|Need Info| D[Retrieval Tool<br/>RAG/Search]
    C -->|Need Calculation| E[Calculator Tool]
    C -->|Need API Data| F[API Tool]
    C -->|Need Memory| G[Memory Store]

    D --> H[Observation]
    E --> H
    F --> H
    G --> H

    H --> B

    C -->|Task Complete| I[Final Answer]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style B fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff
    style I fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style C fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Task] --> B[Agent Core<br/>LLM Reasoning]

    B --> C{Decision}
    C -->|Need Info| D[Retrieval Tool<br/>RAG/Search]
    C -->|Need Calculation| E[Calculator Tool]
    C -->|Need API Data| F[API Tool]
    C -->|Need Memory| G[Memory Store]

    D --> H[Observation]
    E --> H
    F --> H
    G --> H

    H --> B

    C -->|Task Complete| I[Final Answer]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style B fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff
    style I fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style C fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333

10.5.2 ReAct Pattern: Reasoning + Acting

ReAct Framework (Yao et al., 2022):

Reason: Think about what to do
Act: Execute an action
Observe: See the result
Repeat until task solved

Example Task: “What’s the weather in Jakarta and should I bring an umbrella?”

Agent Trace:

Thought: I need to get current weather data for Jakarta
Action: weather_api
Action Input: {"city": "Jakarta", "country": "ID"}
Observation: {"temperature": 28, "condition": "rainy", "humidity": 85}

Thought: It's rainy with high humidity. User should bring umbrella.
Action: Final Answer
Action Input: "The current weather in Jakarta is rainy with 28°C and 85% humidity. Yes, you should definitely bring an umbrella!"

10.5.3 Building an Agent dengan LangChain

Code

from langchain.agents import Tool, AgentExecutor, create_react_agent
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMMathChain

# Define tools
llm = OpenAI(temperature=0)

# Tool 1: Calculator
llm_math = LLMMathChain.from_llm(llm)

calculator = Tool(
    name="Calculator",
    func=llm_math.run,
    description="Useful untuk mathematical calculations. Input harus math expression."
)

# Tool 2: RAG search (dari section sebelumnya)
def rag_search(query: str) -> str:
    """Search knowledge base."""
    result = qa_chain({"query": query})
    return result['result']

knowledge_base = Tool(
    name="KnowledgeBase",
    func=rag_search,
    description="Useful untuk pertanyaan tentang machine learning concepts. Input harus pertanyaan lengkap."
)

# Tool 3: Python REPL (optional, untuk code execution)
from langchain.utilities import PythonREPL
python_repl = PythonREPL()

python_tool = Tool(
    name="PythonREPL",
    func=python_repl.run,
    description="Useful untuk execute Python code. Input harus valid Python code."
)

# Combine tools
tools = [calculator, knowledge_base, python_tool]

# Create agent prompt
template = """You are a helpful AI assistant. Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought: {agent_scratchpad}"""

prompt = PromptTemplate.from_template(template)

# Create agent
agent = create_react_agent(llm, tools, prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,  # Print reasoning steps
    max_iterations=5,
    handle_parsing_errors=True
)

# Example queries
queries = [
    "Berapa hasil dari 25 * 47 + 138?",
    "Apa itu gradient descent? Lalu hitung derivatif dari x^2 + 3x + 5 di x=2",
    "Generate list 10 angka Fibonacci menggunakan Python, lalu hitung mean-nya"
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print(f"{'='*60}")

    result = agent_executor.invoke({"input": query})
    print(f"\nFinal Answer: {result['output']}")

Output Example:

============================================================
Query: Berapa hasil dari 25 * 47 + 138?
============================================================

> Entering new AgentExecutor chain...
Thought: I need to perform a mathematical calculation
Action: Calculator
Action Input: 25 * 47 + 138
Observation: 1313
Thought: I now know the final answer
Final Answer: 1313

> Finished chain.

Final Answer: 1313

10.5.4 Agent Memory: Conversation History

Types of Memory:

ConversationBufferMemory: Store semua conversation
ConversationSummaryMemory: Summarize old messages
ConversationBufferWindowMemory: Keep last N messages
VectorStoreRetrieverMemory: Semantic search pada history

Implementation:

Code

from langchain.memory import ConversationBufferMemory
from langchain.agents import initialize_agent, AgentType

# Create memory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Initialize agent dengan memory
agent_with_memory = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    memory=memory,
    verbose=True
)

# Multi-turn conversation
print(agent_with_memory.run("Halo, nama saya Budi"))
# Output: "Hello Budi! How can I help you today?"

print(agent_with_memory.run("Apa itu neural network?"))
# Output: *retrieves from knowledge base*

print(agent_with_memory.run("Siapa nama saya tadi?"))
# Output: "Your name is Budi" (remembers dari conversation)

💡 Agent Best Practices

Tool Descriptions Matter: Clear descriptions → better tool selection
Limit Tools: Too many tools confuse the agent (max 5-7)
Validation: Always validate tool outputs before using
Error Handling: Gracefully handle tool failures
Cost Control: Set max_iterations untuk avoid runaway loops

10.6 Evaluating RAG Systems

10.6.1 Evaluation Metrics

Challenge: Tidak ada “ground truth” untuk generative tasks!

Solution: Multiple evaluation dimensions

1. Retrieval Quality:

Code

# Precision@K: Berapa persen retrieved docs yang relevan?
def precision_at_k(retrieved_docs, relevant_docs, k):
    """
    Calculate Precision@K.

    Args:
        retrieved_docs: List of retrieved doc IDs
        relevant_docs: Set of truly relevant doc IDs
        k: Number of top results to consider

    Returns:
        Precision score [0, 1]
    """
    top_k = retrieved_docs[:k]
    relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs]
    return len(relevant_in_top_k) / k

# Recall@K: Berapa persen relevant docs yang ter-retrieve?
def recall_at_k(retrieved_docs, relevant_docs, k):
    """Calculate Recall@K."""
    top_k = retrieved_docs[:k]
    relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs]
    return len(relevant_in_top_k) / len(relevant_docs) if relevant_docs else 0

# Mean Reciprocal Rank (MRR)
def mrr(retrieved_docs, relevant_docs):
    """
    Calculate MRR.
    First relevant doc at position 1 → score = 1
    First relevant doc at position 2 → score = 0.5
    """
    for i, doc in enumerate(retrieved_docs, 1):
        if doc in relevant_docs:
            return 1 / i
    return 0

# Example
retrieved = ['doc3', 'doc1', 'doc7', 'doc2', 'doc5']
relevant = {'doc1', 'doc2', 'doc4'}

print(f"Precision@3: {precision_at_k(retrieved, relevant, 3):.2f}")  # 2/3 = 0.67
print(f"Recall@3: {recall_at_k(retrieved, relevant, 3):.2f}")        # 2/3 = 0.67
print(f"MRR: {mrr(retrieved, relevant):.2f}")                        # 1/2 = 0.50

2. Generation Quality:

Code

# BLEU, ROUGE (traditional metrics)
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

reference = "Machine learning is a subset of artificial intelligence"
generated = "Machine learning is part of AI"

# BLEU (precision-focused)
bleu = sentence_bleu([reference.split()], generated.split())
print(f"BLEU: {bleu:.2f}")

# ROUGE (recall-focused)
rouge = Rouge()
scores = rouge.get_scores(generated, reference)
print(f"ROUGE-L F1: {scores[0]['rouge-l']['f']:.2f}")

3. Faithfulness (Groundedness):

Apakah generated answer grounded dalam retrieved context?

Code

# Using LLM as judge
def evaluate_faithfulness(context, answer, llm):
    """
    Check if answer is supported by context.

    Returns:
        score [0-5], reasoning
    """
    prompt = f"""Given the following context and answer, rate how well the answer is supported by the context on a scale of 0-5:

Context: {context}

Answer: {answer}

Rating (0=not supported at all, 5=fully supported):
Reasoning: """

    response = llm(prompt)
    return response

# Example
context = "Python was created by Guido van Rossum in 1991."
answer = "Python was developed in the early 1990s"

score = evaluate_faithfulness(context, answer, llm)
print(score)

4. Relevance:

Apakah answer relevan dengan question?

Code

def evaluate_relevance(question, answer, llm):
    """Rate answer relevance to question."""
    prompt = f"""Rate how relevant the answer is to the question (0-5):

Question: {question}
Answer: {answer}

Rating:
Reasoning: """

    return llm(prompt)

10.6.2 End-to-End Evaluation Framework

Code

class RAGEvaluator:
    """Comprehensive RAG evaluation."""

    def __init__(self, rag_chain, llm_judge):
        self.rag_chain = rag_chain
        self.llm_judge = llm_judge

    def evaluate(self, test_cases):
        """
        Evaluate RAG system pada test cases.

        Args:
            test_cases: List of dicts with keys:
                - 'question': str
                - 'expected_answer': str (optional)
                - 'relevant_docs': set (optional)

        Returns:
            Evaluation results dataframe
        """
        results = []

        for case in test_cases:
            # Run RAG
            output = self.rag_chain({"query": case['question']})
            answer = output['result']
            retrieved_docs = [doc.metadata['id'] for doc in output['source_documents']]

            # Evaluate retrieval
            if 'relevant_docs' in case:
                precision = precision_at_k(retrieved_docs, case['relevant_docs'], 3)
                recall = recall_at_k(retrieved_docs, case['relevant_docs'], 3)
            else:
                precision, recall = None, None

            # Evaluate generation
            context = "\n".join([doc.page_content for doc in output['source_documents']])
            faithfulness = self.evaluate_faithfulness(context, answer)
            relevance = self.evaluate_relevance(case['question'], answer)

            results.append({
                'question': case['question'],
                'answer': answer,
                'precision@3': precision,
                'recall@3': recall,
                'faithfulness': faithfulness,
                'relevance': relevance
            })

        return pd.DataFrame(results)

    def evaluate_faithfulness(self, context, answer):
        """LLM-based faithfulness check."""
        # Implementation similar to above
        pass

    def evaluate_relevance(self, question, answer):
        """LLM-based relevance check."""
        # Implementation similar to above
        pass

# Usage
test_cases = [
    {
        'question': 'Apa itu backpropagation?',
        'relevant_docs': {'doc12', 'doc34'}
    },
    {
        'question': 'Perbedaan CNN dan RNN?',
        'relevant_docs': {'doc45', 'doc67', 'doc89'}
    }
]

evaluator = RAGEvaluator(qa_chain, llm)
results = evaluator.evaluate(test_cases)
print(results)

10.6.3 A/B Testing RAG Configurations

Test different configurations:

Code

import pandas as pd

# Configurations to test
configs = [
    {
        'name': 'Baseline',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 3,
        'search_type': 'similarity'
    },
    {
        'name': 'Larger Chunks',
        'chunk_size': 1000,
        'chunk_overlap': 100,
        'k': 3,
        'search_type': 'similarity'
    },
    {
        'name': 'More Retrieval',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 5,
        'search_type': 'similarity'
    },
    {
        'name': 'MMR (diversity)',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 3,
        'search_type': 'mmr'  # Maximal Marginal Relevance
    }
]

# Run experiments
results = []
for config in configs:
    # Build RAG dengan config
    rag = build_rag_system(**config)

    # Evaluate
    eval_results = evaluator.evaluate(test_cases)

    # Aggregate
    results.append({
        'config': config['name'],
        'avg_precision': eval_results['precision@3'].mean(),
        'avg_faithfulness': eval_results['faithfulness'].mean(),
        'avg_relevance': eval_results['relevance'].mean()
    })

results_df = pd.DataFrame(results)
print(results_df)

# Visualize
results_df.plot(x='config', kind='bar', figsize=(12, 6))
plt.title('RAG Configuration Comparison')
plt.ylabel('Score')
plt.legend(['Precision@3', 'Faithfulness', 'Relevance'])
plt.tight_layout()
plt.show()

📊 Evaluation Best Practices

Use Multiple Metrics: No single metric captures all aspects
LLM-as-Judge: Effective untuk semantic evaluation
Human Eval: Gold standard, tapi expensive (use for validation)
Regression Testing: Track metrics over time
Domain-Specific: Customize metrics untuk your use case

10.7 Production Considerations

10.7.1 Scalability & Performance

Challenges:

Latency: Retrieval + generation bisa lambat
Throughput: Banyak concurrent users
Cost: API calls untuk embeddings + LLM expensive

Solutions:

Code

# 1. Caching (simple approach)
import functools
from datetime import datetime, timedelta

cache = {}
CACHE_EXPIRY = timedelta(hours=1)

def cached_rag(query):
    """Cache RAG results untuk repeated queries."""
    if query in cache:
        result, timestamp = cache[query]
        if datetime.now() - timestamp < CACHE_EXPIRY:
            print("Cache hit!")
            return result

    # Cache miss
    result = qa_chain({"query": query})
    cache[query] = (result, datetime.now())
    return result

# 2. Batch processing untuk embeddings
def batch_embed(texts, batch_size=32):
    """Embed dalam batches untuk efficiency."""
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = model.encode(batch)
        embeddings.extend(batch_embeddings)
    return embeddings

# 3. Async processing
import asyncio

async def async_rag(query):
    """Asynchronous RAG untuk concurrency."""
    # Parallelize retrieval dan LLM call jika possible
    retrieval_task = asyncio.create_task(retrieve(query))
    # ... implementation
    pass

10.7.2 Monitoring & Observability

Key Metrics to Track:

Code

import logging
from datetime import datetime

class RAGMonitor:
    """Monitor RAG system performance."""

    def __init__(self):
        self.metrics = {
            'total_queries': 0,
            'avg_latency': 0,
            'cache_hit_rate': 0,
            'error_rate': 0
        }
        self.logger = logging.getLogger(__name__)

    def log_query(self, query, latency, cached, error=None):
        """Log setiap query."""
        self.metrics['total_queries'] += 1

        # Update metrics
        n = self.metrics['total_queries']
        self.metrics['avg_latency'] = (
            (self.metrics['avg_latency'] * (n-1) + latency) / n
        )

        if cached:
            self.metrics['cache_hit_rate'] = (
                (self.metrics['cache_hit_rate'] * (n-1) + 1) / n
            )

        if error:
            self.metrics['error_rate'] = (
                (self.metrics['error_rate'] * (n-1) + 1) / n
            )
            self.logger.error(f"Query error: {error}")

        # Log to file/database
        self.logger.info(f"Query: {query[:50]}... | Latency: {latency:.2f}s | Cached: {cached}")

    def get_metrics(self):
        """Return current metrics."""
        return self.metrics

# Usage
monitor = RAGMonitor()

def monitored_rag(query):
    """RAG dengan monitoring."""
    start_time = datetime.now()
    cached = False
    error = None

    try:
        result = cached_rag(query)
        if query in cache:
            cached = True
    except Exception as e:
        error = str(e)
        raise
    finally:
        latency = (datetime.now() - start_time).total_seconds()
        monitor.log_query(query, latency, cached, error)

    return result

10.7.3 Security & Privacy

Concerns:

Data Privacy: User queries might contain sensitive info
Injection Attacks: Malicious prompts
Data Leakage: Retrieved docs might expose confidential info

Mitigations:

Code

# 1. Input sanitization
def sanitize_input(query):
    """Remove potentially malicious content."""
    # Remove excessively long inputs
    max_length = 500
    query = query[:max_length]

    # Remove common injection patterns
    forbidden_patterns = ['ignore previous', 'disregard', 'system:']
    for pattern in forbidden_patterns:
        if pattern.lower() in query.lower():
            raise ValueError("Potentially malicious input detected")

    return query

# 2. Access control
def check_permissions(user_id, doc_id):
    """Check if user dapat akses document."""
    # Query database/ACL
    # Return True/False
    pass

def filtered_retrieval(query, user_id):
    """Retrieve hanya docs yang user boleh akses."""
    all_results = vectorstore.similarity_search(query, k=10)

    # Filter berdasarkan permissions
    filtered = [
        doc for doc in all_results
        if check_permissions(user_id, doc.metadata['id'])
    ]

    return filtered[:3]  # Return top-3 after filtering

# 3. PII redaction
import re

def redact_pii(text):
    """Remove personally identifiable information."""
    # Email
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[EMAIL]', text)

    # Phone (Indonesia)
    text = re.sub(r'\b(?:\+62|0)\d{9,11}\b', '[PHONE]', text)

    # ID numbers (example: 16 digits)
    text = re.sub(r'\b\d{16}\b', '[ID_NUMBER]', text)

    return text

10.8 Studi Kasus: Customer Support RAG System

10.8.1 Problem Statement

Scenario:

Perusahaan e-commerce ingin build customer support chatbot yang dapat:

Answer FAQ from knowledge base (100+ documents)
Handle 1000+ queries/day
Provide accurate answers dengan source citations
Reduce support ticket volume by 40%

10.8.2 System Design

Architecture:

Code

# Knowledge base: FAQs, product manuals, policies
documents = load_documents('./knowledge_base/')

# Processing pipeline
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)
chunks = text_splitter.split_documents(documents)

# Embeddings: Indonesian-capable model
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('LazarusNLP/IndoBERT-base-uncased')

# Vector DB: ChromaDB untuk persistence
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./customer_support_db"
)

# Hybrid retrieval untuk better accuracy
ensemble_retriever = create_ensemble_retriever(
    vectorstore, chunks, weights=[0.6, 0.4]
)

# LLM: OpenAI GPT-4 (bisa diganti dengan local model)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)

# QA Chain dengan custom prompt
from langchain.prompts import PromptTemplate

template = """Anda adalah asisten customer support yang membantu. Gunakan konteks berikut untuk menjawab pertanyaan pelanggan.

PENTING:
- Jika Anda tidak tahu jawabannya, katakan "Maaf, saya tidak memiliki informasi tersebut. Silakan hubungi tim support kami."
- Jangan membuat informasi
- Berikan jawaban yang ramah dan profesional
- Sertakan referensi ke sumber jika relevan

Konteks:
{context}

Pertanyaan: {question}

Jawaban:"""

QA_PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=ensemble_retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": QA_PROMPT},
    return_source_documents=True
)

10.8.3 Implementation & Results

Deployment:

Code

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class Query(BaseModel):
    question: str
    user_id: str

class Response(BaseModel):
    answer: str
    sources: list[str]
    confidence: float

@app.post("/ask", response_model=Response)
async def ask_question(query: Query):
    """API endpoint untuk customer queries."""
    try:
        # Sanitize input
        question = sanitize_input(query.question)

        # Run RAG
        result = qa_chain({"query": question})

        # Extract sources
        sources = [
            doc.metadata.get('source', 'Unknown')
            for doc in result['source_documents']
        ]

        # Estimate confidence (simple heuristic)
        confidence = estimate_confidence(result)

        return Response(
            answer=result['result'],
            sources=sources,
            confidence=confidence
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

def estimate_confidence(result):
    """Estimate answer confidence berdasarkan retrieval scores."""
    # Simple heuristic: average similarity of retrieved docs
    # Real implementation would be more sophisticated
    return 0.85  # Placeholder

Results After 3 Months:

Metric	Before RAG	After RAG	Improvement
Ticket Volume	5000/month	2800/month	44% reduction ✅
Response Time	2 hours	30 seconds	99.3% faster ✅
Customer Satisfaction	3.2/5	4.5/5	41% increase ✅
Support Cost	$50K/month	$32K/month	36% savings ✅
Accuracy (human eval)	N/A	87%	New capability ✅

Lessons Learned:

Chunking matters: Experimented with 500, 800, 1000 tokens → 800 optimal
Hybrid search: Improved precision from 0.72 to 0.84
Prompt engineering: Custom prompt reduced hallucinations significantly
Fallback important: “I don’t know” better than wrong answer
Continuous improvement: Weekly analysis of low-confidence answers

10.9 Ringkasan (Summary)

Konsep Kunci:

RAG = Retrieval + Generation:
- Solve LLM limitations (knowledge cutoff, hallucinations)
- Combine semantic search dengan generative AI
- Enable domain-specific, updatable knowledge
Embeddings:
- Semantic representations as vectors
- Cosine similarity untuk retrieval
- Models: Sentence Transformers, OpenAI embeddings
Vector Databases:
- Efficient similarity search at scale
- Options: FAISS (library), ChromaDB (embedded), Pinecone (cloud)
- ANN algorithms untuk speed vs accuracy trade-off
RAG Pipeline:
- Document → Chunking → Embedding → Vector DB
- Query → Embedding → Retrieval → LLM → Response
- LangChain simplifies implementation
AI Agents:
- Beyond RAG: tools, memory, reasoning
- ReAct pattern: Reason → Act → Observe
- Can plan multi-step tasks autonomously
Evaluation:
- Retrieval: Precision@K, Recall@K, MRR
- Generation: Faithfulness, Relevance
- LLM-as-judge untuk semantic eval
Production:
- Caching, batching untuk performance
- Monitoring metrics (latency, error rate)
- Security: input sanitization, access control

Aplikasi Praktis:

Customer support chatbots
Document Q&A systems
Code documentation assistants
Research literature search
Legal/medical knowledge bases

Next Steps:

Chapter 11: MLOps & Deployment (production ML systems)
Advanced topics: Multi-modal RAG, agentic workflows
Fine-tuning embeddings untuk domain specificity

10.10 Latihan & Tugas (Exercises)

📝 Soal Latihan

Pertanyaan Konseptual:

Jelaskan perbedaan fundamental antara LLM murni dan RAG system. Kapan Anda akan menggunakan masing-masing?
Mengapa cosine similarity lebih populer daripada Euclidean distance untuk semantic search? Berikan contoh kasus dimana keduanya memberikan hasil berbeda.
Apa trade-offs antara chunking strategies berikut:
- Fixed-size tokens (512 tokens)
- Paragraph-based
- Recursive character splitting
Kapan masing-masing cocok digunakan?
Compare dan contrast:
- RAG system dengan retrieval database
- AI Agent dengan traditional chatbot
- Vector database (FAISS) dengan relational database (PostgreSQL)
Dalam RAG evaluation, mengapa kita butuh multiple metrics (Precision, Recall, Faithfulness, Relevance)? Berikan contoh kasus dimana satu metric tinggi tapi yang lain rendah.

Tugas Praktikum:

Tugas 1: Build Simple RAG System - Input: Collection of 20+ text documents (bisa Wikipedia articles, technical docs, dll.) - Tasks: 1. Implement chunking strategy pilihan Anda 2. Generate embeddings dengan Sentence Transformers 3. Build FAISS index 4. Implement retrieval function 5. Test dengan 5 sample queries - Output:

Python script dengan code lengkap
Report showing query results dengan retrieved chunks
Rubric:
- Chunking implementation (20%)
- FAISS integration (20%)
- Retrieval accuracy (30%)
- Code quality & documentation (15%)
- Report clarity (15%)

Tugas 2: Compare Embedding Models - Objective: Compare different embedding models untuk Indonesian text - Models to test:

sentence-transformers/all-MiniLM-L6-v2 (multilingual)
indobenchmark/indobert-base-p1 (Indonesian BERT)
LazarusNLP/IndoBERT-base-uncased (Indonesian)
Tasks:
1. Create test set: 10 queries, each dengan 3 relevant + 7 irrelevant docs
2. Compute embeddings dengan each model
3. Evaluate dengan Precision@3, Recall@3, MRR
4. Analyze trade-offs (speed vs accuracy)
Deliverable:
- Comparison table dengan metrics
- Visualization (bar charts)
- Analysis report (1-2 pages)

Tugas 3: Build RAG System dengan LangChain - Scenario: Academic paper Q&A system - Requirements: 1. Load multiple PDF papers (use PyPDF atau similar) 2. Implement chunking dengan RecursiveCharacterTextSplitter 3. Use ChromaDB untuk persistence 4. Implement QA chain dengan custom prompt 5. Add source citations ke responses - Bonus:

Implement hybrid search (vector + BM25)
Add conversation memory
Build simple Streamlit/Gradio UI
Rubric:
- Core functionality (40%)
- Source citations (15%)
- Code organization (15%)
- Documentation (15%)
- Bonus features (15%)

Tugas 4: Simple AI Agent - Objective: Build agent yang bisa use multiple tools - Required Tools: 1. Calculator (untuk math) 2. Wikipedia search (untuk facts) 3. Weather API (untuk current weather) - Tasks: 1. Define tool wrappers 2. Create ReAct agent dengan LangChain 3. Test dengan diverse queries: - Pure math: “What’s 234 * 567?” - Fact + math: “Population of Jakarta times 3?” - Multi-step: “What’s the weather in Paris? Is it above average?” - Deliverable:

Agent code
Trace logs showing reasoning steps
Analysis report discussing successes/failures
Rubric:
- Tool implementation (30%)
- Agent reasoning quality (30%)
- Test coverage (20%)
- Analysis depth (20%)

Tugas 5: RAG Evaluation Pipeline - Objective: Build evaluation framework untuk RAG system - Requirements: 1. Create test dataset: 20+ question-answer pairs dengan ground truth 2. Implement metrics: - Retrieval: Precision@K, Recall@K - Generation: Faithfulness (LLM-as-judge) 3. Run evaluation pada 2 different RAG configurations 4. Visualize results - Deliverable:

Evaluation code
Test dataset (JSON/CSV)
Results report dengan recommendations
Rubric:
- Metric implementation (35%)
- Test dataset quality (20%)
- Comparative analysis (25%)
- Recommendations (20%)

Proyek Akhir (Capstone):

“Domain-Specific RAG System”

Build complete RAG system untuk domain pilihan Anda:

Options:
- Medical Q&A (dari medical literature)
- Legal assistant (dari regulations/case law)
- Code documentation helper
- Academic research assistant

Requirements:

Document collection (50+ documents)
Full RAG pipeline (chunking → embedding → retrieval → generation)
Evaluation dengan test set (20+ queries)
Simple web interface (Streamlit/Gradio)
Documentation (architecture, usage, evaluation results)

Grading:

System functionality (30%)
Retrieval quality (20%)
Generation quality (20%)
User interface (15%)
Documentation (15%)

10.11 Referensi & Bacaan Lebih Lanjut

Paper Fundamental:

Lewis et al., 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. NeurIPS.
Yao et al., 2022. “ReAct: Synergizing Reasoning and Acting in Language Models”. ICLR.
Gao et al., 2023. “Retrieval-Augmented Generation for Large Language Models: A Survey”. arXiv.

Tutorial & Guides:

LangChain Documentation - Comprehensive framework untuk LLM apps
FAISS Wiki - Facebook AI Similarity Search
ChromaDB Docs - Embeddings database
Sentence Transformers - State-of-the-art text embeddings

Tools & Libraries:

LangChain - LLM application framework
LlamaIndex - Alternative RAG framework
Haystack - End-to-end NLP pipelines
OpenAI Embeddings - High-quality embeddings API

Buku:

Tunstall et al., 2022. “Natural Language Processing with Transformers”. O’Reilly.
Huyen, 2022. “Designing Machine Learning Systems”. O’Reilly. (Chapter on embeddings)

Blog Posts:

Video Courses:

Hubungan dengan Learning Outcomes Program

CPMK	Sub-CPMK	Tercakup
CPMK-3	Implementasi modern AI systems	✓
CPMK-4	Evaluasi model performance	✓
CPMK-5	Production ML deployment	✓

Related Labs: Lab 10 - Building RAG System for Indonesian Documents Related Chapters: Chapter 8 (Transformers), Chapter 11 (MLOps & Deployment) Estimated Reading Time: 120 minutes Estimated Practice Time: 10-12 hours

Last Updated: December 6, 2024 Version: 1.0 Author: Tim Pengembang - Politeknik Siber dan Sandi Negara

--- title: "Bab 10: RAG & AI Agents" subtitle: "Retrieval-Augmented Generation, Vector Databases & Intelligent Agents" number-sections: false --- # Bab 10: RAG & AI Agents {#sec-chapter-10} ::: {.callout-note} ## 🎯 Hasil Pembelajaran (Learning Outcomes) Setelah mempelajari bab ini, Anda akan mampu: 1. **Memahami** konsep Retrieval-Augmented Generation (RAG) dan motivasinya 2. **Mengidentifikasi** komponen-komponen sistem RAG (embeddings, vector DB, retrieval) 3. **Mengimplementasikan** vector databases dengan FAISS dan ChromaDB 4. **Membangun** RAG pipeline sederhana dengan LangChain 5. **Menerapkan** chunking strategies dan semantic search 6. **Mengembangkan** AI agents dengan tools dan memory 7. **Mengevaluasi** kualitas RAG systems menggunakan berbagai metrics ::: ## 10.1 Dari LLM ke RAG: Mengapa Kita Butuh Retrieval? ### 10.1.1 Limitasi Large Language Models **Masalah Fundamental LLMs:** Meskipun powerful, LLMs seperti GPT, BERT, dan Llama memiliki **critical limitations**: **1. Knowledge Cutoff** 📅: - Model hanya "tahu" data sampai tanggal training - GPT-4 (trained 2023) tidak tahu berita 2024 - Tidak bisa update knowledge tanpa re-training (costly!) **2. Hallucination** 🎭: - LLMs bisa generate jawaban yang sounds plausible tapi **faktanya salah** - Tidak ada grounding ke factual sources - Berbahaya untuk critical applications (medical, legal, financial) **3. Domain-Specific Knowledge** 🏢: - LLMs tidak tahu internal company documents - Tidak punya akses ke proprietary data - Tidak bisa query real-time databases **4. No Source Attribution** 📚: - Tidak bisa cite sources - Sulit verify informasi - Compliance & legal issues ::: {.callout-tip} ## 💡 Contoh Problem **User**: "Berapa harga saham Apple hari ini?" **LLM tanpa RAG**: - "Maaf, saya tidak punya data real-time..." - Atau worse: hallucinates a price! **LLM dengan RAG**: 1. Retrieve dari stock API/database 2. Ground response pada data terkini 3. Jawaban akurat dengan source citation ::: ### 10.1.2 Solusi: Retrieval-Augmented Generation (RAG) **Definisi:** > **RAG** adalah teknik yang mengkombinasikan **retrieval** (pencarian informasi dari knowledge base) dengan **generation** (LLM text generation) untuk menghasilkan jawaban yang lebih akurat, factual, dan dapat diverifikasi. **Arsitektur RAG:** ```{mermaid} %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%% graph TB A[User Query] --> B[Query Embedding] B --> C[Vector Search] D[Document Corpus] --> E[Text Chunking] E --> F[Embedding Generation] F --> G[(Vector Database)] C --> G G --> H[Retrieve Top-K<br/>Relevant Chunks] H --> I[Construct Prompt] A --> I I --> J[LLM Generation] J --> K[Response with<br/>Source Citations] style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff style K fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff style G fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333 style J fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff ``` **Cara Kerja RAG:** ```{python} #| echo: true #| code-fold: false import matplotlib.pyplot as plt import matplotlib.patches as mpatches from matplotlib.patches import FancyBboxPatch import numpy as np fig, ax = plt.subplots(figsize=(14, 8)) ax.set_xlim(0, 10) ax.set_ylim(0, 10) ax.axis('off') # Step 1: Document Processing (Left) ax.add_patch(FancyBboxPatch((0.5, 7.5), 2, 1.5, boxstyle="round,pad=0.1", edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2)) ax.text(1.5, 8.5, 'Document\nCorpus', ha='center', va='center', fontsize=11, fontweight='bold') ax.arrow(1.5, 7.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A') ax.add_patch(FancyBboxPatch((0.5, 5.5), 2, 1, boxstyle="round,pad=0.1", edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2)) ax.text(1.5, 6, 'Chunking', ha='center', va='center', fontsize=10, fontweight='bold') ax.arrow(1.5, 5.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A') ax.add_patch(FancyBboxPatch((0.5, 3.5), 2, 1, boxstyle="round,pad=0.1", edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2)) ax.text(1.5, 4, 'Embeddings', ha='center', va='center', fontsize=10, fontweight='bold') ax.arrow(1.5, 3.5, 0.8, -0.5, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A') # Vector Database (Center) ax.add_patch(FancyBboxPatch((3, 1.5), 2.5, 1.5, boxstyle="round,pad=0.1", edgecolor='#FFA500', facecolor='#FFF8DC', linewidth=3)) ax.text(4.25, 2.25, 'Vector\nDatabase', ha='center', va='center', fontsize=11, fontweight='bold') # Step 2: Query Processing (Right) ax.add_patch(FancyBboxPatch((7, 7.5), 2, 1.5, boxstyle="round,pad=0.1", edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2)) ax.text(8, 8.5, 'User\nQuery', ha='center', va='center', fontsize=11, fontweight='bold') ax.arrow(8, 7.5, 0, -1.3, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57') ax.add_patch(FancyBboxPatch((7, 5), 2, 1, boxstyle="round,pad=0.1", edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2)) ax.text(8, 5.5, 'Query\nEmbedding', ha='center', va='center', fontsize=10, fontweight='bold') ax.arrow(8, 5, -2.5, -2, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57') # Retrieval arrow ax.arrow(5.5, 2.5, 1.3, 1.5, head_width=0.15, head_length=0.15, fc='#9C27B0', ec='#9C27B0') ax.text(6.5, 4, 'Top-K\nRetrieval', ha='center', va='center', fontsize=9, bbox=dict(boxstyle='round', facecolor='#F3E5F5', alpha=0.8)) # LLM Generation ax.add_patch(FancyBboxPatch((7, 1.5), 2, 1.5, boxstyle="round,pad=0.1", edgecolor='#FF6B6B', facecolor='#FFEBEE', linewidth=2)) ax.text(8, 2.25, 'LLM\nGeneration', ha='center', va='center', fontsize=11, fontweight='bold') ax.arrow(8, 1.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#C92A2A', ec='#C92A2A') # Final Response ax.add_patch(FancyBboxPatch((6.5, 0), 3, 0.5, boxstyle="round,pad=0.1", edgecolor='#50C878', facecolor='#C8E6C9', linewidth=3)) ax.text(8, 0.25, 'Response + Citations', ha='center', va='center', fontsize=11, fontweight='bold') # Title ax.text(5, 9.5, 'RAG Pipeline Architecture', ha='center', va='center', fontsize=14, fontweight='bold', color='#333') plt.tight_layout() plt.show() ``` **Keuntungan RAG:** | Aspek | LLM Murni | RAG System | |-------|-----------|------------| | **Knowledge** | Static (cutoff date) | ✅ Dynamic, updatable | | **Hallucination** | Frequent | ✅ Reduced (grounded) | | **Source** | No citation | ✅ Traceable sources | | **Domain Data** | Generic only | ✅ Custom knowledge base | | **Cost** | Low inference | Medium (retrieval + gen) | | **Latency** | Fast (~100ms) | Slower (~500ms) | ### 10.1.3 Use Cases RAG di Industry **1. Customer Support Chatbots** 🤖: - Knowledge base: Product manuals, FAQs, troubleshooting guides - Example: "How do I reset my password?" → Retrieve from internal docs **2. Legal Document Analysis** ⚖️: - Corpus: Case law, regulations, contracts - Example: "Find precedents for patent infringement" **3. Medical Q&A** 🏥: - Database: Medical literature, clinical guidelines - Example: "Treatment options for Type 2 diabetes" **4. Code Documentation** 💻: - Codebase + docs retrieval - Example: "How to authenticate API requests in our system?" **5. Academic Research** 📚: - Literature search + summarization - Example: "Recent advances in quantum computing 2024" ## 10.2 Embeddings & Vector Representations ### 10.2.1 Apa itu Embeddings? **Definisi:** > **Embedding** adalah representasi vektor (array of numbers) dari text, image, atau data lain dalam high-dimensional space, dimana semantic similarity tercermin sebagai geometric proximity. **Intuisi:** ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA # Simulate word embeddings (normally 768 or 1536 dimensions) # We'll create simple 2D examples for visualization np.random.seed(42) # Categories of words animals = { 'cat': [0.2, 0.8], 'dog': [0.3, 0.85], 'lion': [0.25, 0.75], 'tiger': [0.28, 0.72], 'elephant': [0.35, 0.7] } tech = { 'computer': [0.8, 0.3], 'laptop': [0.82, 0.28], 'phone': [0.85, 0.35], 'tablet': [0.83, 0.32], 'monitor': [0.78, 0.27] } fruits = { 'apple': [0.5, 0.2], 'banana': [0.52, 0.18], 'orange': [0.48, 0.22], 'grape': [0.51, 0.19], 'mango': [0.49, 0.21] } fig, ax = plt.subplots(figsize=(12, 8)) # Plot each category for word, pos in animals.items(): ax.scatter(pos[0], pos[1], c='#FF6B6B', s=300, alpha=0.6, edgecolors='darkred', linewidths=2) ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold') for word, pos in tech.items(): ax.scatter(pos[0], pos[1], c='#4A90E2', s=300, alpha=0.6, edgecolors='darkblue', linewidths=2) ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold') for word, pos in fruits.items(): ax.scatter(pos[0], pos[1], c='#50C878', s=300, alpha=0.6, edgecolors='darkgreen', linewidths=2) ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold') # Add cluster circles from matplotlib.patches import Circle ax.add_patch(Circle((0.27, 0.76), 0.15, fill=False, edgecolor='#FF6B6B', linewidth=2, linestyle='--', label='Animals')) ax.add_patch(Circle((0.816, 0.3), 0.08, fill=False, edgecolor='#4A90E2', linewidth=2, linestyle='--', label='Technology')) ax.add_patch(Circle((0.5, 0.2), 0.05, fill=False, edgecolor='#50C878', linewidth=2, linestyle='--', label='Fruits')) ax.set_xlabel('Dimension 1', fontsize=12, fontweight='bold') ax.set_ylabel('Dimension 2', fontsize=12, fontweight='bold') ax.set_title('Embedding Space: Semantic Similarity as Distance', fontsize=14, fontweight='bold') ax.legend(fontsize=11, loc='upper right') ax.grid(True, alpha=0.3) ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) plt.tight_layout() plt.show() ``` **Key Properties:** 1. **Similar words = Close vectors**: - "cat" dan "dog" berdekatan - "laptop" dan "computer" berdekatan 2. **Different concepts = Distant vectors**: - "cat" jauh dari "laptop" 3. **Mathematical operations**: - King - Man + Woman ≈ Queen - Paris - France + Italy ≈ Rome ### 10.2.2 Sentence & Document Embeddings **Word Embeddings vs. Sentence Embeddings:** | Type | Example | Dimension | Use Case | |------|---------|-----------|----------| | Word | "cat" → [0.2, 0.8, ...] | 300 (Word2Vec) | Word similarity | | Sentence | "The cat sleeps" → [0.5, ...] | 768 (BERT) | Semantic search | | Document | Full article → [0.3, ...] | 1536 (OpenAI) | Document retrieval | **Implementasi dengan Sentence Transformers:** ```{python} #| echo: true #| eval: false from sentence_transformers import SentenceTransformer import numpy as np # Load pre-trained model model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Example sentences sentences = [ "Machine learning adalah subset dari artificial intelligence", "Deep learning menggunakan neural networks dengan banyak layer", "Python adalah bahasa pemrograman populer untuk data science", "Jakarta adalah ibu kota Indonesia" ] # Generate embeddings embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # (4, 384) print(f"First embedding (truncated):\n{embeddings[0][:10]}") # Compute similarity from sklearn.metrics.pairwise import cosine_similarity similarity_matrix = cosine_similarity(embeddings) print("\nSimilarity Matrix:") print(similarity_matrix) ``` **Output:** ``` Embedding shape: (4, 384) First embedding (truncated): [ 0.0234 -0.1234 0.5678 ... ] Similarity Matrix: [[1.000 0.812 0.456 0.123] # Sent 1 vs all [0.812 1.000 0.489 0.098] # Sent 2 vs all [0.456 0.489 1.000 0.156] # Sent 3 vs all [0.123 0.098 0.156 1.000]] # Sent 4 vs all ``` **Interpretasi:** - Sentence 1 dan 2 (ML/DL) sangat similar (0.812) - Sentence 4 (Jakarta) paling berbeda dari semua (low similarity) ### 10.2.3 Cosine Similarity untuk Retrieval **Formula:** $$ \text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} $$ **Range**: -1 (opposite) to +1 (identical) **Visualisasi:** ```{python} #| echo: true #| code-fold: false import numpy as np import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Case 1: High similarity ax = axes[0] v1 = np.array([0.8, 0.6]) v2 = np.array([0.7, 0.5]) ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='#4A90E2', width=0.01, label='Vector A') ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='#50C878', width=0.01, label='Vector B') ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) ax.set_aspect('equal') ax.grid(True, alpha=0.3) ax.legend() ax.set_title('High Similarity\n(cos ≈ 0.99)', fontweight='bold') # Case 2: Medium similarity ax = axes[1] v1 = np.array([0.8, 0.6]) v2 = np.array([0.6, 0.8]) ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='#4A90E2', width=0.01, label='Vector A') ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='#FFD700', width=0.01, label='Vector C') ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) ax.set_aspect('equal') ax.grid(True, alpha=0.3) ax.legend() ax.set_title('Medium Similarity\n(cos ≈ 0.96)', fontweight='bold') # Case 3: Low similarity ax = axes[2] v1 = np.array([0.8, 0.2]) v2 = np.array([0.2, 0.8]) ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='#4A90E2', width=0.01, label='Vector A') ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='#FF6B6B', width=0.01, label='Vector D') ax.set_xlim([0, 1]) ax.set_ylim([0, 1]) ax.set_aspect('equal') ax.grid(True, alpha=0.3) ax.legend() ax.set_title('Low Similarity\n(cos ≈ 0.52)', fontweight='bold') plt.tight_layout() plt.show() ``` ::: {.callout-warning} ## ⚠️ Common Pitfalls 1. **Dimensionality Matters**: - Lebih tinggi ≠ selalu lebih baik - 384-dim cukup untuk banyak tasks - 1536-dim untuk complex semantic understanding 2. **Model Selection**: - `all-MiniLM-L6-v2` (384-dim): Fast, good general purpose - `all-mpnet-base-v2` (768-dim): Better quality, slower - OpenAI `text-embedding-ada-002` (1536-dim): Best quality, API cost 3. **Normalization**: - Always normalize vectors untuk cosine similarity - `sklearn.preprocessing.normalize()` atau manual L2 norm ::: ## 10.3 Vector Databases ### 10.3.1 Mengapa Butuh Vector Database? **Problem Statement:** Bayangkan Anda punya **1 million documents**. Untuk setiap query: - Compute cosine similarity dengan **semua 1M vectors** - Sort untuk find top-K - **Time complexity**: O(N × D) dimana N=documents, D=dimensions **Result**: 🐌 **SANGAT LAMBAT!** **Solusi: Vector Database dengan ANN (Approximate Nearest Neighbors)** ### 10.3.2 Pilihan Vector Databases | Database | Type | Best For | Pros | Cons | |----------|------|----------|------|------| | **FAISS** | Library | Research, prototyping | Fast, free, Facebook-backed | Not distributed | | **ChromaDB** | Embedded | Small-medium apps | Easy, Python-native | Limited scale | | **Pinecone** | Cloud | Production apps | Managed, scalable | Cost, vendor lock-in | | **Weaviate** | Self-hosted | Enterprise | Open source, GraphQL | Complex setup | | **Milvus** | Self-hosted | Large scale | Distributed, fast | Requires infrastructure | | **Qdrant** | Self-hosted | Modern apps | Rust-based, fast | Newer, smaller community | ### 10.3.3 Implementasi dengan FAISS **Installation:** ```bash pip install faiss-cpu # or faiss-gpu for GPU support ``` **Example: Building a simple vector search:** ```{python} #| echo: true #| eval: false import faiss import numpy as np from sentence_transformers import SentenceTransformer # Sample documents documents = [ "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data", "Deep learning menggunakan neural networks dengan banyak hidden layers", "Natural language processing memproses dan memahami bahasa manusia", "Computer vision memungkinkan komputer memahami gambar dan video", "Reinforcement learning belajar melalui trial and error dengan rewards", "Transfer learning memanfaatkan model pre-trained untuk task baru", "Ensemble methods menggabungkan multiple models untuk hasil lebih baik", "Python adalah bahasa pemrograman populer untuk data science", ] # Generate embeddings model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embeddings = model.encode(documents) # Get embedding dimension d = embeddings.shape[1] # 384 for all-MiniLM-L6-v2 print(f"Embedding dimension: {d}") # Create FAISS index # IndexFlatL2: Exact search dengan L2 distance (bisa diganti dengan cosine) index = faiss.IndexFlatL2(d) # Normalize vectors untuk cosine similarity faiss.normalize_L2(embeddings) # Add vectors to index index.add(embeddings.astype('float32')) print(f"Total vectors dalam index: {index.ntotal}") # Query query = "Apa itu neural networks?" query_embedding = model.encode([query]) faiss.normalize_L2(query_embedding) # Search for top-3 most similar documents k = 3 distances, indices = index.search(query_embedding.astype('float32'), k) print(f"\nQuery: '{query}'") print("\nTop-3 hasil retrieval:") for i, (idx, dist) in enumerate(zip(indices[0], distances[0])): # Convert L2 distance to similarity score similarity = 1 - (dist / 2) # Normalized L2 to cosine similarity print(f"{i+1}. [Score: {similarity:.3f}] {documents[idx]}") ``` **Output:** ``` Embedding dimension: 384 Total vectors dalam index: 8 Query: 'Apa itu neural networks?' Top-3 hasil retrieval: 1. [Score: 0.892] Deep learning menggunakan neural networks dengan banyak hidden layers 2. [Score: 0.734] Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data 3. [Score: 0.698] Transfer learning memanfaatkan model pre-trained untuk task baru ``` ### 10.3.4 Implementasi dengan ChromaDB **ChromaDB** lebih user-friendly dan persistent: ```{python} #| echo: true #| eval: false import chromadb from chromadb.config import Settings # Create client client = chromadb.Client(Settings( chroma_db_impl="duckdb+parquet", persist_directory="./chroma_db" # Data akan disimpan di sini )) # Create collection collection = client.create_collection( name="ml_documents", metadata={"description": "Machine learning knowledge base"} ) # Add documents documents = [ "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data", "Deep learning menggunakan neural networks dengan banyak hidden layers", "Natural language processing memproses dan memahami bahasa manusia", ] collection.add( documents=documents, ids=["doc1", "doc2", "doc3"], metadatas=[ {"category": "ML", "difficulty": "beginner"}, {"category": "DL", "difficulty": "intermediate"}, {"category": "NLP", "difficulty": "intermediate"} ] ) # Query results = collection.query( query_texts=["Apa itu neural networks?"], n_results=2, where={"difficulty": "intermediate"} # Optional metadata filter ) print("Hasil retrieval:") for doc, dist, meta in zip(results['documents'][0], results['distances'][0], results['metadatas'][0]): print(f"[Distance: {dist:.3f}] {doc}") print(f" Metadata: {meta}\n") ``` ::: {.callout-tip} ## 💡 Best Practices 1. **Choose the right index type**: - `IndexFlatL2`: Exact search, small datasets (<100K) - `IndexIVFFlat`: Approximate, medium datasets (100K-1M) - `IndexHNSW`: Fast approximate, large datasets (>1M) 2. **Batch processing**: - Add vectors in batches (e.g., 1000 at a time) - Faster than one-by-one 3. **Persistence**: - FAISS: Save/load dengan `faiss.write_index()` dan `faiss.read_index()` - ChromaDB: Otomatis persistent ke disk ::: ## 10.4 Building RAG Systems ### 10.4.1 Text Chunking Strategies **Mengapa Chunking?** - Documents terlalu panjang untuk embed sekaligus - Token limits (e.g., 512 tokens untuk BERT) - Better retrieval granularity **Chunking Methods:** ```{python} #| echo: true #| eval: false # Method 1: Fixed-size chunking def chunk_by_tokens(text, chunk_size=512, overlap=50): """ Split text into fixed-size chunks dengan overlap. Args: text: Input text chunk_size: Number of tokens per chunk overlap: Number of overlapping tokens antar chunk Returns: List of text chunks """ from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') tokens = tokenizer.encode(text, add_special_tokens=False) chunks = [] start = 0 while start < len(tokens): end = start + chunk_size chunk_tokens = tokens[start:end] chunk_text = tokenizer.decode(chunk_tokens) chunks.append(chunk_text) start += (chunk_size - overlap) return chunks # Method 2: Semantic chunking (berdasarkan paragraf/section) def chunk_by_paragraph(text): """Split by paragraph boundaries.""" paragraphs = text.split('\n\n') return [p.strip() for p in paragraphs if p.strip()] # Method 3: Recursive character splitting (LangChain style) from langchain.text_splitter import RecursiveCharacterTextSplitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, length_function=len, separators=["\n\n", "\n", ". ", " ", ""] # Hierarchy of split points ) sample_text = """ Machine learning adalah bidang yang berkembang pesat. Dalam beberapa tahun terakhir, deep learning telah merevolusi berbagai domain. Computer vision kini dapat mengenali objek dengan akurasi superhuman. Natural language processing memungkinkan chatbot yang sangat natural. Ke depannya, AI akan semakin terintegrasi dalam kehidupan sehari-hari. """ chunks = text_splitter.split_text(sample_text) print(f"Number of chunks: {len(chunks)}") for i, chunk in enumerate(chunks, 1): print(f"\nChunk {i}:") print(chunk) ``` **Comparison:** | Method | Pros | Cons | Best For | |--------|------|------|----------| | Fixed tokens | Consistent size, fast | Might split mid-sentence | Technical docs | | Paragraph | Semantic coherence | Variable size | Articles, books | | Recursive | Best of both worlds | More complex | General purpose | ### 10.4.2 Complete RAG Pipeline dengan LangChain **LangChain** adalah framework populer untuk building LLM applications. **Installation:** ```bash pip install langchain langchain-community chromadb openai ``` **Full RAG Example:** ```{python} #| echo: true #| eval: false from langchain.document_loaders import TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.llms import OpenAI # atau Ollama untuk local LLM # Step 1: Load documents # Misalnya dari folder berisi txt files from langchain.document_loaders import DirectoryLoader loader = DirectoryLoader('./docs/', glob="**/*.txt") documents = loader.load() print(f"Loaded {len(documents)} documents") # Step 2: Chunking text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, ) texts = text_splitter.split_documents(documents) print(f"Split into {len(texts)} chunks") # Step 3: Create embeddings & vector store embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" ) vectorstore = Chroma.from_documents( documents=texts, embedding=embeddings, persist_directory="./chroma_db" ) # Step 4: Create retriever retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 3} # Retrieve top-3 chunks ) # Step 5: Create QA chain llm = OpenAI(temperature=0) # Ganti dengan model pilihan Anda qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # "stuff", "map_reduce", "refine", or "map_rerank" retriever=retriever, return_source_documents=True ) # Step 6: Query! query = "Apa perbedaan antara supervised dan unsupervised learning?" result = qa_chain({"query": query}) print(f"Question: {query}") print(f"\nAnswer: {result['result']}") print(f"\nSources:") for i, doc in enumerate(result['source_documents'], 1): print(f"{i}. {doc.metadata.get('source', 'Unknown')}") print(f" {doc.page_content[:200]}...") ``` **Chain Types Explained:** 1. **Stuff**: Masukkan semua retrieved docs ke satu prompt - Pros: Simple, best quality - Cons: Limited by context window 2. **Map-Reduce**: Process each doc separately, then combine - Pros: Handles banyak docs - Cons: Might lose connections 3. **Refine**: Iteratively refine answer dengan each doc - Pros: Good for comprehensive answers - Cons: Slower, more LLM calls 4. **Map-Rerank**: Score each doc's answer, pilih terbaik - Pros: High quality - Cons: Expensive (many LLM calls) ### 10.4.3 Advanced RAG: Hybrid Search **Limitation of pure vector search:** - Might miss exact keyword matches - Example: Searching "GPT-4" might tidak find exact mention **Solution: Hybrid Search = Dense (vector) + Sparse (BM25) retrieval** ```{python} #| echo: true #| eval: false from langchain.retrievers import BM25Retriever, EnsembleRetriever from langchain.vectorstores import FAISS # Vector retriever (dense) vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # BM25 retriever (sparse, keyword-based) bm25_retriever = BM25Retriever.from_documents(texts) bm25_retriever.k = 3 # Ensemble retriever (kombinasi keduanya) ensemble_retriever = EnsembleRetriever( retrievers=[vector_retriever, bm25_retriever], weights=[0.5, 0.5] # Equal weight, bisa di-tune ) # Use dalam QA chain qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=ensemble_retriever, return_source_documents=True ) ``` **Benefits:** - Better recall (lebih banyak relevant docs ditemukan) - Handles both semantic dan exact matches - More robust untuk diverse queries ## 10.5 AI Agents: Dari RAG ke Autonomous Systems ### 10.5.1 Apa itu AI Agents? **Definisi:** > **AI Agent** adalah sistem yang dapat menggunakan **tools**, **memory**, dan **reasoning** untuk secara autonomous menyelesaikan tasks complex. **RAG vs. Agents:** | Aspect | RAG System | AI Agent | |--------|------------|----------| | **Capability** | Retrieve & answer | Reason, plan, act | | **Tools** | None (just retrieval) | Can use tools (calculator, API, etc.) | | **Memory** | Stateless | Can maintain memory | | **Autonomy** | Single-step | Multi-step planning | | **Example** | Q&A chatbot | Personal assistant | **Agent Architecture:** ```{mermaid} %%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%% graph TB A[User Task] --> B[Agent Core<br/>LLM Reasoning] B --> C{Decision} C -->|Need Info| D[Retrieval Tool<br/>RAG/Search] C -->|Need Calculation| E[Calculator Tool] C -->|Need API Data| F[API Tool] C -->|Need Memory| G[Memory Store] D --> H[Observation] E --> H F --> H G --> H H --> B C -->|Task Complete| I[Final Answer] style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff style B fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff style I fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff style C fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333 ``` ### 10.5.2 ReAct Pattern: Reasoning + Acting **ReAct Framework** (Yao et al., 2022): - **Reason**: Think about what to do - **Act**: Execute an action - **Observe**: See the result - **Repeat** until task solved **Example Task**: "What's the weather in Jakarta and should I bring an umbrella?" **Agent Trace:** ``` Thought: I need to get current weather data for Jakarta Action: weather_api Action Input: {"city": "Jakarta", "country": "ID"} Observation: {"temperature": 28, "condition": "rainy", "humidity": 85} Thought: It's rainy with high humidity. User should bring umbrella. Action: Final Answer Action Input: "The current weather in Jakarta is rainy with 28°C and 85% humidity. Yes, you should definitely bring an umbrella!" ``` ### 10.5.3 Building an Agent dengan LangChain ```{python} #| echo: true #| eval: false from langchain.agents import Tool, AgentExecutor, create_react_agent from langchain.prompts import PromptTemplate from langchain.llms import OpenAI from langchain.chains import LLMMathChain # Define tools llm = OpenAI(temperature=0) # Tool 1: Calculator llm_math = LLMMathChain.from_llm(llm) calculator = Tool( name="Calculator", func=llm_math.run, description="Useful untuk mathematical calculations. Input harus math expression." ) # Tool 2: RAG search (dari section sebelumnya) def rag_search(query: str) -> str: """Search knowledge base.""" result = qa_chain({"query": query}) return result['result'] knowledge_base = Tool( name="KnowledgeBase", func=rag_search, description="Useful untuk pertanyaan tentang machine learning concepts. Input harus pertanyaan lengkap." ) # Tool 3: Python REPL (optional, untuk code execution) from langchain.utilities import PythonREPL python_repl = PythonREPL() python_tool = Tool( name="PythonREPL", func=python_repl.run, description="Useful untuk execute Python code. Input harus valid Python code." ) # Combine tools tools = [calculator, knowledge_base, python_tool] # Create agent prompt template = """You are a helpful AI assistant. Answer the following questions as best you can. You have access to the following tools: {tools} Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [{tool_names}] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: {input} Thought: {agent_scratchpad}""" prompt = PromptTemplate.from_template(template) # Create agent agent = create_react_agent(llm, tools, prompt) agent_executor = AgentExecutor( agent=agent, tools=tools, verbose=True, # Print reasoning steps max_iterations=5, handle_parsing_errors=True ) # Example queries queries = [ "Berapa hasil dari 25 * 47 + 138?", "Apa itu gradient descent? Lalu hitung derivatif dari x^2 + 3x + 5 di x=2", "Generate list 10 angka Fibonacci menggunakan Python, lalu hitung mean-nya" ] for query in queries: print(f"\n{'='*60}") print(f"Query: {query}") print(f"{'='*60}") result = agent_executor.invoke({"input": query}) print(f"\nFinal Answer: {result['output']}") ``` **Output Example:** ``` ============================================================ Query: Berapa hasil dari 25 * 47 + 138? ============================================================ > Entering new AgentExecutor chain... Thought: I need to perform a mathematical calculation Action: Calculator Action Input: 25 * 47 + 138 Observation: 1313 Thought: I now know the final answer Final Answer: 1313 > Finished chain. Final Answer: 1313 ``` ### 10.5.4 Agent Memory: Conversation History **Types of Memory:** 1. **ConversationBufferMemory**: Store semua conversation 2. **ConversationSummaryMemory**: Summarize old messages 3. **ConversationBufferWindowMemory**: Keep last N messages 4. **VectorStoreRetrieverMemory**: Semantic search pada history **Implementation:** ```{python} #| echo: true #| eval: false from langchain.memory import ConversationBufferMemory from langchain.agents import initialize_agent, AgentType # Create memory memory = ConversationBufferMemory( memory_key="chat_history", return_messages=True ) # Initialize agent dengan memory agent_with_memory = initialize_agent( tools=tools, llm=llm, agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION, memory=memory, verbose=True ) # Multi-turn conversation print(agent_with_memory.run("Halo, nama saya Budi")) # Output: "Hello Budi! How can I help you today?" print(agent_with_memory.run("Apa itu neural network?")) # Output: *retrieves from knowledge base* print(agent_with_memory.run("Siapa nama saya tadi?")) # Output: "Your name is Budi" (remembers dari conversation) ``` ::: {.callout-tip} ## 💡 Agent Best Practices 1. **Tool Descriptions Matter**: Clear descriptions → better tool selection 2. **Limit Tools**: Too many tools confuse the agent (max 5-7) 3. **Validation**: Always validate tool outputs before using 4. **Error Handling**: Gracefully handle tool failures 5. **Cost Control**: Set `max_iterations` untuk avoid runaway loops ::: ## 10.6 Evaluating RAG Systems ### 10.6.1 Evaluation Metrics **Challenge**: Tidak ada "ground truth" untuk generative tasks! **Solution**: Multiple evaluation dimensions **1. Retrieval Quality:** ```{python} #| echo: true #| eval: false # Precision@K: Berapa persen retrieved docs yang relevan? def precision_at_k(retrieved_docs, relevant_docs, k): """ Calculate Precision@K. Args: retrieved_docs: List of retrieved doc IDs relevant_docs: Set of truly relevant doc IDs k: Number of top results to consider Returns: Precision score [0, 1] """ top_k = retrieved_docs[:k] relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs] return len(relevant_in_top_k) / k # Recall@K: Berapa persen relevant docs yang ter-retrieve? def recall_at_k(retrieved_docs, relevant_docs, k): """Calculate Recall@K.""" top_k = retrieved_docs[:k] relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs] return len(relevant_in_top_k) / len(relevant_docs) if relevant_docs else 0 # Mean Reciprocal Rank (MRR) def mrr(retrieved_docs, relevant_docs): """ Calculate MRR. First relevant doc at position 1 → score = 1 First relevant doc at position 2 → score = 0.5 """ for i, doc in enumerate(retrieved_docs, 1): if doc in relevant_docs: return 1 / i return 0 # Example retrieved = ['doc3', 'doc1', 'doc7', 'doc2', 'doc5'] relevant = {'doc1', 'doc2', 'doc4'} print(f"Precision@3: {precision_at_k(retrieved, relevant, 3):.2f}") # 2/3 = 0.67 print(f"Recall@3: {recall_at_k(retrieved, relevant, 3):.2f}") # 2/3 = 0.67 print(f"MRR: {mrr(retrieved, relevant):.2f}") # 1/2 = 0.50 ``` **2. Generation Quality:** ```{python} #| echo: true #| eval: false # BLEU, ROUGE (traditional metrics) from nltk.translate.bleu_score import sentence_bleu from rouge import Rouge reference = "Machine learning is a subset of artificial intelligence" generated = "Machine learning is part of AI" # BLEU (precision-focused) bleu = sentence_bleu([reference.split()], generated.split()) print(f"BLEU: {bleu:.2f}") # ROUGE (recall-focused) rouge = Rouge() scores = rouge.get_scores(generated, reference) print(f"ROUGE-L F1: {scores[0]['rouge-l']['f']:.2f}") ``` **3. Faithfulness (Groundedness):** Apakah generated answer **grounded** dalam retrieved context? ```{python} #| echo: true #| eval: false # Using LLM as judge def evaluate_faithfulness(context, answer, llm): """ Check if answer is supported by context. Returns: score [0-5], reasoning """ prompt = f"""Given the following context and answer, rate how well the answer is supported by the context on a scale of 0-5: Context: {context} Answer: {answer} Rating (0=not supported at all, 5=fully supported): Reasoning: """ response = llm(prompt) return response # Example context = "Python was created by Guido van Rossum in 1991." answer = "Python was developed in the early 1990s" score = evaluate_faithfulness(context, answer, llm) print(score) ``` **4. Relevance:** Apakah answer relevan dengan question? ```{python} #| echo: true #| eval: false def evaluate_relevance(question, answer, llm): """Rate answer relevance to question.""" prompt = f"""Rate how relevant the answer is to the question (0-5): Question: {question} Answer: {answer} Rating: Reasoning: """ return llm(prompt) ``` ### 10.6.2 End-to-End Evaluation Framework ```{python} #| echo: true #| eval: false class RAGEvaluator: """Comprehensive RAG evaluation.""" def __init__(self, rag_chain, llm_judge): self.rag_chain = rag_chain self.llm_judge = llm_judge def evaluate(self, test_cases): """ Evaluate RAG system pada test cases. Args: test_cases: List of dicts with keys: - 'question': str - 'expected_answer': str (optional) - 'relevant_docs': set (optional) Returns: Evaluation results dataframe """ results = [] for case in test_cases: # Run RAG output = self.rag_chain({"query": case['question']}) answer = output['result'] retrieved_docs = [doc.metadata['id'] for doc in output['source_documents']] # Evaluate retrieval if 'relevant_docs' in case: precision = precision_at_k(retrieved_docs, case['relevant_docs'], 3) recall = recall_at_k(retrieved_docs, case['relevant_docs'], 3) else: precision, recall = None, None # Evaluate generation context = "\n".join([doc.page_content for doc in output['source_documents']]) faithfulness = self.evaluate_faithfulness(context, answer) relevance = self.evaluate_relevance(case['question'], answer) results.append({ 'question': case['question'], 'answer': answer, 'precision@3': precision, 'recall@3': recall, 'faithfulness': faithfulness, 'relevance': relevance }) return pd.DataFrame(results) def evaluate_faithfulness(self, context, answer): """LLM-based faithfulness check.""" # Implementation similar to above pass def evaluate_relevance(self, question, answer): """LLM-based relevance check.""" # Implementation similar to above pass # Usage test_cases = [ { 'question': 'Apa itu backpropagation?', 'relevant_docs': {'doc12', 'doc34'} }, { 'question': 'Perbedaan CNN dan RNN?', 'relevant_docs': {'doc45', 'doc67', 'doc89'} } ] evaluator = RAGEvaluator(qa_chain, llm) results = evaluator.evaluate(test_cases) print(results) ``` ### 10.6.3 A/B Testing RAG Configurations **Test different configurations:** ```{python} #| echo: true #| eval: false import pandas as pd # Configurations to test configs = [ { 'name': 'Baseline', 'chunk_size': 500, 'chunk_overlap': 50, 'k': 3, 'search_type': 'similarity' }, { 'name': 'Larger Chunks', 'chunk_size': 1000, 'chunk_overlap': 100, 'k': 3, 'search_type': 'similarity' }, { 'name': 'More Retrieval', 'chunk_size': 500, 'chunk_overlap': 50, 'k': 5, 'search_type': 'similarity' }, { 'name': 'MMR (diversity)', 'chunk_size': 500, 'chunk_overlap': 50, 'k': 3, 'search_type': 'mmr' # Maximal Marginal Relevance } ] # Run experiments results = [] for config in configs: # Build RAG dengan config rag = build_rag_system(**config) # Evaluate eval_results = evaluator.evaluate(test_cases) # Aggregate results.append({ 'config': config['name'], 'avg_precision': eval_results['precision@3'].mean(), 'avg_faithfulness': eval_results['faithfulness'].mean(), 'avg_relevance': eval_results['relevance'].mean() }) results_df = pd.DataFrame(results) print(results_df) # Visualize results_df.plot(x='config', kind='bar', figsize=(12, 6)) plt.title('RAG Configuration Comparison') plt.ylabel('Score') plt.legend(['Precision@3', 'Faithfulness', 'Relevance']) plt.tight_layout() plt.show() ``` ::: {.callout-important} ## 📊 Evaluation Best Practices 1. **Use Multiple Metrics**: No single metric captures all aspects 2. **LLM-as-Judge**: Effective untuk semantic evaluation 3. **Human Eval**: Gold standard, tapi expensive (use for validation) 4. **Regression Testing**: Track metrics over time 5. **Domain-Specific**: Customize metrics untuk your use case ::: ## 10.7 Production Considerations ### 10.7.1 Scalability & Performance **Challenges:** 1. **Latency**: Retrieval + generation bisa lambat 2. **Throughput**: Banyak concurrent users 3. **Cost**: API calls untuk embeddings + LLM expensive **Solutions:** ```{python} #| echo: true #| eval: false # 1. Caching (simple approach) import functools from datetime import datetime, timedelta cache = {} CACHE_EXPIRY = timedelta(hours=1) def cached_rag(query): """Cache RAG results untuk repeated queries.""" if query in cache: result, timestamp = cache[query] if datetime.now() - timestamp < CACHE_EXPIRY: print("Cache hit!") return result # Cache miss result = qa_chain({"query": query}) cache[query] = (result, datetime.now()) return result # 2. Batch processing untuk embeddings def batch_embed(texts, batch_size=32): """Embed dalam batches untuk efficiency.""" embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] batch_embeddings = model.encode(batch) embeddings.extend(batch_embeddings) return embeddings # 3. Async processing import asyncio async def async_rag(query): """Asynchronous RAG untuk concurrency.""" # Parallelize retrieval dan LLM call jika possible retrieval_task = asyncio.create_task(retrieve(query)) # ... implementation pass ``` ### 10.7.2 Monitoring & Observability **Key Metrics to Track:** ```{python} #| echo: true #| eval: false import logging from datetime import datetime class RAGMonitor: """Monitor RAG system performance.""" def __init__(self): self.metrics = { 'total_queries': 0, 'avg_latency': 0, 'cache_hit_rate': 0, 'error_rate': 0 } self.logger = logging.getLogger(__name__) def log_query(self, query, latency, cached, error=None): """Log setiap query.""" self.metrics['total_queries'] += 1 # Update metrics n = self.metrics['total_queries'] self.metrics['avg_latency'] = ( (self.metrics['avg_latency'] * (n-1) + latency) / n ) if cached: self.metrics['cache_hit_rate'] = ( (self.metrics['cache_hit_rate'] * (n-1) + 1) / n ) if error: self.metrics['error_rate'] = ( (self.metrics['error_rate'] * (n-1) + 1) / n ) self.logger.error(f"Query error: {error}") # Log to file/database self.logger.info(f"Query: {query[:50]}... | Latency: {latency:.2f}s | Cached: {cached}") def get_metrics(self): """Return current metrics.""" return self.metrics # Usage monitor = RAGMonitor() def monitored_rag(query): """RAG dengan monitoring.""" start_time = datetime.now() cached = False error = None try: result = cached_rag(query) if query in cache: cached = True except Exception as e: error = str(e) raise finally: latency = (datetime.now() - start_time).total_seconds() monitor.log_query(query, latency, cached, error) return result ``` ### 10.7.3 Security & Privacy **Concerns:** 1. **Data Privacy**: User queries might contain sensitive info 2. **Injection Attacks**: Malicious prompts 3. **Data Leakage**: Retrieved docs might expose confidential info **Mitigations:** ```{python} #| echo: true #| eval: false # 1. Input sanitization def sanitize_input(query): """Remove potentially malicious content.""" # Remove excessively long inputs max_length = 500 query = query[:max_length] # Remove common injection patterns forbidden_patterns = ['ignore previous', 'disregard', 'system:'] for pattern in forbidden_patterns: if pattern.lower() in query.lower(): raise ValueError("Potentially malicious input detected") return query # 2. Access control def check_permissions(user_id, doc_id): """Check if user dapat akses document.""" # Query database/ACL # Return True/False pass def filtered_retrieval(query, user_id): """Retrieve hanya docs yang user boleh akses.""" all_results = vectorstore.similarity_search(query, k=10) # Filter berdasarkan permissions filtered = [ doc for doc in all_results if check_permissions(user_id, doc.metadata['id']) ] return filtered[:3] # Return top-3 after filtering # 3. PII redaction import re def redact_pii(text): """Remove personally identifiable information.""" # Email text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text) # Phone (Indonesia) text = re.sub(r'\b(?:\+62|0)\d{9,11}\b', '[PHONE]', text) # ID numbers (example: 16 digits) text = re.sub(r'\b\d{16}\b', '[ID_NUMBER]', text) return text ``` ## 10.8 Studi Kasus: Customer Support RAG System ### 10.8.1 Problem Statement **Scenario**: Perusahaan e-commerce ingin build customer support chatbot yang dapat: - Answer FAQ from knowledge base (100+ documents) - Handle 1000+ queries/day - Provide accurate answers dengan source citations - Reduce support ticket volume by 40% ### 10.8.2 System Design **Architecture:** ```{python} #| echo: true #| eval: false # Knowledge base: FAQs, product manuals, policies documents = load_documents('./knowledge_base/') # Processing pipeline text_splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, ) chunks = text_splitter.split_documents(documents) # Embeddings: Indonesian-capable model from sentence_transformers import SentenceTransformer embedding_model = SentenceTransformer('LazarusNLP/IndoBERT-base-uncased') # Vector DB: ChromaDB untuk persistence vectorstore = Chroma.from_documents( documents=chunks, embedding=embedding_model, persist_directory="./customer_support_db" ) # Hybrid retrieval untuk better accuracy ensemble_retriever = create_ensemble_retriever( vectorstore, chunks, weights=[0.6, 0.4] ) # LLM: OpenAI GPT-4 (bisa diganti dengan local model) from langchain.chat_models import ChatOpenAI llm = ChatOpenAI(model="gpt-4", temperature=0) # QA Chain dengan custom prompt from langchain.prompts import PromptTemplate template = """Anda adalah asisten customer support yang membantu. Gunakan konteks berikut untuk menjawab pertanyaan pelanggan. PENTING: - Jika Anda tidak tahu jawabannya, katakan "Maaf, saya tidak memiliki informasi tersebut. Silakan hubungi tim support kami." - Jangan membuat informasi - Berikan jawaban yang ramah dan profesional - Sertakan referensi ke sumber jika relevan Konteks: {context} Pertanyaan: {question} Jawaban:""" QA_PROMPT = PromptTemplate( template=template, input_variables=["context", "question"] ) qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=ensemble_retriever, chain_type="stuff", chain_type_kwargs={"prompt": QA_PROMPT}, return_source_documents=True ) ``` ### 10.8.3 Implementation & Results **Deployment:** ```{python} #| echo: true #| eval: false from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI() class Query(BaseModel): question: str user_id: str class Response(BaseModel): answer: str sources: list[str] confidence: float @app.post("/ask", response_model=Response) async def ask_question(query: Query): """API endpoint untuk customer queries.""" try: # Sanitize input question = sanitize_input(query.question) # Run RAG result = qa_chain({"query": question}) # Extract sources sources = [ doc.metadata.get('source', 'Unknown') for doc in result['source_documents'] ] # Estimate confidence (simple heuristic) confidence = estimate_confidence(result) return Response( answer=result['result'], sources=sources, confidence=confidence ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) def estimate_confidence(result): """Estimate answer confidence berdasarkan retrieval scores.""" # Simple heuristic: average similarity of retrieved docs # Real implementation would be more sophisticated return 0.85 # Placeholder ``` **Results After 3 Months:** | Metric | Before RAG | After RAG | Improvement | |--------|-----------|-----------|-------------| | Ticket Volume | 5000/month | 2800/month | **44% reduction** ✅ | | Response Time | 2 hours | 30 seconds | **99.3% faster** ✅ | | Customer Satisfaction | 3.2/5 | 4.5/5 | **41% increase** ✅ | | Support Cost | $50K/month | $32K/month | **36% savings** ✅ | | Accuracy (human eval) | N/A | 87% | New capability ✅ | **Lessons Learned:** 1. **Chunking matters**: Experimented with 500, 800, 1000 tokens → 800 optimal 2. **Hybrid search**: Improved precision from 0.72 to 0.84 3. **Prompt engineering**: Custom prompt reduced hallucinations significantly 4. **Fallback important**: "I don't know" better than wrong answer 5. **Continuous improvement**: Weekly analysis of low-confidence answers ## 10.9 Ringkasan (Summary) **Konsep Kunci:** 1. **RAG = Retrieval + Generation**: - Solve LLM limitations (knowledge cutoff, hallucinations) - Combine semantic search dengan generative AI - Enable domain-specific, updatable knowledge 2. **Embeddings**: - Semantic representations as vectors - Cosine similarity untuk retrieval - Models: Sentence Transformers, OpenAI embeddings 3. **Vector Databases**: - Efficient similarity search at scale - Options: FAISS (library), ChromaDB (embedded), Pinecone (cloud) - ANN algorithms untuk speed vs accuracy trade-off 4. **RAG Pipeline**: - Document → Chunking → Embedding → Vector DB - Query → Embedding → Retrieval → LLM → Response - LangChain simplifies implementation 5. **AI Agents**: - Beyond RAG: tools, memory, reasoning - ReAct pattern: Reason → Act → Observe - Can plan multi-step tasks autonomously 6. **Evaluation**: - Retrieval: Precision@K, Recall@K, MRR - Generation: Faithfulness, Relevance - LLM-as-judge untuk semantic eval 7. **Production**: - Caching, batching untuk performance - Monitoring metrics (latency, error rate) - Security: input sanitization, access control **Aplikasi Praktis:** - Customer support chatbots - Document Q&A systems - Code documentation assistants - Research literature search - Legal/medical knowledge bases **Next Steps:** - Chapter 11: MLOps & Deployment (production ML systems) - Advanced topics: Multi-modal RAG, agentic workflows - Fine-tuning embeddings untuk domain specificity ## 10.10 Latihan & Tugas (Exercises) ::: {.callout-important} ## 📝 Soal Latihan **Pertanyaan Konseptual:** 1. **Jelaskan perbedaan fundamental antara LLM murni dan RAG system. Kapan Anda akan menggunakan masing-masing?** 2. **Mengapa cosine similarity lebih populer daripada Euclidean distance untuk semantic search? Berikan contoh kasus dimana keduanya memberikan hasil berbeda.** 3. **Apa trade-offs antara chunking strategies berikut:** - Fixed-size tokens (512 tokens) - Paragraph-based - Recursive character splitting Kapan masing-masing cocok digunakan? 4. **Compare dan contrast:** - RAG system dengan retrieval database - AI Agent dengan traditional chatbot - Vector database (FAISS) dengan relational database (PostgreSQL) 5. **Dalam RAG evaluation, mengapa kita butuh multiple metrics (Precision, Recall, Faithfulness, Relevance)? Berikan contoh kasus dimana satu metric tinggi tapi yang lain rendah.** **Tugas Praktikum:** **Tugas 1: Build Simple RAG System** - **Input**: Collection of 20+ text documents (bisa Wikipedia articles, technical docs, dll.) - **Tasks**: 1. Implement chunking strategy pilihan Anda 2. Generate embeddings dengan Sentence Transformers 3. Build FAISS index 4. Implement retrieval function 5. Test dengan 5 sample queries - **Output**: - Python script dengan code lengkap - Report showing query results dengan retrieved chunks - **Rubric**: - Chunking implementation (20%) - FAISS integration (20%) - Retrieval accuracy (30%) - Code quality & documentation (15%) - Report clarity (15%) **Tugas 2: Compare Embedding Models** - **Objective**: Compare different embedding models untuk Indonesian text - **Models to test**: - `sentence-transformers/all-MiniLM-L6-v2` (multilingual) - `indobenchmark/indobert-base-p1` (Indonesian BERT) - `LazarusNLP/IndoBERT-base-uncased` (Indonesian) - **Tasks**: 1. Create test set: 10 queries, each dengan 3 relevant + 7 irrelevant docs 2. Compute embeddings dengan each model 3. Evaluate dengan Precision@3, Recall@3, MRR 4. Analyze trade-offs (speed vs accuracy) - **Deliverable**: - Comparison table dengan metrics - Visualization (bar charts) - Analysis report (1-2 pages) **Tugas 3: Build RAG System dengan LangChain** - **Scenario**: Academic paper Q&A system - **Requirements**: 1. Load multiple PDF papers (use PyPDF atau similar) 2. Implement chunking dengan RecursiveCharacterTextSplitter 3. Use ChromaDB untuk persistence 4. Implement QA chain dengan custom prompt 5. Add source citations ke responses - **Bonus**: - Implement hybrid search (vector + BM25) - Add conversation memory - Build simple Streamlit/Gradio UI - **Rubric**: - Core functionality (40%) - Source citations (15%) - Code organization (15%) - Documentation (15%) - Bonus features (15%) **Tugas 4: Simple AI Agent** - **Objective**: Build agent yang bisa use multiple tools - **Required Tools**: 1. Calculator (untuk math) 2. Wikipedia search (untuk facts) 3. Weather API (untuk current weather) - **Tasks**: 1. Define tool wrappers 2. Create ReAct agent dengan LangChain 3. Test dengan diverse queries: - Pure math: "What's 234 * 567?" - Fact + math: "Population of Jakarta times 3?" - Multi-step: "What's the weather in Paris? Is it above average?" - **Deliverable**: - Agent code - Trace logs showing reasoning steps - Analysis report discussing successes/failures - **Rubric**: - Tool implementation (30%) - Agent reasoning quality (30%) - Test coverage (20%) - Analysis depth (20%) **Tugas 5: RAG Evaluation Pipeline** - **Objective**: Build evaluation framework untuk RAG system - **Requirements**: 1. Create test dataset: 20+ question-answer pairs dengan ground truth 2. Implement metrics: - Retrieval: Precision@K, Recall@K - Generation: Faithfulness (LLM-as-judge) 3. Run evaluation pada 2 different RAG configurations 4. Visualize results - **Deliverable**: - Evaluation code - Test dataset (JSON/CSV) - Results report dengan recommendations - **Rubric**: - Metric implementation (35%) - Test dataset quality (20%) - Comparative analysis (25%) - Recommendations (20%) **Proyek Akhir (Capstone):** **"Domain-Specific RAG System"** Build complete RAG system untuk domain pilihan Anda: - Options: - Medical Q&A (dari medical literature) - Legal assistant (dari regulations/case law) - Code documentation helper - Academic research assistant **Requirements:** 1. Document collection (50+ documents) 2. Full RAG pipeline (chunking → embedding → retrieval → generation) 3. Evaluation dengan test set (20+ queries) 4. Simple web interface (Streamlit/Gradio) 5. Documentation (architecture, usage, evaluation results) **Grading:** - System functionality (30%) - Retrieval quality (20%) - Generation quality (20%) - User interface (15%) - Documentation (15%) ::: ## 10.11 Referensi & Bacaan Lebih Lanjut **Paper Fundamental:** - Lewis et al., 2020. "[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)". *NeurIPS*. - Yao et al., 2022. "[ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)". *ICLR*. - Gao et al., 2023. "[Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997)". *arXiv*. **Tutorial & Guides:** - [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction) - Comprehensive framework untuk LLM apps - [FAISS Wiki](https://github.com/facebookresearch/faiss/wiki) - Facebook AI Similarity Search - [ChromaDB Docs](https://docs.trychroma.com/) - Embeddings database - [Sentence Transformers](https://www.sbert.net/) - State-of-the-art text embeddings **Tools & Libraries:** - [LangChain](https://github.com/langchain-ai/langchain) - LLM application framework - [LlamaIndex](https://www.llamaindex.ai/) - Alternative RAG framework - [Haystack](https://haystack.deepset.ai/) - End-to-end NLP pipelines - [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings) - High-quality embeddings API **Buku:** - Tunstall et al., 2022. "Natural Language Processing with Transformers". O'Reilly. - Huyen, 2022. "Designing Machine Learning Systems". O'Reilly. (Chapter on embeddings) **Blog Posts:** - [Pinecone: What is RAG?](https://www.pinecone.io/learn/retrieval-augmented-generation/) - [LangChain: RAG Tutorial](https://python.langchain.com/docs/use_cases/question_answering/) - [Anthropic: Constitutional AI](https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback) **Video Courses:** - [DeepLearning.AI: LangChain for LLM Application Development](https://www.deeplearning.ai/short-courses/langchain-for-llm-application-development/) - [DeepLearning.AI: Building Applications with Vector Databases](https://www.deeplearning.ai/short-courses/building-applications-vector-databases/) --- ## Hubungan dengan Learning Outcomes Program | CPMK | Sub-CPMK | Tercakup | |------|----------|---------| | CPMK-3 | Implementasi modern AI systems | ✓ | | CPMK-4 | Evaluasi model performance | ✓ | | CPMK-5 | Production ML deployment | ✓ | --- **Related Labs:** Lab 10 - Building RAG System for Indonesian Documents **Related Chapters:** Chapter 8 (Transformers), Chapter 11 (MLOps & Deployment) **Estimated Reading Time:** 120 minutes **Estimated Practice Time:** 10-12 hours --- *Last Updated: December 6, 2024* *Version: 1.0* *Author: Tim Pengembang - Politeknik Siber dan Sandi Negara*