Bab 10: RAG & AI Agents

Retrieval-Augmented Generation, Vector Databases & Intelligent Agents

Bab 10: RAG & AI Agents

🎯 Hasil Pembelajaran (Learning Outcomes)

Setelah mempelajari bab ini, Anda akan mampu:

  1. Memahami konsep Retrieval-Augmented Generation (RAG) dan motivasinya
  2. Mengidentifikasi komponen-komponen sistem RAG (embeddings, vector DB, retrieval)
  3. Mengimplementasikan vector databases dengan FAISS dan ChromaDB
  4. Membangun RAG pipeline sederhana dengan LangChain
  5. Menerapkan chunking strategies dan semantic search
  6. Mengembangkan AI agents dengan tools dan memory
  7. Mengevaluasi kualitas RAG systems menggunakan berbagai metrics

10.1 Dari LLM ke RAG: Mengapa Kita Butuh Retrieval?

10.1.1 Limitasi Large Language Models

Masalah Fundamental LLMs:

Meskipun powerful, LLMs seperti GPT, BERT, dan Llama memiliki critical limitations:

1. Knowledge Cutoff 📅:

  • Model hanya “tahu” data sampai tanggal training
  • GPT-4 (trained 2023) tidak tahu berita 2024
  • Tidak bisa update knowledge tanpa re-training (costly!)

2. Hallucination 🎭:

  • LLMs bisa generate jawaban yang sounds plausible tapi faktanya salah
  • Tidak ada grounding ke factual sources
  • Berbahaya untuk critical applications (medical, legal, financial)

3. Domain-Specific Knowledge 🏢:

  • LLMs tidak tahu internal company documents
  • Tidak punya akses ke proprietary data
  • Tidak bisa query real-time databases

4. No Source Attribution 📚:

  • Tidak bisa cite sources
  • Sulit verify informasi
  • Compliance & legal issues
💡 Contoh Problem

User: “Berapa harga saham Apple hari ini?”

LLM tanpa RAG:

  • “Maaf, saya tidak punya data real-time…”
  • Atau worse: hallucinates a price!

LLM dengan RAG: 1. Retrieve dari stock API/database 2. Ground response pada data terkini 3. Jawaban akurat dengan source citation

10.1.2 Solusi: Retrieval-Augmented Generation (RAG)

Definisi:

RAG adalah teknik yang mengkombinasikan retrieval (pencarian informasi dari knowledge base) dengan generation (LLM text generation) untuk menghasilkan jawaban yang lebih akurat, factual, dan dapat diverifikasi.

Arsitektur RAG:

Code
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]

    D[Document Corpus] --> E[Text Chunking]
    E --> F[Embedding Generation]
    F --> G[(Vector Database)]

    C --> G
    G --> H[Retrieve Top-K<br/>Relevant Chunks]

    H --> I[Construct Prompt]
    A --> I
    I --> J[LLM Generation]
    J --> K[Response with<br/>Source Citations]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style K fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style G fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333
    style J fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Query] --> B[Query Embedding]
    B --> C[Vector Search]

    D[Document Corpus] --> E[Text Chunking]
    E --> F[Embedding Generation]
    F --> G[(Vector Database)]

    C --> G
    G --> H[Retrieve Top-K<br/>Relevant Chunks]

    H --> I[Construct Prompt]
    A --> I
    I --> J[LLM Generation]
    J --> K[Response with<br/>Source Citations]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style K fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style G fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333
    style J fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff

Cara Kerja RAG:

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch
import numpy as np

fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Step 1: Document Processing (Left)
ax.add_patch(FancyBboxPatch((0.5, 7.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 8.5, 'Document\nCorpus', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(1.5, 7.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

ax.add_patch(FancyBboxPatch((0.5, 5.5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 6, 'Chunking', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(1.5, 5.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

ax.add_patch(FancyBboxPatch((0.5, 3.5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#4A90E2', facecolor='#E3F2FD', linewidth=2))
ax.text(1.5, 4, 'Embeddings', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(1.5, 3.5, 0.8, -0.5, head_width=0.15, head_length=0.15, fc='#2E5C8A', ec='#2E5C8A')

# Vector Database (Center)
ax.add_patch(FancyBboxPatch((3, 1.5), 2.5, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#FFA500', facecolor='#FFF8DC', linewidth=3))
ax.text(4.25, 2.25, 'Vector\nDatabase', ha='center', va='center', fontsize=11, fontweight='bold')

# Step 2: Query Processing (Right)
ax.add_patch(FancyBboxPatch((7, 7.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2))
ax.text(8, 8.5, 'User\nQuery', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(8, 7.5, 0, -1.3, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57')

ax.add_patch(FancyBboxPatch((7, 5), 2, 1, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#E8F5E9', linewidth=2))
ax.text(8, 5.5, 'Query\nEmbedding', ha='center', va='center', fontsize=10, fontweight='bold')

ax.arrow(8, 5, -2.5, -2, head_width=0.15, head_length=0.15, fc='#2E8B57', ec='#2E8B57')

# Retrieval arrow
ax.arrow(5.5, 2.5, 1.3, 1.5, head_width=0.15, head_length=0.15, fc='#9C27B0', ec='#9C27B0')
ax.text(6.5, 4, 'Top-K\nRetrieval', ha='center', va='center', fontsize=9,
        bbox=dict(boxstyle='round', facecolor='#F3E5F5', alpha=0.8))

# LLM Generation
ax.add_patch(FancyBboxPatch((7, 1.5), 2, 1.5, boxstyle="round,pad=0.1",
                             edgecolor='#FF6B6B', facecolor='#FFEBEE', linewidth=2))
ax.text(8, 2.25, 'LLM\nGeneration', ha='center', va='center', fontsize=11, fontweight='bold')

ax.arrow(8, 1.5, 0, -0.8, head_width=0.15, head_length=0.15, fc='#C92A2A', ec='#C92A2A')

# Final Response
ax.add_patch(FancyBboxPatch((6.5, 0), 3, 0.5, boxstyle="round,pad=0.1",
                             edgecolor='#50C878', facecolor='#C8E6C9', linewidth=3))
ax.text(8, 0.25, 'Response + Citations', ha='center', va='center',
        fontsize=11, fontweight='bold')

# Title
ax.text(5, 9.5, 'RAG Pipeline Architecture', ha='center', va='center',
        fontsize=14, fontweight='bold', color='#333')

plt.tight_layout()
plt.show()

Keuntungan RAG:

Aspek LLM Murni RAG System
Knowledge Static (cutoff date) ✅ Dynamic, updatable
Hallucination Frequent ✅ Reduced (grounded)
Source No citation ✅ Traceable sources
Domain Data Generic only ✅ Custom knowledge base
Cost Low inference Medium (retrieval + gen)
Latency Fast (~100ms) Slower (~500ms)

10.1.3 Use Cases RAG di Industry

1. Customer Support Chatbots 🤖:

  • Knowledge base: Product manuals, FAQs, troubleshooting guides
  • Example: “How do I reset my password?” → Retrieve from internal docs

2. Legal Document Analysis ⚖️:

  • Corpus: Case law, regulations, contracts
  • Example: “Find precedents for patent infringement”

3. Medical Q&A 🏥:

  • Database: Medical literature, clinical guidelines
  • Example: “Treatment options for Type 2 diabetes”

4. Code Documentation 💻:

  • Codebase + docs retrieval
  • Example: “How to authenticate API requests in our system?”

5. Academic Research 📚:

  • Literature search + summarization
  • Example: “Recent advances in quantum computing 2024”

10.2 Embeddings & Vector Representations

10.2.1 Apa itu Embeddings?

Definisi:

Embedding adalah representasi vektor (array of numbers) dari text, image, atau data lain dalam high-dimensional space, dimana semantic similarity tercermin sebagai geometric proximity.

Intuisi:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Simulate word embeddings (normally 768 or 1536 dimensions)
# We'll create simple 2D examples for visualization

np.random.seed(42)

# Categories of words
animals = {
    'cat': [0.2, 0.8], 'dog': [0.3, 0.85], 'lion': [0.25, 0.75],
    'tiger': [0.28, 0.72], 'elephant': [0.35, 0.7]
}

tech = {
    'computer': [0.8, 0.3], 'laptop': [0.82, 0.28], 'phone': [0.85, 0.35],
    'tablet': [0.83, 0.32], 'monitor': [0.78, 0.27]
}

fruits = {
    'apple': [0.5, 0.2], 'banana': [0.52, 0.18], 'orange': [0.48, 0.22],
    'grape': [0.51, 0.19], 'mango': [0.49, 0.21]
}

fig, ax = plt.subplots(figsize=(12, 8))

# Plot each category
for word, pos in animals.items():
    ax.scatter(pos[0], pos[1], c='#FF6B6B', s=300, alpha=0.6, edgecolors='darkred', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

for word, pos in tech.items():
    ax.scatter(pos[0], pos[1], c='#4A90E2', s=300, alpha=0.6, edgecolors='darkblue', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

for word, pos in fruits.items():
    ax.scatter(pos[0], pos[1], c='#50C878', s=300, alpha=0.6, edgecolors='darkgreen', linewidths=2)
    ax.annotate(word, pos, fontsize=11, ha='center', va='center', fontweight='bold')

# Add cluster circles
from matplotlib.patches import Circle
ax.add_patch(Circle((0.27, 0.76), 0.15, fill=False, edgecolor='#FF6B6B',
                     linewidth=2, linestyle='--', label='Animals'))
ax.add_patch(Circle((0.816, 0.3), 0.08, fill=False, edgecolor='#4A90E2',
                     linewidth=2, linestyle='--', label='Technology'))
ax.add_patch(Circle((0.5, 0.2), 0.05, fill=False, edgecolor='#50C878',
                     linewidth=2, linestyle='--', label='Fruits'))

ax.set_xlabel('Dimension 1', fontsize=12, fontweight='bold')
ax.set_ylabel('Dimension 2', fontsize=12, fontweight='bold')
ax.set_title('Embedding Space: Semantic Similarity as Distance',
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='upper right')
ax.grid(True, alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])

plt.tight_layout()
plt.show()

Key Properties:

  1. Similar words = Close vectors:

    • “cat” dan “dog” berdekatan
    • “laptop” dan “computer” berdekatan
  2. Different concepts = Distant vectors:

    • “cat” jauh dari “laptop”
  3. Mathematical operations:

    • King - Man + Woman ≈ Queen
    • Paris - France + Italy ≈ Rome

10.2.2 Sentence & Document Embeddings

Word Embeddings vs. Sentence Embeddings:

Type Example Dimension Use Case
Word “cat” → [0.2, 0.8, …] 300 (Word2Vec) Word similarity
Sentence “The cat sleeps” → [0.5, …] 768 (BERT) Semantic search
Document Full article → [0.3, …] 1536 (OpenAI) Document retrieval

Implementasi dengan Sentence Transformers:

Code
from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "Machine learning adalah subset dari artificial intelligence",
    "Deep learning menggunakan neural networks dengan banyak layer",
    "Python adalah bahasa pemrograman populer untuk data science",
    "Jakarta adalah ibu kota Indonesia"
]

# Generate embeddings
embeddings = model.encode(sentences)

print(f"Embedding shape: {embeddings.shape}")  # (4, 384)
print(f"First embedding (truncated):\n{embeddings[0][:10]}")

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(embeddings)

print("\nSimilarity Matrix:")
print(similarity_matrix)

Output:

Embedding shape: (4, 384)
First embedding (truncated):
[ 0.0234 -0.1234  0.5678 ... ]

Similarity Matrix:
[[1.000 0.812 0.456 0.123]   # Sent 1 vs all
 [0.812 1.000 0.489 0.098]   # Sent 2 vs all
 [0.456 0.489 1.000 0.156]   # Sent 3 vs all
 [0.123 0.098 0.156 1.000]]  # Sent 4 vs all

Interpretasi:

  • Sentence 1 dan 2 (ML/DL) sangat similar (0.812)
  • Sentence 4 (Jakarta) paling berbeda dari semua (low similarity)

10.2.3 Cosine Similarity untuk Retrieval

Formula:

\[ \text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Range: -1 (opposite) to +1 (identical)

Visualisasi:

import numpy as np
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Case 1: High similarity
ax = axes[0]
v1 = np.array([0.8, 0.6])
v2 = np.array([0.7, 0.5])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#50C878', width=0.01, label='Vector B')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('High Similarity\n(cos ≈ 0.99)', fontweight='bold')

# Case 2: Medium similarity
ax = axes[1]
v1 = np.array([0.8, 0.6])
v2 = np.array([0.6, 0.8])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#FFD700', width=0.01, label='Vector C')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('Medium Similarity\n(cos ≈ 0.96)', fontweight='bold')

# Case 3: Low similarity
ax = axes[2]
v1 = np.array([0.8, 0.2])
v2 = np.array([0.2, 0.8])
ax.quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1,
          color='#4A90E2', width=0.01, label='Vector A')
ax.quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1,
          color='#FF6B6B', width=0.01, label='Vector D')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_title('Low Similarity\n(cos ≈ 0.52)', fontweight='bold')

plt.tight_layout()
plt.show()
⚠️ Common Pitfalls
  1. Dimensionality Matters:

    • Lebih tinggi ≠ selalu lebih baik
    • 384-dim cukup untuk banyak tasks
    • 1536-dim untuk complex semantic understanding
  2. Model Selection:

    • all-MiniLM-L6-v2 (384-dim): Fast, good general purpose
    • all-mpnet-base-v2 (768-dim): Better quality, slower
    • OpenAI text-embedding-ada-002 (1536-dim): Best quality, API cost
  3. Normalization:

    • Always normalize vectors untuk cosine similarity
    • sklearn.preprocessing.normalize() atau manual L2 norm

10.3 Vector Databases

10.3.1 Mengapa Butuh Vector Database?

Problem Statement:

Bayangkan Anda punya 1 million documents. Untuk setiap query:

  • Compute cosine similarity dengan semua 1M vectors
  • Sort untuk find top-K
  • Time complexity: O(N × D) dimana N=documents, D=dimensions

Result: 🐌 SANGAT LAMBAT!

Solusi: Vector Database dengan ANN (Approximate Nearest Neighbors)

10.3.2 Pilihan Vector Databases

Database Type Best For Pros Cons
FAISS Library Research, prototyping Fast, free, Facebook-backed Not distributed
ChromaDB Embedded Small-medium apps Easy, Python-native Limited scale
Pinecone Cloud Production apps Managed, scalable Cost, vendor lock-in
Weaviate Self-hosted Enterprise Open source, GraphQL Complex setup
Milvus Self-hosted Large scale Distributed, fast Requires infrastructure
Qdrant Self-hosted Modern apps Rust-based, fast Newer, smaller community

10.3.3 Implementasi dengan FAISS

Installation:

pip install faiss-cpu  # or faiss-gpu for GPU support

Example: Building a simple vector search:

Code
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Sample documents
documents = [
    "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data",
    "Deep learning menggunakan neural networks dengan banyak hidden layers",
    "Natural language processing memproses dan memahami bahasa manusia",
    "Computer vision memungkinkan komputer memahami gambar dan video",
    "Reinforcement learning belajar melalui trial and error dengan rewards",
    "Transfer learning memanfaatkan model pre-trained untuk task baru",
    "Ensemble methods menggabungkan multiple models untuk hasil lebih baik",
    "Python adalah bahasa pemrograman populer untuk data science",
]

# Generate embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(documents)

# Get embedding dimension
d = embeddings.shape[1]  # 384 for all-MiniLM-L6-v2
print(f"Embedding dimension: {d}")

# Create FAISS index
# IndexFlatL2: Exact search dengan L2 distance (bisa diganti dengan cosine)
index = faiss.IndexFlatL2(d)

# Normalize vectors untuk cosine similarity
faiss.normalize_L2(embeddings)

# Add vectors to index
index.add(embeddings.astype('float32'))

print(f"Total vectors dalam index: {index.ntotal}")

# Query
query = "Apa itu neural networks?"
query_embedding = model.encode([query])
faiss.normalize_L2(query_embedding)

# Search for top-3 most similar documents
k = 3
distances, indices = index.search(query_embedding.astype('float32'), k)

print(f"\nQuery: '{query}'")
print("\nTop-3 hasil retrieval:")
for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    # Convert L2 distance to similarity score
    similarity = 1 - (dist / 2)  # Normalized L2 to cosine similarity
    print(f"{i+1}. [Score: {similarity:.3f}] {documents[idx]}")

Output:

Embedding dimension: 384
Total vectors dalam index: 8

Query: 'Apa itu neural networks?'

Top-3 hasil retrieval:
1. [Score: 0.892] Deep learning menggunakan neural networks dengan banyak hidden layers
2. [Score: 0.734] Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data
3. [Score: 0.698] Transfer learning memanfaatkan model pre-trained untuk task baru

10.3.4 Implementasi dengan ChromaDB

ChromaDB lebih user-friendly dan persistent:

Code
import chromadb
from chromadb.config import Settings

# Create client
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./chroma_db"  # Data akan disimpan di sini
))

# Create collection
collection = client.create_collection(
    name="ml_documents",
    metadata={"description": "Machine learning knowledge base"}
)

# Add documents
documents = [
    "Machine learning adalah cabang dari AI yang fokus pada pembelajaran dari data",
    "Deep learning menggunakan neural networks dengan banyak hidden layers",
    "Natural language processing memproses dan memahami bahasa manusia",
]

collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3"],
    metadatas=[
        {"category": "ML", "difficulty": "beginner"},
        {"category": "DL", "difficulty": "intermediate"},
        {"category": "NLP", "difficulty": "intermediate"}
    ]
)

# Query
results = collection.query(
    query_texts=["Apa itu neural networks?"],
    n_results=2,
    where={"difficulty": "intermediate"}  # Optional metadata filter
)

print("Hasil retrieval:")
for doc, dist, meta in zip(results['documents'][0],
                            results['distances'][0],
                            results['metadatas'][0]):
    print(f"[Distance: {dist:.3f}] {doc}")
    print(f"  Metadata: {meta}\n")
💡 Best Practices
  1. Choose the right index type:

    • IndexFlatL2: Exact search, small datasets (<100K)
    • IndexIVFFlat: Approximate, medium datasets (100K-1M)
    • IndexHNSW: Fast approximate, large datasets (>1M)
  2. Batch processing:

    • Add vectors in batches (e.g., 1000 at a time)
    • Faster than one-by-one
  3. Persistence:

    • FAISS: Save/load dengan faiss.write_index() dan faiss.read_index()
    • ChromaDB: Otomatis persistent ke disk

10.4 Building RAG Systems

10.4.1 Text Chunking Strategies

Mengapa Chunking?

  • Documents terlalu panjang untuk embed sekaligus
  • Token limits (e.g., 512 tokens untuk BERT)
  • Better retrieval granularity

Chunking Methods:

Code
# Method 1: Fixed-size chunking
def chunk_by_tokens(text, chunk_size=512, overlap=50):
    """
    Split text into fixed-size chunks dengan overlap.

    Args:
        text: Input text
        chunk_size: Number of tokens per chunk
        overlap: Number of overlapping tokens antar chunk

    Returns:
        List of text chunks
    """
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    tokens = tokenizer.encode(text, add_special_tokens=False)

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        start += (chunk_size - overlap)

    return chunks

# Method 2: Semantic chunking (berdasarkan paragraf/section)
def chunk_by_paragraph(text):
    """Split by paragraph boundaries."""
    paragraphs = text.split('\n\n')
    return [p.strip() for p in paragraphs if p.strip()]

# Method 3: Recursive character splitting (LangChain style)
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]  # Hierarchy of split points
)

sample_text = """
Machine learning adalah bidang yang berkembang pesat.
Dalam beberapa tahun terakhir, deep learning telah merevolusi berbagai domain.

Computer vision kini dapat mengenali objek dengan akurasi superhuman.
Natural language processing memungkinkan chatbot yang sangat natural.

Ke depannya, AI akan semakin terintegrasi dalam kehidupan sehari-hari.
"""

chunks = text_splitter.split_text(sample_text)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(chunk)

Comparison:

Method Pros Cons Best For
Fixed tokens Consistent size, fast Might split mid-sentence Technical docs
Paragraph Semantic coherence Variable size Articles, books
Recursive Best of both worlds More complex General purpose

10.4.2 Complete RAG Pipeline dengan LangChain

LangChain adalah framework populer untuk building LLM applications.

Installation:

pip install langchain langchain-community chromadb openai

Full RAG Example:

Code
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI  # atau Ollama untuk local LLM

# Step 1: Load documents
# Misalnya dari folder berisi txt files
from langchain.document_loaders import DirectoryLoader

loader = DirectoryLoader('./docs/', glob="**/*.txt")
documents = loader.load()

print(f"Loaded {len(documents)} documents")

# Step 2: Chunking
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
texts = text_splitter.split_documents(documents)
print(f"Split into {len(texts)} chunks")

# Step 3: Create embeddings & vector store
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma.from_documents(
    documents=texts,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Step 4: Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Retrieve top-3 chunks
)

# Step 5: Create QA chain
llm = OpenAI(temperature=0)  # Ganti dengan model pilihan Anda

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff", "map_reduce", "refine", or "map_rerank"
    retriever=retriever,
    return_source_documents=True
)

# Step 6: Query!
query = "Apa perbedaan antara supervised dan unsupervised learning?"
result = qa_chain({"query": query})

print(f"Question: {query}")
print(f"\nAnswer: {result['result']}")
print(f"\nSources:")
for i, doc in enumerate(result['source_documents'], 1):
    print(f"{i}. {doc.metadata.get('source', 'Unknown')}")
    print(f"   {doc.page_content[:200]}...")

Chain Types Explained:

  1. Stuff: Masukkan semua retrieved docs ke satu prompt
    • Pros: Simple, best quality
    • Cons: Limited by context window
  2. Map-Reduce: Process each doc separately, then combine
    • Pros: Handles banyak docs
    • Cons: Might lose connections
  3. Refine: Iteratively refine answer dengan each doc
    • Pros: Good for comprehensive answers
    • Cons: Slower, more LLM calls
  4. Map-Rerank: Score each doc’s answer, pilih terbaik
    • Pros: High quality
    • Cons: Expensive (many LLM calls)

10.5 AI Agents: Dari RAG ke Autonomous Systems

10.5.1 Apa itu AI Agents?

Definisi:

AI Agent adalah sistem yang dapat menggunakan tools, memory, dan reasoning untuk secara autonomous menyelesaikan tasks complex.

RAG vs. Agents:

Aspect RAG System AI Agent
Capability Retrieve & answer Reason, plan, act
Tools None (just retrieval) Can use tools (calculator, API, etc.)
Memory Stateless Can maintain memory
Autonomy Single-step Multi-step planning
Example Q&A chatbot Personal assistant

Agent Architecture:

Code
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Task] --> B[Agent Core<br/>LLM Reasoning]

    B --> C{Decision}
    C -->|Need Info| D[Retrieval Tool<br/>RAG/Search]
    C -->|Need Calculation| E[Calculator Tool]
    C -->|Need API Data| F[API Tool]
    C -->|Need Memory| G[Memory Store]

    D --> H[Observation]
    E --> H
    F --> H
    G --> H

    H --> B

    C -->|Task Complete| I[Final Answer]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style B fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff
    style I fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style C fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#4A90E2', 'primaryTextColor': '#333', 'primaryBorderColor': '#2E5C8A', 'lineColor': '#2E5C8A', 'secondaryColor': '#50C878', 'tertiaryColor': '#FFD700'}}}%%
graph TB
    A[User Task] --> B[Agent Core<br/>LLM Reasoning]

    B --> C{Decision}
    C -->|Need Info| D[Retrieval Tool<br/>RAG/Search]
    C -->|Need Calculation| E[Calculator Tool]
    C -->|Need API Data| F[API Tool]
    C -->|Need Memory| G[Memory Store]

    D --> H[Observation]
    E --> H
    F --> H
    G --> H

    H --> B

    C -->|Task Complete| I[Final Answer]

    style A fill:#4A90E2,stroke:#2E5C8A,stroke-width:2px,color:#fff
    style B fill:#FF6B6B,stroke:#C92A2A,stroke-width:2px,color:#fff
    style I fill:#50C878,stroke:#2E8B57,stroke-width:2px,color:#fff
    style C fill:#FFD700,stroke:#FFA500,stroke-width:2px,color:#333

10.5.2 ReAct Pattern: Reasoning + Acting

ReAct Framework (Yao et al., 2022):

  • Reason: Think about what to do
  • Act: Execute an action
  • Observe: See the result
  • Repeat until task solved

Example Task: “What’s the weather in Jakarta and should I bring an umbrella?”

Agent Trace:

Thought: I need to get current weather data for Jakarta
Action: weather_api
Action Input: {"city": "Jakarta", "country": "ID"}
Observation: {"temperature": 28, "condition": "rainy", "humidity": 85}

Thought: It's rainy with high humidity. User should bring umbrella.
Action: Final Answer
Action Input: "The current weather in Jakarta is rainy with 28°C and 85% humidity. Yes, you should definitely bring an umbrella!"

10.5.3 Building an Agent dengan LangChain

Code
from langchain.agents import Tool, AgentExecutor, create_react_agent
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMMathChain

# Define tools
llm = OpenAI(temperature=0)

# Tool 1: Calculator
llm_math = LLMMathChain.from_llm(llm)

calculator = Tool(
    name="Calculator",
    func=llm_math.run,
    description="Useful untuk mathematical calculations. Input harus math expression."
)

# Tool 2: RAG search (dari section sebelumnya)
def rag_search(query: str) -> str:
    """Search knowledge base."""
    result = qa_chain({"query": query})
    return result['result']

knowledge_base = Tool(
    name="KnowledgeBase",
    func=rag_search,
    description="Useful untuk pertanyaan tentang machine learning concepts. Input harus pertanyaan lengkap."
)

# Tool 3: Python REPL (optional, untuk code execution)
from langchain.utilities import PythonREPL
python_repl = PythonREPL()

python_tool = Tool(
    name="PythonREPL",
    func=python_repl.run,
    description="Useful untuk execute Python code. Input harus valid Python code."
)

# Combine tools
tools = [calculator, knowledge_base, python_tool]

# Create agent prompt
template = """You are a helpful AI assistant. Answer the following questions as best you can. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

Question: {input}
Thought: {agent_scratchpad}"""

prompt = PromptTemplate.from_template(template)

# Create agent
agent = create_react_agent(llm, tools, prompt)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,  # Print reasoning steps
    max_iterations=5,
    handle_parsing_errors=True
)

# Example queries
queries = [
    "Berapa hasil dari 25 * 47 + 138?",
    "Apa itu gradient descent? Lalu hitung derivatif dari x^2 + 3x + 5 di x=2",
    "Generate list 10 angka Fibonacci menggunakan Python, lalu hitung mean-nya"
]

for query in queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print(f"{'='*60}")

    result = agent_executor.invoke({"input": query})
    print(f"\nFinal Answer: {result['output']}")

Output Example:

============================================================
Query: Berapa hasil dari 25 * 47 + 138?
============================================================

> Entering new AgentExecutor chain...
Thought: I need to perform a mathematical calculation
Action: Calculator
Action Input: 25 * 47 + 138
Observation: 1313
Thought: I now know the final answer
Final Answer: 1313

> Finished chain.

Final Answer: 1313

10.5.4 Agent Memory: Conversation History

Types of Memory:

  1. ConversationBufferMemory: Store semua conversation
  2. ConversationSummaryMemory: Summarize old messages
  3. ConversationBufferWindowMemory: Keep last N messages
  4. VectorStoreRetrieverMemory: Semantic search pada history

Implementation:

Code
from langchain.memory import ConversationBufferMemory
from langchain.agents import initialize_agent, AgentType

# Create memory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Initialize agent dengan memory
agent_with_memory = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    memory=memory,
    verbose=True
)

# Multi-turn conversation
print(agent_with_memory.run("Halo, nama saya Budi"))
# Output: "Hello Budi! How can I help you today?"

print(agent_with_memory.run("Apa itu neural network?"))
# Output: *retrieves from knowledge base*

print(agent_with_memory.run("Siapa nama saya tadi?"))
# Output: "Your name is Budi" (remembers dari conversation)
💡 Agent Best Practices
  1. Tool Descriptions Matter: Clear descriptions → better tool selection
  2. Limit Tools: Too many tools confuse the agent (max 5-7)
  3. Validation: Always validate tool outputs before using
  4. Error Handling: Gracefully handle tool failures
  5. Cost Control: Set max_iterations untuk avoid runaway loops

10.6 Evaluating RAG Systems

10.6.1 Evaluation Metrics

Challenge: Tidak ada “ground truth” untuk generative tasks!

Solution: Multiple evaluation dimensions

1. Retrieval Quality:

Code
# Precision@K: Berapa persen retrieved docs yang relevan?
def precision_at_k(retrieved_docs, relevant_docs, k):
    """
    Calculate Precision@K.

    Args:
        retrieved_docs: List of retrieved doc IDs
        relevant_docs: Set of truly relevant doc IDs
        k: Number of top results to consider

    Returns:
        Precision score [0, 1]
    """
    top_k = retrieved_docs[:k]
    relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs]
    return len(relevant_in_top_k) / k

# Recall@K: Berapa persen relevant docs yang ter-retrieve?
def recall_at_k(retrieved_docs, relevant_docs, k):
    """Calculate Recall@K."""
    top_k = retrieved_docs[:k]
    relevant_in_top_k = [doc for doc in top_k if doc in relevant_docs]
    return len(relevant_in_top_k) / len(relevant_docs) if relevant_docs else 0

# Mean Reciprocal Rank (MRR)
def mrr(retrieved_docs, relevant_docs):
    """
    Calculate MRR.
    First relevant doc at position 1 → score = 1
    First relevant doc at position 2 → score = 0.5
    """
    for i, doc in enumerate(retrieved_docs, 1):
        if doc in relevant_docs:
            return 1 / i
    return 0

# Example
retrieved = ['doc3', 'doc1', 'doc7', 'doc2', 'doc5']
relevant = {'doc1', 'doc2', 'doc4'}

print(f"Precision@3: {precision_at_k(retrieved, relevant, 3):.2f}")  # 2/3 = 0.67
print(f"Recall@3: {recall_at_k(retrieved, relevant, 3):.2f}")        # 2/3 = 0.67
print(f"MRR: {mrr(retrieved, relevant):.2f}")                        # 1/2 = 0.50

2. Generation Quality:

Code
# BLEU, ROUGE (traditional metrics)
from nltk.translate.bleu_score import sentence_bleu
from rouge import Rouge

reference = "Machine learning is a subset of artificial intelligence"
generated = "Machine learning is part of AI"

# BLEU (precision-focused)
bleu = sentence_bleu([reference.split()], generated.split())
print(f"BLEU: {bleu:.2f}")

# ROUGE (recall-focused)
rouge = Rouge()
scores = rouge.get_scores(generated, reference)
print(f"ROUGE-L F1: {scores[0]['rouge-l']['f']:.2f}")

3. Faithfulness (Groundedness):

Apakah generated answer grounded dalam retrieved context?

Code
# Using LLM as judge
def evaluate_faithfulness(context, answer, llm):
    """
    Check if answer is supported by context.

    Returns:
        score [0-5], reasoning
    """
    prompt = f"""Given the following context and answer, rate how well the answer is supported by the context on a scale of 0-5:

Context: {context}

Answer: {answer}

Rating (0=not supported at all, 5=fully supported):
Reasoning: """

    response = llm(prompt)
    return response

# Example
context = "Python was created by Guido van Rossum in 1991."
answer = "Python was developed in the early 1990s"

score = evaluate_faithfulness(context, answer, llm)
print(score)

4. Relevance:

Apakah answer relevan dengan question?

Code
def evaluate_relevance(question, answer, llm):
    """Rate answer relevance to question."""
    prompt = f"""Rate how relevant the answer is to the question (0-5):

Question: {question}
Answer: {answer}

Rating:
Reasoning: """

    return llm(prompt)

10.6.2 End-to-End Evaluation Framework

Code
class RAGEvaluator:
    """Comprehensive RAG evaluation."""

    def __init__(self, rag_chain, llm_judge):
        self.rag_chain = rag_chain
        self.llm_judge = llm_judge

    def evaluate(self, test_cases):
        """
        Evaluate RAG system pada test cases.

        Args:
            test_cases: List of dicts with keys:
                - 'question': str
                - 'expected_answer': str (optional)
                - 'relevant_docs': set (optional)

        Returns:
            Evaluation results dataframe
        """
        results = []

        for case in test_cases:
            # Run RAG
            output = self.rag_chain({"query": case['question']})
            answer = output['result']
            retrieved_docs = [doc.metadata['id'] for doc in output['source_documents']]

            # Evaluate retrieval
            if 'relevant_docs' in case:
                precision = precision_at_k(retrieved_docs, case['relevant_docs'], 3)
                recall = recall_at_k(retrieved_docs, case['relevant_docs'], 3)
            else:
                precision, recall = None, None

            # Evaluate generation
            context = "\n".join([doc.page_content for doc in output['source_documents']])
            faithfulness = self.evaluate_faithfulness(context, answer)
            relevance = self.evaluate_relevance(case['question'], answer)

            results.append({
                'question': case['question'],
                'answer': answer,
                'precision@3': precision,
                'recall@3': recall,
                'faithfulness': faithfulness,
                'relevance': relevance
            })

        return pd.DataFrame(results)

    def evaluate_faithfulness(self, context, answer):
        """LLM-based faithfulness check."""
        # Implementation similar to above
        pass

    def evaluate_relevance(self, question, answer):
        """LLM-based relevance check."""
        # Implementation similar to above
        pass

# Usage
test_cases = [
    {
        'question': 'Apa itu backpropagation?',
        'relevant_docs': {'doc12', 'doc34'}
    },
    {
        'question': 'Perbedaan CNN dan RNN?',
        'relevant_docs': {'doc45', 'doc67', 'doc89'}
    }
]

evaluator = RAGEvaluator(qa_chain, llm)
results = evaluator.evaluate(test_cases)
print(results)

10.6.3 A/B Testing RAG Configurations

Test different configurations:

Code
import pandas as pd

# Configurations to test
configs = [
    {
        'name': 'Baseline',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 3,
        'search_type': 'similarity'
    },
    {
        'name': 'Larger Chunks',
        'chunk_size': 1000,
        'chunk_overlap': 100,
        'k': 3,
        'search_type': 'similarity'
    },
    {
        'name': 'More Retrieval',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 5,
        'search_type': 'similarity'
    },
    {
        'name': 'MMR (diversity)',
        'chunk_size': 500,
        'chunk_overlap': 50,
        'k': 3,
        'search_type': 'mmr'  # Maximal Marginal Relevance
    }
]

# Run experiments
results = []
for config in configs:
    # Build RAG dengan config
    rag = build_rag_system(**config)

    # Evaluate
    eval_results = evaluator.evaluate(test_cases)

    # Aggregate
    results.append({
        'config': config['name'],
        'avg_precision': eval_results['precision@3'].mean(),
        'avg_faithfulness': eval_results['faithfulness'].mean(),
        'avg_relevance': eval_results['relevance'].mean()
    })

results_df = pd.DataFrame(results)
print(results_df)

# Visualize
results_df.plot(x='config', kind='bar', figsize=(12, 6))
plt.title('RAG Configuration Comparison')
plt.ylabel('Score')
plt.legend(['Precision@3', 'Faithfulness', 'Relevance'])
plt.tight_layout()
plt.show()
📊 Evaluation Best Practices
  1. Use Multiple Metrics: No single metric captures all aspects
  2. LLM-as-Judge: Effective untuk semantic evaluation
  3. Human Eval: Gold standard, tapi expensive (use for validation)
  4. Regression Testing: Track metrics over time
  5. Domain-Specific: Customize metrics untuk your use case

10.7 Production Considerations

10.7.1 Scalability & Performance

Challenges:

  1. Latency: Retrieval + generation bisa lambat
  2. Throughput: Banyak concurrent users
  3. Cost: API calls untuk embeddings + LLM expensive

Solutions:

Code
# 1. Caching (simple approach)
import functools
from datetime import datetime, timedelta

cache = {}
CACHE_EXPIRY = timedelta(hours=1)

def cached_rag(query):
    """Cache RAG results untuk repeated queries."""
    if query in cache:
        result, timestamp = cache[query]
        if datetime.now() - timestamp < CACHE_EXPIRY:
            print("Cache hit!")
            return result

    # Cache miss
    result = qa_chain({"query": query})
    cache[query] = (result, datetime.now())
    return result

# 2. Batch processing untuk embeddings
def batch_embed(texts, batch_size=32):
    """Embed dalam batches untuk efficiency."""
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_embeddings = model.encode(batch)
        embeddings.extend(batch_embeddings)
    return embeddings

# 3. Async processing
import asyncio

async def async_rag(query):
    """Asynchronous RAG untuk concurrency."""
    # Parallelize retrieval dan LLM call jika possible
    retrieval_task = asyncio.create_task(retrieve(query))
    # ... implementation
    pass

10.7.2 Monitoring & Observability

Key Metrics to Track:

Code
import logging
from datetime import datetime

class RAGMonitor:
    """Monitor RAG system performance."""

    def __init__(self):
        self.metrics = {
            'total_queries': 0,
            'avg_latency': 0,
            'cache_hit_rate': 0,
            'error_rate': 0
        }
        self.logger = logging.getLogger(__name__)

    def log_query(self, query, latency, cached, error=None):
        """Log setiap query."""
        self.metrics['total_queries'] += 1

        # Update metrics
        n = self.metrics['total_queries']
        self.metrics['avg_latency'] = (
            (self.metrics['avg_latency'] * (n-1) + latency) / n
        )

        if cached:
            self.metrics['cache_hit_rate'] = (
                (self.metrics['cache_hit_rate'] * (n-1) + 1) / n
            )

        if error:
            self.metrics['error_rate'] = (
                (self.metrics['error_rate'] * (n-1) + 1) / n
            )
            self.logger.error(f"Query error: {error}")

        # Log to file/database
        self.logger.info(f"Query: {query[:50]}... | Latency: {latency:.2f}s | Cached: {cached}")

    def get_metrics(self):
        """Return current metrics."""
        return self.metrics

# Usage
monitor = RAGMonitor()

def monitored_rag(query):
    """RAG dengan monitoring."""
    start_time = datetime.now()
    cached = False
    error = None

    try:
        result = cached_rag(query)
        if query in cache:
            cached = True
    except Exception as e:
        error = str(e)
        raise
    finally:
        latency = (datetime.now() - start_time).total_seconds()
        monitor.log_query(query, latency, cached, error)

    return result

10.7.3 Security & Privacy

Concerns:

  1. Data Privacy: User queries might contain sensitive info
  2. Injection Attacks: Malicious prompts
  3. Data Leakage: Retrieved docs might expose confidential info

Mitigations:

Code
# 1. Input sanitization
def sanitize_input(query):
    """Remove potentially malicious content."""
    # Remove excessively long inputs
    max_length = 500
    query = query[:max_length]

    # Remove common injection patterns
    forbidden_patterns = ['ignore previous', 'disregard', 'system:']
    for pattern in forbidden_patterns:
        if pattern.lower() in query.lower():
            raise ValueError("Potentially malicious input detected")

    return query

# 2. Access control
def check_permissions(user_id, doc_id):
    """Check if user dapat akses document."""
    # Query database/ACL
    # Return True/False
    pass

def filtered_retrieval(query, user_id):
    """Retrieve hanya docs yang user boleh akses."""
    all_results = vectorstore.similarity_search(query, k=10)

    # Filter berdasarkan permissions
    filtered = [
        doc for doc in all_results
        if check_permissions(user_id, doc.metadata['id'])
    ]

    return filtered[:3]  # Return top-3 after filtering

# 3. PII redaction
import re

def redact_pii(text):
    """Remove personally identifiable information."""
    # Email
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[EMAIL]', text)

    # Phone (Indonesia)
    text = re.sub(r'\b(?:\+62|0)\d{9,11}\b', '[PHONE]', text)

    # ID numbers (example: 16 digits)
    text = re.sub(r'\b\d{16}\b', '[ID_NUMBER]', text)

    return text

10.8 Studi Kasus: Customer Support RAG System

10.8.1 Problem Statement

Scenario:

Perusahaan e-commerce ingin build customer support chatbot yang dapat:

  • Answer FAQ from knowledge base (100+ documents)
  • Handle 1000+ queries/day
  • Provide accurate answers dengan source citations
  • Reduce support ticket volume by 40%

10.8.2 System Design

Architecture:

Code
# Knowledge base: FAQs, product manuals, policies
documents = load_documents('./knowledge_base/')

# Processing pipeline
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
)
chunks = text_splitter.split_documents(documents)

# Embeddings: Indonesian-capable model
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('LazarusNLP/IndoBERT-base-uncased')

# Vector DB: ChromaDB untuk persistence
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./customer_support_db"
)

# Hybrid retrieval untuk better accuracy
ensemble_retriever = create_ensemble_retriever(
    vectorstore, chunks, weights=[0.6, 0.4]
)

# LLM: OpenAI GPT-4 (bisa diganti dengan local model)
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)

# QA Chain dengan custom prompt
from langchain.prompts import PromptTemplate

template = """Anda adalah asisten customer support yang membantu. Gunakan konteks berikut untuk menjawab pertanyaan pelanggan.

PENTING:
- Jika Anda tidak tahu jawabannya, katakan "Maaf, saya tidak memiliki informasi tersebut. Silakan hubungi tim support kami."
- Jangan membuat informasi
- Berikan jawaban yang ramah dan profesional
- Sertakan referensi ke sumber jika relevan

Konteks:
{context}

Pertanyaan: {question}

Jawaban:"""

QA_PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=ensemble_retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": QA_PROMPT},
    return_source_documents=True
)

10.8.3 Implementation & Results

Deployment:

Code
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class Query(BaseModel):
    question: str
    user_id: str

class Response(BaseModel):
    answer: str
    sources: list[str]
    confidence: float

@app.post("/ask", response_model=Response)
async def ask_question(query: Query):
    """API endpoint untuk customer queries."""
    try:
        # Sanitize input
        question = sanitize_input(query.question)

        # Run RAG
        result = qa_chain({"query": question})

        # Extract sources
        sources = [
            doc.metadata.get('source', 'Unknown')
            for doc in result['source_documents']
        ]

        # Estimate confidence (simple heuristic)
        confidence = estimate_confidence(result)

        return Response(
            answer=result['result'],
            sources=sources,
            confidence=confidence
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

def estimate_confidence(result):
    """Estimate answer confidence berdasarkan retrieval scores."""
    # Simple heuristic: average similarity of retrieved docs
    # Real implementation would be more sophisticated
    return 0.85  # Placeholder

Results After 3 Months:

Metric Before RAG After RAG Improvement
Ticket Volume 5000/month 2800/month 44% reduction
Response Time 2 hours 30 seconds 99.3% faster
Customer Satisfaction 3.2/5 4.5/5 41% increase
Support Cost $50K/month $32K/month 36% savings
Accuracy (human eval) N/A 87% New capability ✅

Lessons Learned:

  1. Chunking matters: Experimented with 500, 800, 1000 tokens → 800 optimal
  2. Hybrid search: Improved precision from 0.72 to 0.84
  3. Prompt engineering: Custom prompt reduced hallucinations significantly
  4. Fallback important: “I don’t know” better than wrong answer
  5. Continuous improvement: Weekly analysis of low-confidence answers

10.9 Ringkasan (Summary)

Konsep Kunci:

  1. RAG = Retrieval + Generation:

    • Solve LLM limitations (knowledge cutoff, hallucinations)
    • Combine semantic search dengan generative AI
    • Enable domain-specific, updatable knowledge
  2. Embeddings:

    • Semantic representations as vectors
    • Cosine similarity untuk retrieval
    • Models: Sentence Transformers, OpenAI embeddings
  3. Vector Databases:

    • Efficient similarity search at scale
    • Options: FAISS (library), ChromaDB (embedded), Pinecone (cloud)
    • ANN algorithms untuk speed vs accuracy trade-off
  4. RAG Pipeline:

    • Document → Chunking → Embedding → Vector DB
    • Query → Embedding → Retrieval → LLM → Response
    • LangChain simplifies implementation
  5. AI Agents:

    • Beyond RAG: tools, memory, reasoning
    • ReAct pattern: Reason → Act → Observe
    • Can plan multi-step tasks autonomously
  6. Evaluation:

    • Retrieval: Precision@K, Recall@K, MRR
    • Generation: Faithfulness, Relevance
    • LLM-as-judge untuk semantic eval
  7. Production:

    • Caching, batching untuk performance
    • Monitoring metrics (latency, error rate)
    • Security: input sanitization, access control

Aplikasi Praktis:

  • Customer support chatbots
  • Document Q&A systems
  • Code documentation assistants
  • Research literature search
  • Legal/medical knowledge bases

Next Steps:

  • Chapter 11: MLOps & Deployment (production ML systems)
  • Advanced topics: Multi-modal RAG, agentic workflows
  • Fine-tuning embeddings untuk domain specificity

10.10 Latihan & Tugas (Exercises)

📝 Soal Latihan

Pertanyaan Konseptual:

  1. Jelaskan perbedaan fundamental antara LLM murni dan RAG system. Kapan Anda akan menggunakan masing-masing?

  2. Mengapa cosine similarity lebih populer daripada Euclidean distance untuk semantic search? Berikan contoh kasus dimana keduanya memberikan hasil berbeda.

  3. Apa trade-offs antara chunking strategies berikut:

    • Fixed-size tokens (512 tokens)
    • Paragraph-based
    • Recursive character splitting

    Kapan masing-masing cocok digunakan?

  4. Compare dan contrast:

    • RAG system dengan retrieval database
    • AI Agent dengan traditional chatbot
    • Vector database (FAISS) dengan relational database (PostgreSQL)
  5. Dalam RAG evaluation, mengapa kita butuh multiple metrics (Precision, Recall, Faithfulness, Relevance)? Berikan contoh kasus dimana satu metric tinggi tapi yang lain rendah.

Tugas Praktikum:

Tugas 1: Build Simple RAG System - Input: Collection of 20+ text documents (bisa Wikipedia articles, technical docs, dll.) - Tasks: 1. Implement chunking strategy pilihan Anda 2. Generate embeddings dengan Sentence Transformers 3. Build FAISS index 4. Implement retrieval function 5. Test dengan 5 sample queries - Output:

  • Python script dengan code lengkap

  • Report showing query results dengan retrieved chunks

  • Rubric:

    • Chunking implementation (20%)
    • FAISS integration (20%)
    • Retrieval accuracy (30%)
    • Code quality & documentation (15%)
    • Report clarity (15%)

Tugas 2: Compare Embedding Models - Objective: Compare different embedding models untuk Indonesian text - Models to test:

  • sentence-transformers/all-MiniLM-L6-v2 (multilingual)

  • indobenchmark/indobert-base-p1 (Indonesian BERT)

  • LazarusNLP/IndoBERT-base-uncased (Indonesian)

  • Tasks:

    1. Create test set: 10 queries, each dengan 3 relevant + 7 irrelevant docs
    2. Compute embeddings dengan each model
    3. Evaluate dengan Precision@3, Recall@3, MRR
    4. Analyze trade-offs (speed vs accuracy)
  • Deliverable:

    • Comparison table dengan metrics
    • Visualization (bar charts)
    • Analysis report (1-2 pages)

Tugas 3: Build RAG System dengan LangChain - Scenario: Academic paper Q&A system - Requirements: 1. Load multiple PDF papers (use PyPDF atau similar) 2. Implement chunking dengan RecursiveCharacterTextSplitter 3. Use ChromaDB untuk persistence 4. Implement QA chain dengan custom prompt 5. Add source citations ke responses - Bonus:

  • Implement hybrid search (vector + BM25)

  • Add conversation memory

  • Build simple Streamlit/Gradio UI

  • Rubric:

    • Core functionality (40%)
    • Source citations (15%)
    • Code organization (15%)
    • Documentation (15%)
    • Bonus features (15%)

Tugas 4: Simple AI Agent - Objective: Build agent yang bisa use multiple tools - Required Tools: 1. Calculator (untuk math) 2. Wikipedia search (untuk facts) 3. Weather API (untuk current weather) - Tasks: 1. Define tool wrappers 2. Create ReAct agent dengan LangChain 3. Test dengan diverse queries: - Pure math: “What’s 234 * 567?” - Fact + math: “Population of Jakarta times 3?” - Multi-step: “What’s the weather in Paris? Is it above average?” - Deliverable:

  • Agent code

  • Trace logs showing reasoning steps

  • Analysis report discussing successes/failures

  • Rubric:

    • Tool implementation (30%)
    • Agent reasoning quality (30%)
    • Test coverage (20%)
    • Analysis depth (20%)

Tugas 5: RAG Evaluation Pipeline - Objective: Build evaluation framework untuk RAG system - Requirements: 1. Create test dataset: 20+ question-answer pairs dengan ground truth 2. Implement metrics: - Retrieval: Precision@K, Recall@K - Generation: Faithfulness (LLM-as-judge) 3. Run evaluation pada 2 different RAG configurations 4. Visualize results - Deliverable:

  • Evaluation code

  • Test dataset (JSON/CSV)

  • Results report dengan recommendations

  • Rubric:

    • Metric implementation (35%)
    • Test dataset quality (20%)
    • Comparative analysis (25%)
    • Recommendations (20%)

Proyek Akhir (Capstone):

“Domain-Specific RAG System”

Build complete RAG system untuk domain pilihan Anda:

  • Options:

    • Medical Q&A (dari medical literature)
    • Legal assistant (dari regulations/case law)
    • Code documentation helper
    • Academic research assistant

Requirements:

  1. Document collection (50+ documents)
  2. Full RAG pipeline (chunking → embedding → retrieval → generation)
  3. Evaluation dengan test set (20+ queries)
  4. Simple web interface (Streamlit/Gradio)
  5. Documentation (architecture, usage, evaluation results)

Grading:

  • System functionality (30%)
  • Retrieval quality (20%)
  • Generation quality (20%)
  • User interface (15%)
  • Documentation (15%)

10.11 Referensi & Bacaan Lebih Lanjut

Paper Fundamental:

Tutorial & Guides:

Tools & Libraries:

Buku:

  • Tunstall et al., 2022. “Natural Language Processing with Transformers”. O’Reilly.
  • Huyen, 2022. “Designing Machine Learning Systems”. O’Reilly. (Chapter on embeddings)

Blog Posts:

Video Courses:


Hubungan dengan Learning Outcomes Program

CPMK Sub-CPMK Tercakup
CPMK-3 Implementasi modern AI systems
CPMK-4 Evaluasi model performance
CPMK-5 Production ML deployment

Related Labs: Lab 10 - Building RAG System for Indonesian Documents Related Chapters: Chapter 8 (Transformers), Chapter 11 (MLOps & Deployment) Estimated Reading Time: 120 minutes Estimated Practice Time: 10-12 hours


Last Updated: December 6, 2024 Version: 1.0 Author: Tim Pengembang - Politeknik Siber dan Sandi Negara