Menggunakan sentence-transformers untuk embedding dokumen
Membangun vector database dengan FAISS
Melakukan similarity search dan retrieval
Mengimplementasikan sistem Q&A berbasis RAG
Mengoptimalkan performance dengan indexing strategies
Mengevaluasi kualitas retrieval dan ranking
Menangani edge cases dan error handling
24.2 Gambaran Umum Lab
Lab ini fokus pada membangun sistem RAG yang mengintegrasikan document retrieval dengan language model capabilities.
24.2.1 Konsep RAG
Retrieval-Augmented Generation (RAG) adalah teknik yang menggabungkan:
Retrieval: Mengambil dokumen relevan dari knowledge base
Augmentation: Menambahkan dokumen ke prompt
Generation: Menggunakan LLM untuk generate jawaban
graph LR A[Query/Pertanyaan] --> B[Embedding] B --> C[Vector Search] C --> D[Retrieve Documents] D --> E[Create Context] E --> F[Generate Answer] F --> G[Response] style A fill:#e6f3ff style B fill:#ffe6e6 style C fill:#ffffcc style D fill:#ccffcc style E fill:#e6ccff style F fill:#ffcccc style G fill:#ccffe6
graph LR
A[Query/Pertanyaan] --> B[Embedding]
B --> C[Vector Search]
C --> D[Retrieve Documents]
D --> E[Create Context]
E --> F[Generate Answer]
F --> G[Response]
style A fill:#e6f3ff
style B fill:#ffe6e6
style C fill:#ffffcc
style D fill:#ccffcc
style E fill:#e6ccff
style F fill:#ffcccc
style G fill:#ccffe6
24.2.2 Lab Structure
graph TD A[Setup & Installation] --> B[Document Loading] B --> C[Text Preprocessing] C --> D[Embedding Creation] D --> E[FAISS Indexing] E --> F[Basic Retrieval] F --> G[QA System] G --> H[Optimization] style A fill:#e6f3ff style B fill:#ffe6e6 style C fill:#ffffcc style D fill:#ccffcc style E fill:#e6ccff style F fill:#ffcccc style G fill:#ccffe6 style H fill:#ffccff
graph TD
A[Setup & Installation] --> B[Document Loading]
B --> C[Text Preprocessing]
C --> D[Embedding Creation]
D --> E[FAISS Indexing]
E --> F[Basic Retrieval]
F --> G[QA System]
G --> H[Optimization]
style A fill:#e6f3ff
style B fill:#ffe6e6
style C fill:#ffffcc
style D fill:#ccffcc
style E fill:#e6ccff
style F fill:#ffcccc
style G fill:#ccffe6
style H fill:#ffccff
24.3 Persiapan Environment
24.3.1 Install Dependencies
import subprocessimport syspackages = ['sentence-transformers','faiss-cpu', # Gunakan 'faiss-gpu' jika punya GPU'langchain','pypdf','python-dotenv']for package in packages: subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])print("✓ All packages installed successfully!")
24.3.2 Import Libraries
# Core librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom pathlib import Pathimport warningswarnings.filterwarnings('ignore')# Text processingimport refrom collections import Counterfrom typing import List, Tuple, Dict# Embeddings & Vector Searchfrom sentence_transformers import SentenceTransformerimport faiss# File handlingfrom pypdf import PdfReaderimport json# Utilitiesfrom tqdm import tqdmimport timeprint("✓ All imports successful!")
# ConfigurationCONFIG = {'embedding_model': 'sentence-transformers/all-MiniLM-L6-v2', # Fast & accurate'chunk_size': 300, # Characters per chunk'chunk_overlap': 50, # Overlap between chunks'similarity_threshold': 0.5,'top_k': 3, # Retrieve top 3 documents'embedding_dim': 384, # Dimension of embeddings'seed': 42,}print("Configuration:")for key, value in CONFIG.items():print(f" {key}: {value}")np.random.seed(CONFIG['seed'])
25 Part 1: Document Loading & Preprocessing
25.1 Load Sample Documents
def create_sample_documents():"""Create sample documents for demonstration""" documents = [ {'title': 'Machine Learning Basics','content': '''Machine Learning adalah cabang dari Artificial Intelligence yang memungkinkan komputer belajar dari data tanpa diprogram secara eksplisit. Ada tiga jenis utama: 1. Supervised Learning: Belajar dari labeled data 2. Unsupervised Learning: Menemukan pattern dalam unlabeled data 3. Reinforcement Learning: Belajar melalui interaksi dan reward Machine learning telah diaplikasikan dalam berbagai domain seperti computer vision, natural language processing, dan predictive analytics.''' }, {'title': 'Deep Learning & Neural Networks','content': '''Deep Learning menggunakan artificial neural networks dengan multiple layers untuk ekstraksi features yang kompleks. Arsitektur populer termasuk: - Convolutional Neural Networks (CNN): untuk image processing - Recurrent Neural Networks (RNN): untuk sequence data - Transformer Networks: untuk NLP tasks Deep learning telah mencapai state-of-the-art performance dalam berbagai aplikasi seperti image recognition, machine translation, dan speech recognition.''' }, {'title': 'Natural Language Processing','content': '''NLP adalah teknologi yang memungkinkan komputer memahami dan memproses bahasa manusia. Task utama NLP meliputi: - Tokenization: Memecah text menjadi tokens - Named Entity Recognition: Mengidentifikasi entities seperti nama, lokasi - Sentiment Analysis: Menentukan sentimen dalam text - Machine Translation: Menerjemahkan antar bahasa Modern NLP menggunakan transformer-based models seperti BERT dan GPT yang sangat powerful.''' }, {'title': 'Computer Vision Applications','content': '''Computer Vision adalah cabang AI yang fokus pada pembuatan komputer yang bisa memahami dan menganalisa visual information dari dunia sekitar. Aplikasi utama: - Image Classification: Mengkategorikan images - Object Detection: Mendeteksi dan melokalisasi objects dalam image - Semantic Segmentation: Segmentasi pixel-level dari images - Face Recognition: Identifikasi dan verifikasi faces CNN architecture seperti ResNet dan YOLOv5 mencapai akurasi tinggi dalam tasks ini.''' }, {'title': 'Reinforcement Learning','content': '''Reinforcement Learning adalah paradigma machine learning dimana agent belajar dengan berinteraksi dengan environment dan menerima rewards atau penalties. Konsep kunci: - Agent: Entity yang melakukan actions - Environment: Sistem yang merespon actions - Reward: Signal yang menunjukkan kualitas action - Policy: Strategi agent untuk memilih actions RL digunakan dalam game AI, robotics, dan autonomous systems.''' } ]return documents# Create documentsdocuments = create_sample_documents()print(f"✓ Created {len(documents)} sample documents")print("\nDocuments:")for i, doc inenumerate(documents, 1):print(f" {i}. {doc['title']}")
25.2 Text Preprocessing
def preprocess_text(text: str) ->str:"""Preprocess text: lowercase, remove extra spaces"""# Lowercase text = text.lower()# Remove extra whitespace text = re.sub(r'\s+', ' ', text).strip()# Remove special characters (keep basic punctuation) text = re.sub(r'[^\w\s.,!?-]', '', text)return textdef chunk_text(text: str, chunk_size: int=300, overlap: int=50) -> List[str]:"""Split text into overlapping chunks""" chunks = [] start =0while start <len(text): end = start + chunk_size chunk = text[start:end] chunks.append(chunk.strip()) start = end - overlap # Create overlapreturn chunks# Create chunks from documentsdocument_chunks = []chunk_metadata = []for doc in documents: content = doc['content'] preprocessed = preprocess_text(content) chunks = chunk_text(preprocessed, CONFIG['chunk_size'], CONFIG['chunk_overlap'])for chunk_idx, chunk inenumerate(chunks): document_chunks.append(chunk) chunk_metadata.append({'document_title': doc['title'],'chunk_index': chunk_idx,'chunk_text': chunk,'original_length': len(doc['content']) })print(f"✓ Created {len(document_chunks)} chunks from documents")print(f" Average chunk length: {np.mean([len(c) for c in document_chunks]):.0f} chars")print(f"\nFirst chunk example:")print(f" {document_chunks[0][:200]}...")
26 Part 2: Embedding Creation & FAISS Indexing
26.1 Load Embedding Model
def load_embedding_model(model_name: str):"""Load pre-trained embedding model"""print(f"Loading embedding model: {model_name}") model = SentenceTransformer(model_name)print(f"✓ Model loaded!")print(f" Model dimension: {model.get_sentence_embedding_dimension()}")return model# Load modelembedding_model = load_embedding_model(CONFIG['embedding_model'])
26.2 Create Embeddings
def create_embeddings(texts: List[str], model, batch_size: int=32):"""Create embeddings for texts"""print(f"Creating embeddings for {len(texts)} texts...") embeddings = model.encode(texts, show_progress_bar=True, batch_size=batch_size)print(f"✓ Embeddings created!")print(f" Shape: {embeddings.shape}")print(f" Data type: {embeddings.dtype}")return embeddings# Create embeddingsembeddings = create_embeddings(document_chunks, embedding_model)# Save embeddingsnp.save(dirs['embeddings'] /'document_embeddings.npy', embeddings)print(f"✓ Embeddings saved to {dirs['embeddings']}/document_embeddings.npy")
26.3 Build FAISS Index
def build_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:"""Build FAISS index using L2 distance"""print(f"Building FAISS index...")# Ensure embeddings are C-contiguous embeddings = embeddings.astype('float32')# Create L2 index (Euclidean distance) dimension = embeddings.shape[1] index = faiss.IndexFlatL2(dimension)# Add vectors to index index.add(embeddings)print(f"✓ FAISS index created!")print(f" Total vectors: {index.ntotal}")print(f" Vector dimension: {dimension}")return index# Build indexfaiss_index = build_faiss_index(embeddings)# Save indexindex_path = dirs['indices'] /'documents.index'faiss.write_index(faiss_index, str(index_path))print(f"✓ Index saved to {index_path}")
26.4 Verify Index
def verify_index(index: faiss.Index, embeddings: np.ndarray):"""Verify index consistency"""print("Verifying index...")# Test with first embedding test_embedding = embeddings[:1].astype('float32') distances, indices = index.search(test_embedding, 5)print(f"✓ Index verification successful!")print(f" Nearest neighbors to first embedding:")for i, (dist, idx) inenumerate(zip(distances[0], indices[0]), 1):print(f" {i}. Index {idx}, Distance: {dist:.4f}")verify_index(faiss_index, embeddings)
27 Part 3: Document Retrieval System
27.1 Basic Retrieval
def retrieve_similar_documents(query: str, embedding_model, faiss_index: faiss.Index, document_chunks: List[str], k: int=3) -> List[Dict]:"""Retrieve similar documents for query"""# Embed query query_embedding = embedding_model.encode([query], convert_to_numpy=True).astype('float32')# Search in FAISS distances, indices = faiss_index.search(query_embedding, k)# Prepare results results = []for rank, (distance, idx) inenumerate(zip(distances[0], indices[0]), 1): metadata = chunk_metadata[idx] results.append({'rank': rank,'distance': float(distance),'similarity_score': 1/ (1+ distance), # Convert distance to similarity'document_title': metadata['document_title'],'chunk_text': document_chunks[idx],'chunk_index': idx })return results# Test retrievalqueries = ["Apa itu machine learning?","Bagaimana deep learning bekerja?","Sebutkan aplikasi NLP",]print("="*80)print("RETRIEVAL RESULTS")print("="*80)for query in queries:print(f"\nQuery: {query}") results = retrieve_similar_documents(query, embedding_model, faiss_index, document_chunks, k=2)for result in results:print(f"\n [{result['rank']}] {result['document_title']} (Similarity: {result['similarity_score']:.3f})")print(f" {result['chunk_text'][:150]}...")
27.2 Evaluate Retrieval Quality
def evaluate_retrieval(results: List[Dict]) -> Dict:"""Evaluate retrieval quality"""ifnot results:return {} avg_similarity = np.mean([r['similarity_score'] for r in results]) max_similarity =max([r['similarity_score'] for r in results]) min_similarity =min([r['similarity_score'] for r in results])return {'num_results': len(results),'avg_similarity': avg_similarity,'max_similarity': max_similarity,'min_similarity': min_similarity,'top_document': results[0]['document_title'] }# Evaluate queriesprint("\n"+"="*80)print("RETRIEVAL EVALUATION")print("="*80)for query in queries: results = retrieve_similar_documents(query, embedding_model, faiss_index, document_chunks, k=3) evaluation = evaluate_retrieval(results)print(f"\nQuery: {query}")print(f" Top Document: {evaluation['top_document']}")print(f" Avg Similarity: {evaluation['avg_similarity']:.3f}")print(f" Similarity Range: [{evaluation['min_similarity']:.3f}, {evaluation['max_similarity']:.3f}]")
28 Part 4: QA System Implementation
28.1 Simple QA System
class SimpleRAGSystem:"""Simple RAG system for Q&A"""def__init__(self, embedding_model, faiss_index, document_chunks, chunk_metadata):self.embedding_model = embedding_modelself.faiss_index = faiss_indexself.document_chunks = document_chunksself.chunk_metadata = chunk_metadatadef retrieve(self, query: str, top_k: int=3) -> List[Dict]:"""Retrieve relevant documents"""# Embed query query_embedding =self.embedding_model.encode([query], convert_to_numpy=True).astype('float32')# Search distances, indices =self.faiss_index.search(query_embedding, top_k)# Format results results = []for rank, (distance, idx) inenumerate(zip(distances[0], indices[0]), 1): similarity =1/ (1+ distance) results.append({'rank': rank,'similarity': similarity,'text': self.document_chunks[idx],'metadata': self.chunk_metadata[idx] })return resultsdef build_context(self, results: List[Dict]) ->str:"""Build context from retrieved documents""" context ="Dokumen Relevan:\n"for result in results: context +=f"\n[Dokumen {result['rank']}] {result['metadata']['document_title']}:\n" context +=f"{result['text']}\n"return contextdef generate_answer(self, query: str, top_k: int=3) -> Dict:"""Generate answer from retrieved context"""# Retrieve results =self.retrieve(query, top_k)# Build context context =self.build_context(results)# Create prompt (in real system, would use LLM) prompt =f"""Berdasarkan dokumen berikut, jawab pertanyaan dengan jelas dan singkat.{context}Pertanyaan: {query}Jawaban:"""return {'query': query,'retrieved_documents': len(results),'context': context,'prompt': prompt,'retrieval_results': results }# Initialize QA systemqa_system = SimpleRAGSystem(embedding_model, faiss_index, document_chunks, chunk_metadata)# Test QAtest_questions = ["Apa perbedaan antara supervised dan unsupervised learning?","Jelaskan arsitektur CNN dan gunanya","Bagaimana reinforcement learning berbeda dari supervised learning?"]print("="*80)print("QA SYSTEM TEST")print("="*80)for question in test_questions: result = qa_system.generate_answer(question, top_k=2)print(f"\nQuestion: {question}")print(f"Retrieved Documents: {result['retrieved_documents']}")print(f"\nContext Preview:")print(result['context'][:300] +"...")