pageindex-rag
PageIndex RAG Architecture
PageIndex replaces vector-based similarity search with LLM-driven hierarchical navigation, achieving 98.7% accuracy on financial document benchmarks by reasoning through document structure instead of matching embeddings.
Core Innovation: Why Vector RAG Fails
Query-Knowledge Mismatch: Vector similarity measures surface semantics, not task relevance. "What are debt trends?" matches "trends" mentions, not actual trend analysis.
Hard Chunking: Fixed 512-1000 token chunks fragment mid-sentence, breaking contextual continuity. Financial statements split across chunks lose asset-liability relationships.
Context Window Deterioration: Retrieving 10-20 chunks creates needle-in-haystack problems where relevant info gets buried.
Cross-Reference Blindness: Cannot follow "see Appendix G" or "Section 3.2" references without manual preprocessing.
PageIndex Solution
Replace vector databases with hierarchical tree indices stored as JSON: