doc-to-vector-dataset-generator
Installation
SKILL.md
Doc-to-Vector Dataset Generator
Transform documents into high-quality vector search datasets.
Pipeline Steps
- Extract text from various formats (PDF, DOCX, HTML)
- Clean text (remove noise, normalize)
- Chunk strategically (semantic boundaries)
- Add metadata (source, timestamps, classification)
- Deduplicate (near-duplicate detection)
- Quality check (length, content validation)
- Export JSONL (one chunk per line)
Text Extraction
# PDF extraction
import pymupdf