doc-to-vector-dataset-generator

Installation
SKILL.md

Doc-to-Vector Dataset Generator

Transform documents into high-quality vector search datasets.

Pipeline Steps

  1. Extract text from various formats (PDF, DOCX, HTML)
  2. Clean text (remove noise, normalize)
  3. Chunk strategically (semantic boundaries)
  4. Add metadata (source, timestamps, classification)
  5. Deduplicate (near-duplicate detection)
  6. Quality check (length, content validation)
  7. Export JSONL (one chunk per line)

Text Extraction

# PDF extraction
import pymupdf
Related skills

More from patricio0312rev/skills

Installs
95
GitHub Stars
38
First Seen
Jan 24, 2026