protocol-entity-extraction
Protocol 实体抽取
从 Protocol PDF/DOCX 中抽取 SAP 所需的全部关键实体,输出结构化 JSON。
Quick Start
from utils.document_parser import get_document_parser
parser = get_document_parser()
synopsis = await parser.parse(protocol_path, page_range=[1, 15])
stats = await parser.parse(protocol_path, page_range=[90, 115])
解析策略
多段定向解析:Synopsis(p1-15,高表格) -> Design(p30-50) -> Statistics(p90-115,纯文本) -> SAP Appendix(p240+)
解析引擎(四级降级)
DocumentParser 自动选择最优引擎:
- Unstructured API (云端) — 需要 UNSTRUCTURED_API_KEY,表格+OCR 最强
- Docling (本地 AI) — IBM 开源,视觉+语言模型做表格识别,无需 API Key,推荐本地首选
- pdfplumber (本地) — 文本+简单表格
- PyPDF2 (本地) — 仅纯文本兜底
More from malue-ai/dazee-small
pywinauto
Automate Windows desktop applications using pywinauto. Discover windows, inspect controls, click buttons, type text, and drive any Win32/UIA application programmatically.
363app-recommender
Recommend the best application for a user task based on installed apps (from app-scanner) and common software knowledge.
17excel-fixer
Auto-detect and fix common Excel formatting issues like merged cells, inconsistent types, duplicate headers, and encoding problems.
15eightctl
Control Eight Sleep pods (status, temperature, alarms, schedules).
14gemini
Gemini CLI for one-shot Q&A, summaries, and generation.
14bluebubbles
Build or update the BlueBubbles external channel plugin for Moltbot (extension package, REST send/probe, webhook inbound).
13