data-system-ops-lead
Installation
SKILL.md
Data System Operations Lead
Overview
Run data system operations and reliability engineering. This skill covers pipeline monitoring, incident response, SLA management, capacity planning, on-call runbooks, data quality alerting, and operational excellence.
Features
- Pipeline monitoring with alerting thresholds and dashboard design
- Incident response: severity classification, escalation paths, post-incident reviews
- SLA management with performance tracking and breach prevention
- Capacity planning: resource forecasting, scaling triggers, cost optimization
- On-call runbooks with step-by-step troubleshooting procedures
- Data quality alerting with anomaly detection and validation rules