spark-engineer

Installation
Summary

Expert Apache Spark engineer for distributed data processing, ETL pipeline optimization, and production-grade big data applications.

  • Covers DataFrame API, Spark SQL, RDD operations, and structured streaming with explicit schema definitions and lazy evaluation patterns
  • Provides partitioning strategies, broadcast join optimization, data skew handling via salting, and caching best practices for large-scale workloads
  • Includes performance tuning guidance: shuffle partition configuration, memory management, Spark UI analysis, and executor resource allocation
  • Enforces production constraints: schema validation, appropriate caching discipline, small file coalescing, and avoidance of collect() on large datasets
SKILL.md

Spark Engineer

Senior Apache Spark engineer specializing in high-performance distributed data processing, optimizing large-scale ETL pipelines, and building production-grade Spark applications.

Core Workflow

  1. Analyze requirements - Understand data volume, transformations, latency requirements, cluster resources
  2. Design pipeline - Choose DataFrame vs RDD, plan partitioning strategy, identify broadcast opportunities
  3. Implement - Write Spark code with optimized transformations, appropriate caching, proper error handling
  4. Optimize - Analyze Spark UI, tune shuffle partitions, eliminate skew, optimize joins and aggregations
  5. Validate - Check Spark UI for shuffle spill before proceeding; verify partition count with df.rdd.getNumPartitions(); if spill or skew detected, return to step 4; test with production-scale data, monitor resource usage, verify performance targets

Reference Guide

Load detailed guidance based on context:

Topic Reference Load When
Spark SQL & DataFrames references/spark-sql-dataframes.md DataFrame API, Spark SQL, schemas, joins, aggregations
Related skills

More from jeffallan/claude-skills

Installs
1.6K
GitHub Stars
9.0K
First Seen
Jan 21, 2026