spark-optimization
Apache Spark job optimization through partitioning, memory tuning, shuffle reduction, and join strategies.
- Covers partitioning strategies, broadcast joins, bucketed joins, and skew handling with salting techniques to minimize shuffle overhead
- Includes caching and persistence patterns with storage level selection, checkpointing for complex lineages, and memory configuration breakdown
- Provides data format optimization for Parquet and Delta Lake, column pruning, predicate pushdown, and Z-ordering for multi-dimensional queries
- Features monitoring and debugging patterns including query plan analysis, stage metrics tracking, and partition skew detection
- Adaptive Query Execution (AQE) configuration and production-ready settings for executor memory, parallelism, serialization, and compression
Apache Spark Optimization
Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.
When to Use This Skill
- Optimizing slow Spark jobs
- Tuning memory and executor configuration
- Implementing efficient partitioning strategies
- Debugging Spark performance issues
- Scaling Spark pipelines for large datasets
- Reducing shuffle and data skew
Core Concepts
1. Spark Execution Model
Driver Program
More from wshobson/agents
tailwind-design-system
Build scalable design systems with Tailwind CSS v4, design tokens, component libraries, and responsive patterns. Use when creating component libraries, implementing design systems, or standardizing UI patterns.
41.0Ktypescript-advanced-types
Master TypeScript's advanced type system including generics, conditional types, mapped types, template literals, and utility types for building type-safe applications. Use when implementing complex type logic, creating reusable type utilities, or ensuring compile-time type safety in TypeScript projects.
40.4Knodejs-backend-patterns
Build production-ready Node.js backend services with Express/Fastify, implementing middleware patterns, error handling, authentication, database integration, and API design best practices. Use when creating Node.js servers, REST APIs, GraphQL backends, or microservices architectures.
31.8Kpython-performance-optimization
Profile and optimize Python code using cProfile, memory profilers, and performance best practices. Use when debugging slow Python code, optimizing bottlenecks, or improving application performance.
22.1Kapi-design-principles
Master REST and GraphQL API design principles to build intuitive, scalable, and maintainable APIs that delight developers. Use when designing new APIs, reviewing API specifications, or establishing API design standards.
20.3Kpython-testing-patterns
Implement comprehensive testing strategies with pytest, fixtures, mocking, and test-driven development. Use when writing Python tests, setting up test suites, or implementing testing best practices.
19.7K