XPU Triton Kernels for Intel GPUs

This skill provides patterns and guidance for developing optimized Triton kernels targeting Intel XPU GPUs (Battlemage/Arc Pro B50). It integrates the Xe-Forge optimization framework — an LLM-driven loop that transforms PyTorch code into fast Triton kernels.

Quick Start

Optimize a Kernel (Xe-Forge Workflow)

The full optimization workflow analyzes a PyTorch baseline, generates Triton kernel variants in a branching trial tree, benchmarks each on XPU hardware, and finalizes the best result.

# 1. Analyze the baseline
python scripts/analyze_kernel.py test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 2. Initialize trial tracking
python scripts/trial_manager.py init 70_Gemm_Sigmoid test_kernels/70_Gemm_Sigmoid_Scaling_ResidualAdd_pytorch.py

# 3. Validate a generated kernel (no GPU needed)
python scripts/validate_triton.py my_kernel.py

xpu-kernels

XPU Triton Kernels for Intel GPUs

Quick Start

Optimize a Kernel (Xe-Forge Workflow)