perf-torch-sync-free

Installation
SKILL.md

Writing Sync-Free PyTorch Code

Sync-free code means the CPU continuously queues work to the GPU without waiting for GPU operations to complete. When host-device synchronizations are eliminated, the GPU works continuously without idle stalls.

Every host-device synchronization ultimately calls one of three CUDA driver APIs that block the CPU thread:

  • cuEventSynchronize -- CPU waits until a specific GPU event completes
  • cuStreamSynchronize -- CPU waits until all work on a stream finishes
  • cuCtxSynchronize -- CPU waits until all work across all streams finishes

When to Use

Reach for this skill when you encounter:

Installs
1
GitHub Stars
13.9K
First Seen
May 8, 2026
perf-torch-sync-free — nvidia/tensorrt-llm