cuda-guide

Installation
SKILL.md

CUDA Guide

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Core Principles

  1. Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
  2. Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
  3. Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
  4. Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
  5. Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

Guardrails

Error Checking

  • ALWAYS check CUDA API return values with a macro wrapper
  • ALWAYS call cudaGetLastError() after every kernel launch
  • ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host
Related skills
Installs
11
Repository
ar4mirez/samuel
First Seen
Mar 1, 2026