cuda-guide

Installation

SKILL.md

CUDA Guide

Applies to: CUDA 11+, GPU Computing, Deep Learning, Scientific Computing, HPC

Core Principles

Parallelism First: Design algorithms for thousands of concurrent threads; serial thinking is the primary enemy of GPU performance
Memory Hierarchy Awareness: Global memory is 100x slower than shared memory and 1000x slower than registers; every kernel design starts with memory access planning
Coalesced Access: Adjacent threads must access adjacent memory addresses; a single misaligned access pattern can reduce bandwidth by 32x
Occupancy Over Cleverness: Maximize active warps per SM by managing register count, shared memory usage, and block dimensions together
Minimize Host-Device Transfers: PCIe bandwidth is the bottleneck; overlap transfers with computation using streams and pinned memory

Guardrails

Error Checking

ALWAYS check CUDA API return values with a macro wrapper
ALWAYS call cudaGetLastError() after every kernel launch
ALWAYS call cudaDeviceSynchronize() before reading kernel results on the host

Related skills

More from ar4mirez/samuel

Installs

11

Repository

ar4mirez/samuel

First Seen

Mar 1, 2026

Security Audits

Gen Agent Trust HubPass