transformer-architecture-guide

Installation
SKILL.md

Transformer Architecture Guide

Understand, implement, and adapt Transformer architectures for NLP, computer vision, and multimodal research, from the original attention mechanism to modern variants.

The Original Transformer

The Transformer (Vaswani et al., 2017, "Attention Is All You Need") replaced recurrence and convolution with self-attention as the primary sequence modeling mechanism.

Core Components

Component Function Key Parameters
Multi-Head Self-Attention Computes attention weights across all positions d_model, n_heads, d_k, d_v
Feed-Forward Network Position-wise nonlinear transformation d_model, d_ff
Positional Encoding Injects sequence order information Sinusoidal or learned
Layer Normalization Stabilizes training Pre-norm or post-norm
Residual Connections Enables gradient flow in deep networks Add before or after norm

Self-Attention Mechanism

Related skills
Installs
2
GitHub Stars
217
First Seen
Apr 2, 2026