torch-tensor-parallelism
Tensor Parallelism Implementation Guide
This skill provides guidance for implementing tensor parallelism patterns in PyTorch, specifically for ColumnParallelLinear and RowParallelLinear layers that distribute computation across multiple devices.
Core Concepts
Tensor Parallelism Overview
Tensor parallelism splits individual layers across multiple devices to parallelize computation within a single forward/backward pass. The two primary patterns are:
-
ColumnParallelLinear: Shards weights along the output dimension (columns). Each device computes a portion of the output features, then results are concatenated via all-gather.
-
RowParallelLinear: Shards weights along the input dimension (rows). Each device computes partial outputs using its shard of the input, then results are summed via all-reduce.
Critical Implementation Requirement
When implementing tensor parallelism (especially in simulation or testing contexts), the forward pass must actually perform the collective operations, not just compute local shards:
- ColumnParallelLinear: Must concatenate outputs from all ranks (all-gather semantics)
More from letta-ai/skills
extracting-pdf-text
Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
257imessage
Send and read iMessages/SMS from macOS. Use for texting contacts, scheduling services, or automating message-based workflows. Triggers on queries about texting, messaging, SMS, iMessage, or contacting someone via text.
206video-processing
Guide for video analysis and frame-level event detection tasks using OpenCV and similar libraries. This skill should be used when detecting events in videos (jumps, movements, gestures), extracting frames, analyzing motion patterns, or implementing computer vision algorithms on video data. It provides verification strategies and helps avoid common pitfalls in video processing workflows.
189letta-api-client
Build applications with the Letta API — a model-agnostic, stateful API for building persistent agents with memory and long-term learning. Covers SDK patterns for Python and TypeScript. Includes 24 working code examples.
147google-workspace
Connect to Gmail and Google Calendar via OAuth 2.0. Use when users want to search/read emails, create drafts, search calendar events, check availability, or schedule meetings. Triggers on queries about email, inbox, calendar, schedule, or meetings.
127portfolio-optimization
Guidance for implementing high-performance portfolio optimization using Python C extensions. This skill applies when tasks require optimizing financial computations (matrix operations, covariance calculations, portfolio risk metrics) by implementing C extensions for Python. Use when performance speedup requirements exist (e.g., 1.2x or greater) and the task involves numerical computations on large datasets (thousands of assets).
101