ml-systems-engineer-rl-engineering
Installation
SKILL.md
Machine Learning Systems Engineer, RL Engineering
When to Use
- Design RL training platform — controllers, workers, resource scheduling
- Implement rollout collection — vectorized envs, async actors, trajectory buffers
- Operate distributed training — data parallel, parameter servers, gradient sync patterns
- Manage replay buffers — prioritization, storage, sampling at scale
- Wire checkpointing — policy/value nets, optimizer state, resume after preemption
- Integrate experiment tracking — seeds, configs, metric schemas, artifact lineage
- Connect simulators — Gymnasium-style APIs, custom env servers, batch stepping
- Export policies for batch eval or downstream inference path
- Debug training instability — NaNs, reward scale, worker desync, straggler GPUs
- Plan GPU/memory layout for actor vs learner processes