sparse-autoencoder-training

Pass

Audited by Gen Agent Trust Hub on May 11, 2026

Risk Level: SAFEEXTERNAL_DOWNLOADSCOMMAND_EXECUTIONDATA_EXFILTRATION
Full Analysis
  • [EXTERNAL_DOWNLOADS]: The skill fetches pre-trained Sparse Autoencoders and datasets from HuggingFace and identifies the official SAELens repository on GitHub as a primary resource. These are recognized services within the machine learning research community.
  • [COMMAND_EXECUTION]: Instructions include the installation of the sae-lens and transformer-lens packages via standard package managers. The provided code examples demonstrate legitimate use of these libraries for neural network analysis.
  • [DATA_EXFILTRATION]: Training configurations include options to log metrics to Weights & Biases (W&B), which is a standard practice for experiment tracking in machine learning.
  • [CREDENTIALS_UNSAFE]: References to HuggingFace API tokens for model uploads use placeholders like hf_token, following safe development practices by avoiding hardcoded secrets.
  • [DATA_INGESTION_SURFACE]: The skill is designed to process untrusted data from the 'Pile' dataset and user-supplied text prompts to train and analyze autoencoders. While this represents an attack surface for indirect prompt injection, the capabilities (training and feature discovery) are consistent with the skill's primary research purpose.
  • Ingestion points: dataset_path configuration in SKILL.md and prompt variables in references/tutorials.md.
  • Boundary markers: None present in the code examples.
  • Capability inventory: Model training, feature steering, and data upload capabilities across all referenced scripts.
  • Sanitization: No specific sanitization or filtering of input text is implemented in the provided examples.
Audit Metadata
Risk Level
SAFE
Analyzed
May 11, 2026, 02:49 PM