ai-model-benchmarking

Pass

Audited by Gen Agent Trust Hub on Apr 2, 2026

Risk Level: SAFE
Full Analysis
  • [SAFE]: The skill provides educational documentation and code snippets for benchmarking AI models using the well-known EleutherAI lm-evaluation-harness library.
  • [EXTERNAL_DOWNLOADS]: Includes instructions to install the 'lm-eval' package via pip, which is the standard library for this domain.
  • [COMMAND_EXECUTION]: Demonstrates standard CLI usage of the lm_eval tool for running academic benchmarks like MMLU and GSM8K.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 2, 2026, 03:02 PM
Security Audit — agent-trust-hub — ai-model-benchmarking