ai-model-benchmarking

Pass

Audited by Gen Agent Trust Hub on Apr 2, 2026

Risk Level: SAFE

Full Analysis

[SAFE]: The skill provides educational documentation and code snippets for benchmarking AI models using the well-known EleutherAI lm-evaluation-harness library.
[EXTERNAL_DOWNLOADS]: Includes instructions to install the 'lm-eval' package via pip, which is the standard library for this domain.
[COMMAND_EXECUTION]: Demonstrates standard CLI usage of the lm_eval tool for running academic benchmarks like MMLU and GSM8K.

Audit Metadata

Risk Level

SAFE

Analyzed

Apr 2, 2026, 03:02 PM

Security Audit — agent-trust-hub — ai-model-benchmarking