The Agent Skills Directory

[EXTERNAL_DOWNLOADS]: The documentation recommends installing standard, well-known machine learning and benchmarking libraries such as lm-eval, transformers, and vllm from official package registries.
[COMMAND_EXECUTION]: Workflow examples demonstrate the use of shell commands and Python's os.system to automate periodic model evaluations and performance comparisons.
[REMOTE_CODE_EXECUTION]: The skill documents the intended use of the allow_code_execution flag required for functional correctness benchmarks like HumanEval, which involves executing model-generated code in a controlled manner.
[DYNAMIC_EXECUTION]: Detailed guides are provided for creating custom evaluation tasks that utilize Python-based utility functions for specialized data processing and metric aggregation.
[SAFE]: No malicious obfuscation, unauthorized data access, or persistence mechanisms were detected. The skill follows security best practices by recommending environment variables for API key management.

evaluating-llms-harness