hugging-face-evaluation
[ Official ]Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations
Browse and install thousands of AI Agent skills in the Killer-Skills directory. Supports Claude Code, Windsurf, Cursor, and more.
Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations
Guide evaluation of healthcare AI systems with domain-specific safety criteria, clinical accuracy rubrics, and score interpretation. Use when building or reviewing health/medical AI evaluations.
Use when auditing Claude skills and commands for quality. Supports Quick Scan (changed skills only) and Full Stocktake modes with sequential subagent batch evaluation.
cpp-testing is a specialized AI agent skill for automating C++ test workflows using GoogleTest and CMake.
requesting-code-review is a skill that enables developers to request code reviews, ensuring work meets requirements through subagent-driven evaluation
sandbox-sdk is a Cloudflare Workers solution for secure code execution, providing a sandboxed environment for running untrusted code, ideal for developers needing isolated execution.
Get alternative perspectives via Google Gemini on complex design decisions, architecture trade-offs, stuck debugging, or approach evaluation.
Benchmark Manager is a skill that manages AILANG evaluation benchmarks with features like prompt integration, debugging, and best practices.
eval is a Thoughtbox intention ledger for agents, designed to evaluate AI decisions against its decision-making metrics.
Generate visualizations from completed experiment evaluations using inspect-viz. Use after run-experiment to create interactive HTML plots from inspect-ai evaluation logs.
MixSeek Agent Skills collection for AI coding assistants. Provides workspace management, team configuration, evaluation setup, and debugging tools for MixSeek-Core.
LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evi