llm-evaluation AI Agent Skills Search Results

typescript-sdk

comet-ml

TypeScript SDK patterns for Opik. Use when working in sdks/typescript.

★ 17.8k

⑂ 0

AI

hugging-face-evaluation

[ Official ]

huggingface

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations

★ 9.5k

⑂ 0

AI

budmem

BudEcosystem

Bud AI Foundry - A comprehensive inference stack for compound AI deployment, optimization and scaling. Bud Stack provides intelligent infrastructure automation, performance optimization, and seamless

★ 10

⑂ 0

AI

ground-truth-evaluation

oaknational

A collection of tools for working with the Oak Open Curriculum Data, including a published MCP server

★ 3

⑂ 0

Developer

agent-evaluation

oimiragieo

LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evi

★ 14

⑂ 0

Developer

evaluation

mshraditya

Evaluation is a process of assessing agent systems using different approaches than traditional software

★ 0

⑂ 0

Developer

Healthcare AI Evaluation

MFD3000

Guide evaluation of healthcare AI systems with domain-specific safety criteria, clinical accuracy rubrics, and score interpretation. Use when building or reviewing health/medical AI evaluations.

★ 0

⑂ 0

Developer

debug-stuck-eval

METR

Debug stuck Hawk/Inspect AI evaluations. Use when user mentions stuck eval, eval not progressing, eval hanging, samples not completing, eval set frozen, runner stuck, 500 errors in eval, retry loop, e

★ 21

⑂ 0

AI

eval-harness

[ Featured ]

affaan-m

eval-harness is a formal evaluation framework implementing eval-driven development principles for Claude Code sessions

★ 108.5k

⑂ 0

Developer

skill-stocktake

[ Featured ]

affaan-m

Use when auditing Claude skills and commands for quality. Supports Quick Scan (changed skills only) and Full Stocktake modes with sequential subagent batch evaluation.

★ 108.5k

⑂ 0

Developer

launch-prep

benchflow-ai

Framework for creating high fidelity and complex RL environments and evaluation tasks

★ 203

⑂ 0

Developer

e2e

langwatch

Generate and verify E2E tests for a feature. Explores live app, creates test plan, generates tests, runs and fixes until passing.

★ 2.8k

⑂ 0

AI

Browsing:

typescript-sdk

hugging-face-evaluation

budmem

ground-truth-evaluation

agent-evaluation

evaluation

Healthcare AI Evaluation

debug-stuck-eval

eval-harness

skill-stocktake

launch-prep

e2e