evaluation AI Agent Skills Search Results

typescript-sdk

comet-ml

TypeScript SDK patterns for Opik. Use when working in sdks/typescript.

★ 17.8k

⑂ 0

AI

hugging-face-evaluation

[ Official ]

huggingface

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations

★ 9.5k

⑂ 0

AI

carrier-relationship-management

[ Featured ]

affaan-m

carrier-relationship-management is a skill that automates and optimizes freight management processes using AI-powered tools.

★ 108.5k

⑂ 0

Developer

e2e

langwatch

Generate and verify E2E tests for a feature. Explores live app, creates test plan, generates tests, runs and fixes until passing.

★ 2.8k

⑂ 0

AI

ground-truth-evaluation

oaknational

A collection of tools for working with the Oak Open Curriculum Data, including a published MCP server

★ 3

⑂ 0

Developer

agent-evaluation

oimiragieo

LLM-as-judge evaluation framework with 5-dimension rubric (accuracy, groundedness, coherence, completeness, helpfulness) for scoring AI-generated content quality with weighted composite scores and evi

★ 14

⑂ 0

Developer

tech-stack-evaluator

matteocervelli

Auto-activates during requirements analysis to evaluate technical stack

★ 20

⑂ 0

Developer

debug-stuck-eval

METR

Debug stuck Hawk/Inspect AI evaluations. Use when user mentions stuck eval, eval not progressing, eval hanging, samples not completing, eval set frozen, runner stuck, 500 errors in eval, retry loop, e

★ 21

⑂ 0

AI

evaluation

mshraditya

Evaluation is a process of assessing agent systems using different approaches than traditional software

★ 0

⑂ 0

Developer

Healthcare AI Evaluation

MFD3000

Guide evaluation of healthcare AI systems with domain-specific safety criteria, clinical accuracy rubrics, and score interpretation. Use when building or reviewing health/medical AI evaluations.

★ 0

⑂ 0

Developer

launch-prep

benchflow-ai

Framework for creating high fidelity and complex RL environments and evaluation tasks

★ 203

⑂ 0

Developer

evaluating-llms

ancoleman

Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for

★ 0

⑂ 0

Developer

Browsing:

typescript-sdk

hugging-face-evaluation

carrier-relationship-management

e2e

ground-truth-evaluation

agent-evaluation

tech-stack-evaluator

debug-stuck-eval

evaluation

Healthcare AI Evaluation

launch-prep

evaluating-llms