{"kind":"Skill","metadata":{"namespace":"community","name":"agentic-eval","version":"0.1.0"},"spec":{"description":"|","files":{"SKILL.md":"---\nname: agentic-eval\ndescription: |\n  Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when:\n  - Implementing self-critique and reflection loops\n  - Building evaluator-optimizer pipelines for quality-critical generation\n  - Creating test-driven code refinement workflows\n  - Designing rubric-based or LLM-as-judge evaluation systems\n  - Adding iterative improvement to agent outputs (code, reports, analysis)\n  - Measuring and improving agent response quality\n---\n\n# Agentic Evaluation Patterns\n\nPatterns for self-improvement through iterative evaluation and refinement.\n\n## Overview\n\nEvaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.\n\n```\nGenerate → Evaluate → Critique → Refine → Output\n    ↑                              │\n    └──────────────────────────────┘\n```\n\n## When to Use\n\n- **Quality-critical generation**: Code, reports, analysis requiring high accuracy\n- **Tasks with clear evaluation criteria**: Defined success metrics exist\n- **Content requiring specific standards**: Style guides, compliance, formatting\n\n---\n\n## Pattern 1: Basic Reflection\n\nAgent evaluates and improves its own output through self-critique.\n\n```python\ndef reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -\u003e str:\n    \"\"\"Generate with reflection loop.\"\"\"\n    output = llm(f\"Complete this task:\\n{task}\")\n    \n    for i in range(max_iterations):\n        # Self-critique\n        critique = llm(f\"\"\"\n        Evaluate this output against criteria: {criteria}\n        Output: {output}\n        Rate each: PASS/FAIL with feedback as JSON.\n        \"\"\")\n        \n        critique_data = json.loads(critique)\n        all_pass = all(c[\"status\"] == \"PASS\" for c in critique_data.values())\n        if all_pass:\n            return output\n        \n        # Refine based on critique\n        failed = {k: v[\"feedback\"] for k, v in critique_data.items() if v[\"status\"] == \"FAIL\"}\n        output = llm(f\"Improve to address: {failed}\\nOriginal: {output}\")\n    \n    return output\n```\n\n**Key insight**: Use structured JSON output for reliable parsing of critique results.\n\n---\n\n## Pattern 2: Evaluator-Optimizer\n\nSeparate generation and evaluation into distinct components for clearer responsibilities.\n\n```python\nclass EvaluatorOptimizer:\n    def __init__(self, score_threshold: float = 0.8):\n        self.score_threshold = score_threshold\n    \n    def generate(self, task: str) -\u003e str:\n        return llm(f\"Complete: {task}\")\n    \n    def evaluate(self, output: str, task: str) -\u003e dict:\n        return json.loads(llm(f\"\"\"\n        Evaluate output for task: {task}\n        Output: {output}\n        Return JSON: {{\"overall_score\": 0-1, \"dimensions\": {{\"accuracy\": ..., \"clarity\": ...}}}}\n        \"\"\"))\n    \n    def optimize(self, output: str, feedback: dict) -\u003e str:\n        return llm(f\"Improve based on feedback: {feedback}\\nOutput: {output}\")\n    \n    def run(self, task: str, max_iterations: int = 3) -\u003e str:\n        output = self.generate(task)\n        for _ in range(max_iterations):\n            evaluation = self.evaluate(output, task)\n            if evaluation[\"overall_score\"] \u003e= self.score_threshold:\n                break\n            output = self.optimize(output, evaluation)\n        return output\n```\n\n---\n\n## Pattern 3: Code-Specific Reflection\n\nTest-driven refinement loop for code generation.\n\n```python\nclass CodeReflector:\n    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -\u003e str:\n        code = llm(f\"Write Python code for: {spec}\")\n        tests = llm(f\"Generate pytest tests for: {spec}\\nCode: {code}\")\n        \n        for _ in range(max_iterations):\n            result = run_tests(code, tests)\n            if result[\"success\"]:\n                return code\n            code = llm(f\"Fix error: {result['error']}\\nCode: {code}\")\n        return code\n```\n\n---\n\n## Evaluation Strategies\n\n### Outcome-Based\nEvaluate whether output achieves the expected result.\n\n```python\ndef evaluate_outcome(task: str, output: str, expected: str) -\u003e str:\n    return llm(f\"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}\")\n```\n\n### LLM-as-Judge\nUse LLM to compare and rank outputs.\n\n```python\ndef llm_judge(output_a: str, output_b: str, criteria: str) -\u003e str:\n    return llm(f\"Compare outputs A and B for {criteria}. Which is better and why?\")\n```\n\n### Rubric-Based\nScore outputs against weighted dimensions.\n\n```python\nRUBRIC = {\n    \"accuracy\": {\"weight\": 0.4},\n    \"clarity\": {\"weight\": 0.3},\n    \"completeness\": {\"weight\": 0.3}\n}\n\ndef evaluate_with_rubric(output: str, rubric: dict) -\u003e float:\n    scores = json.loads(llm(f\"Rate 1-5 for each dimension: {list(rubric.keys())}\\nOutput: {output}\"))\n    return sum(scores[d] * rubric[d][\"weight\"] for d in rubric) / 5\n```\n\n---\n\n## Best Practices\n\n| Practice | Rationale |\n|----------|-----------|\n| **Clear criteria** | Define specific, measurable evaluation criteria upfront |\n| **Iteration limits** | Set max iterations (3-5) to prevent infinite loops |\n| **Convergence check** | Stop if output score isn't improving between iterations |\n| **Log history** | Keep full trajectory for debugging and analysis |\n| **Structured output** | Use JSON for reliable parsing of evaluation results |\n\n---\n\n## Quick Start Checklist\n\n```markdown\n## Evaluation Implementation Checklist\n\n### Setup\n- [ ] Define evaluation criteria/rubric\n- [ ] Set score threshold for \"good enough\"\n- [ ] Configure max iterations (default: 3)\n\n### Implementation\n- [ ] Implement generate() function\n- [ ] Implement evaluate() function with structured output\n- [ ] Implement optimize() function\n- [ ] Wire up the refinement loop\n\n### Safety\n- [ ] Add convergence detection\n- [ ] Log all iterations for debugging\n- [ ] Handle evaluation parse failures gracefully\n```\n"},"import":{"commit_sha":"541b7819d8c3545c6df122491af4fa1eae415779","imported_at":"2026-05-18T20:05:35Z","license_text":"MIT License\n\nCopyright GitHub, Inc.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.","owner":"github","repo":"github/awesome-copilot","source_url":"https://github.com/github/awesome-copilot/tree/541b7819d8c3545c6df122491af4fa1eae415779/skills/agentic-eval"}},"content_hash":[215,223,30,168,241,91,163,90,180,109,78,179,79,218,53,34,22,37,66,132,30,177,251,156,162,111,13,229,12,192,149,13],"trust_level":"unsigned","yanked":false}