{"kind":"Skill","metadata":{"namespace":"community","name":"phoenix-evals","version":"0.1.0"},"spec":{"description":"Build and run evaluators for AI/LLM applications using Phoenix.","files":{"SKILL.md":"---\nname: phoenix-evals\ndescription: Build and run evaluators for AI/LLM applications using Phoenix.\nlicense: Apache-2.0\ncompatibility: Requires Phoenix server. Python skills need phoenix and openai packages; TypeScript skills need @arizeai/phoenix-client.\nmetadata:\n  author: oss@arize.com\n  version: \"1.0.0\"\n  languages: \"Python, TypeScript\"\n---\n\n# Phoenix Evals\n\nBuild evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.\n\n## Quick Reference\n\n| Task | Files |\n| ---- | ----- |\n| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |\n| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |\n| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |\n| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |\n| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |\n| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |\n| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |\n| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |\n| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |\n| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |\n| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |\n| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |\n| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |\n| RAG evals | [evaluators-rag](references/evaluators-rag.md) |\n| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |\n| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |\n\n## Workflows\n\n**Starting Fresh:**\n[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)\n\n**Building Evaluator:**\n[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}\n\n**RAG Systems:**\n[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)\n\n**Production:**\n[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)\n\n## Reference Categories\n\n| Prefix | Description |\n| ------ | ----------- |\n| `fundamentals-*` | Types, scores, anti-patterns |\n| `observe-*` | Tracing, sampling |\n| `error-analysis-*` | Finding failures |\n| `axial-coding-*` | Categorizing failures |\n| `evaluators-*` | Code, LLM, RAG evaluators |\n| `experiments-*` | Datasets, running experiments |\n| `validation-*` | Validating evaluator accuracy against human labels |\n| `production-*` | CI/CD, monitoring |\n\n## Key Principles\n\n| Principle | Action |\n| --------- | ------ |\n| Error analysis first | Can't automate what you haven't observed |\n| Custom \u003e generic | Build from your failures |\n| Code first | Deterministic before LLM |\n| Validate judges | \u003e80% TPR/TNR |\n| Binary \u003e Likert | Pass/fail, not 1-5 |\n","references/axial-coding.md":"# Axial Coding\n\nGroup open-ended notes into structured failure taxonomies.\n\n## Process\n\n1. **Gather** - Collect open coding notes\n2. **Pattern** - Group notes with common themes\n3. **Name** - Create actionable category names\n4. **Quantify** - Count failures per category\n\n## Example Taxonomy\n\n```yaml\nfailure_taxonomy:\n  content_quality:\n    hallucination: [invented_facts, fictional_citations]\n    incompleteness: [partial_answer, missing_key_info]\n    inaccuracy: [wrong_numbers, wrong_dates]\n  \n  communication:\n    tone_mismatch: [too_casual, too_formal]\n    clarity: [ambiguous, jargon_heavy]\n  \n  context:\n    user_context: [ignored_preferences, misunderstood_intent]\n    retrieved_context: [ignored_documents, wrong_context]\n  \n  safety:\n    missing_disclaimers: [legal, medical, financial]\n```\n\n## Add Annotation (Python)\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\nclient.spans.add_span_annotation(\n    span_id=\"abc123\",\n    annotation_name=\"failure_category\",\n    label=\"hallucination\",\n    explanation=\"invented a feature that doesn't exist\",\n    annotator_kind=\"HUMAN\",\n    sync=True,\n)\n```\n\n## Add Annotation (TypeScript)\n\n```typescript\nimport { addSpanAnnotation } from \"@arizeai/phoenix-client/spans\";\n\nawait addSpanAnnotation({\n  spanAnnotation: {\n    spanId: \"abc123\",\n    name: \"failure_category\",\n    label: \"hallucination\",\n    explanation: \"invented a feature that doesn't exist\",\n    annotatorKind: \"HUMAN\",\n  }\n});\n```\n\n## Agent Failure Taxonomy\n\n```yaml\nagent_failures:\n  planning: [wrong_plan, incomplete_plan]\n  tool_selection: [wrong_tool, missed_tool, unnecessary_call]\n  tool_execution: [wrong_parameters, type_error]\n  state_management: [lost_context, stuck_in_loop]\n  error_recovery: [no_fallback, wrong_fallback]\n```\n\n## Transition Matrix (Agents)\n\nShows where failures occur between states:\n\n```python\ndef build_transition_matrix(conversations, states):\n    matrix = defaultdict(lambda: defaultdict(int))\n    for conv in conversations:\n        if conv[\"failed\"]:\n            last_success = find_last_success(conv)\n            first_failure = find_first_failure(conv)\n            matrix[last_success][first_failure] += 1\n    return pd.DataFrame(matrix).fillna(0)\n```\n\n## Principles\n\n- **MECE** - Each failure fits ONE category\n- **Actionable** - Categories suggest fixes\n- **Bottom-up** - Let categories emerge from data\n","references/common-mistakes-python.md":"# Common Mistakes (Python)\n\nPatterns that LLMs frequently generate incorrectly from training data.\n\n## Legacy Model Classes\n\n```python\n# WRONG\nfrom phoenix.evals import OpenAIModel, AnthropicModel\nmodel = OpenAIModel(model=\"gpt-4\")\n\n# RIGHT\nfrom phoenix.evals import LLM\nllm = LLM(provider=\"openai\", model=\"gpt-4o\")\n```\n\n**Why**: `OpenAIModel`, `AnthropicModel`, etc. are legacy 1.0 wrappers in `phoenix.evals.legacy`.\nThe `LLM` class is provider-agnostic and is the current 2.0 API.\n\n## Using run_evals Instead of evaluate_dataframe\n\n```python\n# WRONG — legacy 1.0 API\nfrom phoenix.evals import run_evals\nresults = run_evals(dataframe=df, evaluators=[eval1], provide_explanation=True)\n# Returns list of DataFrames\n\n# RIGHT — current 2.0 API\nfrom phoenix.evals import evaluate_dataframe\nresults_df = evaluate_dataframe(dataframe=df, evaluators=[eval1])\n# Returns single DataFrame with {name}_score dict columns\n```\n\n**Why**: `run_evals` is the legacy 1.0 batch function. `evaluate_dataframe` is the current\n2.0 function with a different return format.\n\n## Wrong Result Column Names\n\n```python\n# WRONG — column doesn't exist\nscore = results_df[\"relevance\"].mean()\n\n# WRONG — column exists but contains dicts, not numbers\nscore = results_df[\"relevance_score\"].mean()\n\n# RIGHT — extract numeric score from dict\nscores = results_df[\"relevance_score\"].apply(\n    lambda x: x.get(\"score\", 0.0) if isinstance(x, dict) else 0.0\n)\nscore = scores.mean()\n```\n\n**Why**: `evaluate_dataframe` returns columns named `{name}_score` containing Score dicts\nlike `{\"name\": \"...\", \"score\": 1.0, \"label\": \"...\", \"explanation\": \"...\"}`.\n\n## Deprecated project_name Parameter\n\n```python\n# WRONG\ndf = client.spans.get_spans_dataframe(project_name=\"my-project\")\n\n# RIGHT\ndf = client.spans.get_spans_dataframe(project_identifier=\"my-project\")\n```\n\n**Why**: `project_name` is deprecated in favor of `project_identifier`, which also\naccepts project IDs.\n\n## Wrong Client Constructor\n\n```python\n# WRONG\nclient = Client(endpoint=\"https://app.phoenix.arize.com\")\nclient = Client(url=\"https://app.phoenix.arize.com\")\n\n# RIGHT — for remote/cloud Phoenix\nclient = Client(base_url=\"https://app.phoenix.arize.com\", api_key=\"...\")\n\n# ALSO RIGHT — for local Phoenix (falls back to env vars or localhost:6006)\nclient = Client()\n```\n\n**Why**: The parameter is `base_url`, not `endpoint` or `url`. For local instances,\n`Client()` with no args works fine. For remote instances, `base_url` and `api_key` are required.\n\n## Too-Aggressive Time Filters\n\n```python\n# WRONG — often returns zero spans\nfrom datetime import datetime, timedelta\ndf = client.spans.get_spans_dataframe(\n    project_identifier=\"my-project\",\n    start_time=datetime.now() - timedelta(hours=1),\n)\n\n# RIGHT — use limit to control result size instead\ndf = client.spans.get_spans_dataframe(\n    project_identifier=\"my-project\",\n    limit=50,\n)\n```\n\n**Why**: Traces may be from any time period. A 1-hour window frequently returns\nnothing. Use `limit=` to control result size instead.\n\n## Not Filtering Spans Appropriately\n\n```python\n# WRONG — fetches all spans including internal LLM calls, retrievers, etc.\ndf = client.spans.get_spans_dataframe(project_identifier=\"my-project\")\n\n# RIGHT for end-to-end evaluation — filter to top-level spans\ndf = client.spans.get_spans_dataframe(\n    project_identifier=\"my-project\",\n    root_spans_only=True,\n)\n\n# RIGHT for RAG evaluation — fetch child spans for retriever/LLM metrics\nall_spans = client.spans.get_spans_dataframe(\n    project_identifier=\"my-project\",\n)\nretriever_spans = all_spans[all_spans[\"span_kind\"] == \"RETRIEVER\"]\nllm_spans = all_spans[all_spans[\"span_kind\"] == \"LLM\"]\n```\n\n**Why**: For end-to-end evaluation (e.g., overall answer quality), use `root_spans_only=True`.\nFor RAG systems, you often need child spans separately — retriever spans for\nDocumentRelevance and LLM spans for Faithfulness. Choose the right span level\nfor your evaluation target.\n\n## Assuming Span Output is Plain Text\n\n```python\n# WRONG — output may be JSON, not plain text\ndf[\"output\"] = df[\"attributes.output.value\"]\n\n# RIGHT — parse JSON and extract the answer field\nimport json\n\ndef extract_answer(output_value):\n    if not isinstance(output_value, str):\n        return str(output_value) if output_value is not None else \"\"\n    try:\n        parsed = json.loads(output_value)\n        if isinstance(parsed, dict):\n            for key in (\"answer\", \"result\", \"output\", \"response\"):\n                if key in parsed:\n                    return str(parsed[key])\n    except (json.JSONDecodeError, TypeError):\n        pass\n    return output_value\n\ndf[\"output\"] = df[\"attributes.output.value\"].apply(extract_answer)\n```\n\n**Why**: LangChain and other frameworks often output structured JSON from root spans,\nlike `{\"context\": \"...\", \"question\": \"...\", \"answer\": \"...\"}`. Evaluators need\nthe actual answer text, not the raw JSON.\n\n## Using @create_evaluator for LLM-Based Evaluation\n\n```python\n# WRONG — @create_evaluator doesn't call an LLM\n@create_evaluator(name=\"relevance\", kind=\"llm\")\ndef relevance(input: str, output: str) -\u003e str:\n    pass  # No LLM is involved\n\n# RIGHT — use ClassificationEvaluator for LLM-based evaluation\nfrom phoenix.evals import ClassificationEvaluator, LLM\n\nrelevance = ClassificationEvaluator(\n    name=\"relevance\",\n    prompt_template=\"Is this relevant?\\n{{input}}\\n{{output}}\\nAnswer:\",\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"relevant\": 1.0, \"irrelevant\": 0.0},\n)\n```\n\n**Why**: `@create_evaluator` wraps a plain Python function. Setting `kind=\"llm\"`\nmarks it as LLM-based but you must implement the LLM call yourself.\nFor LLM-based evaluation, prefer `ClassificationEvaluator` which handles\nthe LLM call, structured output parsing, and explanations automatically.\n\n## Using llm_classify Instead of ClassificationEvaluator\n\n```python\n# WRONG — legacy 1.0 API\nfrom phoenix.evals import llm_classify\nresults = llm_classify(\n    dataframe=df,\n    template=template_str,\n    model=model,\n    rails=[\"relevant\", \"irrelevant\"],\n)\n\n# RIGHT — current 2.0 API\nfrom phoenix.evals import ClassificationEvaluator, async_evaluate_dataframe, LLM\n\nclassifier = ClassificationEvaluator(\n    name=\"relevance\",\n    prompt_template=template_str,\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"relevant\": 1.0, \"irrelevant\": 0.0},\n)\nresults_df = await async_evaluate_dataframe(dataframe=df, evaluators=[classifier])\n```\n\n**Why**: `llm_classify` is the legacy 1.0 function. The current pattern is to create\nan evaluator with `ClassificationEvaluator` and run it with `async_evaluate_dataframe()`.\n\n## Using HallucinationEvaluator\n\n```python\n# WRONG — deprecated\nfrom phoenix.evals import HallucinationEvaluator\neval = HallucinationEvaluator(model)\n\n# RIGHT — use FaithfulnessEvaluator\nfrom phoenix.evals.metrics import FaithfulnessEvaluator\nfrom phoenix.evals import LLM\neval = FaithfulnessEvaluator(llm=LLM(provider=\"openai\", model=\"gpt-4o\"))\n```\n\n**Why**: `HallucinationEvaluator` is deprecated. `FaithfulnessEvaluator` is its replacement,\nusing \"faithful\"/\"unfaithful\" labels with maximized score (1.0 = faithful).\n","references/error-analysis-multi-turn.md":"# Error Analysis: Multi-Turn Conversations\n\nDebugging complex multi-turn conversation traces.\n\n## The Approach\n\n1. **End-to-end first** - Did the conversation achieve the goal?\n2. **Find first failure** - Trace backwards to root cause\n3. **Simplify** - Try single-turn before multi-turn debug\n4. **N-1 testing** - Isolate turn-specific vs capability issues\n\n## Find First Upstream Failure\n\n```\nTurn 1: User asks about flights ✓\nTurn 2: Assistant asks for dates ✓\nTurn 3: User provides dates ✓\nTurn 4: Assistant searches WRONG dates ← FIRST FAILURE\nTurn 5: Shows wrong flights (consequence)\nTurn 6: User frustrated (consequence)\n```\n\nFocus on Turn 4, not Turn 6.\n\n## Simplify First\n\nBefore debugging multi-turn, test single-turn:\n\n```python\n# If single-turn also fails → problem is retrieval/knowledge\n# If single-turn passes → problem is conversation context\nresponse = chat(\"What's the return policy for electronics?\")\n```\n\n## N-1 Testing\n\nGive turns 1 to N-1 as context, test turn N:\n\n```python\ncontext = conversation[:n-1]\nresponse = chat_with_context(context, user_message_n)\n# Compare to actual turn N\n```\n\nThis isolates whether error is from context or underlying capability.\n\n## Checklist\n\n1. Did conversation achieve goal? (E2E)\n2. Which turn first went wrong?\n3. Can you reproduce with single-turn?\n4. Is error from context or capability? (N-1 test)\n","references/error-analysis.md":"# Error Analysis\n\nReview traces to discover failure modes before building evaluators.\n\n## Process\n\n1. **Sample** - 100+ traces (errors, negative feedback, random)\n2. **Open Code** - Write free-form notes per trace\n3. **Axial Code** - Group notes into failure categories\n4. **Quantify** - Count failures per category\n5. **Prioritize** - Rank by frequency × severity\n\n## Sample Traces\n\n### Span-level sampling (Python — DataFrame)\n\n```python\nfrom phoenix.client import Client\n\n# Client() works for local Phoenix (falls back to env vars or localhost:6006)\n# For remote/cloud: Client(base_url=\"https://app.phoenix.arize.com\", api_key=\"...\")\nclient = Client()\nspans_df = client.spans.get_spans_dataframe(project_identifier=\"my-app\")\n\n# Build representative sample\nsample = pd.concat([\n    spans_df[spans_df[\"status_code\"] == \"ERROR\"].sample(30),\n    spans_df[spans_df[\"feedback\"] == \"negative\"].sample(30),\n    spans_df.sample(40),\n]).drop_duplicates(\"span_id\").head(100)\n```\n\n### Span-level sampling (TypeScript)\n\n```typescript\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\n\nconst { spans: errors } = await getSpans({\n  project: { projectName: \"my-app\" },\n  statusCode: \"ERROR\",\n  limit: 30,\n});\nconst { spans: allSpans } = await getSpans({\n  project: { projectName: \"my-app\" },\n  limit: 70,\n});\nconst sample = [...errors, ...allSpans.sort(() =\u003e Math.random() - 0.5).slice(0, 40)];\nconst unique = [...new Map(sample.map((s) =\u003e [s.context.span_id, s])).values()].slice(0, 100);\n```\n\n### Trace-level sampling (Python)\n\nWhen errors span multiple spans (e.g., agent workflows), sample whole traces:\n\n```python\nfrom datetime import datetime, timedelta\n\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=24),\n    include_spans=True,\n    sort=\"latency_ms\",\n    order=\"desc\",\n    limit=100,\n)\n# Each trace has: trace_id, start_time, end_time, spans\n```\n\n### Trace-level sampling (TypeScript)\n\n```typescript\nimport { getTraces } from \"@arizeai/phoenix-client/traces\";\n\nconst { traces } = await getTraces({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),\n  includeSpans: true,\n  limit: 100,\n});\n```\n\n## Add Notes (Python)\n\n```python\nclient.spans.add_span_note(\n    span_id=\"abc123\",\n    note=\"wrong timezone - said 3pm EST but user is PST\"\n)\n```\n\n## Add Notes (TypeScript)\n\n```typescript\nimport { addSpanNote } from \"@arizeai/phoenix-client/spans\";\n\nawait addSpanNote({\n  spanNote: {\n    spanId: \"abc123\",\n    note: \"wrong timezone - said 3pm EST but user is PST\"\n  }\n});\n```\n\n## What to Note\n\n| Type | Examples |\n| ---- | -------- |\n| Factual errors | Wrong dates, prices, made-up features |\n| Missing info | Didn't answer question, omitted details |\n| Tone issues | Too casual/formal for context |\n| Tool issues | Wrong tool, wrong parameters |\n| Retrieval | Wrong docs, missing relevant docs |\n\n## Good Notes\n\n```\nBAD:  \"Response is bad\"\nGOOD: \"Response says ships in 2 days but policy is 5-7 days\"\n```\n\n## Group into Categories\n\n```python\ncategories = {\n    \"factual_inaccuracy\": [\"wrong shipping time\", \"incorrect price\"],\n    \"hallucination\": [\"made up a discount\", \"invented feature\"],\n    \"tone_mismatch\": [\"informal for enterprise client\"],\n}\n# Priority = Frequency × Severity\n```\n\n## Retrieve Existing Annotations\n\n### Python\n\n```python\n# From a spans DataFrame\nannotations_df = client.spans.get_span_annotations_dataframe(\n    spans_dataframe=sample,\n    project_identifier=\"my-app\",\n    include_annotation_names=[\"quality\", \"correctness\"],\n)\n# annotations_df has: span_id (index), name, label, score, explanation\n\n# Or from specific span IDs\nannotations_df = client.spans.get_span_annotations_dataframe(\n    span_ids=[\"span-id-1\", \"span-id-2\"],\n    project_identifier=\"my-app\",\n)\n```\n\n### TypeScript\n\n```typescript\nimport { getSpanAnnotations } from \"@arizeai/phoenix-client/spans\";\n\nconst { annotations } = await getSpanAnnotations({\n  project: { projectName: \"my-app\" },\n  spanIds: [\"span-id-1\", \"span-id-2\"],\n  includeAnnotationNames: [\"quality\", \"correctness\"],\n});\n\nfor (const ann of annotations) {\n  console.log(`${ann.span_id}: ${ann.name} = ${ann.result?.label} (${ann.result?.score})`);\n}\n```\n\n## Saturation\n\nStop when new traces reveal no new failure modes. Minimum: 100 traces.\n","references/evaluate-dataframe-python.md":"# Batch Evaluation with evaluate_dataframe (Python)\n\nRun evaluators across a DataFrame. The core 2.0 batch evaluation API.\n\n## Preferred: async_evaluate_dataframe\n\nFor batch evaluations (especially with LLM evaluators), prefer the async version\nfor better throughput:\n\n```python\nfrom phoenix.evals import async_evaluate_dataframe\n\nresults_df = await async_evaluate_dataframe(\n    dataframe=df,              # pandas DataFrame with columns matching evaluator params\n    evaluators=[eval1, eval2], # List of evaluators\n    concurrency=5,             # Max concurrent LLM calls (default 3)\n    exit_on_error=False,       # Optional: stop on first error (default True)\n    max_retries=3,             # Optional: retry failed LLM calls (default 10)\n)\n```\n\n## Sync Version\n\n```python\nfrom phoenix.evals import evaluate_dataframe\n\nresults_df = evaluate_dataframe(\n    dataframe=df,              # pandas DataFrame with columns matching evaluator params\n    evaluators=[eval1, eval2], # List of evaluators\n    exit_on_error=False,       # Optional: stop on first error (default True)\n    max_retries=3,             # Optional: retry failed LLM calls (default 10)\n)\n```\n\n## Result Column Format\n\n`async_evaluate_dataframe` / `evaluate_dataframe` returns a copy of the input DataFrame with added columns.\n**Result columns contain dicts, NOT raw numbers.**\n\nFor each evaluator named `\"foo\"`, two columns are added:\n\n| Column | Type | Contents |\n| ------ | ---- | -------- |\n| `foo_score` | `dict` | `{\"name\": \"foo\", \"score\": 1.0, \"label\": \"True\", \"explanation\": \"...\", \"metadata\": {...}, \"kind\": \"code\", \"direction\": \"maximize\"}` |\n| `foo_execution_details` | `dict` | `{\"status\": \"success\", \"exceptions\": [], \"execution_seconds\": 0.001}` |\n\nOnly non-None fields appear in the score dict.\n\n### Extracting Numeric Scores\n\n```python\n# WRONG — these will fail or produce unexpected results\nscore = results_df[\"relevance\"].mean()                    # KeyError!\nscore = results_df[\"relevance_score\"].mean()              # Tries to average dicts!\n\n# RIGHT — extract the numeric score from each dict\nscores = results_df[\"relevance_score\"].apply(\n    lambda x: x.get(\"score\", 0.0) if isinstance(x, dict) else 0.0\n)\nmean_score = scores.mean()\n```\n\n### Extracting Labels\n\n```python\nlabels = results_df[\"relevance_score\"].apply(\n    lambda x: x.get(\"label\", \"\") if isinstance(x, dict) else \"\"\n)\n```\n\n### Extracting Explanations (LLM evaluators)\n\n```python\nexplanations = results_df[\"relevance_score\"].apply(\n    lambda x: x.get(\"explanation\", \"\") if isinstance(x, dict) else \"\"\n)\n```\n\n### Finding Failures\n\n```python\nscores = results_df[\"relevance_score\"].apply(\n    lambda x: x.get(\"score\", 0.0) if isinstance(x, dict) else 0.0\n)\nfailed_mask = scores \u003c 0.5\nfailures = results_df[failed_mask]\n```\n\n## Input Mapping\n\nEvaluators receive each row as a dict. Column names must match the evaluator's\nexpected parameter names. If they don't match, use `.bind()` or `bind_evaluator`:\n\n```python\nfrom phoenix.evals import bind_evaluator, create_evaluator, async_evaluate_dataframe\n\n@create_evaluator(name=\"check\", kind=\"code\")\ndef check(response: str) -\u003e bool:\n    return len(response.strip()) \u003e 0\n\n# Option 1: Use .bind() method on the evaluator\ncheck.bind(input_mapping={\"response\": \"answer\"})\nresults_df = await async_evaluate_dataframe(dataframe=df, evaluators=[check])\n\n# Option 2: Use bind_evaluator function\nbound = bind_evaluator(evaluator=check, input_mapping={\"response\": \"answer\"})\nresults_df = await async_evaluate_dataframe(dataframe=df, evaluators=[bound])\n```\n\nOr simply rename columns to match:\n\n```python\ndf = df.rename(columns={\n    \"attributes.input.value\": \"input\",\n    \"attributes.output.value\": \"output\",\n})\n```\n\n## DO NOT use run_evals\n\n```python\n# WRONG — legacy 1.0 API\nfrom phoenix.evals import run_evals\nresults = run_evals(dataframe=df, evaluators=[eval1])\n# Returns List[DataFrame] — one per evaluator\n\n# RIGHT — current 2.0 API\nfrom phoenix.evals import async_evaluate_dataframe\nresults_df = await async_evaluate_dataframe(dataframe=df, evaluators=[eval1])\n# Returns single DataFrame with {name}_score dict columns\n```\n\nKey differences:\n- `run_evals` returns a **list** of DataFrames (one per evaluator)\n- `async_evaluate_dataframe` returns a **single** DataFrame with all results merged\n- `async_evaluate_dataframe` uses `{name}_score` dict column format\n- `async_evaluate_dataframe` uses `bind_evaluator` for input mapping (not `input_mapping=` param)\n","references/evaluators-code-python.md":"# Evaluators: Code Evaluators in Python\n\nDeterministic evaluators without LLM. Fast, cheap, reproducible.\n\n## Basic Pattern\n\n```python\nimport re\nimport json\nfrom phoenix.evals import create_evaluator\n\n@create_evaluator(name=\"has_citation\", kind=\"code\")\ndef has_citation(output: str) -\u003e bool:\n    return bool(re.search(r'\\[\\d+\\]', output))\n\n@create_evaluator(name=\"json_valid\", kind=\"code\")\ndef json_valid(output: str) -\u003e bool:\n    try:\n        json.loads(output)\n        return True\n    except json.JSONDecodeError:\n        return False\n```\n\n## Parameter Binding\n\n| Parameter | Description |\n| --------- | ----------- |\n| `output` | Task output |\n| `input` | Example input |\n| `expected` | Expected output |\n| `metadata` | Example metadata |\n\n```python\n@create_evaluator(name=\"matches_expected\", kind=\"code\")\ndef matches_expected(output: str, expected: dict) -\u003e bool:\n    return output.strip() == expected.get(\"answer\", \"\").strip()\n```\n\n## Common Patterns\n\n- **Regex**: `re.search(pattern, output)`\n- **JSON schema**: `jsonschema.validate()`\n- **Keywords**: `keyword in output.lower()`\n- **Length**: `len(output.split())`\n- **Similarity**: `editdistance.eval()` or Jaccard\n\n## Return Types\n\n| Return type | Result |\n| ----------- | ------ |\n| `bool` | `True` → score=1.0, label=\"True\"; `False` → score=0.0, label=\"False\" |\n| `float`/`int` | Used as the `score` value directly |\n| `str` (short, ≤3 words) | Used as the `label` value |\n| `str` (long, ≥4 words) | Used as the `explanation` value |\n| `dict` with `score`/`label`/`explanation` | Mapped to Score fields directly |\n| `Score` object | Used as-is |\n\n## Important: Code vs LLM Evaluators\n\nThe `@create_evaluator` decorator wraps a plain Python function.\n\n- `kind=\"code\"` (default): For deterministic evaluators that don't call an LLM.\n- `kind=\"llm\"`: Marks the evaluator as LLM-based, but **you** must implement the LLM\n  call inside the function. The decorator does not call an LLM for you.\n\nFor most LLM-based evaluation, prefer `ClassificationEvaluator` which handles\nthe LLM call, structured output parsing, and explanations automatically:\n\n```python\nfrom phoenix.evals import ClassificationEvaluator, LLM\n\nrelevance = ClassificationEvaluator(\n    name=\"relevance\",\n    prompt_template=\"Is this relevant?\\n{{input}}\\n{{output}}\\nAnswer:\",\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"relevant\": 1.0, \"irrelevant\": 0.0},\n)\n```\n\n## Pre-Built\n\n```python\nfrom phoenix.client.experiments import create_evaluator\nfrom phoenix.evals.metrics import MatchesRegex\n\ndate_format = MatchesRegex(pattern=r\"\\d{4}-\\d{2}-\\d{2}\")\n\n\n@create_evaluator(name=\"contains_any_keyword\", kind=\"code\")\ndef contains_any_keyword(output, expected):\n    keywords = expected.get(\"keywords\", [])\n    return any(kw.lower() in str(output).lower() for kw in keywords)\n\n\n@create_evaluator(name=\"json_parseable\", kind=\"code\")\ndef json_parseable(output):\n    import json\n\n    try:\n        json.loads(output)\n        return True\n    except (json.JSONDecodeError, TypeError):\n        return False\n```\n","references/evaluators-code-typescript.md":"# Evaluators: Code Evaluators in TypeScript\n\nDeterministic evaluators without LLM. Fast, cheap, reproducible.\n\n## Basic Pattern\n\n```typescript\nimport { createEvaluator } from \"@arizeai/phoenix-evals\";\n\nconst containsCitation = createEvaluator\u003c{ output: string }\u003e(\n  ({ output }) =\u003e /\\[\\d+\\]/.test(output) ? 1 : 0,\n  { name: \"contains_citation\", kind: \"CODE\" }\n);\n```\n\n## With Full Results (asExperimentEvaluator)\n\n```typescript\nimport { asExperimentEvaluator } from \"@arizeai/phoenix-client/experiments\";\n\nconst jsonValid = asExperimentEvaluator({\n  name: \"json_valid\",\n  kind: \"CODE\",\n  evaluate: async ({ output }) =\u003e {\n    try {\n      JSON.parse(String(output));\n      return { score: 1.0, label: \"valid_json\" };\n    } catch (e) {\n      return { score: 0.0, label: \"invalid_json\", explanation: String(e) };\n    }\n  },\n});\n```\n\n## Parameter Types\n\n```typescript\ninterface EvaluatorParams {\n  input: Record\u003cstring, unknown\u003e;\n  output: unknown;\n  expected: Record\u003cstring, unknown\u003e;\n  metadata: Record\u003cstring, unknown\u003e;\n}\n```\n\n## Common Patterns\n\n- **Regex**: `/pattern/.test(output)`\n- **JSON**: `JSON.parse()` + zod schema\n- **Keywords**: `output.includes(keyword)`\n- **Similarity**: `fastest-levenshtein`\n","references/evaluators-custom-templates.md":"# Evaluators: Custom Templates\n\nDesign LLM judge prompts.\n\n## Complete Template Pattern\n\n```python\nTEMPLATE = \"\"\"Evaluate faithfulness of the response to the context.\n\n\u003ccontext\u003e{{context}}\u003c/context\u003e\n\u003cresponse\u003e{{output}}\u003c/response\u003e\n\nCRITERIA:\n\"faithful\" = ALL claims supported by context\n\"unfaithful\" = ANY claim NOT in context\n\nEXAMPLES:\nContext: \"Price is $10\" → Response: \"It costs $10\" → faithful\nContext: \"Price is $10\" → Response: \"About $15\" → unfaithful\n\nEDGE CASES:\n- Empty context → cannot_evaluate\n- \"I don't know\" when appropriate → faithful\n- Partial faithfulness → unfaithful (strict)\n\nAnswer (faithful/unfaithful):\"\"\"\n```\n\n## Template Structure\n\n1. Task description\n2. Input variables in XML tags\n3. Criteria definitions\n4. Examples (2-4 cases)\n5. Edge cases\n6. Output format\n\n## XML Tags\n\n```\n\u003cquestion\u003e{{input}}\u003c/question\u003e\n\u003cresponse\u003e{{output}}\u003c/response\u003e\n\u003ccontext\u003e{{context}}\u003c/context\u003e\n\u003creference\u003e{{reference}}\u003c/reference\u003e\n```\n\n## Common Mistakes\n\n| Mistake | Fix |\n| ------- | --- |\n| Vague criteria | Define each label exactly |\n| No examples | Include 2-4 cases |\n| Ambiguous format | Specify exact output |\n| No edge cases | Address ambiguity |\n","references/evaluators-llm-python.md":"# Evaluators: LLM Evaluators in Python\n\nLLM evaluators use a language model to judge outputs. Use when criteria are subjective.\n\n## Quick Start\n\n```python\nfrom phoenix.evals import ClassificationEvaluator, LLM\n\nllm = LLM(provider=\"openai\", model=\"gpt-4o\")\n\nHELPFULNESS_TEMPLATE = \"\"\"Rate how helpful the response is.\n\n\u003cquestion\u003e{{input}}\u003c/question\u003e\n\u003cresponse\u003e{{output}}\u003c/response\u003e\n\n\"helpful\" means directly addresses the question.\n\"not_helpful\" means does not address the question.\n\nYour answer (helpful/not_helpful):\"\"\"\n\nhelpfulness = ClassificationEvaluator(\n    name=\"helpfulness\",\n    prompt_template=HELPFULNESS_TEMPLATE,\n    llm=llm,\n    choices={\"not_helpful\": 0, \"helpful\": 1}\n)\n```\n\n## Template Variables\n\nUse XML tags to wrap variables for clarity:\n\n| Variable | XML Tag |\n| -------- | ------- |\n| `{{input}}` | `\u003cquestion\u003e{{input}}\u003c/question\u003e` |\n| `{{output}}` | `\u003cresponse\u003e{{output}}\u003c/response\u003e` |\n| `{{reference}}` | `\u003creference\u003e{{reference}}\u003c/reference\u003e` |\n| `{{context}}` | `\u003ccontext\u003e{{context}}\u003c/context\u003e` |\n\n## create_classifier (Factory)\n\nShorthand factory that returns a `ClassificationEvaluator`. Prefer direct\n`ClassificationEvaluator` instantiation for more parameters/customization:\n\n```python\nfrom phoenix.evals import create_classifier, LLM\n\nrelevance = create_classifier(\n    name=\"relevance\",\n    prompt_template=\"\"\"Is this response relevant to the question?\n\u003cquestion\u003e{{input}}\u003c/question\u003e\n\u003cresponse\u003e{{output}}\u003c/response\u003e\nAnswer (relevant/irrelevant):\"\"\",\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"relevant\": 1.0, \"irrelevant\": 0.0},\n)\n```\n\n## Input Mapping\n\nColumn names must match template variables. Rename columns or use `bind_evaluator`:\n\n```python\n# Option 1: Rename columns to match template variables\ndf = df.rename(columns={\"user_query\": \"input\", \"ai_response\": \"output\"})\n\n# Option 2: Use bind_evaluator\nfrom phoenix.evals import bind_evaluator\n\nbound = bind_evaluator(\n    evaluator=helpfulness,\n    input_mapping={\"input\": \"user_query\", \"output\": \"ai_response\"},\n)\n```\n\n## Running\n\n```python\nfrom phoenix.evals import evaluate_dataframe\n\nresults_df = evaluate_dataframe(dataframe=df, evaluators=[helpfulness])\n```\n\n## Best Practices\n\n1. **Be specific** - Define exactly what pass/fail means\n2. **Include examples** - Show concrete cases for each label\n3. **Explanations by default** - `ClassificationEvaluator` includes explanations automatically\n4. **Study built-in prompts** - See\n   `phoenix.evals.__generated__.classification_evaluator_configs` for examples\n   of well-structured evaluation prompts (Faithfulness, Correctness, DocumentRelevance, etc.)\n","references/evaluators-llm-typescript.md":"# Evaluators: LLM Evaluators in TypeScript\n\nLLM evaluators use a language model to judge outputs. Uses Vercel AI SDK.\n\n## Quick Start\n\n```typescript\nimport { createClassificationEvaluator } from \"@arizeai/phoenix-evals\";\nimport { openai } from \"@ai-sdk/openai\";\n\nconst helpfulness = await createClassificationEvaluator\u003c{\n  input: string;\n  output: string;\n}\u003e({\n  name: \"helpfulness\",\n  model: openai(\"gpt-4o\"),\n  promptTemplate: `Rate helpfulness.\n\u003cquestion\u003e{{input}}\u003c/question\u003e\n\u003cresponse\u003e{{output}}\u003c/response\u003e\nAnswer (helpful/not_helpful):`,\n  choices: { not_helpful: 0, helpful: 1 },\n});\n```\n\n## Template Variables\n\nUse XML tags: `\u003cquestion\u003e{{input}}\u003c/question\u003e`, `\u003cresponse\u003e{{output}}\u003c/response\u003e`, `\u003ccontext\u003e{{context}}\u003c/context\u003e`\n\n## Custom Evaluator with asExperimentEvaluator\n\n```typescript\nimport { asExperimentEvaluator } from \"@arizeai/phoenix-client/experiments\";\n\nconst customEval = asExperimentEvaluator({\n  name: \"custom\",\n  kind: \"LLM\",\n  evaluate: async ({ input, output }) =\u003e {\n    // Your LLM call here\n    return { score: 1.0, label: \"pass\", explanation: \"...\" };\n  },\n});\n```\n\n## Pre-Built Evaluators\n\n```typescript\nimport { createFaithfulnessEvaluator } from \"@arizeai/phoenix-evals\";\n\nconst faithfulnessEvaluator = createFaithfulnessEvaluator({\n  model: openai(\"gpt-4o\"),\n});\n```\n\n## Best Practices\n\n- Be specific about criteria\n- Include examples in prompts\n- Use `\u003cthinking\u003e` for chain of thought\n","references/evaluators-overview.md":"# Evaluators: Overview\n\nWhen and how to build automated evaluators.\n\n## Decision Framework\n\n```\nShould I Build an Evaluator?\n        │\n        ▼\nCan I fix it with a prompt change?\n    YES → Fix the prompt first\n    NO  → Is this a recurring issue?\n          YES → Build evaluator\n          NO  → Add to watchlist\n```\n\n**Don't automate prematurely.** Many issues are simple prompt fixes.\n\n## Evaluator Requirements\n\n1. **Clear criteria** - Specific, not \"Is it good?\"\n2. **Labeled test set** - 100+ examples with human labels\n3. **Measured accuracy** - Know TPR/TNR before deploying\n\n## Evaluator Lifecycle\n\n1. **Discover** - Error analysis reveals pattern\n2. **Design** - Define criteria and test cases\n3. **Implement** - Build code or LLM evaluator\n4. **Calibrate** - Validate against human labels\n5. **Deploy** - Add to experiment/CI pipeline\n6. **Monitor** - Track accuracy over time\n7. **Maintain** - Update as product evolves\n\n## What NOT to Automate\n\n- **Rare issues** - \u003c5 instances? Watchlist, don't build\n- **Quick fixes** - Fixable by prompt change? Fix it\n- **Evolving criteria** - Stabilize definition first\n","references/evaluators-pre-built.md":"# Evaluators: Pre-Built\n\nUse for exploration only. Validate before production.\n\n## Python\n\n```python\nfrom phoenix.evals import LLM\nfrom phoenix.evals.metrics import FaithfulnessEvaluator\n\nllm = LLM(provider=\"openai\", model=\"gpt-4o\")\nfaithfulness_eval = FaithfulnessEvaluator(llm=llm)\n```\n\n**Note**: `HallucinationEvaluator` is deprecated. Use `FaithfulnessEvaluator` instead.\nIt uses \"faithful\"/\"unfaithful\" labels with score 1.0 = faithful.\n\n## TypeScript\n\n```typescript\nimport { createHallucinationEvaluator } from \"@arizeai/phoenix-evals\";\nimport { openai } from \"@ai-sdk/openai\";\n\nconst hallucinationEval = createHallucinationEvaluator({ model: openai(\"gpt-4o\") });\n```\n\n## Available (2.0)\n\n| Evaluator | Type | Description |\n| --------- | ---- | ----------- |\n| `FaithfulnessEvaluator` | LLM | Is the response faithful to the context? |\n| `CorrectnessEvaluator` | LLM | Is the response correct? |\n| `DocumentRelevanceEvaluator` | LLM | Are retrieved documents relevant? |\n| `ToolSelectionEvaluator` | LLM | Did the agent select the right tool? |\n| `ToolInvocationEvaluator` | LLM | Did the agent invoke the tool correctly? |\n| `ToolResponseHandlingEvaluator` | LLM | Did the agent handle the tool response well? |\n| `MatchesRegex` | Code | Does output match a regex pattern? |\n| `PrecisionRecallFScore` | Code | Precision/recall/F-score metrics |\n| `exact_match` | Code | Exact string match |\n\nLegacy evaluators (`HallucinationEvaluator`, `QAEvaluator`, `RelevanceEvaluator`,\n`ToxicityEvaluator`, `SummarizationEvaluator`) are in `phoenix.evals.legacy` and deprecated.\n\n## When to Use\n\n| Situation | Recommendation |\n| --------- | -------------- |\n| Exploration | Find traces to review |\n| Find outliers | Sort by scores |\n| Production | Validate first (\u003e80% human agreement) |\n| Domain-specific | Build custom |\n\n## Exploration Pattern\n\n```python\nfrom phoenix.evals import evaluate_dataframe\n\nresults_df = evaluate_dataframe(dataframe=traces, evaluators=[faithfulness_eval])\n\n# Score columns contain dicts — extract numeric scores\nscores = results_df[\"faithfulness_score\"].apply(\n    lambda x: x.get(\"score\", 0.0) if isinstance(x, dict) else 0.0\n)\nlow_scores = results_df[scores \u003c 0.5]   # Review these\nhigh_scores = results_df[scores \u003e 0.9]  # Also sample\n```\n\n## Validation Required\n\n```python\nfrom sklearn.metrics import classification_report\n\nprint(classification_report(human_labels, evaluator_results[\"label\"]))\n# Target: \u003e80% agreement\n```\n","references/evaluators-rag.md":"# Evaluators: RAG Systems\n\nRAG has two distinct components requiring different evaluation approaches.\n\n## Two-Phase Evaluation\n\n```\nRETRIEVAL                    GENERATION\n─────────                    ──────────\nQuery → Retriever → Docs     Docs + Query → LLM → Answer\n         │                              │\n    IR Metrics              LLM Judges / Code Checks\n```\n\n**Debug retrieval first** using IR metrics, then tackle generation quality.\n\n## Retrieval Evaluation (IR Metrics)\n\nUse traditional information retrieval metrics:\n\n| Metric | What It Measures |\n| ------ | ---------------- |\n| Recall@k | Of all relevant docs, how many in top k? |\n| Precision@k | Of k retrieved docs, how many relevant? |\n| MRR | How high is first relevant doc? |\n| NDCG | Quality weighted by position |\n\n```python\n# Requires query-document relevance labels\ndef recall_at_k(retrieved_ids, relevant_ids, k=5):\n    retrieved_set = set(retrieved_ids[:k])\n    relevant_set = set(relevant_ids)\n    if not relevant_set:\n        return 0.0\n    return len(retrieved_set \u0026 relevant_set) / len(relevant_set)\n```\n\n## Creating Retrieval Test Data\n\nGenerate query-document pairs synthetically:\n\n```python\n# Reverse process: document → questions that document answers\ndef generate_retrieval_test(documents):\n    test_pairs = []\n    for doc in documents:\n        # Extract facts, generate questions\n        questions = llm(f\"Generate 3 questions this document answers:\\n{doc}\")\n        for q in questions:\n            test_pairs.append({\"query\": q, \"relevant_doc_id\": doc.id})\n    return test_pairs\n```\n\n## Generation Evaluation\n\nUse LLM judges for qualities code can't measure:\n\n| Eval | Question |\n| ---- | -------- |\n| **Faithfulness** | Are all claims supported by retrieved context? |\n| **Relevance** | Does answer address the question? |\n| **Completeness** | Does answer cover key points from context? |\n\n```python\nfrom phoenix.evals import ClassificationEvaluator, LLM\n\nFAITHFULNESS_TEMPLATE = \"\"\"Given the context and answer, is every claim in the answer supported by the context?\n\n\u003ccontext\u003e{{context}}\u003c/context\u003e\n\u003canswer\u003e{{output}}\u003c/answer\u003e\n\n\"faithful\" = ALL claims supported by context\n\"unfaithful\" = ANY claim NOT in context\n\nAnswer (faithful/unfaithful):\"\"\"\n\nfaithfulness = ClassificationEvaluator(\n    name=\"faithfulness\",\n    prompt_template=FAITHFULNESS_TEMPLATE,\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"unfaithful\": 0, \"faithful\": 1}\n)\n```\n\n## RAG Failure Taxonomy\n\nCommon failure modes to evaluate:\n\n```yaml\nretrieval_failures:\n  - no_relevant_docs: Query returns unrelated content\n  - partial_retrieval: Some relevant docs missed\n  - wrong_chunk: Right doc, wrong section\n\ngeneration_failures:\n  - hallucination: Claims not in retrieved context\n  - ignored_context: Answer doesn't use retrieved docs\n  - incomplete: Missing key information from context\n  - wrong_synthesis: Misinterprets or miscombines sources\n```\n\n## Evaluation Order\n\n1. **Retrieval first** - If wrong docs, generation will fail\n2. **Faithfulness** - Is answer grounded in context?\n3. **Answer quality** - Does answer address the question?\n\nFix retrieval problems before debugging generation.\n","references/experiments-datasets-python.md":"# Experiments: Datasets in Python\n\nCreating and managing evaluation datasets.\n\n## Creating Datasets\n\n`create_dataset()` upserts: if a dataset with the same name already exists it is updated in-place; re-running with identical inputs is a no-op.\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\n\n# From examples\ndataset = client.datasets.create_dataset(\n    name=\"qa-test-v1\",\n    examples=[\n        {\n            \"input\": {\"question\": \"What is 2+2?\"},\n            \"output\": {\"answer\": \"4\"},\n            \"metadata\": {\"category\": \"math\"},\n        },\n    ],\n)\n\n# With stable example IDs for targeted updates across uploads\ndataset = client.datasets.create_dataset(\n    name=\"qa-test-v1\",\n    examples=[\n        {\n            \"id\": \"q-001\",                      # stable ID — server updates this row, not inserts\n            \"input\": {\"question\": \"What is 2+2?\"},\n            \"output\": {\"answer\": \"4\"},\n            \"metadata\": {\"category\": \"math\"},\n        },\n    ],\n)\n\n# From DataFrame\ndataset = client.datasets.create_dataset(\n    dataframe=df,\n    name=\"qa-test-v1\",\n    input_keys=[\"question\"],\n    output_keys=[\"answer\"],\n    metadata_keys=[\"category\"],\n    split_key=\"split\",        # single split column (use this instead of deprecated split_keys)\n    example_id_key=\"id\",      # column containing stable example IDs\n)\n```\n\n## From Production Traces\n\n```python\nspans_df = client.spans.get_spans_dataframe(project_identifier=\"my-app\")\n\ndataset = client.datasets.create_dataset(\n    dataframe=spans_df[[\"input.value\", \"output.value\"]],\n    name=\"production-sample-v1\",\n    input_keys=[\"input.value\"],\n    output_keys=[\"output.value\"],\n)\n```\n\n## Retrieving Datasets\n\n```python\ndataset = client.datasets.get_dataset(name=\"qa-test-v1\")\ndf = dataset.to_dataframe()\n```\n\n## Key Parameters\n\n| Parameter | Description |\n| --------- | ----------- |\n| `input_keys` | Columns for task input |\n| `output_keys` | Columns for expected output |\n| `metadata_keys` | Additional context |\n| `example_id_key` | Column with stable example IDs; server updates the matching row instead of inserting |\n| `split_key` | Single column for split assignment (replaces deprecated `split_keys`) |\n| `split_keys` | **Deprecated** — use `split_key` (singular) instead |\n\n## Using Evaluators in Experiments\n\n### Evaluators as experiment evaluators\n\nPass phoenix-evals evaluators directly to `run_experiment` as the `evaluators` argument:\n\n```python\nfrom functools import partial\nfrom phoenix.client import AsyncClient\nfrom phoenix.evals import ClassificationEvaluator, LLM, bind_evaluator\n\n# Define an LLM evaluator\nrefusal = ClassificationEvaluator(\n    name=\"refusal\",\n    prompt_template=\"Is this a refusal?\\nQuestion: {{query}}\\nResponse: {{response}}\",\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"refusal\": 0, \"answer\": 1},\n)\n\n# Bind to map dataset columns to evaluator params\nrefusal_evaluator = bind_evaluator(refusal, {\"query\": \"input.query\", \"response\": \"output\"})\n\n# Define experiment task\nasync def run_rag_task(input, rag_engine):\n    return rag_engine.query(input[\"query\"])\n\n# Run experiment with the evaluator\nexperiment = await AsyncClient().experiments.run_experiment(\n    dataset=ds,\n    task=partial(run_rag_task, rag_engine=query_engine),\n    experiment_name=\"baseline\",\n    evaluators=[refusal_evaluator],\n    concurrency=10,\n)\n```\n\n### Evaluators as the task (meta evaluation)\n\nUse an LLM evaluator as the experiment **task** to test the evaluator itself\nagainst human annotations:\n\n```python\nfrom phoenix.evals import create_evaluator\n\n# The evaluator IS the task being tested\ndef run_refusal_eval(input, evaluator):\n    result = evaluator.evaluate(input)\n    return result[0]\n\n# A simple heuristic checks judge vs human agreement\n@create_evaluator(name=\"exact_match\")\ndef exact_match(output, expected):\n    return float(output[\"score\"]) == float(expected[\"refusal_score\"])\n\n# Run: evaluator is the task, exact_match evaluates it\nexperiment = await AsyncClient().experiments.run_experiment(\n    dataset=annotated_dataset,\n    task=partial(run_refusal_eval, evaluator=refusal),\n    experiment_name=\"judge-v1\",\n    evaluators=[exact_match],\n    concurrency=10,\n)\n```\n\nThis pattern lets you iterate on evaluator prompts until they align with human judgments.\nSee `tutorials/evals/evals-2/evals_2.0_rag_demo.ipynb` for a full worked example.\n\n## Best Practices\n\n- **Upsert by default**: Re-upload to the same name to update in-place; use `example_id_key` so the server targets specific rows instead of treating every upload as new data\n- **Versioning**: Version with tags or new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits\n- **Metadata**: Track source, category, difficulty\n- **Balance**: Ensure diverse coverage across categories\n- **Avoid `split_keys`**: Pass `split_key` (singular) — `split_keys` is deprecated and emits a `DeprecationWarning`\n","references/experiments-datasets-typescript.md":"# Experiments: Datasets in TypeScript\n\nCreating and managing evaluation datasets.\n\n## Creating Datasets\n\n`createDataset()` upserts: if a dataset with the same name already exists it is updated to match the provided examples. Re-running with identical inputs is a no-op.\n\n```typescript\nimport { createClient } from \"@arizeai/phoenix-client\";\nimport { createDataset } from \"@arizeai/phoenix-client/datasets\";\n\nconst client = createClient();\n\nconst { datasetId } = await createDataset({\n  client,\n  name: \"qa-test-v1\",\n  examples: [\n    {\n      input: { question: \"What is 2+2?\" },\n      output: { answer: \"4\" },\n      metadata: { category: \"math\" },\n    },\n  ],\n});\n\n// With stable example IDs for targeted updates across uploads\nconst { datasetId } = await createDataset({\n  client,\n  name: \"qa-test-v1\",\n  examples: [\n    {\n      id: \"q-001\",                        // stable ID — server updates this row, not inserts\n      input: { question: \"What is 2+2?\" },\n      output: { answer: \"4\" },\n      metadata: { category: \"math\" },\n    },\n  ],\n});\n```\n\n## Example Structure\n\n```typescript\ninterface Example {\n  input: Record\u003cstring, unknown\u003e;    // Task input\n  output?: Record\u003cstring, unknown\u003e | null;  // Expected output\n  metadata?: Record\u003cstring, unknown\u003e | null; // Additional context\n  splits?: string | string[] | null; // Split assignment (\"train\", [\"train\", \"easy\"], etc.)\n  spanId?: string | null;            // OTEL span ID to link back to source trace\n  id?: string | null;                // Stable user-provided ID; server updates matching row\n}\n```\n\n## From Production Traces\n\n```typescript\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\n\nconst { spans } = await getSpans({\n  project: { projectName: \"my-app\" },\n  parentId: null, // root spans only\n  limit: 100,\n});\n\nconst examples = spans.map((span) =\u003e ({\n  input: { query: span.attributes?.[\"input.value\"] },\n  output: { response: span.attributes?.[\"output.value\"] },\n  metadata: { spanId: span.context.span_id },\n}));\n\nawait createDataset({ client, name: \"production-sample\", examples });\n```\n\n## Retrieving Datasets\n\n```typescript\nimport { getDataset, listDatasets } from \"@arizeai/phoenix-client/datasets\";\n\nconst dataset = await getDataset({ client, datasetId: \"...\" });\nconst all = await listDatasets({ client });\n```\n\n## Best Practices\n\n- **Upsert by default**: Re-upload to the same name to update in-place; use `id` on examples so the server targets specific rows instead of treating every upload as new data\n- **Versioning**: Version with new names (e.g., `qa-test-v2`) when you want a clean snapshot, not just incremental edits\n- **Metadata**: Track source, category, provenance\n- **Type safety**: Use the `Example` type from `@arizeai/phoenix-client/datasets`\n","references/experiments-overview.md":"# Experiments: Overview\n\nSystematic testing of AI systems with datasets, tasks, and evaluators.\n\n## Structure\n\n```\nDATASET     → Examples: {input, expected_output, metadata}\nTASK        → function(input) → output\nEVALUATORS  → (input, output, expected) → score\nEXPERIMENT  → Run task on all examples, score results\n```\n\n## Basic Usage\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\nexperiment = client.experiments.run_experiment(\n    dataset=my_dataset,\n    task=my_task,\n    evaluators=[accuracy, faithfulness],\n    experiment_name=\"improved-retrieval-v2\",\n)\n\nprint(experiment.aggregate_scores)\n# {'accuracy': 0.85, 'faithfulness': 0.92}\n```\n\n## Workflow\n\n1. **Create dataset** - From traces, synthetic data, or manual curation\n2. **Define task** - The function to test (your LLM pipeline)\n3. **Select evaluators** - Code and/or LLM-based\n4. **Run experiment** - Execute and score\n5. **Analyze \u0026 iterate** - Review, modify task, re-run\n\n## Dry Runs\n\nTest setup before full execution:\n\n```python\nexperiment = client.experiments.run_experiment(\n    dataset=dataset,\n    task=task,\n    evaluators=evaluators,\n    dry_run=3,\n)  # Just 3 examples\n```\n\n## Async Usage\n\nUse `AsyncClient` when your task or evaluators make network calls and you want higher throughput:\n\n```python\nfrom phoenix.client import AsyncClient\n\nclient = AsyncClient()\nexperiment = await client.experiments.run_experiment(\n    dataset=my_dataset,\n    task=my_async_task,\n    evaluators=[accuracy, faithfulness],\n    experiment_name=\"improved-retrieval-v2\",\n)\n```\n\n## Best Practices\n\n- **Name meaningfully**: `\"improved-retrieval-v2-2024-01-15\"` not `\"test\"`\n- **Version datasets**: Don't modify existing\n- **Multiple evaluators**: Combine perspectives\n","references/experiments-running-python.md":"# Experiments: Running Experiments in Python\n\nExecute experiments with `run_experiment`.\n\n## Basic Usage\n\n```python\nfrom phoenix.client import Client\nfrom phoenix.client.experiments import run_experiment\n\nclient = Client()\ndataset = client.datasets.get_dataset(name=\"qa-test-v1\")\n\ndef my_task(example):\n    return call_llm(example.input[\"question\"])\n\ndef exact_match(output, expected):\n    return 1.0 if output.strip().lower() == expected[\"answer\"].strip().lower() else 0.0\n\nexperiment = run_experiment(\n    dataset=dataset,\n    task=my_task,\n    evaluators=[exact_match],\n    experiment_name=\"qa-experiment-v1\",\n)\n```\n\n## Task Functions\n\n```python\n# Basic task\ndef task(example):\n    return call_llm(example.input[\"question\"])\n\n# With context (RAG)\ndef rag_task(example):\n    return call_llm(f\"Context: {example.input['context']}\\nQ: {example.input['question']}\")\n```\n\n## Evaluator Parameters\n\n| Parameter | Access |\n| --------- | ------ |\n| `output` | Task output |\n| `expected` | Example expected output |\n| `input` | Example input |\n| `metadata` | Example metadata |\n\n## Options\n\n```python\nexperiment = run_experiment(\n    dataset=dataset,\n    task=my_task,\n    evaluators=evaluators,\n    experiment_name=\"my-experiment\",\n    dry_run=3,       # Test with 3 examples\n    repetitions=3,   # Run each example 3 times\n)\n```\n\n## Results\n\n```python\nprint(experiment.aggregate_scores)\n# {'accuracy': 0.85, 'faithfulness': 0.92}\n\nfor run in experiment.runs:\n    print(run.output, run.scores)\n```\n\n## Stability\n\nSingle-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.\n\nAveraging over repetitions lets the score you report reflect the prompt rather than the sampling noise:\n\n```python\nrun_experiment(\n    # ...\n    repetitions=3,\n)\n```\n\nThings to consider:\n\n- Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.\n- Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.\n- Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.\n\nConsider adding stability when:\n\n- Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.\n- A prompt change flips example labels in ways that don't track with how the outputs actually changed.\n- The judge's reasoning on the same output reads differently from one run to the next.\n\nRepetitions are also what `repetitions=1` (default) silently relies on — don't trust a tuning decision based on a single 10-example run.\n\n## Add Evaluations Later\n\n```python\nfrom phoenix.client.experiments import evaluate_experiment\n\nevaluate_experiment(experiment=experiment, evaluators=[new_evaluator])\n```\n","references/experiments-running-typescript.md":"# Experiments: Running Experiments in TypeScript\n\nExecute experiments with `runExperiment`.\n\n## Basic Usage\n\n```typescript\nimport { createClient } from \"@arizeai/phoenix-client\";\nimport {\n  runExperiment,\n  asExperimentEvaluator,\n} from \"@arizeai/phoenix-client/experiments\";\n\nconst client = createClient();\n\nconst task = async (example: { input: Record\u003cstring, unknown\u003e }) =\u003e {\n  return await callLLM(example.input.question as string);\n};\n\nconst exactMatch = asExperimentEvaluator({\n  name: \"exact_match\",\n  kind: \"CODE\",\n  evaluate: async ({ output, expected }) =\u003e ({\n    score: output === expected?.answer ? 1.0 : 0.0,\n    label: output === expected?.answer ? \"match\" : \"no_match\",\n  }),\n});\n\nconst experiment = await runExperiment({\n  client,\n  experimentName: \"qa-experiment-v1\",\n  dataset: { datasetId: \"your-dataset-id\" },\n  task,\n  evaluators: [exactMatch],\n});\n```\n\n## Task Functions\n\n```typescript\n// Basic task\nconst task = async (example) =\u003e await callLLM(example.input.question as string);\n\n// With context (RAG)\nconst ragTask = async (example) =\u003e {\n  const prompt = `Context: ${example.input.context}\\nQ: ${example.input.question}`;\n  return await callLLM(prompt);\n};\n```\n\n## Evaluator Parameters\n\n```typescript\ninterface EvaluatorParams {\n  input: Record\u003cstring, unknown\u003e;\n  output: unknown;\n  expected: Record\u003cstring, unknown\u003e;\n  metadata: Record\u003cstring, unknown\u003e;\n}\n```\n\n## Options\n\n```typescript\nconst experiment = await runExperiment({\n  client,\n  experimentName: \"my-experiment\",\n  dataset: { datasetName: \"qa-test-v1\" },\n  task,\n  evaluators,\n  repetitions: 3, // Run each example 3 times\n  maxConcurrency: 5, // Limit concurrent executions\n});\n```\n\n## Stability\n\nSingle-run scores are noisy when either the task or the evaluator is non-deterministic — an LLM call, tool use, streaming output, an LLM-as-judge. On a small dataset, that per-run noise can swamp the signal from a prompt change.\n\nAveraging over repetitions lets the score you report reflect the prompt rather than the sampling noise:\n\n```typescript\nawait runExperiment({\n  // ...\n  repetitions: 3,\n});\n```\n\nThings to consider:\n\n- Reach for repetitions when the task or the evaluator is an LLM call and the dataset is small.\n- Prefer repetitions when per-example cost is low and you mostly want to settle the score; prefer growing the dataset when you also need to cover more behaviors.\n- Skip repetitions when both the task and the evaluator are deterministic (e.g. string comparison against a ground truth) — a single run is the answer.\n\nConsider adding stability when:\n\n- Repeat runs of the same experiment drift in ways that feel larger than the differences you're trying to measure.\n- A prompt change flips example labels in ways that don't track with how the outputs actually changed.\n- The judge's reasoning on the same output reads differently from one run to the next.\n\nRepetitions are also what `repetitions: 1` (default) silently relies on — don't trust a tuning decision based on a single 10-example run.\n\n## Add Evaluations Later\n\n```typescript\nimport { evaluateExperiment } from \"@arizeai/phoenix-client/experiments\";\n\nawait evaluateExperiment({ client, experiment, evaluators: [newEvaluator] });\n```\n","references/experiments-synthetic-python.md":"# Experiments: Generating Synthetic Test Data\n\nCreating diverse, targeted test data for evaluation.\n\n## Dimension-Based Approach\n\nDefine axes of variation, then generate combinations:\n\n```python\ndimensions = {\n    \"issue_type\": [\"billing\", \"technical\", \"shipping\"],\n    \"customer_mood\": [\"frustrated\", \"neutral\", \"happy\"],\n    \"complexity\": [\"simple\", \"moderate\", \"complex\"],\n}\n```\n\n## Two-Step Generation\n\n1. **Generate tuples** (combinations of dimension values)\n2. **Convert to natural queries** (separate LLM call per tuple)\n\n```python\n# Step 1: Create tuples\ntuples = [\n    (\"billing\", \"frustrated\", \"complex\"),\n    (\"shipping\", \"neutral\", \"simple\"),\n]\n\n# Step 2: Convert to natural query\ndef tuple_to_query(t):\n    prompt = f\"\"\"Generate a realistic customer message:\n    Issue: {t[0]}, Mood: {t[1]}, Complexity: {t[2]}\n    \n    Write naturally, include typos if appropriate. Don't be formulaic.\"\"\"\n    return llm(prompt)\n```\n\n## Target Failure Modes\n\nDimensions should target known failures from error analysis:\n\n```python\n# From error analysis findings\ndimensions = {\n    \"timezone\": [\"EST\", \"PST\", \"UTC\", \"ambiguous\"],  # Known failure\n    \"date_format\": [\"ISO\", \"US\", \"EU\", \"relative\"],   # Known failure\n}\n```\n\n## Quality Control\n\n- **Validate**: Check for placeholder text, minimum length\n- **Deduplicate**: Remove near-duplicate queries using embeddings\n- **Balance**: Ensure coverage across dimension values\n\n## When to Use\n\n| Use Synthetic | Use Real Data |\n| ------------- | ------------- |\n| Limited production data | Sufficient traces |\n| Testing edge cases | Validating actual behavior |\n| Pre-launch evals | Post-launch monitoring |\n\n## Sample Sizes\n\n| Purpose | Size |\n| ------- | ---- |\n| Initial exploration | 50-100 |\n| Comprehensive eval | 100-500 |\n| Per-dimension | 10-20 per combination |\n","references/experiments-synthetic-typescript.md":"# Experiments: Generating Synthetic Test Data (TypeScript)\n\nCreating diverse, targeted test data for evaluation.\n\n## Dimension-Based Approach\n\nDefine axes of variation, then generate combinations:\n\n```typescript\nconst dimensions = {\n  issueType: [\"billing\", \"technical\", \"shipping\"],\n  customerMood: [\"frustrated\", \"neutral\", \"happy\"],\n  complexity: [\"simple\", \"moderate\", \"complex\"],\n};\n```\n\n## Two-Step Generation\n\n1. **Generate tuples** (combinations of dimension values)\n2. **Convert to natural queries** (separate LLM call per tuple)\n\n```typescript\nimport { generateText } from \"ai\";\nimport { openai } from \"@ai-sdk/openai\";\n\n// Step 1: Create tuples\ntype Tuple = [string, string, string];\nconst tuples: Tuple[] = [\n  [\"billing\", \"frustrated\", \"complex\"],\n  [\"shipping\", \"neutral\", \"simple\"],\n];\n\n// Step 2: Convert to natural query\nasync function tupleToQuery(t: Tuple): Promise\u003cstring\u003e {\n  const { text } = await generateText({\n    model: openai(\"gpt-4o\"),\n    prompt: `Generate a realistic customer message:\n    Issue: ${t[0]}, Mood: ${t[1]}, Complexity: ${t[2]}\n    \n    Write naturally, include typos if appropriate. Don't be formulaic.`,\n  });\n  return text;\n}\n```\n\n## Target Failure Modes\n\nDimensions should target known failures from error analysis:\n\n```typescript\n// From error analysis findings\nconst dimensions = {\n  timezone: [\"EST\", \"PST\", \"UTC\", \"ambiguous\"], // Known failure\n  dateFormat: [\"ISO\", \"US\", \"EU\", \"relative\"], // Known failure\n};\n```\n\n## Quality Control\n\n- **Validate**: Check for placeholder text, minimum length\n- **Deduplicate**: Remove near-duplicate queries using embeddings\n- **Balance**: Ensure coverage across dimension values\n\n```typescript\nfunction validateQuery(query: string): boolean {\n  const minLength = 20;\n  const hasPlaceholder = /\\[.*?\\]|\u003c.*?\u003e/.test(query);\n  return query.length \u003e= minLength \u0026\u0026 !hasPlaceholder;\n}\n```\n\n## When to Use\n\n| Use Synthetic | Use Real Data |\n| ------------- | ------------- |\n| Limited production data | Sufficient traces |\n| Testing edge cases | Validating actual behavior |\n| Pre-launch evals | Post-launch monitoring |\n\n## Sample Sizes\n\n| Purpose | Size |\n| ------- | ---- |\n| Initial exploration | 50-100 |\n| Comprehensive eval | 100-500 |\n| Per-dimension | 10-20 per combination |\n","references/fundamentals-anti-patterns.md":"# Anti-Patterns\n\nCommon mistakes and fixes.\n\n| Anti-Pattern | Problem | Fix |\n| ------------ | ------- | --- |\n| Generic metrics | Pre-built scores don't match your failures | Build from error analysis |\n| Vibe-based | No quantification | Measure with experiments |\n| Ignoring humans | Uncalibrated LLM judges | Validate \u003e80% TPR/TNR |\n| Premature automation | Evaluators for imagined problems | Let observed failures drive |\n| Saturation blindness | 100% pass = no signal | Keep capability evals at 50-80% |\n| Similarity metrics | BERTScore/ROUGE for generation | Use for retrieval only |\n| Model switching | Hoping a model works better | Error analysis first |\n| Single-run scoring | LLM judges and non-deterministic tasks add per-run noise that can drown the signal from a prompt change on a small dataset | Set `repetitions` on `runExperiment` (or grow the dataset) when the task or judge is an LLM call |\n\n## Quantify Changes\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\nbaseline = client.experiments.run_experiment(dataset=dataset, task=old_prompt, evaluators=evaluators)\nimproved = client.experiments.run_experiment(dataset=dataset, task=new_prompt, evaluators=evaluators)\nprint(f\"Improvement: {improved.pass_rate - baseline.pass_rate:+.1%}\")\n```\n\n## Don't Use Similarity for Generation\n\n```python\n# BAD\nscore = bertscore(output, reference)\n\n# GOOD\ncorrect_facts = check_facts_against_source(output, context)\n```\n\n## Error Analysis Before Model Change\n\n```python\n# BAD\nfor model in models:\n    results = test(model)\n\n# GOOD\nfailures = analyze_errors(results)\n# Then decide if model change is warranted\n```\n","references/fundamentals-model-selection.md":"# Model Selection\n\nError analysis first, model changes last.\n\n## Decision Tree\n\n```\nPerformance Issue?\n       │\n       ▼\nError analysis suggests model problem?\n    NO  → Fix prompts, retrieval, tools\n    YES → Is it a capability gap?\n          YES → Consider model change\n          NO  → Fix the actual problem\n```\n\n## Judge Model Selection\n\n| Principle | Action |\n| --------- | ------ |\n| Start capable | Use gpt-4o first |\n| Optimize later | Test cheaper after criteria stable |\n| Same model OK | Judge does different task |\n\n```python\n# Start with capable model\njudge = ClassificationEvaluator(\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    ...\n)\n\n# After validation, test cheaper\njudge_cheap = ClassificationEvaluator(\n    llm=LLM(provider=\"openai\", model=\"gpt-4o-mini\"),\n    ...\n)\n# Compare TPR/TNR on same test set\n```\n\n## Don't Model Shop\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\n\n# BAD\nfor model in [\"gpt-4o\", \"claude-3\", \"gemini-pro\"]:\n    results = client.experiments.run_experiment(\n        dataset=dataset,\n        task=lambda input, _model=model: task(input, model=_model),\n        evaluators=evaluators,\n    )\n\n# GOOD\nfailures = analyze_errors(results)\n# \"Ignores context\" → Fix prompt\n# \"Can't do math\" → Maybe try better model\n```\n\n## When Model Change Is Warranted\n\n- Failures persist after prompt optimization\n- Capability gaps (reasoning, math, code)\n- Error analysis confirms model limitation\n","references/fundamentals.md":"# Fundamentals\n\nApplication-specific tests for AI systems. Code first, LLM for nuance, human for truth.\n\n## Evaluator Types\n\n| Type | Speed | Cost | Use Case |\n| ---- | ----- | ---- | -------- |\n| **Code** | Fast | Cheap | Regex, JSON, format, exact match |\n| **LLM** | Medium | Medium | Subjective quality, complex criteria |\n| **Human** | Slow | Expensive | Ground truth, calibration |\n\n**Decision:** Code first → LLM only when code can't capture criteria → Human for calibration.\n\n## Score Structure\n\n| Property | Required | Description |\n| -------- | -------- | ----------- |\n| `name` | Yes | Evaluator name |\n| `kind` | Yes | `\"code\"`, `\"llm\"`, `\"human\"` |\n| `score` | No* | 0-1 numeric |\n| `label` | No* | `\"pass\"`, `\"fail\"` |\n| `explanation` | No | Rationale |\n\n*One of `score` or `label` required.\n\n## Binary \u003e Likert\n\nUse pass/fail, not 1-5 scales. Clearer criteria, easier calibration.\n\n```python\n# Multiple binary checks instead of one Likert scale\nevaluators = [\n    AnswersQuestion(),    # Yes/No\n    UsesContext(),        # Yes/No\n    NoHallucination(),    # Yes/No\n]\n```\n\n## Quick Patterns\n\n### Code Evaluator\n\n```python\nfrom phoenix.evals import create_evaluator\n\n@create_evaluator(name=\"has_citation\", kind=\"code\")\ndef has_citation(output: str) -\u003e bool:\n    return bool(re.search(r'\\[\\d+\\]', output))\n```\n\n### LLM Evaluator\n\n```python\nfrom phoenix.evals import ClassificationEvaluator, LLM\n\nevaluator = ClassificationEvaluator(\n    name=\"helpfulness\",\n    prompt_template=\"...\",\n    llm=LLM(provider=\"openai\", model=\"gpt-4o\"),\n    choices={\"not_helpful\": 0, \"helpful\": 1}\n)\n```\n\n### Run Experiment\n\n```python\nfrom phoenix.client.experiments import run_experiment\n\nexperiment = run_experiment(\n    dataset=dataset,\n    task=my_task,\n    evaluators=[evaluator1, evaluator2],\n)\nprint(experiment.aggregate_scores)\n```\n","references/observe-sampling-python.md":"# Observe: Sampling Strategies\n\nHow to efficiently sample production traces for review.\n\n## Strategies\n\n### 1. Failure-Focused (Highest Priority)\n\n```python\nerrors = spans_df[spans_df[\"status_code\"] == \"ERROR\"]\nnegative_feedback = spans_df[spans_df[\"feedback\"] == \"negative\"]\n```\n\n### 2. Outliers\n\n```python\nlong_responses = spans_df.nlargest(50, \"response_length\")\nslow_responses = spans_df.nlargest(50, \"latency_ms\")\n```\n\n### 3. Stratified (Coverage)\n\n```python\n# Sample equally from each category\nby_query_type = spans_df.groupby(\"metadata.query_type\").apply(\n    lambda x: x.sample(min(len(x), 20))\n)\n```\n\n### 4. Metric-Guided\n\n```python\n# Review traces flagged by automated evaluators\nflagged = spans_df[eval_results[\"label\"] == \"hallucinated\"]\nborderline = spans_df[(eval_results[\"score\"] \u003e 0.3) \u0026 (eval_results[\"score\"] \u003c 0.7)]\n```\n\n## Building a Review Queue\n\n```python\ndef build_review_queue(spans_df, max_traces=100):\n    queue = pd.concat([\n        spans_df[spans_df[\"status_code\"] == \"ERROR\"],\n        spans_df[spans_df[\"feedback\"] == \"negative\"],\n        spans_df.nlargest(10, \"response_length\"),\n        spans_df.sample(min(30, len(spans_df))),\n    ]).drop_duplicates(\"span_id\").head(max_traces)\n    return queue\n```\n\n## Sample Size Guidelines\n\n| Purpose | Size |\n| ------- | ---- |\n| Initial exploration | 50-100 |\n| Error analysis | 100+ (until saturation) |\n| Golden dataset | 100-500 |\n| Judge calibration | 100+ per class |\n\n**Saturation:** Stop when new traces show the same failure patterns.\n\n## Trace-Level Sampling\n\nWhen you need whole requests (all spans per trace), use `get_traces`:\n\n```python\nfrom phoenix.client import Client\nfrom datetime import datetime, timedelta\n\nclient = Client()\n\n# Recent traces with full span trees\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    limit=100,\n    include_spans=True,\n)\n\n# Time-windowed sampling (e.g., last hour)\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=1),\n    limit=50,\n    include_spans=True,\n)\n\n# Filter by session (multi-turn conversations)\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    session_id=\"user-session-abc\",\n    include_spans=True,\n)\n\n# Sort by latency to find slowest requests\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    sort=\"latency_ms\",\n    order=\"desc\",\n    limit=50,\n)\n```\n","references/observe-sampling-typescript.md":"# Observe: Sampling Strategies (TypeScript)\n\nHow to efficiently sample production traces for review.\n\n## Strategies\n\n### 1. Failure-Focused (Highest Priority)\n\nUse server-side filters to fetch only what you need:\n\n```typescript\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\n\n// Server-side filter — only ERROR spans are returned\nconst { spans: errors } = await getSpans({\n  project: { projectName: \"my-project\" },\n  statusCode: \"ERROR\",\n  limit: 100,\n});\n\n// Fetch only LLM spans\nconst { spans: llmSpans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  spanKind: \"LLM\",\n  limit: 100,\n});\n\n// Filter by span name\nconst { spans: chatSpans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  name: \"chat_completion\",\n  limit: 100,\n});\n```\n\n### 2. Outliers\n\n```typescript\nconst { spans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  limit: 200,\n});\nconst latency = (s: (typeof spans)[number]) =\u003e\n  new Date(s.end_time).getTime() - new Date(s.start_time).getTime();\nconst sorted = [...spans].sort((a, b) =\u003e latency(b) - latency(a));\nconst slowResponses = sorted.slice(0, 50);\n```\n\n### 3. Stratified (Coverage)\n\n```typescript\n// Sample equally from each category\nfunction stratifiedSample\u003cT\u003e(items: T[], groupBy: (item: T) =\u003e string, perGroup: number): T[] {\n  const groups = new Map\u003cstring, T[]\u003e();\n  for (const item of items) {\n    const key = groupBy(item);\n    if (!groups.has(key)) groups.set(key, []);\n    groups.get(key)!.push(item);\n  }\n  return [...groups.values()].flatMap((g) =\u003e g.slice(0, perGroup));\n}\n\nconst { spans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  limit: 500,\n});\nconst byQueryType = stratifiedSample(spans, (s) =\u003e s.attributes?.[\"metadata.query_type\"] ?? \"unknown\", 20);\n```\n\n### 4. Metric-Guided\n\n```typescript\nimport { getSpanAnnotations } from \"@arizeai/phoenix-client/spans\";\n\n// Fetch annotations for your spans, then filter by label\nconst { annotations } = await getSpanAnnotations({\n  project: { projectName: \"my-project\" },\n  spanIds: spans.map((s) =\u003e s.context.span_id),\n  includeAnnotationNames: [\"hallucination\"],\n});\n\nconst flaggedSpanIds = new Set(\n  annotations.filter((a) =\u003e a.result?.label === \"hallucinated\").map((a) =\u003e a.span_id)\n);\nconst flagged = spans.filter((s) =\u003e flaggedSpanIds.has(s.context.span_id));\n```\n\n## Trace-Level Sampling\n\nWhen you need whole requests (all spans in a trace), use `getTraces`:\n\n```typescript\nimport { getTraces } from \"@arizeai/phoenix-client/traces\";\n\n// Recent traces with full span trees\nconst { traces } = await getTraces({\n  project: { projectName: \"my-project\" },\n  limit: 100,\n  includeSpans: true,\n});\n\n// Filter by session (e.g., multi-turn conversations)\nconst { traces: sessionTraces } = await getTraces({\n  project: { projectName: \"my-project\" },\n  sessionId: \"user-session-abc\",\n  includeSpans: true,\n});\n\n// Time-windowed sampling\nconst { traces: recentTraces } = await getTraces({\n  project: { projectName: \"my-project\" },\n  startTime: new Date(Date.now() - 60 * 60 * 1000), // last hour\n  limit: 50,\n  includeSpans: true,\n});\n```\n\n## Building a Review Queue\n\n```typescript\n// Combine server-side filters into a review queue\nconst { spans: errorSpans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  statusCode: \"ERROR\",\n  limit: 30,\n});\nconst { spans: allSpans } = await getSpans({\n  project: { projectName: \"my-project\" },\n  limit: 100,\n});\nconst random = allSpans.sort(() =\u003e Math.random() - 0.5).slice(0, 30);\n\nconst combined = [...errorSpans, ...random];\nconst unique = [...new Map(combined.map((s) =\u003e [s.context.span_id, s])).values()];\nconst reviewQueue = unique.slice(0, 100);\n```\n\n## Sample Size Guidelines\n\n| Purpose | Size |\n| ------- | ---- |\n| Initial exploration | 50-100 |\n| Error analysis | 100+ (until saturation) |\n| Golden dataset | 100-500 |\n| Judge calibration | 100+ per class |\n\n**Saturation:** Stop when new traces show the same failure patterns.\n","references/observe-tracing-setup.md":"# Observe: Tracing Setup\n\nConfigure tracing to capture data for evaluation.\n\n## Quick Setup\n\n```python\n# Python\nfrom phoenix.otel import register\n\nregister(project_name=\"my-app\", auto_instrument=True)\n```\n\n```typescript\n// TypeScript\nimport { registerPhoenix } from \"@arizeai/phoenix-otel\";\n\nregisterPhoenix({ projectName: \"my-app\", autoInstrument: true });\n```\n\n## Essential Attributes\n\n| Attribute | Why It Matters |\n| --------- | -------------- |\n| `input.value` | User's request |\n| `output.value` | Response to evaluate |\n| `retrieval.documents` | Context for faithfulness |\n| `tool.name`, `tool.parameters` | Agent evaluation |\n| `llm.model_name` | Track by model |\n\n## Custom Attributes for Evals\n\n```python\nspan.set_attribute(\"metadata.client_type\", \"enterprise\")\nspan.set_attribute(\"metadata.query_category\", \"billing\")\n```\n\n## Exporting for Evaluation\n\n### Spans (Python — DataFrame)\n\n```python\nfrom phoenix.client import Client\n\n# Client() works for local Phoenix (falls back to env vars or localhost:6006)\n# For remote/cloud: Client(base_url=\"https://app.phoenix.arize.com\", api_key=\"...\")\nclient = Client()\nspans_df = client.spans.get_spans_dataframe(\n    project_identifier=\"my-app\",  # NOT project_name= (deprecated)\n    root_spans_only=True,\n)\n\ndataset = client.datasets.create_dataset(\n    name=\"error-analysis-set\",\n    dataframe=spans_df[[\"input.value\", \"output.value\"]],\n    input_keys=[\"input.value\"],\n    output_keys=[\"output.value\"],\n)\n```\n\n### Spans (TypeScript)\n\n```typescript\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\n\nconst { spans } = await getSpans({\n  project: { projectName: \"my-app\" },\n  parentId: null, // root spans only\n  limit: 100,\n});\n```\n\n### Traces (Python — structured)\n\nUse `get_traces` when you need full trace trees (e.g., multi-turn conversations, agent workflows):\n\n```python\nfrom datetime import datetime, timedelta\n\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=24),\n    include_spans=True,  # includes all spans per trace\n    limit=100,\n)\n# Each trace has: trace_id, start_time, end_time, spans (when include_spans=True)\n```\n\n### Traces (TypeScript)\n\n```typescript\nimport { getTraces } from \"@arizeai/phoenix-client/traces\";\n\nconst { traces } = await getTraces({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 24 * 60 * 60 * 1000),\n  includeSpans: true,\n  limit: 100,\n});\n```\n\n## Uploading Evaluations as Annotations\n\n### Python\n\n```python\nfrom phoenix.evals import evaluate_dataframe\nfrom phoenix.evals.utils import to_annotation_dataframe\n\n# Run evaluations\nresults_df = evaluate_dataframe(dataframe=spans_df, evaluators=[my_eval])\n\n# Format results for Phoenix annotations\nannotations_df = to_annotation_dataframe(results_df)\n\n# Upload to Phoenix\nclient.spans.log_span_annotations_dataframe(dataframe=annotations_df)\n```\n\n### TypeScript\n\n```typescript\nimport { logSpanAnnotations } from \"@arizeai/phoenix-client/spans\";\n\nawait logSpanAnnotations({\n  spanAnnotations: [\n    {\n      spanId: \"abc123\",\n      name: \"quality\",\n      label: \"good\",\n      score: 0.95,\n      annotatorKind: \"LLM\",\n    },\n  ],\n});\n```\n\nAnnotations are visible in the Phoenix UI alongside your traces.\n\n## Verify\n\nRequired attributes: `input.value`, `output.value`, `status_code`\nFor RAG: `retrieval.documents`\nFor agents: `tool.name`, `tool.parameters`\n","references/production-continuous.md":"# Production: Continuous Evaluation\n\nCapability vs regression evals and the ongoing feedback loop.\n\n## Two Types of Evals\n\n| Type | Pass Rate Target | Purpose | Update |\n| ---- | ---------------- | ------- | ------ |\n| **Capability** | 50-80% | Measure improvement | Add harder cases |\n| **Regression** | 95-100% | Catch breakage | Add fixed bugs |\n\n## Saturation\n\nWhen capability evals hit \u003e95% pass rate, they're saturated:\n1. Graduate passing cases to regression suite\n2. Add new challenging cases to capability suite\n\n## Feedback Loop\n\n```\nProduction → Sample traffic → Run evaluators → Find failures\n    ↑                                              ↓\nDeploy  ←  Run CI evals  ←  Create test cases  ←  Error analysis\n```\n\n## Implementation\n\nBuild a continuous monitoring loop:\n\n1. **Sample recent traces** at regular intervals (e.g., 100 traces per hour)\n2. **Run evaluators** on sampled traces\n3. **Log results** to Phoenix for tracking\n4. **Queue concerning results** for human review\n5. **Create test cases** from recurring failure patterns\n\n### Python\n\n```python\nfrom phoenix.client import Client\nfrom datetime import datetime, timedelta\n\nclient = Client()\n\n# 1. Sample recent spans (includes full attributes for evaluation)\nspans_df = client.spans.get_spans_dataframe(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=1),\n    root_spans_only=True,\n    limit=100,\n)\n\n# 2. Run evaluators\nfrom phoenix.evals import evaluate_dataframe\n\nresults_df = evaluate_dataframe(\n    dataframe=spans_df,\n    evaluators=[quality_eval, safety_eval],\n)\n\n# 3. Upload results as annotations\nfrom phoenix.evals.utils import to_annotation_dataframe\n\nannotations_df = to_annotation_dataframe(results_df)\nclient.spans.log_span_annotations_dataframe(dataframe=annotations_df)\n```\n\n### TypeScript\n\n```typescript\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\nimport { logSpanAnnotations } from \"@arizeai/phoenix-client/spans\";\n\n// 1. Sample recent spans\nconst { spans } = await getSpans({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 60 * 60 * 1000),\n  parentId: null, // root spans only\n  limit: 100,\n});\n\n// 2. Run evaluators (user-defined)\nconst results = await Promise.all(\n  spans.map(async (span) =\u003e ({\n    spanId: span.context.span_id,\n    ...await runEvaluators(span, [qualityEval, safetyEval]),\n  }))\n);\n\n// 3. Upload results as annotations\nawait logSpanAnnotations({\n  spanAnnotations: results.map((r) =\u003e ({\n    spanId: r.spanId,\n    name: \"quality\",\n    score: r.qualityScore,\n    label: r.qualityLabel,\n    annotatorKind: \"LLM\" as const,\n  })),\n});\n```\n\nFor trace-level monitoring (e.g., agent workflows), use `get_traces`/`getTraces` to identify traces:\n\n```python\n# Python: identify slow traces\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=1),\n    sort=\"latency_ms\",\n    order=\"desc\",\n    limit=50,\n)\n```\n\n```typescript\n// TypeScript: identify slow traces\nimport { getTraces } from \"@arizeai/phoenix-client/traces\";\n\nconst { traces } = await getTraces({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 60 * 60 * 1000),\n  limit: 50,\n});\n```\n\n## Alerting\n\n| Condition | Severity | Action |\n| --------- | -------- | ------ |\n| Regression \u003c 98% | Critical | Page oncall |\n| Capability declining | Warning | Slack notify |\n| Capability \u003e 95% for 7d | Info | Schedule review |\n\n## Key Principles\n\n- **Two suites** - Capability + Regression always\n- **Graduate cases** - Move consistent passes to regression\n- **Track trends** - Monitor over time, not just snapshots\n","references/production-guardrails.md":"# Production: Guardrails vs Evaluators\n\nGuardrails block in real-time. Evaluators measure asynchronously.\n\n## Key Distinction\n\n```\nRequest → [INPUT GUARDRAIL] → LLM → [OUTPUT GUARDRAIL] → Response\n                                            │\n                                            └──→ ASYNC EVALUATOR (background)\n```\n\n## Guardrails\n\n| Aspect | Requirement |\n| ------ | ----------- |\n| Timing | Synchronous, blocking |\n| Latency | \u003c 100ms |\n| Purpose | Prevent harm |\n| Type | Code-based (deterministic) |\n\n**Use for:** PII detection, prompt injection, profanity, length limits, format validation.\n\n## Evaluators\n\n| Aspect | Characteristic |\n| ------ | -------------- |\n| Timing | Async, background |\n| Latency | Can be seconds |\n| Purpose | Measure quality |\n| Type | Can use LLMs |\n\n**Use for:** Helpfulness, faithfulness, tone, completeness, citation accuracy.\n\n## Decision\n\n| Question | Answer |\n| -------- | ------ |\n| Must block harmful content? | Guardrail |\n| Measuring quality? | Evaluator |\n| Need LLM judgment? | Evaluator |\n| \u003c 100ms required? | Guardrail |\n| False positives = angry users? | Evaluator |\n\n## LLM Guardrails: Rarely\n\nOnly use LLM guardrails if:\n- Latency budget \u003e 1s\n- Error cost \u003e\u003e LLM cost\n- Low volume\n- Fallback exists\n\n**Key Principle:** Guardrails prevent harm (block). Evaluators measure quality (log).\n","references/production-overview.md":"# Production: Overview\n\nCI/CD evals vs production monitoring - complementary approaches.\n\n## Two Evaluation Modes\n\n| Aspect | CI/CD Evals | Production Monitoring |\n| ------ | ----------- | -------------------- |\n| **When** | Pre-deployment | Post-deployment, ongoing |\n| **Data** | Fixed dataset | Sampled traffic |\n| **Goal** | Prevent regression | Detect drift |\n| **Response** | Block deploy | Alert \u0026 analyze |\n\n## CI/CD Evaluations\n\n```python\nfrom phoenix.client import Client\n\nclient = Client()\n\n# Fast, deterministic checks\nci_evaluators = [\n    has_required_format,\n    no_pii_leak,\n    safety_check,\n    regression_test_suite,\n]\n\n# Small but representative dataset (~100 examples)\nclient.experiments.run_experiment(dataset=ci_dataset, task=task, evaluators=ci_evaluators)\n```\n\nSet thresholds: regression=0.95, safety=1.0, format=0.98.\n\n## Production Monitoring\n\n### Python\n\n```python\nfrom phoenix.client import Client\nfrom datetime import datetime, timedelta\n\nclient = Client()\n\n# Sample recent traces (last hour)\ntraces = client.traces.get_traces(\n    project_identifier=\"my-app\",\n    start_time=datetime.now() - timedelta(hours=1),\n    include_spans=True,\n    limit=100,\n)\n\n# Run evaluators on sampled traffic\nfor trace in traces:\n    results = run_evaluators_async(trace, production_evaluators)\n    if any(r[\"score\"] \u003c 0.5 for r in results):\n        alert_on_failure(trace, results)\n```\n\n### TypeScript\n\n```typescript\nimport { getTraces } from \"@arizeai/phoenix-client/traces\";\nimport { getSpans } from \"@arizeai/phoenix-client/spans\";\n\n// Sample recent traces (last hour)\nconst { traces } = await getTraces({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 60 * 60 * 1000),\n  includeSpans: true,\n  limit: 100,\n});\n\n// Or sample spans directly for evaluation\nconst { spans } = await getSpans({\n  project: { projectName: \"my-app\" },\n  startTime: new Date(Date.now() - 60 * 60 * 1000),\n  limit: 100,\n});\n\n// Run evaluators on sampled traffic\nfor (const span of spans) {\n  const results = await runEvaluators(span, productionEvaluators);\n  if (results.some((r) =\u003e r.score \u003c 0.5)) {\n    await alertOnFailure(span, results);\n  }\n}\n```\n\nPrioritize: errors → negative feedback → random sample.\n\n## Feedback Loop\n\n```\nProduction finds failure → Error analysis → Add to CI dataset → Prevents future regression\n```\n","references/setup-python.md":"# Setup: Python\n\nPackages required for Phoenix evals and experiments.\n\n## Installation\n\n```bash\n# Core Phoenix package (includes client, evals, otel)\npip install arize-phoenix\n\n# Or install individual packages\npip install arize-phoenix-client   # Phoenix client only\npip install arize-phoenix-evals    # Evaluation utilities\npip install arize-phoenix-otel     # OpenTelemetry integration\n```\n\n## LLM Providers\n\nFor LLM-as-judge evaluators, install your provider's SDK:\n\n```bash\npip install openai      # OpenAI\npip install anthropic   # Anthropic\npip install google-generativeai  # Google\n```\n\n## Validation (Optional)\n\n```bash\npip install scikit-learn  # For TPR/TNR metrics\n```\n\n## Quick Verify\n\n```python\nfrom phoenix.client import Client\nfrom phoenix.evals import LLM, ClassificationEvaluator\nfrom phoenix.otel import register\n\n# All imports should work\nprint(\"Phoenix Python setup complete\")\n```\n\n## Key Imports (Evals 2.0)\n\n```python\nfrom phoenix.client import Client\nfrom phoenix.evals import (\n    ClassificationEvaluator,      # LLM classification evaluator (preferred)\n    LLM,                          # Provider-agnostic LLM wrapper\n    async_evaluate_dataframe,     # Batch evaluate a DataFrame (preferred, async)\n    evaluate_dataframe,           # Batch evaluate a DataFrame (sync)\n    create_evaluator,             # Decorator for code-based evaluators\n    create_classifier,            # Factory for LLM classification evaluators\n    bind_evaluator,               # Map column names to evaluator params\n    Score,                        # Score dataclass\n)\nfrom phoenix.evals.utils import to_annotation_dataframe  # Format results for Phoenix annotations\n```\n\n**Prefer**: `ClassificationEvaluator` over `create_classifier` (more parameters/customization).\n**Prefer**: `async_evaluate_dataframe` over `evaluate_dataframe` (better throughput for LLM evals).\n\n**Do NOT use** legacy 1.0 imports: `OpenAIModel`, `AnthropicModel`, `run_evals`, `llm_classify`.\n","references/setup-typescript.md":"# Setup: TypeScript\n\nPackages required for Phoenix evals and experiments.\n\n## Installation\n\n```bash\n# Using npm\nnpm install @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel\n\n# Using pnpm\npnpm add @arizeai/phoenix-client @arizeai/phoenix-evals @arizeai/phoenix-otel\n```\n\n## LLM Providers\n\nFor LLM-as-judge evaluators, install Vercel AI SDK providers:\n\n```bash\nnpm install ai @ai-sdk/openai      # Vercel AI SDK + OpenAI\nnpm install @ai-sdk/anthropic      # Anthropic\nnpm install @ai-sdk/google         # Google\n```\n\nOr use direct provider SDKs:\n\n```bash\nnpm install openai                 # OpenAI direct\nnpm install @anthropic-ai/sdk      # Anthropic direct\n```\n\n## Quick Verify\n\n```typescript\nimport { createClient } from \"@arizeai/phoenix-client\";\nimport { createClassificationEvaluator } from \"@arizeai/phoenix-evals\";\nimport { registerPhoenix } from \"@arizeai/phoenix-otel\";\n\n// All imports should work\nconsole.log(\"Phoenix TypeScript setup complete\");\n```\n","references/validation-evaluators-python.md":"# Validating Evaluators (Python)\n\nValidate LLM evaluators against human-labeled examples. Target \u003e80% TPR/TNR/Accuracy.\n\n## Calculate Metrics\n\n```python\nfrom sklearn.metrics import classification_report, confusion_matrix\n\nprint(classification_report(human_labels, evaluator_predictions))\n\ncm = confusion_matrix(human_labels, evaluator_predictions)\ntn, fp, fn, tp = cm.ravel()\ntpr = tp / (tp + fn)\ntnr = tn / (tn + fp)\nprint(f\"TPR: {tpr:.2f}, TNR: {tnr:.2f}\")\n```\n\n## Correct Production Estimates\n\n```python\ndef correct_estimate(observed, tpr, tnr):\n    \"\"\"Adjust observed pass rate using known TPR/TNR.\"\"\"\n    return (observed - (1 - tnr)) / (tpr - (1 - tnr))\n```\n\n## Find Misclassified\n\n```python\n# False Positives: Evaluator pass, human fail\nfp_mask = (evaluator_predictions == 1) \u0026 (human_labels == 0)\nfalse_positives = dataset[fp_mask]\n\n# False Negatives: Evaluator fail, human pass\nfn_mask = (evaluator_predictions == 0) \u0026 (human_labels == 1)\nfalse_negatives = dataset[fn_mask]\n```\n\n## Red Flags\n\n- TPR or TNR \u003c 70%\n- Large gap between TPR and TNR\n- Kappa \u003c 0.6\n","references/validation-evaluators-typescript.md":"# Validating Evaluators (TypeScript)\n\nValidate an LLM evaluator against human-labeled examples before deploying it.\nTarget: **\u003e80% TPR and \u003e80% TNR**.\n\nRoles are inverted compared to a normal task experiment:\n\n| Normal experiment | Evaluator validation |\n|---|---|\n| Task = agent logic | Task = run the evaluator under test |\n| Evaluator = judge output | Evaluator = exact-match vs human ground truth |\n| Dataset = agent examples | Dataset = golden hand-labeled examples |\n\n## Golden Dataset\n\nUse a separate dataset name so validation experiments don't mix with task experiments in Phoenix.\nStore human ground truth in `metadata.groundTruthLabel`. Aim for ~50/50 balance:\n\n```typescript\nimport type { Example } from \"@arizeai/phoenix-client/types/datasets\";\n\nconst goldenExamples: Example[] = [\n  { input: { q: \"Capital of France?\" }, output: { answer: \"Paris\" },       metadata: { groundTruthLabel: \"correct\" } },\n  { input: { q: \"Capital of France?\" }, output: { answer: \"Lyon\" },        metadata: { groundTruthLabel: \"incorrect\" } },\n  { input: { q: \"Capital of France?\" }, output: { answer: \"Major city...\" }, metadata: { groundTruthLabel: \"incorrect\" } },\n];\n\nconst VALIDATOR_DATASET = \"my-app-qa-evaluator-validation\"; // separate from task dataset\nconst POSITIVE_LABEL = \"correct\";\nconst NEGATIVE_LABEL = \"incorrect\";\n```\n\n## Validation Experiment\n\n```typescript\nimport { createClient } from \"@arizeai/phoenix-client\";\nimport { createOrGetDataset, getDatasetExamples } from \"@arizeai/phoenix-client/datasets\";\nimport { asExperimentEvaluator, runExperiment } from \"@arizeai/phoenix-client/experiments\";\nimport { myEvaluator } from \"./myEvaluator.js\";\n\nconst client = createClient();\n\nconst { datasetId } = await createOrGetDataset({ client, name: VALIDATOR_DATASET, examples: goldenExamples });\nconst { examples } = await getDatasetExamples({ client, dataset: { datasetId } });\nconst groundTruth = new Map(examples.map((ex) =\u003e [ex.id, ex.metadata?.groundTruthLabel as string]));\n\n// Task: invoke the evaluator under test\nconst task = async (example: (typeof examples)[number]) =\u003e {\n  const result = await myEvaluator.evaluate({ input: example.input, output: example.output, metadata: example.metadata });\n  return result.label ?? \"unknown\";\n};\n\n// Evaluator: exact-match against human ground truth\nconst exactMatch = asExperimentEvaluator({\n  name: \"exact-match\", kind: \"CODE\",\n  evaluate: ({ output, metadata }) =\u003e {\n    const expected = metadata?.groundTruthLabel as string;\n    const predicted = typeof output === \"string\" ? output : \"unknown\";\n    return { score: predicted === expected ? 1 : 0, label: predicted, explanation: `Expected: ${expected}, Got: ${predicted}` };\n  },\n});\n\nconst experiment = await runExperiment({\n  client, experimentName: `evaluator-validation-${Date.now()}`,\n  dataset: { datasetId }, task, evaluators: [exactMatch],\n});\n\n// Compute confusion matrix\nconst runs = Object.values(experiment.runs);\nconst predicted = new Map((experiment.evaluationRuns ?? [])\n  .filter((e) =\u003e e.name === \"exact-match\")\n  .map((e) =\u003e [e.experimentRunId, e.result?.label ?? null]));\n\nlet tp = 0, fp = 0, tn = 0, fn = 0;\nfor (const run of runs) {\n  if (run.error) continue;\n  const p = predicted.get(run.id), a = groundTruth.get(run.datasetExampleId);\n  if (!p || !a) continue;\n  if (a === POSITIVE_LABEL \u0026\u0026 p === POSITIVE_LABEL) tp++;\n  else if (a === NEGATIVE_LABEL \u0026\u0026 p === POSITIVE_LABEL) fp++;\n  else if (a === NEGATIVE_LABEL \u0026\u0026 p === NEGATIVE_LABEL) tn++;\n  else if (a === POSITIVE_LABEL \u0026\u0026 p === NEGATIVE_LABEL) fn++;\n}\nconst total = tp + fp + tn + fn;\nconst tpr = tp + fn \u003e 0 ? (tp / (tp + fn)) * 100 : 0;\nconst tnr = tn + fp \u003e 0 ? (tn / (tn + fp)) * 100 : 0;\nconsole.log(`TPR: ${tpr.toFixed(1)}%  TNR: ${tnr.toFixed(1)}%  Accuracy: ${((tp + tn) / total * 100).toFixed(1)}%`);\n```\n\n## Results \u0026 Quality Rules\n\n| Metric | Target | Low value means |\n|---|---|---|\n| TPR (sensitivity) | \u003e80% | Misses real failures (false negatives) |\n| TNR (specificity) | \u003e80% | Flags good outputs (false positives) |\n| Accuracy | \u003e80% | General weakness |\n\n**Golden dataset rules:** ~50/50 balance · include edge cases · human-labeled only · never mutate (append new versions) · 20–50 examples is enough.\n\n**Re-validate when:** prompt template changes · judge model changes · criteria updated · production FP/FN spike.\n\n## See Also\n\n- `validation.md` — Metric definitions and concepts\n- `experiments-running-typescript.md` — `runExperiment` API\n- `experiments-datasets-typescript.md` — `createOrGetDataset` / `getDatasetExamples`\n","references/validation.md":"# Validation\n\nValidate LLM judges against human labels before deploying. Target \u003e80% agreement.\n\n## Requirements\n\n| Requirement | Target |\n| ----------- | ------ |\n| Test set size | 100+ examples |\n| Balance | ~50/50 pass/fail |\n| Accuracy | \u003e80% |\n| TPR/TNR | Both \u003e70% |\n\n## Metrics\n\n| Metric | Formula | Use When |\n| ------ | ------- | -------- |\n| **Accuracy** | (TP+TN) / Total | General |\n| **TPR (Recall)** | TP / (TP+FN) | Quality assurance |\n| **TNR (Specificity)** | TN / (TN+FP) | Safety-critical |\n| **Cohen's Kappa** | Agreement beyond chance | Comparing evaluators |\n\n## Quick Validation\n\n```python\nfrom sklearn.metrics import classification_report, confusion_matrix, cohen_kappa_score\n\nprint(classification_report(human_labels, evaluator_predictions))\nprint(f\"Kappa: {cohen_kappa_score(human_labels, evaluator_predictions):.3f}\")\n\n# Get TPR/TNR\ncm = confusion_matrix(human_labels, evaluator_predictions)\ntn, fp, fn, tp = cm.ravel()\ntpr = tp / (tp + fn)\ntnr = tn / (tn + fp)\n```\n\n## Golden Dataset Structure\n\n```python\ngolden_example = {\n    \"input\": \"What is the capital of France?\",\n    \"output\": \"Paris is the capital.\",\n    \"ground_truth_label\": \"correct\",\n}\n```\n\n## Building Golden Datasets\n\n1. Sample production traces (errors, negative feedback, edge cases)\n2. Balance ~50/50 pass/fail\n3. Expert labels each example\n4. Version datasets (never modify existing)\n\n```python\n# GOOD - create new version\ngolden_v2 = golden_v1 + [new_examples]\n\n# BAD - never modify existing\ngolden_v1.append(new_example)\n```\n\n## Warning Signs\n\n- All pass or all fail → too lenient/strict\n- Random results → criteria unclear\n- TPR/TNR \u003c 70% → needs improvement\n\n## Re-Validate When\n\n- Prompt template changes\n- Judge model changes\n- Criteria changes\n- Monthly\n"},"import":{"commit_sha":"541b7819d8c3545c6df122491af4fa1eae415779","imported_at":"2026-05-18T20:05:35Z","license_text":"MIT License\n\nCopyright GitHub, Inc.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.","owner":"github","repo":"github/awesome-copilot","source_url":"https://github.com/github/awesome-copilot/tree/541b7819d8c3545c6df122491af4fa1eae415779/plugins/phoenix/skills/phoenix-evals"}},"content_hash":[128,117,178,159,169,30,252,206,37,189,215,150,92,41,59,244,241,108,149,227,109,165,161,72,171,33,34,143,160,0,89,124],"trust_level":"unsigned","yanked":false}
