Day 13: Evaluation & Testing AI Systems

🎯 Cieľ dňa

Nastaviť eval pipeline s DeepEval, LLM-as-judge, a red-teaming. 35% real-world AI incidents bolo spôsobených jednoduchými promptmi.

core practice

Pytest-native LLM eval framework. 50+ metrík vrátane hallucination, relevancy, task completion.

Praktický tutorial: setup, test cases, evaluations, interpretácia výsledkov.

LangSmith: datasets, evaluators (human, heuristic, LLM-as-judge), tracing.

US government guide: AI red teaming ako TEVV. NIST AI Risk Management Framework.

Microsoft PyRIT, DeepTeam, Giskard. Prompt injection, PII leakage, hallucination probing.

LLM-as-Judge — Jeden LLM evaluuje output iného podľa definovaných kritérií. Dominantný eval prístup 2025.

Evaluation Metrics — Answer relevancy, faithfulness, task completion, coherence, toxicity, bias

Red Teaming — Adversarial testing: prompt injection, jailbreaking, data leakage. 35% incidentov = simple prompts.

Tracing — Zaznamenanie každého kroku agenta (inputs, outputs, tool calls, latency) pre debugging

Regression Testing — Eval suites v CI/CD zachytia quality regressions pred deploymentom. DeepEval + pytest.

Nastav eval pipeline pre tvojho ReAct agenta.

Nainštaluj DeepEval, napíš 10+ test casov
Pokry: správny tool selection, accurate answers, hallucination resistance
Implementuj LLM-as-judge scoring pre answer quality
Pridaj 3 adversarial testy: prompt injection, harmful content, hallucination trigger
Spusti deepeval test run a analyzuj výsledky