Week 2 · Day 13/30

Evaluation & Testing AI Systems

Unit testy pre AI, eval frameworks, red-teaming

📅 2026-03-16 ⏱️ 6-7 hodín 📊 Agent Systems
Celkový progres 43%

🎯 Cieľ dňa

Nastaviť eval pipeline s DeepEval, LLM-as-judge, a red-teaming. 35% real-world AI incidents bolo spôsobených jednoduchými promptmi.

core practice

📚 Study Resources

DeepEval — Getting Started

Pytest-native LLM eval framework. 50+ metrík vrátane hallucination, relevancy, task completion.

docs
💻

DataCamp — Evaluate LLMs with DeepEval

Praktický tutorial: setup, test cases, evaluations, interpretácia výsledkov.

tutorial
📊

Analytics Vidhya — LangSmith Evaluation

LangSmith: datasets, evaluators (human, heuristic, LLM-as-judge), tracing.

tutorial
🛡️

CISA — AI Red Teaming Guide

US government guide: AI red teaming ako TEVV. NIST AI Risk Management Framework.

official
⚔️

Giskard — Best 7 AI Red Teaming Tools

Microsoft PyRIT, DeepTeam, Giskard. Prompt injection, PII leakage, hallucination probing.

comparison

💡 Key Concepts

LLM-as-Judge — Jeden LLM evaluuje output iného podľa definovaných kritérií. Dominantný eval prístup 2025.
Evaluation Metrics — Answer relevancy, faithfulness, task completion, coherence, toxicity, bias
Red Teaming — Adversarial testing: prompt injection, jailbreaking, data leakage. 35% incidentov = simple prompts.
Tracing — Zaznamenanie každého kroku agenta (inputs, outputs, tool calls, latency) pre debugging
Regression Testing — Eval suites v CI/CD zachytia quality regressions pred deploymentom. DeepEval + pytest.

🔧 Praktické cvičenie

Nastav eval pipeline pre tvojho ReAct agenta.

  1. Nainštaluj DeepEval, napíš 10+ test casov
  2. Pokry: správny tool selection, accurate answers, hallucination resistance
  3. Implementuj LLM-as-judge scoring pre answer quality
  4. Pridaj 3 adversarial testy: prompt injection, harmful content, hallucination trigger
  5. Spusti deepeval test run a analyzuj výsledky