Unit testy pre AI, eval frameworks, red-teaming
Nastaviť eval pipeline s DeepEval, LLM-as-judge, a red-teaming. 35% real-world AI incidents bolo spôsobených jednoduchými promptmi.
Pytest-native LLM eval framework. 50+ metrík vrátane hallucination, relevancy, task completion.
docsPraktický tutorial: setup, test cases, evaluations, interpretácia výsledkov.
tutorialLangSmith: datasets, evaluators (human, heuristic, LLM-as-judge), tracing.
tutorialUS government guide: AI red teaming ako TEVV. NIST AI Risk Management Framework.
officialMicrosoft PyRIT, DeepTeam, Giskard. Prompt injection, PII leakage, hallucination probing.
comparisonNastav eval pipeline pre tvojho ReAct agenta.