Day 25: Private Benchmarks & Evals

🎯 Cieľ dňa

Vybudovať custom evaluation pipeline špecifickú pre tvoje use cases. Verejné benchmarky nestačia.

core practice

Guide na stavbu vlastných LLM eval metrík integrovaných do CI/CD.

LLM-as-judge s chain-of-thoughts pre evaluáciu podľa AKÉHOKOĽVEK custom kritéria.

Praktický Python tutorial: setup, test cases, custom metrics.

Private Benchmarks — Eval suites špecifické pre tvoj use case. Reálne otázky, reálne expected outputs.

G-Eval — LLM-as-judge + CoT pre evaluáciu podľa custom kritérií. Flexibilnejšie ako fixed metrics.

Continuous Evaluation — Eval v CI/CD pipeline. Každý commit triggeruje eval suite. Zachytí regressions.

Golden Dataset — Curated set input-output párov. Ground truth pre systematic testing.

Vybuduj private eval suite pre tvoje Ollama agenty.