#evaluation
evaluation shows up across 4 section(s) and 13 page(s) in this workspace. Use this page as a topic map, not just an archive.
Start here
If you are new to this topic, begin with the strongest entry points first, then move into related notes and supporting material.
Where it appears
- Systems 8 page(s)
- Sentences 1 page(s)
- Self 1 page(s)
- Shelf 3 page(s)
Decision-Making Under Uncertainty in AI Runtimes
A practical framework for making accountable decisions in AI systems when evidence is partial, time is limited, and outcomes are high-impact.
Evaluation as a Runtime Discipline
Why evaluation should live inside the operating loop of an AI system instead of being treated as an occasional review ritual.
Evaluation Is a Human Problem
Why benchmarks are not enough and judgment defines quality.
From Ad-Hoc Prompts to Repeatable Agent Workflows
A practical case study showing how structured instructions, handoff memory, and quality gates improved consistency and coverage in this repository.
Knowledge Management as Runtime Memory
Why modern AI teams should treat knowledge management as a live runtime memory system, not a static documentation archive.
What LLM-Ops Actually Means
LLM-Ops is governance over time. Understanding the lifecycle of probabilistic systems.
Observability First: How AI Systems Learn After Launch
Why observability is the missing layer between model output and reliable product behavior in production AI systems.
Skill Evaluation and Versioning
How to define expected behavior, detect regressions, version skill changes safely, and decide when rollback is the right move.
Observability turns behavior into knowledge.
How I Run a Weekly Eval Loop
A small review ritual for checking whether my AI workflows are getting clearer or only getting faster.
Context window stress test
A small experiment to see where longer context starts to degrade quality.
Evaluation and prompting references
Shortlist for building safer, more measurable prompts.
Retrieval and grounding evaluation kit
A compact resource pack for checking whether an AI system retrieves the right evidence before it answers.