#evaluation

evaluation shows up across 4 section(s) and 13 page(s) in this workspace. Use this page as a topic map, not just an archive.

Start here

If you are new to this topic, begin with the strongest entry points first, then move into related notes and supporting material.

Where it appears

  • Systems 8 page(s)
  • Sentences 1 page(s)
  • Self 1 page(s)
  • Shelf 3 page(s)
systems

Decision-Making Under Uncertainty in AI Runtimes

A practical framework for making accountable decisions in AI systems when evidence is partial, time is limited, and outcomes are high-impact.

systems

Evaluation as a Runtime Discipline

Why evaluation should live inside the operating loop of an AI system instead of being treated as an occasional review ritual.

systems

Evaluation Is a Human Problem

Why benchmarks are not enough and judgment defines quality.

systems

From Ad-Hoc Prompts to Repeatable Agent Workflows

A practical case study showing how structured instructions, handoff memory, and quality gates improved consistency and coverage in this repository.

systems

Knowledge Management as Runtime Memory

Why modern AI teams should treat knowledge management as a live runtime memory system, not a static documentation archive.

systems

What LLM-Ops Actually Means

LLM-Ops is governance over time. Understanding the lifecycle of probabilistic systems.

systems

Observability First: How AI Systems Learn After Launch

Why observability is the missing layer between model output and reliable product behavior in production AI systems.

systems

Skill Evaluation and Versioning

How to define expected behavior, detect regressions, version skill changes safely, and decide when rollback is the right move.

sentences

Observability turns behavior into knowledge.

self

How I Run a Weekly Eval Loop

A small review ritual for checking whether my AI workflows are getting clearer or only getting faster.

shelf

Context window stress test

A small experiment to see where longer context starts to degrade quality.

shelf

Evaluation and prompting references

Shortlist for building safer, more measurable prompts.

shelf

Retrieval and grounding evaluation kit

A compact resource pack for checking whether an AI system retrieves the right evidence before it answers.