One behavior slice
Give the agent a narrow requirement, explicit write set, and a stop rule if the fix needs more files. Prevent overnight scope creep.
Overnight agent testing
Agentic apps are harder than web E2E: streaming output, fuzzy correctness, tool calls, state drift, and a host that may sleep while the run is active.
Give the agent a narrow requirement, explicit write set, and a stop rule if the fix needs more files. Prevent overnight scope creep.
For streaming chat, assert state transitions, tool calls, guardrails, final receipt fields, and representative user turns instead of only snapshots.
Record host and worker heartbeats. If the laptop slept, network dropped, or the terminal died, the run should say that instead of pretending success.
Host heartbeat
Run this next to the overnight test so sleep/wake, lid close, and network interruptions show up as missing timestamps.
while true; do date >> ~/overnight-agent-host.log; sleep 30; done
Take A Coffee gives Codex temporary awake mode, host checks, restore step, and a receipt for local macOS and Windows overnight runs.