Evidence dashboard
Four-layer pipeline. Every layer shipped with artifacts; none self-reported. Numbers below are the live-registered verification trail from the Prism42 open-source repo.
Layer 1 — Kernel correctness
Five-role adversarial dialectic against GPU attention kernels on H100 SXM. Every finding ships with an executed PoC log. Session durability verified; 5-agent multi-agent delegation will land when Anthropic callable_agents exits research-preview on this workspace.
artifacts · findings/kernel-audits/ · benchmarked on H100 SXM5 + B300 SXM6 torch.compile
Layer 2 — Inference performance
Measured decode latency: 22.53 µs p50 (FlashInfer fa3 on H100 SXM5, DeepSeek-V3 MLA, bf16, T=4096); 43.25 µs p50(torch.compile on B300 SXM6). Decode is <1% of the full conversational voice-turn budget — STT + TTS + dialogue-manager roundtrip dominate. See docs/anthropic-elevenlabs-agent-bp-2026-04-21.md §5 for the full budget math.
Layer 3 — Clinical reasoning
Opus 4.7 HealthBench Hard baseline: 0.196 ± 0.068 (mean of N=3 independent runs, 95% CI half-width) on the declared 30-example subset. First published Opus-4.7 HealthBench Hard number. Harness-delta gate: paired 95% CI must exclude 0.
Layer 4 — Voice + product usability
Live 911-style calls against the 14-agent PSAP stack (5 voice-facing + 4 oversight + 2 post-session + 3 orchestration). Every turn graded by a cross-vendor rubric chain (gpt-5-5 primary). 42-scenario red-team battery runs in CI per corpus/red-team/psap-fixtures-v0.1.yaml. Release gate blocks any deploy where any of the 42 scenarios fails.