Appendix A10 — Experiment Design¶

Purpose¶

This appendix gives a practical structure for designing experiments so that model, fusion, and policy changes can be compared fairly.

A useful experiment should say:

Experiment type	Example question
base-model comparison	is model B better than model A on replay attacks?
fusion experiment	does calibrated fusion reduce false accepts without hurting retries?
threshold search	which score bands fit onboarding best?
channel-specific policy	should web use a stricter threshold than app?
data ablation	does low-light data improve false rejects in dim conditions?

Mistake	Why it hurts
changing model and threshold together without control	hard to know what helped
testing on the same data used for tuning	results become optimistic
no segment analysis	hidden regressions survive
no experiment record	hard to reproduce findings