When we shipped the first version of our seven-day risk score, two years ago, it was a single integer between zero and one hundred. The model was good — it had held up in cross-validation on retrospective data from three prior seasons across two clubs. The reception from the medical staff was not good. They read the number, nodded, and went back to their existing workflow. The score changed nothing.
The objection was not that the score was wrong, at least not in any measurable sense. The objection was that nobody could push back on it. A head physiotherapist who has spent fifteen years reading bodies will not, and should not, defer to a number whose provenance is opaque. Clinical judgment is built on the ability to disagree with a signal, to say "I see what you see, but here is why this player is different." A black-box score takes that ability away. The staff had no way to argue with the model, so they ignored it. Perfectly rational.
We spent the next twelve months rebuilding the surface without changing the underlying prediction engine. The work was not technical — it was translational. We needed to learn how the staff already thought about risk, and map our output onto that mental model instead of asking them to adopt ours.
The result was two principles that everything else is built on. First, every score is delivered with its three strongest contributing factors, named in the language the staff uses: training load spike, sleep deficit, previous ipsilateral injury within sixty days. Each factor includes a sparkline of the trailing fourteen days, visible on hover, so the clinician can verify the trajectory, not just the label. Second, when a similar pattern exists in the player's own history, the score is accompanied by a text reference to that analogue: "This combination of factors resembles the week of October 14, when the player was held out of two sessions and recovered fully by match day." The framing shifts from abstract prediction to concrete comparison.
The analogue system turned out to be the more important of the two. A model that says "78 % risk" invites skepticism. A model that says "this looks like the week before the last hamstring" invites a conversation. The staff started annotating those analogues — "no, this time it is different, he rested during the break" — and those annotations became training signal for the next model iteration. The feedback loop closed.
What changed was not the model's accuracy, which improved marginally, but the rate at which the score influenced decisions. In the first quarter after the redesign, the staff acted on elevated scores — by modifying training load, scheduling extra recovery, or initiating a conversation with the player — at roughly three times the rate of the previous quarter. They did not trust the score more. They trusted their ability to evaluate it.
This distinction matters for any medical AI that hopes to be used in practice, not just admired in a research paper. An explainable risk score does not replace the physiotherapist's judgment. It gives them something substantive to agree or disagree with. The decision stays with the human. The model just brings the relevant pattern to their attention, framed in their own language, at the moment it matters.
We apply the same explainability principle to our return-to-play protocols. When the platform suggests a recovery progression, it surfaces the specific physiological signals that inform each step. The staff can accept, modify, or override the suggestion based on their direct observation of the player. The model adapts. The clinician stays in charge.