Loading...
Select a recommendation from the list
Same evaluator run on the coach-synthesized rec (Sonnet 4.5 as fitness/nutrition coach, grounded in Preffect KB + PubMed citation). Quantifies headroom vs. production.
Each row is a rec where the coach scored β€ production. Use this to find KB gaps, off-target retrieval, and cases where production was genuinely already good. Click any row to jump to the rec's full evaluation.
| Domain | Rec title | Prod | Coach | Ξ | Worst dim | Likely cause |
|---|
Clinical (Grounding + Suitability) = is the advice scientifically sound? Behavioral (Actionability + Motivation Fit) = is it persuasive / executable? The larger gap points to the bottleneck layer.
Loadingβ¦
Loadingβ¦
Where did the user's action (accept/complete) land vs. the clinical score we gave? Upper-left is the most important bucket for health tech β users engaging with clinically weak recs is a safety signal, not a success signal.
Biggest Ξ = the production layer to fix first. Groundingβretrieval/context. Motivation Fitβpersona/prompt. Suitabilityβsignal-conflict resolution. Actionabilityβspecificity rules.
evaluator.py:EVAL_SYSTEM_PROMPT.
Kept as-is until agreed.
Where users drop off β helps identify whether the problem is content (dismissed) or execution (accepted but not completed)
Quality bars + 7-day MA + coach line on the left axis; engagement % on the right. Vertical dashed lines mark production prompt/context deployments β see the strip below for commit details.
Each dot is one day of production recs. Upper-right = higher-quality recs correlate with more user action. Color shifts old β recent β recent points clustering up-right = production is improving over time.
Average score per domain per day
Are weak dimensions (personalization, engagement likelihood) improving with prompt tuning?
Average score across all evaluated recs per rubric dimension
| Date | Recs | Prod | Coach | Ξ | Sleep | Nutr | Phys | Emot | Accept% | Dismiss% | Refresh% | Complete% | Flagged | Safety |
|---|