Rec Evaluator — Preffect

Exclude internal accounts ⓘ

Max recs per user per day: ⓘ

Total Recs

LLM-Evaluated ⓘ

Avg Quality ⓘ

Flagged <5 ⓘ

⚠ Safety Flags (—)

Binary: is_safe=false · treat as incidents

🧑‍🏫 Coach Benchmark (KB-grounded alternative)

Same evaluator run on the coach-synthesized rec (Sonnet 4.5 as fitness/nutrition coach, grounded in Preffect KB + PubMed citation). Quantifies headroom vs. production.

Not a production candidate. Coach calls Sonnet 4.5 at ~$0.01/rec and ~20s latency — too slow/expensive to serve inline. It exists as a gold-standard ceiling and as the "preferred" side of preference pairs used to DPO-fine-tune a smaller, faster production model. The coach-vs-production Δ is the quality headroom that training can recover.

Production avg

Coach avg

Δ quality

Coach win rate

Wins / ties / losses

When the Coach Lost or Tied — Failure-Mode Inspection

—

Each row is a rec where the coach scored ≤ production. Use this to find KB gaps, off-target retrieval, and cases where production was genuinely already good. Click any row to jump to the rec's full evaluation.

Domain	Rec title	Prod	Coach	Δ	Worst dim	Likely cause

Where the Gap Lives — Headroom

Production Coach (KB-grounded)

Biggest Δ = the production layer to fix first. Grounding→retrieval/context. Motivation Fit→persona/prompt. Suitability→signal-conflict resolution. Actionability→specificity rules.

By Dimension

By Domain

🤖 How "Quality Score" is computed (Claude Sonnet 4.5)

The evaluator scores each rec on 5 dimensions (1–10). Quality is NOT a simple average — it weights toward the weakest of the first four:

quality_score = min(G, S, A, M) + 0.5 × (mean(G, S, A, M) − min(G, S, A, M))
where G=Grounding · S=Suitability · A=Actionability · M=Motivation Fit

Why this shape: a rec with great Actionability but zero Grounding (hallucinated data) shouldn't average to 6 — it should score near the hallucination level. So the lowest dim acts as a floor, and the other three add a smaller bonus when they're strong.

Dimensions: Grounding · Suitability · Actionability · Motivation Fit. Engagement Likelihood is tracked separately (not in quality) so we can later validate it against observed user behavior.

Open policy question for the team: Is Suitability (clinical fit) weighted enough? Currently any dim including Suitability can be the floor — so a suitability-low rec is capped at a low quality. If the team decides clinical fit deserves an additional weight, change the formula in evaluator.py:EVAL_SYSTEM_PROMPT. Kept as-is until agreed.

— · - safety flags raised · Evaluator is status-blind (never sees accept/dismiss/complete).

User Engagement (production recs)

% of evaluated recs by user action — what users actually did

Accept Rate

Complete Rate

Dismiss Rate

Refresh Rate

Week-over-Week Change

This week vs previous 7 days · arrows show change

Engagement Funnel

Where users drop off — helps identify whether the problem is content (dismissed) or execution (accepted but not completed)

Quality vs Engagement Over Time

Quality bars + 7-day MA + coach line on the left axis; engagement % on the right. Vertical dashed lines mark production prompt/context deployments — see the strip below for commit details.

Quality Score vs User Completion Rate

Each dot is one day of production recs. Upper-right = higher-quality recs correlate with more user action. Color shifts old → recent — recent points clustering up-right = production is improving over time.

Quality by Domain

Average score per domain per day

Dimension Scores Over Time

Are weak dimensions (personalization, engagement likelihood) improving with prompt tuning?

Overall Dimension Averages

Average score across all evaluated recs per rubric dimension

Daily Breakdown

Date	Recs	Prod	Coach	Δ	Sleep	Nutr	Phys	Emot	Accept%	Dismiss%	Refresh%	Complete%	Flagged	Safety

Rec Quality Evaluator

Failed to load trend data