Quality Scoring
Session 11.5 · ~5 min read
From "This Feels Right" to "This Scores 38"
Subjective quality assessment does not scale. When you are the only reviewer, "I know good when I see it" works. When you add a second reviewer, your definitions diverge. When you batch-produce 20 pieces per week, your standards drift. A rubric fixes this by encoding your quality standards into measurable dimensions.
The New York Times built exactly this kind of framework. Their internal tool, Stet, codifies institutional editorial knowledge into a concrete rubric to score AI-generated copy. The principle is universal: if you can define what quality means in numbers, you can enforce it consistently.
Quality Rubric: A scoring framework with defined dimensions, each rated on a fixed scale, that converts subjective editorial judgment into a repeatable, auditable number. The rubric encodes your standards so they survive changes in mood, fatigue, and reviewer.
The Five Scoring Dimensions
Your rubric should have 5 dimensions. Fewer than 5 and you miss important quality signals. More than 7 and the rubric becomes a chore that reviewers skip. Five is the practical optimum.
The dimensions below are a starting point. Modify them to match your content type.
| Dimension | What It Measures | Score 10 | Score 0 |
|---|---|---|---|
| Accuracy | Factual correctness of all verifiable claims | Every claim verified, sources cited, no hallucinations | Multiple fabricated facts, invented sources, wrong numbers |
| Voice Consistency | Match to target voice profile | Indistinguishable from author's natural writing | Generic AI voice with no personality markers |
| Structural Clarity | Logical flow, section organization, argument progression | Each section builds on the previous, clear transitions, no redundancy | Random paragraph order, ideas repeated, no coherent argument |
| Originality of Insight | Presence of ideas that could not be generated by prompting any model | Contains practitioner knowledge, specific examples, and positions only the author could take | Entirely generic advice available in any search result |
| AI Artifact Absence | Freedom from the 15 forensic markers (inverse scale) | Zero detectable AI markers | More than 10 markers present across the piece |
The Scoring Action Matrix
A score without an action is a decoration. Each score range maps to a specific editorial action.
(5 dimensions × 0-10)"] --> B{Total Score?} B -->|"40-50"| C["Publish
Light proofread only"] B -->|"30-39"| D["Rework
Targeted edits on weak dimensions"] B -->|"20-29"| E["Major Revision
Structural and voice overhaul"] B -->|"Below 20"| F["Regenerate
Prompt revision required"] style C fill:#6b8f71,color:#111 style D fill:#c8a882,color:#111 style E fill:#c47a5a,color:#111 style F fill:#c47a5a,color:#111
| Score Range | Action | Typical Time Investment | Expected Output |
|---|---|---|---|
| 40-50 | Publish after proofread | 5-10 minutes | Ready for audience |
| 30-39 | Targeted rework on lowest-scoring dimensions | 20-40 minutes | Publishable after second review |
| 20-29 | Major revision: restructure, inject voice, verify facts | 45-90 minutes | Might reach publishable; consider regeneration |
| Below 20 | Discard and regenerate with revised prompt | Regeneration time + new review cycle | New output from improved prompt |
Calibration
A rubric is only useful if it produces consistent scores. To calibrate, score 5 pieces of content you already know the quality of: one piece of your own best writing, one piece of writing you admire from someone else, one good AI output, one mediocre AI output, and one obvious slop piece.
Your best writing should score 40+. The admired writing should score 40+. Good AI output should score 28-35. Mediocre AI output should score 18-27. Obvious slop should score below 18.
If the scores do not match your intuitive quality ranking, adjust the rubric. Either the dimension definitions are wrong, the scale anchors are wrong, or you are weighting dimensions incorrectly. Calibration is iterative. Expect 2-3 rounds before the rubric reliably matches your judgment.
Using the Rubric in Production
Every piece of content that exits your pipeline should have a score card attached. Not stored separately, not remembered vaguely, but recorded alongside the content in a simple log. Over time, this log reveals patterns: which content types consistently score low, which prompt templates produce the highest scores, and whether your quality is improving or degrading as you scale.
Google's own Search Quality Evaluator Guidelines use a similar approach. Human quality raters evaluate search results against defined rubrics with specific criteria for each rating level. E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) is a rubric. Your rubric is the content-production equivalent.
Further Reading
- Inside The New York Times's A.I. Toolkit, Investigative Reporters and Editors (2025)
- Google AI Content Guidelines: Complete 2026 Guide, Koanthic
- IMPRESS Best Practice Note: The Use of Artificial Intelligence (April 2025)
- Google Quality Raters Update 2025 Checks AI-Generated Content, SlideShare
Assignment
Build your quality rubric. Define 5 scoring dimensions relevant to your content type (you can use the ones in this session or create your own). For each dimension, define what a 10 looks like and what a 0 looks like. Score 5 pieces of content of varying quality. If the scores do not match your intuitive ranking, adjust the rubric and score again. Repeat until the numbers reflect reality.