Session 11.5: Quality Scoring

Course → Module 11: Quality Control & The Human Gate

Session 5 of 7

From "This Feels Right" to "This Scores 38"

Subjective quality assessment does not scale. When you are the only reviewer, "I know good when I see it" works. When you add a second reviewer, your definitions diverge. When you batch-produce 20 pieces per week, your standards drift. A rubric fixes this by encoding your quality standards into measurable dimensions.

The New York Times built exactly this kind of framework. Their internal tool, Stet, codifies institutional editorial knowledge into a concrete rubric to score AI-generated copy. The principle is universal: if you can define what quality means in numbers, you can enforce it consistently.

Quality Rubric: A scoring framework with defined dimensions, each rated on a fixed scale, that converts subjective editorial judgment into a repeatable, auditable number. The rubric encodes your standards so they survive changes in mood, fatigue, and reviewer.

The Five Scoring Dimensions

Your rubric should have 5 dimensions. Fewer than 5 and you miss important quality signals. More than 7 and the rubric becomes a chore that reviewers skip. Five is the practical optimum.

The dimensions below are a starting point. Modify them to match your content type.

Dimension	What It Measures	Score 10	Score 0
Accuracy	Factual correctness of all verifiable claims	Every claim verified, sources cited, no hallucinations	Multiple fabricated facts, invented sources, wrong numbers
Voice Consistency	Match to target voice profile	Indistinguishable from author's natural writing	Generic AI voice with no personality markers
Structural Clarity	Logical flow, section organization, argument progression	Each section builds on the previous, clear transitions, no redundancy	Random paragraph order, ideas repeated, no coherent argument
Originality of Insight	Presence of ideas that could not be generated by prompting any model	Contains practitioner knowledge, specific examples, and positions only the author could take	Entirely generic advice available in any search result
AI Artifact Absence	Freedom from the 15 forensic markers (inverse scale)	Zero detectable AI markers	More than 10 markers present across the piece

The Scoring Action Matrix

A score without an action is a decoration. Each score range maps to a specific editorial action.

flowchart LR A["Score Content
(5 dimensions × 0-10)"] --> B{Total Score?} B -->|"40-50"| C["Publish
Light proofread only"] B -->|"30-39"| D["Rework
Targeted edits on weak dimensions"] B -->|"20-29"| E["Major Revision
Structural and voice overhaul"] B -->|"Below 20"| F["Regenerate
Prompt revision required"] style C fill:#6b8f71,color:#111 style D fill:#c8a882,color:#111 style E fill:#c47a5a,color:#111 style F fill:#c47a5a,color:#111

Score Range	Action	Typical Time Investment	Expected Output
40-50	Publish after proofread	5-10 minutes	Ready for audience
30-39	Targeted rework on lowest-scoring dimensions	20-40 minutes	Publishable after second review
20-29	Major revision: restructure, inject voice, verify facts	45-90 minutes	Might reach publishable; consider regeneration
Below 20	Discard and regenerate with revised prompt	Regeneration time + new review cycle	New output from improved prompt

Calibration

A rubric is only useful if it produces consistent scores. To calibrate, score 5 pieces of content you already know the quality of: one piece of your own best writing, one piece of writing you admire from someone else, one good AI output, one mediocre AI output, and one obvious slop piece.

Your best writing should score 40+. The admired writing should score 40+. Good AI output should score 28-35. Mediocre AI output should score 18-27. Obvious slop should score below 18.

If the scores do not match your intuitive quality ranking, adjust the rubric. Either the dimension definitions are wrong, the scale anchors are wrong, or you are weighting dimensions incorrectly. Calibration is iterative. Expect 2-3 rounds before the rubric reliably matches your judgment.

Using the Rubric in Production

Every piece of content that exits your pipeline should have a score card attached. Not stored separately, not remembered vaguely, but recorded alongside the content in a simple log. Over time, this log reveals patterns: which content types consistently score low, which prompt templates produce the highest scores, and whether your quality is improving or degrading as you scale.

Google's own Search Quality Evaluator Guidelines use a similar approach. Human quality raters evaluate search results against defined rubrics with specific criteria for each rating level. E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) is a rubric. Your rubric is the content-production equivalent.

Assignment

Build your quality rubric. Define 5 scoring dimensions relevant to your content type (you can use the ones in this session or create your own). For each dimension, define what a 10 looks like and what a 0 looks like. Score 5 pieces of content of varying quality. If the scores do not match your intuitive ranking, adjust the rubric and score again. Repeat until the numbers reflect reality.

Quality Scoring