Generations Are Cheap.
Taste Is Not.
I can generate a thousand drafts of this post in under a minute for less than a dollar. I don’t send it out that way though. The companies, the products, and the sense of good taste are all built by people, not inference. That’s why we’re collectively developing an “AI sense.” It’s triggered when the photo is slightly off, or the grammatical tells start to show up everywhere. It’s not just you.
Cheap Generations, Expensive Discernment
Generations are cheap today. Small models capable of real work go from 10 to 40 cents per million tokens. We’re being flooded with so many half-cooked outputs that we can actually taste what’s still raw. As Meltwater puts it: “Brands that have made craftsmanship and quality a key part of their identities have a huge advantage as public sentiment turns against low-quality genAI content. Emphasizing qualities like expert design, human curation, and interpersonal connection is helping such brands appeal to discerning, AI-slop-averse audiences.”
The landscape hasn’t changed since then; it’s only accelerating.
In a world where generations are cheap and answers are easier than questions, good taste is the value. Retaining your ability to say “no, this simply won’t fly,” despite your agent’s kind disposition, is now a survival instinct.
After the Camera
After the invention of photography, you would hire a skilled portrait artist for a full day of sitting, while the artist captured every detail. But once people had instant-capture in their hands, billions of pictures could flow out of a finger resting on a button. Portrait artists didn’t completely disappear, but the skill of photography deepened with respect to framing, timing, editing, sequencing, curation.
I’d imagine we’re at that stage currently, where the people working with AI tools today are either developing the perfect lighting techniques, or just taking a candid selfie.
What Models Can’t Do
Now that we’re understanding AI capabilities better, we need to take a look at the wide shot. Can matters of taste and judgment be the last bastion of human expertise?
“Tools that once assisted human analysis now take the lead while we sleep, sorting, summarizing, and sifting through what matters. These systems never tire, but they also don’t yet care in the way people do. Caring, for now, remains a human responsibility.” — Harvard Chan, March 2026
You can’t program a machine to care like a human does. A model optimized for ‘correct task completions, as efficiently as possible’ isn’t necessarily thinking about which metrics are most important for you or a client. Sticking another LLM on top just to make the decisions for you is actually just kicking the can down the road at double the price.
The science suggests human specialists are still king: when researchers built RubricBench specifically to test whether frontier models could write their own evaluation criteria, they found a “substantial capability gap between human-annotated and model-generated rubrics” (arXiv:2603.01562, 2026). Models can grade against a rubric; they cannot reliably author one. Expert judgment is the input that the whole stack depends on.
Operationalizing Taste
We’ve taken a similar approach in authoring our own rubrics. The methodology is simple: rather than ask “is this output good,” decompose what “good” actually means first, put it into a structured set of criteria an expert can defend, then evaluate against that. This is the operationalization of taste.
Some aspects of taste are generally shared, and that’s what generic models do well. But the only one who can deconstruct what taste is in your perspective, is yourself.
If you’re leadership, operating a business, or the head of brand development, this should matter to you. Are you letting it ride on AI self-improvement and accepting generic outputs, or are you authoring your own future?
Frequently Asked Questions
How much have LLM inference costs actually dropped?
For a constant performance level, LLM inference cost has fallen roughly 10x per year since 2022, faster than the PC compute curve and faster than dotcom-era bandwidth (a16z, 2024). Concretely: GPT-4-equivalent quality cost about $20 per million tokens in late 2022; by late 2025 it was closer to $0.40. Small models capable of real work now run 10 to 40 cents per million tokens.
Can LLMs reliably grade their own outputs?
Not at the level the marketing implies. RubricBench (arXiv:2603.01562, 2026) found a substantial capability gap between human-annotated and model-generated rubrics; even frontier models struggle to autonomously author the evaluation criteria they then grade against. AdaRubric (arXiv:2603.21362) reports static LLM-as-judge correlates with human judgment at r ≈ 0.46, while task-specific expert rubrics raise that correlation to r ≈ 0.77.
What is rubric-based LLM evaluation?
Rubric-based evaluation decomposes 'is this output good?' into a set of explicit, atomic, weighted criteria, each scored independently. Instead of asking a judge model for a single holistic verdict, you assess specific dimensions tied to the task and aggregate. The methodology imports decades of educational measurement and psychometrics into AI evaluation (Autorubric, arXiv:2603.00077; Stanford SCALE Initiative).
Why does AI-generated content feel generic?
A generic model is optimized to perform across the average of every domain in its training data, not the specific things your business, voice, or audience care about. Without an explicit specification of what 'good' means in your context, the model defaults to broadly safe outputs. Domain specificity comes from upstream taste, not from the model itself.
What is 'AI slop'?
AI slop is the flood of low-quality, generic AI-generated content saturating feeds, search results, and content platforms. The term gained currency through 2024 and 2025 as generation costs collapsed and platforms saw measurable drops in content quality; Pinterest rolled out filters for AI imagery in October 2025 after sustained user complaints. Slop isn't defined by being AI-generated; it's defined by being undifferentiated.
Can taste be developed, or is it just personal preference?
Taste is articulated judgment, not preference. Paul Graham's argument in 'Taste for Makers' still holds: as you continue to design things your tastes change, and you know they're improving, which means your earlier tastes were not merely different but worse. Taste is a refinable, defensible faculty; in technical contexts it can be made explicit through tools like rubrics.