The Amazon AI-tokens problem isn’t an Amazon problem.

Measuring token consumption means exactly zero to your customers. So why are companies like Amazon forcing their employees to do it? Measure outcomes instead.

Last week, news broke that Amazon is measuring its employees on how many AI tokens they consume.

Within days, employees responded the way employees always respond to a metric they don’t believe in and are forced to track. They started gaming it. Internal reports describe people writing scripts to run dummy prompts overnight, people padding prompts with filler so each one costs more, asking the AI questions they already knew the answer to, all to keep the token meter running.

The technical name for this is tokenmaxxing. The older, less catchy name is Goodhart’s Law: when a measure becomes a target, it stops being a good measure.

This is not uniquely an Amazon problem.

It is what happens every time a company measures AI adoption and calls it an AI strategy. Adoption is an output. It is a tactic, tool or way of working designed to achieve a meaningful change in human (staff in this case) behavior. Outcomes are ultimately the thing you’re trying to produce because that is where the value is. When the metric execs are watching tracks the output, the work that gets done is the work creates more output, not value.

The shift that exposed it

For most of the last two decades, the constraint on a team’s output was execution. Building was slow. So we measured the things that produced building. We measured things like story points, velocity, sprint commits and tickets closed. None of those metrics actually told us if we built the right thing but they correlated with shipping, and shipping was the bottleneck and the goal, so the proxy was good enough.

AI broke the bottleneck. The cost of producing an output dropped to nearly zero. And the metrics that used to be useful proxies like “are people working?” or “is the team productive?” became increasingly useless. So leadership reached for the next available proxy: are people using AI? This resulted in metrics for token usage, tool adoption, time spent in Copilot or the number of agents deployed. All the things that are easy to count.

What you measure is what you get. So you get tokenmaxxing.

Five signs you’re already there

  1. You have AI usage targets but no outcome targets paired to them. The dashboard has a number for “tokens consumed this quarter” but nothing for “customer behavior that changed because of it.”
  2. Token consumption charts have started appearing in the all-hands deck. What’s still missing are the customer-behavior and retention charts (for example).
  3. The opening question in 1:1s has shifted from “how did the work you did impact our customers?” to “how are you using AI?” These are fundamentally two different questions. The first one asks for an outcome. The second one asks for an activity (or output).
  4. New hire intros now lead with the AI tool stack instead of the problem-solving approach. “She’s a heavy Claude user, runs Cursor for IDE work, has a Notion-AI workflow.” Sadly this tells you nothing about whether this person is any good at the actual job of being a product manager.
  5. A “show me your prompts” culture has replaced a “show me your results” culture. Rather than focus on the impact each person or team’s work is having on the business, everyone is chasing the same shortcuts so they can seem productive.

The fix is the harder question

The fix is not to stop measuring AI use. The fix is to measure the right AI metric.

A good AI metric has three properties. It points at a customer behavior or business outcome, not a tool activity. It can move in either direction and a thoughtful person can defend the call either way. And it survives the question “would we still care about this number if AI didn’t exist?”

Token consumption fails all three. “Did the customer return next week?” passes all three. “Did the feature ship faster and make the customer’s job easier?” passes all three. “Did the experiment we ran tell us something we didn’t know before?” passes all three.

When Karpathy described agentic engineering as “supervising work, inspecting diffs, building evaluation loops” , the seven things that map almost line for line to product management, the metric implication was implicit but not totally obvious. The job is to supervise direction, not to count activity. The same logic applies to how “done” gets defined for AI features now. You specify an outcome distribution, not a checkbox.

The uncomfortable part

Tokenmaxxing is not really an Amazon story. It’s the story of every organization that hasn’t done the harder work of agreeing on what outcome they actually want. Without an outcome, the only thing left to measure is activity. And the only thing employees can do with an activity metric is perform it.

The metricmaxxing crisis predates AI. AI just made it even more painfully visible.

So before you add another “AI usage” chart to your Q3 deck, ask the harder question. What is the customer supposed to do differently on the other side of all this AI activity? If nobody in the room has an answer, the dashboard isn’t the problem. The strategy is.

And no amount of token consumption is going to fix it.

Books

Jeff Gothelf’s books provide transformative insights, guiding readers to navigate the dynamic realms of user experience, agile methodologies, and personal career strategies.

Who Does What By How Much?

Lean UX

Sense and Respond

Lean vs. Agile vs. Design Thinking

Forever Employable

One response to “The Amazon AI-tokens problem isn’t an Amazon problem.”

  1. Forever Employable

Leave a Reply

Your email address will not be published. Required fields are marked *