Plausible vs Correct (AI)
In our rush of AI optimism, I’m seeing some echoes of past over-reaches.
1. Personalizing/humanizing systems: using “he/she” instead of “it”. AI apps use sophisticated statistical methods and zettabytes of human-originated samples, sure, but are still statistical systems. Built by people, for people. We attribute intelligence that’s not there, because the output is derived from human activity so reads like human activity.
We’ve been making this cognitive mistake for at least 60 years. In 1964, human observers decided that Eliza (a very early chatbot) had human-like feelings and intelligence even though it (“she”) was just parsing and reversing simple sentences. Eliza was plausible.
Mirage, rather than hallucination? In a recent talk, Nate Angell made the case that the word ‘hallucinations’ implies sentience: when we say “this AI system hallucinated” and laugh about the mistake, we’re assigning blame and intent to the machine. A better word would be ‘mirage’ since that requires human perception to create an optical illusion. No humans looking at the horizon means no mirages. Sand isn’t sentient.
2. Plausible vs. Correct. Processes take what we feed them: truth, lies, random text, interesting sentences, bizarre inputs, gibberish, tabloid gossip. Systems don’t “know” what’s true, instead retrieving frequent/dominant/voted-up inputs. GIGO. If enough history deniers post enough documents claiming that Lincoln survived Booth’s 1865 assassination attempt and escaped to France, it will take human historians to identify the lie and filter for it. (And the next 10,000 conspiracy theories, Sisyphus.) We get plausible answers, not necessarily correct answers. If we want to train systems on only what’s true, then the burden is on us (humans) to decide what’s true and filter our inputs. Topic by topic.
3. Cherrypicking. If I chat-generate 200 sonnets in Shakespeare’s style and merchandize the one that amuses me the most, I’m giving the impression that the system did this unaided. That the system understands Shakespeare rather than samples his words alongside centuries of (human) scholarship about his sonnets. (“To be, or not to be… trying is the first step towards failure.”)
Why does this matter to product folks?
There are lots of discussions about how AI will replace product managers. IMO, that’s a fundamental misunderstanding of how we add value. We’re not paid to deliver large volumes of generic user stories, stitched together from thousands of sample user stories. (As if there were a generically valuable user story across products and companies and contexts.) We’re paid to have deep insights into the needs of our human end users based on actual human discovery/ validation/ interviews, to spot new business opportunities or unique feature improvements, guess at future impact/outcomes, and push our teams (of humans!) to stay focused and build something truly new.
If a chatbot gives you prompts that help you think, great! But no one actually cares if a story starts with “As a user…” And no one wants a backlog clogged with a statistical blend of other companies’ (out-of-context) user stories. The valuable parts of our work are insights and organizational skills and problem-focused empathy and ability to anticipate market reactions, not entering tickets.
We (product folks, maker teams, tech companies) are responsible for systems that produce good outcomes and meet our customers’ real-world goals. Regardless of what tech we deploy. So we’re responsible for delivering correct answers to users (not just plausible answers) if that matters.
For example, ChatBots seem like a safe, relatively well-understood bit of tech. But context and the cost of wrong responses matter:
- An ecommerce player wants to help shoppers make better product choices, ideally buying more stuff. That’s very low risk, since a bot would make suggestions that the consumer is free to ignore. And we can compare the quality of AI recommendations against other search methods to see which delivers more relevant (correct) suggestions, since our human users will either buy or not buy the suggested item. No harm done; incremental revenue delivered.
- A medical app might deploy an advice bot with recommended medicines/treatments, but without a medical professional reviewing every result. Done wrong, people may die.
So in the second case, we need to have ongoing review of medical recommendations by qualified (expert) humans, even if just a statistical sample, and a metric for success. We might know that the reported incidence of medication errors in acute hospitals is ~6.5 per 100 admissions, so an automated system should cut that by at least 50%, to 3.2 per 100 admissions, or not be worth deploying.
(Outsourcing this review to low-wage non-expert reviewers isn’t sufficient, since they don’t know what’s correct/true. Recycling previously correct answers misses newly launched drugs, updated science, additional drug interactions that are discovered, etc. Waiting for anecdotal patient reports of side-effects or death is too slow and inhumane. So I’d expect these systems to degrade over time.)
Or the ML app that speeds up a bank’s review of mortgages, using 50 years of data reflecting now-illegal redlining. Do we send the software to jail, or are our human execs liable? It’s increasingly important for product folks to understand how these systems work, inevitable data shortcomings, and how they are likely to decay — before we unleash them on the world.
Sound Byte
There will always be shiny new tech. Let’s choose the right tech for the right reasons, focused on what will help us serve our (human) users and customers best.
Originally published at https://www.mironov.com on July 23, 2024.