→ Proof Points

Welcome back {{first_name}},

Did you know that in World War II, Allied cryptographers broke Axis codes by looking for telltales: small statistical patterns that should not have been there.

You would think that in 2026, we would not need to be doing this. The whole promise of AI is that it does the scrutinising for us; sifting through data at a speed and scale no human can match. But what if the AI is not scrutinising at all? What if it is just reading the data, believing it, and passing it along?

That is what this week's deep dive is about.

(P.S. Small plug… Did you know one of my Wikipedia articles is read 1.75 Million times a year?)

– Paul

THE LATEST PODCAST
Why Most Healthcare AI Companies Fail

In 2015, Geoffrey Hinton told the world radiologists would lose their jobs to AI within five years. It's 2026, and we're still short of radiologists. Dr Hugh Harvey has a few thoughts on that.

This week I sat down with Hugh — radiologist, ex-Babylon Health, the man behind Europe's first CE-marked deep learning mammography device, and now regulatory consultant to over 150 medical device companies at Hardian Health. He's seen more AI companies crash into the regulatory wall than probably anyone in Europe.

We got into class one washing (how to make your product less useful and gets you caught), why AI-generated regulatory documentation is already a real problem, and the concept he calls "digital thalidomide."

Listen on Your Favourite Platform:

YouTube
Spotify
Apple Podcasts

DEEP DIVE
A Fake Disease Passed Peer Review

I have a soft spot for scientists who debunk things that don't exist. 

I know this because I am one.

Years ago, I noticed something odd. People kept telling me they cried more when watching movies on airplanes. Everyone had a theory… low cabin oxygen, altitude-induced emotions, the loneliness of transatlantic travel. Even Virgin Atlantic started putting "weepy warnings" before sad films.

As an ALS researcher, part of my work has involved studying uncontrollable laughter and crying. So when a made-up condition called "altitude-adjusted lachrymosity syndrome" (AALS) started circulating in the media, I decided to actually test it. My colleague and I surveyed a thousand people who had watched movies both on planes and on the ground, controlling for genre, film quality, and IMDb ratings. 

The finding? If you watched the same film in both settings, you were equally likely to cry. The reason people thought they cried more on planes was simple: on a long-haul flight, you watch 4 to 6 films back to back. You cry at roughly one in four. 

Do the maths, and of course you remember crying on a plane! Most people just don't sit through six films in a row at home.

It was a fun study but underneath the fun was a serious point about how heuristics trick us. I thought of all this when I read about bixonomania last week. 

Breaking News! A fake disease lands

Almira Osmanovic Thunstrom, a medical researcher at the University of Gothenburg, invented a fictional eye condition: eyelid discolouration and sore eyes, supposedly caused by excessive blue light from screens. She wrote two fake academic papers about it and uploaded them to a preprint server.

The disease? “Bixonomania.”

The paper was not subtle. The funder was listed as the "Professor Sideshow Bob Foundation." One of the acknowledgements thanked a researcher at Starfleet Academy aboard the USS Enterprise. The paper itself literally stated: "This entire paper is made up."

Within weeks, ChatGPT, Google Gemini, Microsoft Copilot, and Perplexity were all confidently diagnosing bixonomania as a real medical condition, complete with clinical explanations and specialist referrals. Then three researchers in India cited one of the fake preprints in a peer-reviewed paper published by Springer Nature, describing bixonomania as an "emerging" condition. 

That paper has since been retracted.

This is not a new phenomenon

There is a long (and occasionally amusing) history of scientists testing the limits of academic publishing. In 1996, physicist Alan Sokal submitted a paper to a cultural studies journal titled "Transgressing the Boundaries: Towards a Transformative Hermeneutics of Quantum Gravity." It was deliberate nonsense-word soup designed to sound impressive while meaning nothing. Yet… it got published. The ensuing scandal, known as the Sokal Affair, raised serious questions about whether some journals were evaluating ideas or just evaluating tone.

Here's what I think: Bixonomania is the Sokal Affair for the AI age. 

What would you believe?

As readers know, I'm a big fan of preprints for burgeoning MedTech and HealthTech startups. (Fun fact: I played a small part in getting them accepted in medicine through work on the BMJ's editorial board)

Preprints undergo far less scrutiny than peer-reviewed publications. That is by design… they are works in progress, meant to share findings quickly before the full review process is complete. 

But the principle depends on one assumption: that the reader can distinguish between a preprint and a peer-reviewed paper. 

Human readers generally can. LLMs, it turns out, cannot. Or at least, they are not critical readers. And that matters enormously, because the chatbot is increasingly the primary interface by which people understand health information.

A study published earlier this year in The Lancet Digital Health tested this directly. Researchers probed 20 LLMs with over 3.4 million prompts containing fabricated medical information, embedded in realistic clinical notes, social media posts, and simulated patient vignettes. The results were striking: falsehoods written in authoritative clinical prose were far more likely to be accepted by the models than the same claims written informally. When misinformation sounds like a doctor wrote it, AI believes it. 

There are some benefits.

Ask ChatGPT what the best online symptom checker is, and it will very likely tell you Ada Health. I think that is partly because Ada is genuinely excellent, but also because Ada has a substantial body of peer-reviewed publications behind it. The LLMs appear to heavily weight peer-reviewed sources. Which is great… until someone figures out that flooding preprint servers with credible-sounding nonsense is a cheap way to manipulate what chatbots recommend.

What happens when we fabricate data

The bixonomania story is, on some level, funny. Any author that manages to squeeze in references to The Simpsons and Star Trek wins my upvote. But a separate investigation published in Nature in the same week tells a far less amusing version of the same story. 

Researchers identified two widely used health datasets on Kaggle — a competition platform for data science — that appear to have been fabricated. One claimed to contain data from 5,000 stroke patients. The other: 100,000 diabetes patients. Together, they have been used in 124 peer-reviewed papers to train machine-learning models that predict disease risk. At least two of those models have been deployed in hospitals, in Indonesia and Spain.

The diabetes dataset had a telltale sign that the data was fake. Among 100,000 supposed patients, only 18 distinct blood glucose values appeared. In real patient data, you would expect a near-continuous distribution: 3.2, 4.7, 5.6, 8.1. Instead the data showed 2.4, 2.4, 4.3, 4.3, 4.3. Repeating. That is not what real biology looks like.

Think of it this way: if a self-driving car's map data was wrong, it might drive you off a cliff. That is an easy risk to understand. The medical equivalent should be seen in the same light. 

As a scientist, I know when I am being spoofed about clinical research. I can spot the telltales. But what if my chatbot was giving me information that was this wrong about something I’m not an expert in, like my personal taxes? Parenting a teen? Or prepping for my next job interview? What if the misinformation was in a domain where I couldn’t easily validate the approach?

That is the question that should keep all of us thinking. Not whether we can trust AI, but whether we have any way of knowing when we should not.

— Paul

FROM OUR DESK
This Month at ProofStack

We hit 100 subscribers to Proof Points 2 weeks ago! And I am closing in on 10,000 followers on LinkedIn! Thank you for being here in the beginning.

Behind the scenes, I met with the MHRA — the UK's medicines and devices regulator — last week as part of my work with the University of Birmingham. The topic was digital mental health interventions: whether wellness apps are sufficiently regulated, overregulated, or not regulated at all, and where tools like OpenAI's products sit within that framework. It was a closed-door conversation, but I will share what I can in a future issue!

UPCOMING CONFERENCES
What We're Attending

I am preparing a keynote for the Hardian Summit on optimizing the alignment between three types of evidence: the evidence regulators want to see, the evidence the market wants to see, and the evidence your academic training taught you to produce. 

The Hardian Summit is one of the few spaces where regulators and innovators are in the same room, which tends to make the conversation more honest than most conferences. Let me know if you’ll be there and come say hi!

This was a long deep dive this time and I hope you enjoyed it! I have a ton of thoughts about AI in the HealthTech industry and I hope if any of this resonates with you, you reply back to me here.

Thanks for reading,

Paul Wicks, PhD
Founder & CEO, ProofStack Health
Move Fast. Prove Things.

P.S. Here are 3 ways I can help you: 

  1. Take the Evidence Scorecard Quiz. Answer 15 questions and we’ll send you a personalised report with feedback tailored to your specific needs.

  2. Follow or connect with me on LinkedIn. I publish top resources and in-depth insights related to building your evidence stack.

  3. Book a strategy session. Uncover the gaps in your evidence and marketing in your Digital Health/MedTech startup.

Keep Reading