All models are wrong, but some are useful

When Models Fail: What Process Safety Can Learn from LLM Hallucination

Nov 30, 2025

The title of this post is taken from a paper published in the year 1976 by the statistician George Box.

As regular readers know, we are publishing a series of posts based on the well-known and highly regarded Process Safety Beacon series, published by the Center for Chemical Process Safety (CCPS).

PSM Elements

The aim of our analyses is to identify those elements of Process Safety Management (PSM) that were most applicable to the reported incident. (The current index of posts is provided at Process Safety Beacon Index.)

For example, the analysis of the October 2025 Beacon Refinery Cooling Tower Explosion and Fire suggests that the following elements were particularly pertinent:

Element 9. Safe Work Practices
Element 10. Asset Integrity / Reliability
Element 14. Operational Readiness

Although the post was not written by a Large Language Model (LLM), it could have been, and the output would have been useful because the analysis is subjective. In other words, LLMs are useful when asked for an opinion. However, when asked to provide accurate information we are skating on much thinner ice.

To test this statement, go to your favorite LLM and enter the following prompt:

Using the CCPS process safety elements, state which of them is the ‘most important’.

My guess is that you will disagree with its response, which is good. A useful conversation has started.

Raw Information

There is another benefit provided by the Beacons and similar reports from other authoritative sources. They provide raw, unadulterated, unprocessed data that is not curated by a model: in this case, the CCPS 20-element model. That model is useful, but it will not always provide complete insights.

Raw data preserves the contradictions that models discard

Which brings the conversation back to LLMs. Like people and organizations, LLMs tend to discard confusing, inexplicable, long-tail or outlying data. Hence, one of the fundamental difficulties with these tools is ‘hallucination’ ― a term that can be defined as follows,

A model saying something that sounds fluent and plausible, but is false, unsupported, or logically inconsistent.

There are many causes of hallucination, but one of the most important is the tendency to smooth out missing, noisy, conflicting, or low-quality data. Such data may be missing from the LLM’s training program, so it is ‘forgotten’. (The second-stage effect of this phenomenon is that LLMs increasingly feed on their own output, thus creating a negative feedback loop. The risk to do with ‘self-training’ will grow as AI-generated content becomes a large fraction of what’s scraped from the web, until eventually PSM trends toward what is sometimes called AI-sludge.)

So, when reading reports from the CCPS and other authoritative bodies, we should always be looking for facts and data that don’t seem to make sense, or that cannot be neatly pigeon-holed into someone’s model.

Which takes us back to the Cooling Tower Explosion report (which I selected at random before starting this post). One conclusion in the report reads,

Cooling towers are typically seen as a low-hazard process since they “only” process water.

This is a useful insight.

Training data (whether human or LLM) tells us that cooling water is ‘good’ because it removes energy from a potentially hazardous situation. We can also say that, ‘cooling towers are safe because they cannot catch fire’. (In fact, anyone with extensive PHA experience knows to slow the discussion when it comes to cooling towers, which are, in fact, surprisingly hazardous.)

Conclusion

The message of this post is not to do with cooling towers or the validity of the 20-element CCPS model. The message is always to look at the raw data, and then spot facts that ‘don’t make sense’ or that don’t fit a preconceived pattern.

Here are some issues to consider when reading Beacons and similar reports.

Model Does Not Fit The Data

Did people behave in a way the procedures don’t anticipate?
Was there an interaction of systems (control, maintenance, contractors, utilities) that our bow-ties and P&IDs failed to depict?
Was the initiating cause something we would classify as ‘non-credible’?

Unidentified ‘Raw Signals’

Unusual alarms, workarounds, missing tools, overtime, staffing, conflicting priorities.
Informal practices (“that valve always sticks”, “we always bypass that trip at startup”).

Smoothing Safety Data

How many near misses never get reported because they were resolved on the spot?
How much detail is lost when a near miss is summarized into a single category in a database?
Do investigation reports create a neat and linear narrative when the incident itself remains confusing, messy and misunderstood?

Ten-Question Quiz

The following quiz provides a quick way of assessing the value of LLMs when it comes to analyzing incident information.

Questions

What is the most accurate definition of hallucination in large language models (LLMs)?
a. A model refusing to answer difficult questions
b. A model generating fluent but factually incorrect or fabricated information
c. A model failing to load its training data
d. A model using excessive computing resources
Which factor most directly contributes to hallucination in LLMs?
a. Training on high-quality curated data only
b. Optimization for next-token prediction rather than truth
c. Use of too many external grounding tools
d. Over-reliance on numerical data
What is ‘model collapse’ in the context of LLM training?
a. A model that stops working due to hardware failure
b. Degradation of model quality when future models are trained mostly on earlier model outputs
c. A temporary bug that occurs during fine-tuning
d. A model that is too large to run on small devices
Why does training on synthetic (AI-generated) data risk degrading future model performance?
a. Synthetic data is always inaccurate
b. AI output uses too many tokens
c. Synthetic data increases training cost
d. Synthetic data tends to eliminate outliers and rare patterns, narrowing the distribution
What term is used to describe raw, messy, uncurated information that includes outliers and contradictions?
a. Clean data
b. Structured data
c. Authentic data
d. Synthetic baseline data
Which phenomenon in process safety most closely parallels LLM hallucination?
a. Excessive redundancy in safety instrumented systems
b. The ‘paper plant’ that exists in procedures and dashboards instead of the real operating conditions
c. Routine equipment inspections
d. ISO 9001 quality audits
What organizational risk is analogous to model collapse in AI?
a. Over-reliance on external consultants
b. Using only sanitized incident summaries and ignoring raw events and weak signals
c. Installing too many alarms in the control room
d. Implementing new maintenance software
Why is it important to examine ‘unusual’ or ‘unexpected’ details when reviewing Process Safety Beacons or incident reports?
a. They are usually caused by operator error
b. These details help justify capital projects
c. They often represent rare failure modes that can be overlooked when data is over-smoothed
d. They are required for regulatory compliance
Which of the following is a practical way to prevent ‘organizational hallucination’ in process safety?
a. Eliminate near-miss reporting to reduce noise
b. Rely only on annual PHA results
c. Encourage everyone to report contradictions and ‘things that don’t make sense’
d. Remove historical incident data older than five years
What is the recommended mindset when using PHA, LOPA, or bow-tie models in process safety?
a. Treat them as complete and definitive descriptions of plant reality
b. View them as simplified models that are useful but inevitably incomplete
c. Replace them annually to ensure accuracy
d. Use them only to satisfy audit requirements

Answer Key

Neural Foundry

This parallel between LLM hallucination and the 'paper plant' in process safety is remarkably sharp. Your point about raw data preserving contradictions that models discard resonates strongly. In PSM, I've seen this pattern where sanitized incident summaries get fed back into training, risk assessments, and eventually policy, each iteration smoothing away the messy reality that might actually reveal systemic failure modes. The cooling tower example is a perfect case, equipment we mentally classify as benign precisely because the model says so, even when field experience quietly contradicts it. One thing worth considering is whether hallucination itself might be a symptom rather than a core problem. If we're training on data that's already been through several organizational filters (near-miss reports that never got filed, investigation narratives tidied up for stakeholders), then the LLM is really just amplifying biases already baked into safety management culture. That might explain why certain incident patterns seem invisible until they result in a major event—they've been systematically excluded from the training set, both human and algorithmic.

Expand full comment

Discussion about this post

Ready for more?