Articles

Expert system or system expert?

June 3, 2025

In this article, James unpacks a common misconception about generative AI: that it can deliver powerful workplace insights straight out of the box. He explains why truly responsible use of these tools demands deep system expertise, something most organisations don’t have. Enter Audiem: an expert system designed to do the heavy lifting for you, making it easier to extract meaningful, reliable insights without becoming a data scientist.

James Pinder

Director and Co-Founder

Stay up to date →

Back in 2018, while Ian Ellison and I were working as workplace consultants, we came up with the idea for Audiem. We wondered if we could use natural-language processing to help us make sense of the free-text employee feedback we were collecting on our consultancy projects.

Fast‑forward a few years, and we launched Audiem in the summer of 2022 with Chris Moriarty, following a year or two of behind‑the‑scenes development work. In November of that year ChatGPTwas launched by OpenAI, an event that sparked the current cycle of interest in artificial intelligence (AI).

Conversational chatbots such as ChatGPT are amazing tools. When used appropriately, they can boost productivity by saving time on routine or creative tasks, such as writing and brainstorming ideas. I use them regularly in my day‑to‑day work and so do my colleagues.

Occasionally, people will ask us if ChatGPT and similar chatbots can do what Audiem does. I can see why people ask this question - after all, they both involve AI - but it also reveals a lack of understanding of how chatbots (and the large language models, or LLMs, that underpin them) work and what they are good at and not good at.

Large language models currently play only a small role in Audiem’s data‑processing pipeline - we use one for a single, discrete task. We’ll be using more LLMs in the future, but the heavy lifting in our data analysis is - and will continue to be - carried out by our custom‑trained text‑classification models.

Right now, we use 22 of these models to tag sentences according to sentiment and the different parts of the Workplace Mix. We’re about to launch model number 23, the focus of which I’ll tell you about in a future article. These models take a long time to train and fine‑tune, but when deployed, they are very fast, very accurate and consistent.

When I talk about consistency, I mean getting the same outputs from the same inputs, time after time. Feed the same dataset through our pipeline multiple times and you’ll get the same sentiment scores each time, because our sentiment models are deterministic. The outputs would only change if we upgraded the models.

Now let’s try to do the same with a chatbot/LLM. I uploaded 948 comments from one of our demo datasets to ChatGPT o3 (one of OpenAI’s “smartest and most capable models”) and gave it this prompt:

“I’ve attached some employee feedback. Tag each comment by sentiment (very positive through to very negative) and confidence score and output as an Excel file”

The confidence score is how certain the LLM is that the label it has chosen is the correct label.

I then repeated this in three separate chats, with the same input data and the same prompt each time. Here’s how the three chats compared, comment by comment:

‍

‍

And even when comment‑level labels matched between chats, the confidence score more often than not differed.

Hmmm… not exactly confidence‑inspiring, is it? If I’m using a chatbot to analyse sentiment in different datasets, how can I confidently compare like with like? How would you even know which results to use?

The reason this is happening is because LLMs are probabilistic: every time they generate text they sample from a distribution of possible next words, so identical inputs can produce different outputs - and therefore different sentiment labels and confidence scores - from one run to the next.

Now, at this point you could argue that the prompt could be improved, so that the guidance is more specific. Or perhaps you could feed the results from one LLM into another LLM for checking. All good ideas, but it’s suddenly starting to become rather complicated, isn’t it?

And we’re only talking about sentiment here, arguably one of the more established areas of natural‑language processing.

Relying on LLMs for this sort of analysis means that you need to become a system expert - essentially a data analyst - just to make sense of your free‑text data. This runs completely counter to why we created Audiem: we wanted anyone to be able to get insights from free‑text feedback, easily and quickly.

Audiem is, in many ways, an expert system. Not in the traditional sense of using hard‑coded rules to make decisions, but because it:

Embodies decades of collective workplace knowledge and expertise
Does the analytical heavy lifting for you, tagging and organising data so that you can make sense of it.

Over time Audiem’s expertise will improve as we add new functionality and features. The release of Audiem 2.0 next month will be a key step in that direction.

If you’ve made it this far, I hope the article has given you some food for thought. If you want to find out more about what Audiem does and how it does it, get in touch at hello@audiem.io

‍

Audiem is accredited by:

Expert system or system expert?

The latest episodes & blogs directly to your inbox

Thank you for your submission!