Understanding AI

Training an LLM

Alright, let's pull back the curtain on how these Large Language Models actually get built because understanding the training process helps you know what you're working with when you're using AI in your audits.

Think of LLM training like building the ultimate research assistant from scratch. It starts with data collection and preparation. And we're talking about gathering text from virtually everywhere: books, websites, code repositories, news articles, even Reddit threads. Imagine if you had to read every piece of documentation that ever existed to become an expert on everything. That's essentially what's happening here, except it's happening at internet scale. But here's where it gets interesting for auditors: this raw data needs to be cleaned and processed, just like how you'd clean up a messy general ledger before diving into your testing. The data gets scrubbed for errors, irrelevant junk gets filtered out, and then it's broken down into what are called tokens.

What are tokens?

Think of tokens as the building blocks of language that the AI can actually work with. They're like the individual line items in a trial balance - each one has meaning, but together they tell the bigger story. A word like "unreconciled" might get broken into "un-", "reconciled" as separate tokens. Each token gets assigned a unique number, kind of like how every GL account gets a unique code.

Now comes the heavy lifting: pre-training. This is where the model learns patterns in language through something called self-supervised learning. Unlike your CPA exam where you had specific questions with right answers, the LLM learns by playing an endless game of "fill in the blank" with itself. The most common method is next-token prediction (also known as causal language modeling).

Here's how it works: the model sees a sequence like "The client's internal controls over financial reporting are..." and tries to predict what comes next. Maybe it guesses "effective" when the actual next word is "deficient." The model then adjusts its internal calculations. Think of it like updating your risk assessment when you find a control deficiency. If there's a mismatch, the model adjusts its internal parameters (weights and biases) through a process called backpropagation and gradient descent.

This adjustment minimizes the loss function, which quantifies how far off the prediction was. Through billions or trillions of such predictions across the massive dataset, the model builds up an understanding of language patterns, grammar, context, and even picks up factual knowledge embedded in the text. The transformer architecture with its attention mechanisms is what makes this possible. It's like having the ability to simultaneously consider every relevant piece of information when making a judgment, rather than just looking at things in isolation.

After pre-training, many LLMs go through fine-tuning to make them more useful for specific tasks. This includes Instruction Tuning, where the model learns to follow commands more precisely, like training it to respond properly when you ask it to "summarize this contract" versus "identify risks in this contract." Another critical fine-tuning technique is Reinforcement Learning from Human Feedback (RLHF). This is basically quality control by human reviewers. Think of it like having experienced partners review and score the model's responses, then using that feedback to improve future performance. This is crucial for making sure the AI gives helpful, accurate, and professionally appropriate responses. Especially important when you're dealing with sensitive audit documentation.

The computational power required for all this is staggering. We're talking about processing datasets larger than any audit firm's entire document repository, using computing resources that cost millions of dollars. The end result is a model that can read your workpapers, understand context, and generate responses that actually make sense in your professional environment.

Can an LLM steal my data?

Here's something crucial for auditors to understand: an LLM is just software code. It can't reach out and grab your files, access your network, or do anything beyond what it's specifically designed to do. It has no memory between conversations. Each time you start a new chat, it's like meeting the AI for the first time.

What looks like "memory" in a conversation is actually the system sending your entire chat history back to the model with each new message. It's like having to re-read an entire email chain every time someone replies.

Data handling practices vary significantly between AI providers and service tiers. Many providers offer business plans where your data is not used for training, while some free consumer versions may use interactions to improve their models (typically with opt-out options available). It's important to read the terms of service to understand how your specific plan handles data, just like reviewing your engagement letter before starting an audit.

Kansaro will never use your data as training data.