A Friendly Walkthrough of How Large Language Models Are Trained

Sam Schriemer
15 hours ago
7 min read

A simple explanation of the pre-training phase (no math panic required). This post draws on Andrej Karpathy’s Deep Dive into Large Language Models and summarizes selected aspects of the pre-training process. Karpathy is an AI researcher and former OpenAI and Tesla engineer, widely known for his educational work on neural networks.

In this blog post, we’ll recap the pre-training portion of the video - the part that often triggers reactions like “Wait… numbers, symbols, tokens?” or “What does training even mean?” Our goal is to demystify this process and give you a general overview of what’s happening: no code, no equations, and no machine learning degree required. Just a clear, human explanation of how large language models get started.

What Does “Pre-Training” Actually Mean?

When people talk about training a large language model, they’re usually talking about pre-training. This is the very first and most foundational phase of building an LLM. During pre-training, the model isn’t answering questions or acting like an assistant yet. Instead, it’s simply learning how language works by reading a massive amount of text and spotting patterns.

In the video, this entire process is broken down into three big steps: gathering internet text, converting that text into a format a computer can understand, and then training a neural network to predict what comes next in a piece of text. While that might sound intimidating, the underlying idea is surprisingly simple.

Gathering and Cleaning Internet Text

Everything starts with data. To teach a model how language works, you need an enormous amount of text. That text comes from the internet and is collected using web crawlers. These crawlers download the raw HTML of websites, which includes everything from articles and blog posts to navigation menus, ads, and styling code.

Of course, a language model doesn’t need to learn from cookie banners or website footers. So before the text can be used, it goes through heavy cleaning and filtering. This step removes low-quality sources, spam, adult or scam content, and anything that isn’t meaningful written language. The visible text is extracted, unnecessary formatting is stripped away, and personally identifiable information like phone numbers or addresses is removed.

Example of Fine Web Data Set -

Image Source

You can think of the image above as large amounts of web text that have already been filtered in many ways. Each row represents content pulled from a website and cleaned so that only the meaningful, human-readable text remains. Once this filtering is complete, all of that text is merged into one massive collection, resulting in a continuous stream of language like the example shown below.

Image Source

Turning Text into Tokens

Once all of this text has been collected and cleaned, the next question is how to actually feed it into a neural network. Neural networks do not understand documents, paragraphs, or even words. They expect a single, long, one-dimensional sequence of symbols. At the most basic level, everything a computer processes becomes combinations of just two symbols, 0 and 1. While this works, it creates extremely long sequences, and sequence length is a limited resource. Instead of working directly with raw bits, we use smarter representations that capture the same text with more meaningful symbols and shorter sequences. This is the bridge between human-readable text and machine-readable input, and it starts by treating all that cleaned internet data as one continuous stream of text, like in the image below.

Image Source

To make these long sequences more manageable, the text is first broken down into bytes, which are shorter and easier for computers to handle than raw bits. From there, it helps to stop thinking in terms of numbers and instead think of each unit as a unique symbol, almost like an emoji the model can recognize. Tokenization builds on this idea by grouping together symbols that frequently appear next to each other. Using techniques such as byte-pair encoding, the model repeatedly merges common patterns into new, larger symbols. For example, character sequences that appear often are combined into a single token, shortening the overall sequence while preserving the information. This process continues until the text is represented as a sequence of tokens, each with its own unique ID.

The model never sees the original text, only these IDs.

Tools like Tiktokenizer make this process visible, but the key idea is simple: tokenization is a translation step that turns human language into a compact, structured sequence of symbols that a neural network can learn from.

Learning by Predicting What Comes Next

Once the text has been converted into tokens, training can begin. The model is shown a short sequence of token IDs, which act as its context. You can think of this as giving the model the first few words of a sentence and asking it to guess what comes next. The input to the neural network is simply a sequence of tokens, and the output is a prediction of the next token in the sequence.

At the start of training, these predictions are essentially random. For each input sequence, the model assigns probabilities to every token it knows, making an initial guess about which one is most likely to come next. Because the correct next token is already known from the dataset, the model’s guess can be compared to the right answer. If the correct token has a low probability, the model is adjusted so that next time it assigns a higher probability to that token and lower probabilities to the others.

This cycle of guessing, comparing, and correcting happens repeatedly for every token across the entire dataset, all in parallel. Over time, the model’s predictions begin to better match the patterns found in real language. This repeated process is what we mean when we say the model is learning to predict what comes next.

What’s Inside the Neural Network?

Behind the scenes, the model’s behavior is controlled by billions of adjustable values called parameters. These parameters determine how input tokens are transformed into output predictions. At the start of training, they are set randomly, which is why the model’s early predictions are poor.

Image Source

Each parameter plays a very small role, but together they form a massive mathematical system that shapes the model’s outputs. As training progresses, these parameters are gradually adjusted so the model becomes better at producing accurate predictions. Karpathy’s video states they’re like knobs on a DJ mixing board: turning one knob slightly changes the sound, and adjusting many knobs together can dramatically improve the result.

From Training to Inference

Once training is complete, the model stops learning and moves into a phase called inference. This is the part we interact with as users. To generate new text, the model starts with a small set of prefix tokens, usually the words you type into a prompt. These tokens are fed into the network, which produces a probability distribution over all possible next tokens. Rather than always picking the most likely option, the model samples from this distribution, similar to flipping a biased coin. This introduces variation and makes responses feel natural rather than repetitive.

Each time a token is sampled, it is added to the sequence and fed back into the model, allowing the process to repeat. Token by token, the model builds a response while following the patterns it learned during training. From a user’s perspective, inference simply feels like the model “thinking” and typing out an answer. Under the hood, it’s just repeatedly predicting and sampling the next token based on probabilities learned from vast amounts of text.

Why This Takes So Much Computing Power

Training a large language model is not a one-time calculation. It is a long, iterative process where the model is updated millions or even billions of times. Each training step makes a tiny adjustment to the neural network based on how well it predicted the next token. Researchers monitor this process using a metric called loss, which is a single number that summarizes how well the model is performing at that moment. A lower loss means the model’s predictions are improving. As training runs, the goal is simply to see that loss gradually decrease over time.

Because these updates are happening constantly and across massive amounts of data, this kind of training cannot realistically be done on a laptop. The models are too large, and the calculations are too intensive. Instead, training runs in the cloud on specialized hardware called GPUs, which are particularly well suited for the kind of parallel math neural networks require. Each line you see in a training log represents one small update to the model, and collectively, millions of these updates slowly shape the network into something useful.

To scale this process, GPUs are grouped together into machines, machines are grouped into clusters, and clusters are housed in large data centers. The more GPUs you have, the more data you can train on and the faster the model can improve. This is why demand for GPUs has skyrocketed and why training cutting-edge language models is something only large organizations can typically afford. Once training is complete and the loss has stabilized, the final model can be released and used for inference, which is the much lighter-weight process users interact with.

Recap: The Big Picture (and What Comes Next)

When you zoom out, pre-training follows a simple loop. Collect and clean large amounts of text, break it into tokens, then train a neural network to predict what comes next, adjusting it each time it’s wrong. Repeat this at a massive scale until the predictions become consistent and strong.

By the end of pre-training, the model has not memorized the internet. Instead, it has learned the patterns of language, which are compressed into its parameters. This stage accounts for most of the time, cost, and computing power involved in building a model and can take months.

After that comes post-training, where the model is turned into an assistant. Instead of predicting internet text, it is trained on conversations between humans and assistants, learning how to respond in a helpful and conversational way. This process still relies on tokens and neural network training, but it is far lighter than pre-training and can take hours rather than months.

We’ll dive deeper into post-training, conversations, and how assistants are shaped in the next post.

PANTA

PANTA

A Friendly Walkthrough of How Large Language Models Are Trained

What Does “Pre-Training” Actually Mean?

Gathering and Cleaning Internet Text

Example of Fine Web Data Set -

Turning Text into Tokens

Learning by Predicting What Comes Next

What’s Inside the Neural Network?

From Training to Inference

Why This Takes So Much Computing Power

Recap: The Big Picture (and What Comes Next)

Recent Posts