Diffusion-Based LLMs (dLLMs) and LLaDA
Conventional autoregressive (AR) LLMs are Transformer-based models that generate text left-to-right and apply triangular causal masking to prevent access to future tokens. Diffusion-based language models (dLLMs), such as LLaDA, are also typically Transformer-based architecturally, but they differ critically in their attention masking: they do not use a causal mask, allowing each position to attend to the entire input context.
What does this change? As a consequence, dLLMs can model bidirectional dependencies without a left-to-right constraint. While the core Transformer architecture remains, the underlying approach to probability modeling is different. This enables dLLMs to fully evaluate the context around any position in the text and to predict (i.e., infill) masked spans that need to be completed.
How Are These Models Trained?
Traditional LLMs are trained under the principle of maximum likelihood, using the next-token prediction objective. At each step, the model receives all previous tokens as input and attempts to generate the following one. This autoregressive (AR) paradigm factorizes the joint probability distribution via the chain rule. More specifically:
Autoregressive models decompose the joint probability of a sequence into conditional probabilities of the “current token given the past context” at each step. This is, in fact, a direct application of the chain rule from probability theory.
For a sequence of random variables:
This equality arises from the sequential application of the definition of conditional probability. Importantly, it holds for any ordering of the variables; if the sequence is permuted according to some ordering π, the decomposition still applies.
AR models adopt a “left-to-right” ordering and consistently apply this rule.
To illustrate, consider a simple numerical example:
During training, these models use a two-stage procedure that corrupts the text and then reconstructs it. In the forward process, the original text is partially masked (i.e., corrupted by adding noise). In the reverse process, the model removes this masking to reconstruct the original. In LLaDA’s training, for each example the fraction of masked tokens is sampled at random from the interval [0,1][0,1], and the model learns to predict the hidden tokens under that randomly chosen masking rate. This is a markedly different approach, which raises a key question: how much masking should be applied?
Sometimes only a few words are masked (an easy task); other times nearly all words are masked (a much harder task). By being exposed to both extremes, the model learns to perform conditional completion as well as sampling from the model distribution.
In this method, LLaDA’s training objective is set up to optimize a variational likelihood bound on the model distribution. In other words, diffusion models approach the maximum-likelihood target indirectly by optimizing a variational objective. This differs from standard masked language models like BERT, which are trained with a fixed masking rate. Thanks to the random-rate masking, LLaDA’s training loss becomes an upper bound on the model’s negative log-likelihood (NLL), enabling full generative training. Thus, a dLLM learns the text probability distribution much like an autoregressive LLM does, but instead of focusing on the next token, it learns by infilling the missing pieces at each step.
Mathematically, diffusion-based language models optimize an ELBO (evidence lower bound) during training to adjust the model parameters, and LLaDA’s loss is derived so that it forms an upper bound on the NLL.
What is negative log-likelihood (NLL)?
It is the negative of the logarithm of the likelihood that the model assigns to the data; in learning, it is used as the loss to be minimized.
A dataset:
So what did we do here?
Our goal is to learn the model parameters θθ such that the probability of the observed data is maximized. In practice, this is equivalently achieved by minimizing the negative log-likelihood (NLL).
Terminology:
- xi: input for the i-th example (feature vector, text context, image, etc.).
- yi: target for the i-th example (label, next token, regression target).
- pθ(yi∣xi): the (conditional) probability the model assigns to the correct target yi given xi (a density if yi is continuous).
- ∑i=1N: sum over all data points.
- Leading “−”: designed so that maximizing likelihood ⇔⇔ minimizing NLL.
Why this form?
- Chain rule ⇒ sums: products of probabilities become sums under the log, which is numerically stable and easier to optimize.
- Harsh penalty for mistakes: if the model assigns low probability to the correct class, then −logp is large; as p→0, NLL→∞.
- Equivalent to MLE: minimizing NLL is exactly the same as maximizing the (log-)likelihood.
The expected NLL is equal to the cross-entropy H(P\*,Pθ), and …
Therefore, reducing the NLL brings Pθ closer to the true distribution.
In AR sequences, for a target sequence y1:T:
The average NLL per token (ANLL) is typically reported, and …
As a result, while autoregressive modeling directly maximizes logP(X), diffusion modeling instead employs a loss function that maximizes a lower bound on logP(X). This can be viewed as a weighted sum of prediction errors across different timesteps. Such an approach means that, in principle, diffusion models are also performing maximum likelihood learning. Indeed, the LLaDA paper demonstrates that the loss function used constitutes an upper bound on the negative log-likelihood (NLL) of the model probability.
Comparison with Modern LLMs
Diffusion-based LLMs are still a relatively new paradigm compared to their autoregressive counterparts. Yet, as of 2025, research results have been highly promising.
Accuracy and Task Performance
Models such as LLaDA have shown competitive performance against traditional LLMs in comprehensive benchmark evaluations. For instance, the LLaDA-8B model (2025) demonstrated a scaling curve similar to autoregressive models of comparable size when trained and tested on equivalent data. On the MMLU benchmark, LLaDA-8B achieved 65.9% accuracy, closely matching the performance of the autoregressive LLaMA3–8B (65.4%). Across 15 zero/few-shot tasks, LLaDA-8B significantly outperformed LLaMA2–7B and performed on par with LLaMA3–8B. This marks an important milestone.
These findings highlight that dLLMs also possess strong in-context learning abilities. Moreover, LLaDA appears to scale its performance as effectively as autoregressive models. A particularly striking detail is that LLaDA-8B was trained on only 2.3 trillion tokens, whereas LLaMA3–8B — achieving nearly identical performance — was trained on 15 trillion tokens. This points to a potential data-efficiency advantage of the diffusion paradigm.
Logical and Mathematical Reasoning Tasks
LLaDA has excelled in tasks requiring mathematical reasoning and logical inference. On GSM8K, LLaDA-8B reached 70.7%, far surpassing LLaMA3–8B (53.1%). On the MATH benchmark, LLaDA achieved 27.3%, compared to LLaMA3–8B’s 15%. These tasks often require multi-step inference and reverse reasoning (tracing results backward), where LLaDA’s bidirectional context modeling may provide a key advantage.
Additionally, LLaDA successfully overcame the so-called “reversal curse,” a challenge where autoregressive LLMs typically struggle. In a test requiring the continuation of a poem presented in reverse order, LLaDA even outperformed GPT-4-class models. While GPT-4o remains stronger in standard forward text generation, LLaDA demonstrated superior performance in the reverse-sequence completion task — a remarkable result.
Conclusion
These results confirm LLaDA’s success and suggest that diffusion models are poised to play a transformative role in the future of language modeling.
I hope you found this article engaging. In my next piece, I will once again explore an equally fascinating model.
See you next time…
