How Do Tabular Foundation Models Work? A TabPFN Walkthrough

For most of its history, tabular machine learning has operated by a single paradigm: you train a model on your dataset. Every model starts from scratch. Every dataset is an island. Gradient-boosted trees (e.g. XGBoost, LightGBM, CatBoost) sit at the top of benchmark leaderboards for a reason: they are remarkably good at squeezing signal from structured data with no prior knowledge at all.

However, this agnostic approach leaves a lot on the table; when a human data scientist looks at a new dataset, they don't start from scratch. They bring years of intuition to bear on the data, like which features tend to interact, what kinds of noise patterns are plausible, and what processes are likely at work generating the data. They bring a prior. Classical out-of-the-box ML methods have no mechanism for this.

Tabular foundation models change the framing entirely.^[3] Rather than training a model on your dataset, you give the entire dataset to a pretrained transformer as input, and it produces a prediction in a single forward pass; no gradient updates, no retraining. The model has already absorbed, during pretraining, a broad prior over what tabular data can look like. At inference time, it reads your data and answers the question: given everything I've learned about how the world generates datasets, what is the most likely outcome for this query point?

This article explains, from the ground up, how this works. We cover the training data generation, the architecture, the loss function, and, most importantly, what the model actually learns. Each section is accompanied by an interactive illustration you can explore.

Part 1 — Where the training data comes from: synthetic causal graphs

The most unusual thing about tabular foundation models is not their architecture, it is what they train on. Despite the fact that most real-world business data is fundamentally tabular, there is no large, readily available corpus of real tabular datasets (the way language models train on text). Instead, to train a tabular foundation model, we generate our own training data procedurally, using random causal graphs.

The process works like this. A random Directed Acyclic Graph (DAG) is sampled. Each node in the graph represents either an input feature or a target variable. Edges represent causal influences: parent nodes affect child nodes via learned weight matrices and nonlinear activation functions (ReLU, tanh, sigmoid). Random noise is injected at the root nodes and is propagated forward through the graph. The final values at each node form a single row of a synthetic dataset.

Repeat this millions of times, with different graph structures, different weights, different noise levels, and you get an enormous diversity of synthetic datasets, each with different linearity, different feature interactions, different noise distributions. This is the pretraining corpus: approximately 130 million distinct synthetic datasets, each containing many rows and each generated from a different causal model. The total number of training rows is therefore 130 million multiplied by the average dataset size.

The reasoning behind this design is elegant. If your goal is to learn a prior over "what kinds of relationships can arise in real-world tabular data," one approach is to curate a massive collection of real datasets. But real datasets are biased, limited, and hard to label correctly for meta-learning. The alternative, generating from a broad prior over causal structures, can produce infinite diversity while remaining mathematically principled. You are training the model to be good at inference over the class of problems that can be described by a causal DAG, which covers most practical tabular prediction tasks. It also allows you to pretrain your model towards the kind of tasks you expect to encounter.

①

Interactive · Synthetic data generation

Step through the DAG generation process: see a causal graph sampled, edge weights assigned, noise injected at root nodes, and values propagated forward to produce one row of synthetic training data. You can also open it in a new tab.

Note on feature names. The feature labels in the illustration above (age, income, debt…) are added for readability and illustration purposes only; the synthetic datasets used in pretraining do not require defining feature names. That said, the structure of the DAG does matter: a prior that generates causal graphs resembling real-world processes (e.g., income influencing loan outcome rather than the reverse) will produce a model better calibrated for datasets from similar domains.

Part 2 — The architecture: a transformer that reads datasets, not sentences

Scope note. This section describes the architecture of TabPFN v2 specifically, as implemented in the Prior Labs source code. Other tabular foundation models, including other versions of TabPFN, TabICL, and ConTextTab; use different architectural choices. The core ideas of synthetic pretraining and in-context learning are shared across the field, but the specific attention design, feature encoding, and loss function described here are TabPFN v2's.

The model is a transformer, but one redesigned around the structure of tabular data rather than sequences of tokens. Understanding how it differs from a standard transformer is essential for understanding why it works.

The token is a cell, not a word

In a language model, a token is a word or subword. In tabular foundation models, a token is, loosley speaking, one cell (more accurately, it is a (row, feature-group) group of adjacent cells). The input is a table with n rows and f features. Features can be grouped into blocks, each block is then linearly projected into a d-dimensional embedding, and the label y is encoded separately and appended as a final column. The full tensor entering the transformer has shape:

Internal tensor shape after encoding (batch, seq_len, num_feature_groups + 1, d_model) The +1 is the label column. For training rows it holds the true label. For query rows it is filled with NaNs, and its value after all L transformer layers is what gets decoded into a prediction. What these dimensions mean in practice: In the context of pretraining, the batch dimension reflects the synthetic data splits during training. At inference time, batch = 1 reflecting the dataset to be analyzed, the model processes one dataset per forward pass. seq_len = number of training rows + number of query rows, all seen together in a single context.

This is the first architectural peculiarity worth noting: the label is not treated as a separate output target sitting outside the model; it is inside the same tensor as the features, occupying the last feature-group slot. This becomes important when we describe how the attention mechanism flows in the model.

Two kinds of attention — and why both matter

In the model, attention is applied twice in sequence, over two completely different dimensions of the tensor. This is the heart of the architecture and the main adaptation to tabular data.

Column attention runs within a single row, across the feature-group dimension. For each row independently, every feature-group token, including the label slot, attends to every other feature-group token in the same row. This is what the code literally does: it applies multi-head attention with the feature-group dimension as the sequence dimension.

The significance of this for training rows: the label is visible. When column attention runs on a training row, the feature tokens can directly read the label, and the label token can read the features. The model learns which feature patterns co-occur with which labels, at the level of individual cells within a single example. This is where feature interaction happens, e.g.: income and debt and age can jointly attend to the label, learning a rich representation of how those features relate to outcomes.

Row attention runs at a fixed feature-group position, across rows. The code transposes the tensor before passing it to the attention module (making rows the sequence dimension), so each token attends to the same feature-group position in other rows. For query/test rows, a critical constraint is enforced: the Keys and Values come only from training rows. Test rows cannot attend to each other or to their own labels (i.e. there is no leakage).

The most consequential row-attention pass is the one over the label slot. The query row's masked label token attends to the label tokens of every training row. Since those training-row label tokens have already been enriched by column attention with feature information, this single attention operation transfers a weighted combination of label values, calibrated by feature similarity, into the query's label slot. This is the in-context learning mechanism.

After both attention passes, an MLP is applied per-token (each (row, feature-group) cell independently), followed by LayerNorm with a residual connection. This is the standard transformer sublayer structure. Then the whole sequence — col attn → row attn → MLP+LN — repeats for each of the L identical layers.

②

Interactive · Row & column attention

Step through pre-selected examples showing exactly which cells are Queries, which are Keys (amber border), and which become Values (green fill), for both column attention and row attention. The label column is included. The visualization uses the features_per_group=1 default; each feature is its own token. You can also open it in a new tab.

Attention to detail: To simplify, we can think of attention as looking at the values in the row-column intersections, however, the real implementation is a bit more nuanced; before attention runs, each raw feature value is projected from a single scalar into a d-dimensional embedding vector by a learned linear encoder. The label column is encoded by a separate y_encoder. The attention mechanism then operates on these dense embedding vectors, not on the raw numbers. Note also that a token can represent more than a single feature: adjacent features can be grouped together into a single token before projection.

Why this is better than a pure MLP

It is worth pausing to ask: why can't a sufficiently deep MLP solve this problem? The answer cuts to the core of what makes in-context learning possible.

An MLP trained on a tabular dataset learns a fixed mapping from input features to predictions. The weights encode the learned function, and that function applies identically to every new input. The MLP has no mechanism to condition its predictions on other data points. Each prediction is made in isolation. This is fine when you have a large, representative training set; the MLP can approximate the true function well. But it cannot do "local reasoning": it cannot notice that a particular test point is nearly identical to three specific training examples, and up-weight those examples accordingly.

Advanced note: There are deeper reasons an MLP cannot serve as a general-purpose tabular foundation model. The first is subtle but fundamental. A plain MLP is rotationally invariant: applying any rotation to the feature matrix produces the same learned function. This sounds harmless, but it is not (see Grinsztajn et al. (2022)^[5]). The embedding step in TabPFN explicitly breaks rotational invariance by learning a separate projection for each feature position. Second, a plain MLP assumes a fixed number of input dimensions tied to specific positions, it cannot be pretrained across datasets with different schemas. Attention has no such constraint; each token attends to all others regardless of column count or order.

Attention changes this fundamentally. By attending over the training set, the model's computation is conditioned on the entire observed dataset. The effective function applied to a query point is not fixed, it is a function of the context.

Capability	Standard MLP	TabPFN (attention-based)
Feature interactions	Learned implicitly; fixed after training	Dynamic; recomputed per query at inference
Heterogeneous feature types	Requires manual preprocessing (one-hot, normalisation)	All types embedded into a shared vector space
Cross-sample reasoning	None — each prediction isolated	Native — attends over full training set in context
Uncertainty estimation	Requires dropout, ensembles, or special heads	Output is a full distribution; uncertainty is explicit
New dataset schema	Must retrain from scratch	Dataset provided as context; forward pass only
Rotation invariance	Rotationally invariant — cannot anchor to the natural feature basis; sample complexity grows with irrelevant features^[5]	Embedding layer breaks rotational invariance; each feature gets its own learned projection tied to its position
Small data	Prone to overfitting; needs regularisation	Leverages pretraining prior

There is also a compositionality argument. Stacked attention layers compute higher-order feature dependencies without explicit enumeration. After one layer, each token's representation is a nonlinear blend of all its neighbours. In layer 2, attention over those blended representations is effectively computing interactions of interactions. By layer L, every token reflects a deep, nonlinear function of all pairwise relationships in the data, iterated L times. This gives the model access to the equivalent of arbitrary interaction terms without the combinatorial explosion of specifying them explicitly.

Advanced From raw table to token grid: the full preprocessing and encoding pipeline

The encoding step bridges the familiar tabular world (rows and columns of numbers) and the transformer world (sequences of embedding vectors). Understanding it also resolves the tension between the attention visualization, which shows named features attending to each other, and the architecture diagram, which refers to feature groups and a tensor of shape (batch, seq_len, num_feature_groups+1, d_model). Step through the four stages in the interactive illustration below.

Open full screen ↗

On permutation invariance. With features_per_group=1, every feature is its own isolated token. Swapping two columns produces the same set of tokens in a different order, and since column attention has no positional encoding, the output is identical, the model is exactly permutation-invariant per forward pass. With features_per_group > 1, features grouped together are treated differently from features across groups, breaking strict invariance. Ensembling recovers this: each estimator applies a different random feature shuffle before grouping, so the averaged prediction is approximately permutation-invariant across the ensemble passes even when individual passes are not.

③

Interactive · Architecture flow

A clickable diagram tracing the full forward pass: raw tabular data → encoder (with label column appended) → L stacked layers each containing col attn, row attn, and MLP → extraction of the label slot from test rows → regression or classification output. You can also open it in a new tab.

Part 3 — The loss function: regression as classification

A surprising design choice for tabular foundation models arises in how regression is handled. Instead of predicting a single number, the model outputs a full probability distribution over possible target values. It does this by turning regression into a classification problem. The mechanism is cleaner than it might sound.

Scope note. This section describes TabPFN v2's bar distribution loss. Other tabular foundation models use different loss functions. The principle of outputting a predictive distribution rather than a point estimate is broadly shared, but the specific discretisation mechanism described here is TabPFN v2's design.

What is NLL? NLL stands for Negative Log-Likelihood. To understand it, start with the question: "How probable does the model think the true answer is?" That probability is the likelihood. If the model assigns high probability to the correct target value, it is doing well; if it spreads probability elsewhere, it is doing poorly. The log of a probability is always negative or zero (since probabilities are between 0 and 1), so a model that assigns probability 1.0 to the right answer gets log(1) = 0 - a perfect score. A model that assigns very low probability gets a very large negative log, and negating that gives a large positive loss. Minimising NLL therefore means training the model to assign high probability to what actually happened. It is the standard way to train any probabilistic model.

Building the buckets

The target range is divided into N buckets. The borders between buckets are not evenly spaced, rather they are chosen so that each bucket has equal probability under the prior (i.e., equal to 1/N). In practice this means computing quantiles of a large prior-data sample. Equal-probability buckets mean the softmax over buckets starts with a uniform prior, which is well-calibrated: before seeing any data, all outcomes are equally likely.

A key point that is easy to miss: the same bin boundaries are shared across all datasets. They are computed once from a large pooled sample drawn from the prior distribution — spanning many different synthetic datasets with different target ranges. Individual dataset targets are then normalised (z-scored) before being mapped to these shared bins, so the bins always cover the relevant range regardless of the dataset's original scale. This also means that while each dataset has its own unique distribution of targets, they all speak the same "binned language" during training. Discretisation of this kind is a well-established technique in statistics and machine learning; with a large enough N, even continuous distributions can be represented accurately. In the bar distribution, N is chosen so that the expected error from discretisation is negligible relative to the model's uncertainty — and because predictions are formed by taking the mean or median of the distribution rather than the bin centre, the approach is even more accurate than it might first appear.

The model outputs a logit vector of length N. A softmax turns this into a probability histogram; a bar chart where each bar represents the probability that the true target falls in that bucket. This is the "bar distribution" or Riemann distribution.

The loss: density NLL, not categorical cross-entropy

Given a true target value y, it is mapped to its bucket index using a binary search. The loss is then:

Bar distribution NLL loss \[ \mathrm{NLL} = -\log p_k + \log(\Delta) \] where p_k is the softmax probability of the target bucket, and Δ is the bucket width. The +log(Δ) term converts categorical cross-entropy into a proper density NLL so that a uniform distribution over [0,1] gives a loss of exactly 0, not log(N).

This scaling matters. Without it, a model that spreads probability evenly across many narrow buckets would be penalised relative to one using fewer wide buckets, an artefact of discretisation rather than a reflection of predictive quality. The density NLL is invariant to the number of buckets chosen, making it a proper scoring rule.

For regression, the final point prediction is typically the mean or median of this distribution. But the distribution itself is the output. The model gives you calibrated uncertainty for free, as a natural consequence of the architecture, not as an add-on.

For classification, the output is simpler: a standard softmax over classes. But even here, the same column and row attention machinery applies. The label column in the transformer holds the class index, and the same label-slot extraction mechanism produces the logits.

④

Interactive · Bar distribution & loss

Step through four frames: (1) building equal-probability borders from the prior, (2) mapping a continuous target to its bucket, (3) the NLL loss formula with the density correction term, (4) the training loop — synthetic datasets cycling through, NLL converging. You can also open it in a new tab.

Part 4 — What the model actually learns: amortised Bayesian inference

This is the most important section.

The model's weights do not store data. They do not memorise which synthetic datasets they have seen, or look them up at inference time. What the weights store is an algorithm: a compressed function that maps "what a dataset looks like" to "what the predictive distribution should be."

Here is what that means in plain terms. During pretraining, the model sees ~130 million datasets. For each one, a label is masked, and the model is trained to predict it from the remaining context. Over time the model discovers that certain patterns in the training rows reliably predict outcomes - linear relationships, clusters, interaction effects - and it learns to recognise those patterns and apply the appropriate predictive logic. By the end of training, the weights have compressed this learning into a function that can identify the statistical character of any new dataset and output a well-calibrated prediction in a single forward pass.

At inference time, the attention mechanism is the instrument that reads the dataset's character. When the query row's label slot attends to training rows via row attention, it is asking: which training examples are most relevant to this query? When column attention runs on a training row, it is asking: what pattern do these features and this label represent together? By the time information has propagated through L layers of alternating attention and MLP, the label slot of the query row holds a rich encoding of the question "given that the training set looks like this, what should the label be?" and the decoder reads that encoding to produce the output distribution.

No memory lookup. No similarity search over stored datasets. Just a forward pass that has been trained to implement good predictive reasoning.

"The model doesn't store data. It stores the algorithm for reasoning about data."

Advanced The formal connection to Bayesian inference (PFN paper Insight 1)

The informal description above has a precise mathematical grounding in the PFN paper (Müller et al., 2022).^[2] During pretraining, the model minimises the expected NLL of masked labels across all datasets sampled from the prior:

TabPFN training objective (PFN paper §3) \[\min_{\theta}\ \mathbb{E}_{D \sim p(D)}\left[-\log q_{\theta}\left(y_{\mathrm{test}} \mid x_{\mathrm{test}}, D_{\mathrm{train}}\right)\right]\] Minimised over randomly sampled datasets D from the prior, with one label masked as the prediction target.

By Insight 1 of the paper, minimising this objective is equivalent to minimising the KL divergence between the model's output and the true Bayesian posterior predictive distribution, averaged over all datasets in the prior:

PFN paper Insight 1 — equivalence to KL minimisation \[\mathbb{E}_{D}\left[\mathrm{KL}\left(p(y \mid x, D)\ \|\ q_{\theta}(y \mid x, D)\right)\right] \to 0\] At the optimum, the model's output q_θ equals the true posterior predictive p(y | x, D) for every dataset in the prior.

This is a strong result: train a network to predict masked labels across a diverse enough prior, and it converges to the Bayesian posterior predictive distribution. The weights have learned to compress the algorithm for computing posteriors into a single forward pass.

The connection to kernel methods. There is also a useful analogy to Gaussian processes. In GP regression, you specify a kernel function that measures similarity between data points, and predictions are weighted combinations of training labels where weights reflect similarity to the query. TabPFN's row-attention weights play the same role: high weight means "this training example is informative about this query." The difference is that in a GP the kernel is fixed and hand-specified (RBF, Matérn, etc.), whereas here the notion of "similarity" is itself learned from the pretraining distribution and adjusted on the fly for each new dataset the model reads.

⑤

Interactive · What the model learns

An animated illustration of the training-to-inference arc: five diverse synthetic datasets flow in, each updating the weights via NLL minimisation. Then a new unseen dataset arrives — the same fixed weights read its structure through attention and output the appropriate posterior. Same weights. Different dataset. Correct posterior. You can also open it in a new tab.

Part 5 — Where this breaks: the limits of the pretraining prior

The power of this framework comes with a real and principled limitation. The model's predictions are only as good as the match between your data and the pretraining prior.

If your dataset comes from a data-generating process that is genuinely out-of-distribution relative to the synthetic DAGs the model trained on, the model's inferred posterior will be poorly calibrated. It will map your data to the closest pretraining experience it can find, which may be wrong in ways that are hard to diagnose.

The model's attention weights are powerful precisely because they reflect a learned notion of similarity but that notion was defined by the pretraining distribution. Feed it something the pretraining has never seen, and the similarity measure becomes unreliable.

Closing: a different way of thinking about tabular prediction

The standard mental model for tabular ML is: collect data, train model, deploy. Tabular foundation models replace this with: collect data, pass to pretrained model, done. The training happened a long time ago, on synthetic data, and what it produced was not a classifier for any specific task; it was an algorithm for Bayesian reasoning about tabular data in general.

This is a meaningful conceptual shift. It means that the quality of your prediction is no longer primarily determined by how much data you have or how carefully you tuned your gradient-boosted tree. It is determined by how well the pretraining prior matches the distribution your data comes from. For most practical tabular prediction tasks, that match is surprisingly good. For novel domains, it requires thought.

The attention mechanism is the piece that makes this possible. Without row attention, the model could not condition predictions on the training set. Without column attention, it could not learn feature interactions within examples. Without the label column sitting inside the same token tensor as the features, the label-slot extraction trick would not work. The architecture is not a standard transformer applied to a table, it is a transformer redesigned around the specific computational requirements of in-context Bayesian inference.

Understanding this at a mechanistic level, knowing that row attention is a learned kernel, that column attention enables dynamic feature interactions, that the loss function is a proper density scoring rule, gives you a principled basis for knowing when to trust the model and when to be skeptical. It also, hopefully, makes the system feel less like a black box and more like a well-reasoned engineering choice.