Tabular foundation models learn by training on millions of synthetic datasets. Each one is generated from a random Directed Acyclic Graph (DAG), a causal model where each node is a feature computed from its parents. Three steps repeat for each new dataset: (1) sample a random DAG, (2) assign edge weights and node functions, (3) inject noise at root nodes and propagate forward to produce one row of data.
Each token in the model represents one (row, feature-group) cell plus an extra column for the label y. Column attention runs within a single row across all feature tokens including the label, letting features read the label on training rows. Row attention runs at a fixed feature-group position across rows, for test rows, K/V come only from training rows, transferring label information into the query's label slot. That slot is then extracted and decoded as the final prediction.
| age | income | debt | score | label (y) |
|---|
Starting from a raw tabular dataset, the model encodes each (row, feature-group) cell independently, appending the label y as a final column, then passes the tensor through L identical layers. Each layer runs column attention, then row attention, then an MLP (each followed by LayerNorm). After all layers, only the label-slot embeddings of the test rows are extracted and decoded to produce predictions. Click any block to read what it does.
Tabular foundation models turn regression into a classification problem by discretising the target range into N equal-probability buckets. The model outputs a softmax over buckets, a histogram approximating the full predictive distribution. The loss is the NLL of the true target's bucket, scaled by bucket width to give a proper density. Step through the four stages to see how this works.
The model's weights don't store data; they store an algorithm. During training on hundreds of millions of synthetic datasets, the model minimises the expected NLL of masked labels across all datasets in the prior. By the PFN paper's Insight 1, this is equivalent to minimising the KL divergence between the model's output and the true Bayesian posterior predictive distribution averaged over every dataset the prior can generate. The weights learn to recognise the statistical fingerprint of a dataset (its noise level, linearity, feature interactions) and map it to the posterior that a Bayesian reasoner with the same prior would produce. At inference, reading the training set through attention is the computation.
Before a single attention operation runs, each estimator in the M-member ensemble passes the data through a multi-stage preprocessing pipeline. Each stage transforms the raw numbers into a form the transformer can reason over. Step through the four stages to see what the model actually receives and why the ensemble design is essential.