tabpfn 0.1.0

  deep learning, modeling

  Max Kuhn

We’re stoked to announce the release of tabpfn 0.1.0. TabPFN is a precompiled deep learning Python model for prediction. The R package tabpfn is an interface to this model via reticulate.

You can install it from CRAN with:

install.packages("tabpfn")

What is TabPFN?

The “tab” means tabular, which is code for everyday rectangular data structures that we find in csv files and databases.

The “pfn” is more complicated – it stands for “prior fitted network”. The model is trained on fully synthetic datasets. The developers created a complex graph model that can simulate a wide variety of data-generating methods, including correlation structures, distributional skewness, missing-data mechanisms, interactions, latent variables, and more. It can also simulate random supervised relationships linking potential predictors to the outcome data. The training process for the model simulated a very large number of these data sets that, in effect, constitute a “training set data point”. For example, during training, if a batch size of 64 was used, that means 64 randomly generated datasets were used in that iteration.

From these data sets, a complex deep learning model is created that captures a huge number of possible relationships. The model is sophisticated enough and trained in a manner that allows it to effectively emulate Bayesian estimation.

When we use the pre-trained model, our training set matters, even though there is no new estimation. The model includes an attention mechanism that “primes the model” by focusing on the types of relationships in your training data. In that way, the pre-fitted network is deliberately biased to effectively predict our new samples. This leads to in-context learning.

And it works; in fact, it works really well.

License for the Underyling Model

PriorLabs created TabPFN. Version 2.5 of the model, which contained several improvements, requires an API key for accessing the model parameter. Without one, an error occurs:

This model is gated and requires you to accept its terms. Please follow these steps: 1. Visit https://huggingface.co/Prior-Labs/tabpfn_2_5 in your browser and accept the terms of use. 2. Log in to your Hugging Face account via the command line by running: hf auth login (Alternatively, you can set the HF_TOKEN environment variable with a read token).

The license includes provisions for “Non-Commercial Use Only” if you are just trying it out.

Instructions for installing the package and obtaining the API key are in the package’s manual.

Also, the model is most efficient when a GPU is available (by an order of magnitude or two). This may seem obvious to anyone already working with deep learning models, but it is a fairly new requirement for those strictly working with traditional tabular data models.

Usage

The syntax is idiomatic R: it supports fitting interfaces via data frames/vectors, formulas, and recipes. The standard R predict() method is used for prediction. augument() is also available for prediction.

When evaluating pre-trained models, there is a possibility that they may have memorized well-known datasets (e.g., Ames housing, Palmer penguins). TabPFN isn’t trained that way, but just in case we are worried about that, we’ll use lesser-known data. Worley (1987) derived a mechanistic model for the flow rate of liquids from two aquifers positioned vertically (i.e., the “upper” and “lower” aquifers). We’ll generate some of that data and add completely noisy predictors to increase the difficulty. The outcome is very skewed, so we’ll log that too.

Additionally, we’ll load the tidymodels library for simulation, data splitting, and visualization.

library(tabpfn)
library(tidymodels)
library(probably)

set.seed(17)
aquifier_data <-
 sim_regression(2000,  method = "worley_1987") |>
 bind_cols(sim_noise(2000, 50)) |>
 mutate(outcome = log10(outcome))

We’ll use a stratified 3:1 training and testing split:

set.seed(8223)
aquifier_split <- initial_split(aquifier_data, strata = outcome)
aquifier_split
## <Training/Testing/Total>
## <1500/500/2000>
aquifier_train <- training(aquifier_split)
aquifier_test  <- testing(aquifier_split)

and “fit” the model:

tab_fit <- tab_pfn(outcome ~ ., data = aquifier_train)

Again, the model does not actually fit anything new. This computes the embeddings for the training set data and stores them for the prediction stage.

To make predictions, predict() returns the model’s results. As previously mentioned, a GPU is not strictly required for these computations. However, if more than a trivial amount of data are being predicted, execution time can be very long.

Since we’ll want to evaluate and plot the data, we’ll use augment(), which just runs predict() and binds the results to the data being predicted:

tab_pred <- augment(tab_fit, aquifier_test)

How does it work?

tab_pred |> metrics(outcome, .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.104 
## 2 rsq     standard      0.937 
## 3 mae     standard      0.0829
tab_pred |> cal_plot_regression(outcome, .pred)

plot of chunk unnamed-chunk-6

That looks good, especially with no training.

Next Steps

There is a lot more functionality to add to the package, including additional prediction types and interpretability tools. Many of these are available in extensions.

We’ll also add a new parsnip model type for TabPFN and other integrations with tidymodels in the summer.

Acknowledgements

A huge thanks to Tomasz Kalinowski and Daniel Falbel for their support on this and all of their hard work on reticulate and torch.

Thanks also to the contributors to date: @frankiethull, @mthulin, and @t-kalinowski.