Intro to probabilistic programming

Introduction

If one wants to estimate an unknown quantity of interest, the question of uncertainty arises.

An example of this can be found in weather forecasts: how likely will it rain today or tomorrow?

What is the best optimal way to assess your uncertainty and best estimate about a given situation ?

Probability

According to some scholars, probability can be seen as reflecting one's beliefs about some event, involving uncertainty (or certainty). This way, there are only subjective beliefs.

Let's say a friend flips a coin, hide the result from me and then peeks at it, and say he sees Heads. He is now certain that the probability that this precise coin flip leads Heads is 1.
But I am not, and since I didn't receive any other information, I'm going to assign a probability that reflects my belief that the coin is fairly balanced: a probability of 0.5.

For punters, one can see this probability as the inverse of the fair decimal odds I reckon for this problem.

Suppose my friend then tells me that he has seen this coin before, and that it gave Heads most of the time. Is my belief the same? Rather I am going to update it with this new information.

The Bayes theorem (or formula) tells us exactly how to update one's belief, i. e. update the implied odds.

Probability density

To get bayesian inference, one needs to grasp the concept of probability density.

It is used to measure a probability. Consider the event that a random variable X is in an interval [a, b], i.e. X > a and X < b.

We have to measure the area under the curve (the integral) of the density to obtain the probability of our event.

Bayesian inference

In bayesian inference, we mostly pay attention to the numerator of Bayes' formula for probability distributions.

Indeed, we have one set of Data (D), and unobserved parameters (theta).

We want to estimate the posterior probability distribution of the unobserved parameters given the (observed) data.

Bayes' theorem states the following:

To get rid of the constant, we can multiply the two terms, and normalize the result so that the posterior sums (integrate) to 1 across all possible theta values.

In practice we use log-probabilities, so the above equation turns into a sum of the type log-posterior = log-likelihood + log-prior + C where C is a constant independent of theta.

Coin tossing

Let's flip a coin. We don't know wheter the coin is biased. We can only toss it and observe the outcome.

Denote p the probability of heads. We want to estimate this quantity.

Slide the quantity below to toss the coin nTosses times. We then show

You can also change the number of tosses above, and select a prior below. Watch how the prior changes the posterior distribution.

The largest differences can be found with a low amount of tosses, while the posterior converges most of the time towards the same limit distribution.

Note that to be fully bayesian (TM), you probably need to take the mean of the posterior distribution rather than the mode (value with the highest probability).

Let's take bets

Supposed you haven't seen the dashed line, and that you witnessed some tosses. Now, I toss it one more time.

What are the fair decimal odds that you reckon to bet that the coin toss leads to Heads:

  • after 5 tosses?
  • 10 tosses?
  • 100 tosses?
  • 400 tosses?

Probabilistic programming

So far, we've done mostly hand-crafted (with my hand) brute-force bayesian inference.

Probabilistic programming languages (PPLs) allow us to specify a model in a specific domain-specific-language (DSL), and the inference engine takes care of the hard work.

For instance, in Numpyro, a new PPL based on JAX in Python, you can specify the above problem as:

import numpyro as ny
import numpyro.distributions as dist


def model(N, obs=None):
  p = ny.sample("p", dist.Normal(0.5, 0.15))
  ny.sample("count_heads", dist.Binomial(N, p), obs=obs)

Then you only specify the automatic inference algorithm, and that's it!

In our case, the problem is uni-dimensional: we only have one parameter to identify.

In real-world™ scenarios, there are often dozens of parameters to identify. It becomes computationally impossible to perform brute-force inference on all of them, that's why smart approximation algorithms such as Markov Chain Monte Carlo (MCMC) are so useful.

The best ones, such as HMC and its variant NUTS, leverage automatic differenciation (AD). For a small overview of automatic differenciation in French, see this post.

To go further

Probability is interpreted as a formal system of logic, the natural extension of Aristotelian logic (in which every statement is either true or false) into the realm of reasoning in the presence of uncertainty.