Statistical inference

The procedure of drawing conclusions about model parameters from randomly distributed data is known as statistical inference. This section explains the mathematical theory behind statistical inference. If you are already familiar with this subject feel free to skip to the next section.

General case

For a statistician a model is simply a probability density function (PDF) f, which describes the (joint) distribution of a set of observables \vec x=(x_1,\ldots,x_n) in the sample space \mathcal X=\R^n and depends on some parameters \vec\omega\in\Omega. The parameter space \Omega can be any (discrete or continuous) set and f can be any function of \vec x and \vec\omega with

(1)\int d^n\vec x\,f(\vec x,\vec\omega) = 1

for all \vec\omega\in\Omega. Statistical inference can help us answer the following questions:

  1. Which choice of the parameters is most compatible with some observed data \vec x_0\in\mathcal X?
  2. Based on the observed data \vec x_0, which parts of the parameter space \Omega can be ruled out (at which confidence level)?

Let us begin with the first question. What we need is a function \hat{\vec\omega}:\R^n\to\Omega which assigns to each observable vector \vec x an estimate of the model parameters. Such a function is called an estimator. There are different ways to define an estimator, but the most popular choice (and the one that myFitter supports) is the maximum likelihood estimator defined by

\hat{\vec\omega}(\vec x) = \argmax_{\vec\omega\in\Omega}f(\vec x,\vec\omega)
\eqsep.

In other words, for a given observation \vec x the maximum likelihood estimate of the parameters are those parameters which maximise f(\vec
x,\cdot). When we keep the observables fixed and regard f as a function of the parameters we call it the likelihood function. Finding the maximum of the likelihood function is commonly referred to as fitting the model f to the data \vec x. It usually requires numerical optimisation techniques.

The second question is answered (in frequentist statistics) by performing a hypothesis test. Let’s say we want know if the model is realised with parameters from some subset \Omega_0\subset\Omega. We call this the null hypothesis. To test this hypothesis we have to define some function T(\vec x), called test statistic, which quantifies the disagreement of the observation \vec x with the null hypothesis. The phrase “quantifying the disagreement” is meant in the rather vague sense that T(\vec x) should be large when \vec x looks like it was not drawn from a distribution with parameters in \Omega_0. In principle any function on \mathcal{X} can be used as a test statistic, but of course some are less useful than others. The most popular choice (and the one myFitter is designed for) is

T(\vec x) = -2\ln\frac{\max_{\vec\omega_0\in\Omega_0}f(\vec x,\vec\omega_0)}
                      {\max_{\vec\omega\in\Omega}f(\vec x,\vec\omega)} \eqsep.

For this choice of T, the test is called a likelihood ratio test. With a slight abuse of notation, we define

(2)\chi^2(\vec x,\vec\omega) \equiv -2\ln f(\vec x,\vec\omega)
\eqsep.

Then we may write

(3)T(\vec x)
=  \min_{\vec\omega_0\in\Omega_0}\chi^2(\vec x,\vec\omega_0)
  -\min_{\vec\omega\in\Omega}\chi^2(\vec x,\vec\omega)
\equiv \Delta\chi^2(\vec x)
\eqsep.

Clearly large values of \Delta\chi^2 indicate that the null hypothesis is unlikely, but at which value should we reject it? To determine that we have to compute another quantity called the p-value

(4)p\equiv p(\Omega_0,\vec x_0) =
\max_{\vec\omega_0\in\Omega_0} \int d^n\vec y\,f(\vec y,\vec\omega_0)
\theta(\Delta\chi^2(\vec y)-\Delta\chi^2(\vec x_0))
\eqsep,

where \theta is the Heavyside step function. Note that the integral is simply the probability that for some random vector \vec y of toy observables drawn from the distribution f(\cdot,\vec\omega_0) the test statistic \Delta\chi^2(\vec y) is larger than the observed value \Delta\chi^2(\vec x_0). The meaning of the p-value is the following: if the null hypothesis is true and we reject it for values of \Delta\chi^2 larger than the observed value \Delta\chi^2(\vec x_0) the probability for wrongly rejecting the null hypothesis is at most p(\Omega_0,\vec
x_0). Thus, the smaller the p-value, the more confident we can be in rejecting the null hypothesis. The complementary probability 1-p(\Omega_0,\vec
x_0) is therefore called the confidence level at which we may reject the null hypothesis \Omega_0, based on the observation \vec x_0.

Instead of specifying the p-value directly people often give the Z-value (or number of ‘standard deviations‘ or ‘sigmas‘) which is related to the p-value by

Z = \sqrt 2\Erf^{-1}(1-p)
\eqsep,

where ‘\Erf‘ denotes the error function. The relation is chosen in such a way that the integral of a normal distribution from -Z to +Z is the confidence level 1-p. So, if someone tells you that they have discovered something at 5\,\sigma they have (usually) done a likelihood ratio test and rejected the “no signal” hypothesis at a confidence level corresponding to Z=5.

Linear regression models and Wilks’ theorem

Evaluating the right-hand side of (4) exactly for a realistic model can be extremely challenging. However, for a specific class of models called linear regression models we can evaluate (4) analytically. The good news is that most models can be approximated by linear regression models and that this approximation is often sufficient for the purpose of estimating the p-value. A concrete formulation of this statement is given by a theorem by Wilks, which we shall discuss shortly. The myFitter package was originally written to handle problems where Wilks’ theorem is not applicable, but of course the framework can also be used in cases where Wilks’ theorem applies.

In a linear regression model the parameter space \Omega=\R^k is a k-dimensional real vector space and the PDF f is a (multi-dimensional) normal distribution with a fixed (i.e. parameter-independent) covariance matrix \Sigma centered on an observable-vector \vec\mu(\vec\omega) which is an affine-linear function of the parameters \vec\omega:

(5)f(\vec x,\vec\omega) &= \mathcal{N}(\vec\mu(\vec\omega), \Sigma)(\vec x)
\propto \exp[-\tfrac12(\vec x-\vec\mu(\vec\omega))^\trans \Sigma^{-1}
              (\vec x-\vec\mu(\vec\omega))]
 \eqsep,\\
 \vec\mu(\vec\omega) &= A\vec\omega + \vec b
 \eqsep,

where \Sigma is a symmetric, positive definite matrix, A is a non-singular (n\times k)-matrix and \vec b is a n-dimensional real vector. Now assume that the parameter space \Omega_0 under the null hypothesis is a k_0-dimensional affine sub-space of \Omega (i.e. a linear sub-space with a constant shift). In this case the formula (4) for the p-value becomes

(6)p(\Omega_0,\vec x_0) = Q_{(k-k_0)/2}(\Delta\chi^2(\vec x_0)/2)
\eqsep,

where Q is the normalised upper incomplete Gamma function which can be found in any decent special functions library. Note that, for a linear regession model, the p-value only depends on the dimension of the affine sub-space \Omega_0 but not on its position within the larger space \Omega.

The models we encounter in global fits are usually not linear regession models. They typically belong to a class of models known as non-linear regression models. The PDF of a non-linear regression model still has the form (5) with a fixed covariance matrix \Sigma, but the means \vec\mu may depend on the parameters \vec\omega in an arbitrary way. Furthermore, the parameter spaces \Omega and \Omega_0 don’t have to be vector spaces but can be arbitrary smooth manifolds. The means \vec\mu of a non-linear regression model are usually the theory predictions for the observables and have a known functional dependence on the parameters \vec\omega of the theory. The information about experimental uncertainties (and correlations) is encoded in the covariance matrix \Sigma.

Since any smooth non-linear function can be locally approximated by a linear function it seems plausible that (6) can still hold approximately even for non-linear regression models. This is essentially the content of Wilks’ theorem. It states that, for a general model (not only non-linear regression models) which satisfies certain regularity conditions, (6) holds asymptotically in the limit of an infinite number of independent observations. The parameter spaces \Omega and \Omega_0 in the general case can be smooth manifolds with dimensions k and k_0, respectively. Note, however, that for Wilks’ theorem to hold \Omega_0 must still be a subset of \Omega.

Plug-in p-values and the myFitter method

In practice we cannot make an arbitrarily large number of independent observations. Thus (6) is only an approximation whose quality depends on the amount of collected data and the model under consideration. To test the quality of the approximation (6) for a specific model one must evaluate or at least approximate (4) by other means, e.g. numerically. Such computations are called toy simulations, and their computational cost can be immense. Note that the evaluation of \Delta\chi^2 (and thus of the integrand in (4)) requires a numerical optimisation. And then the integral in (4) still needs to be maximised over \vec\omega_0.

In most cases the integral is reasonably close to its maxiumum value when \vec\omega_0 is set to its maximum likelihood estimate for the observed data \vec x_0 and the parameter space \Omega_0:

\hat{\vec\omega}_0(\vec x_0)\equiv
\argmax_{\vec\omega\in\Omega_0}f(\vec x_0,\vec\omega)
\eqsep.

Thus, the plug-in p-value

\hat p(\Omega_0,\vec x_0) =
\int d^n\vec y\,f(\vec y,\hat{\vec\omega}_0(\vec x_0))
\theta(\Delta\chi^2(\vec y)-\Delta\chi^2(\vec x_0))

is often a good approximation to the real p-value.

The standard numerical approach to computing the remaining integral is to generate a large number of toy observations \vec y_1,\ldots,\vec y_N distributed according to f(\cdot,\hat{\vec\omega}_0(\vec x_0)) and then determine the fraction of toy observations where \Delta\chi^2(\vec y_i)>\Delta\chi^2(\vec x_0). If the p-value is small this method can become very inefficient. The efficiency can be significantly improved by importance sampling methods, i.e. by drawing toy observations from some other (suitably constructed) sampling density which avoids the region with \Delta\chi^2(\vec y)<\Delta\chi^2(\vec x_0). This method lies at the heart of the myFitter package.

Gaussian and systematic errors

As mentioned above the most common type of model we deal with in global fits are non-linear regression models where the means \vec\mu(\vec\omega) represent theoretical predictions for the observables and the information about experimental uncertainties is encoded in the covariance matrix \Sigma. If the value of an observable is given as 7.1\pm 0.5 in a scientific text this usually implies that the observable has a Gaussian distribution with a standard deviation of 0.5. Measurements from different experiments are usually uncorrelated. Thus the covariance matrix for n independent observables is

\Sigma = \diag(\sigma_1^2,\ldots,\sigma_n^2)\eqsep,

where the \sigma_i are the errors of the individual measurements. If several quantities are measured in the same experiment they can be correlated. Information about the correlations is usually presented in the form of a correlation matrix \rho. The correlation matrix is always symmetric and its diagonal entries are 1. (Thus it is common to only show the lower or upper triangle of \rho in a publication.) The elements \rho_{ij} of \rho are called correlation coefficients. For n observables with errors \sigma_i and correlation coefficients \rho_{ij} the covariance matrix \Sigma is given by

\Sigma_{ij} = \sigma_i\sigma_j\rho_{ij}

In some cases the error of an observable is broken up into a statistical and a systematic component. Typical notations are 7.1\pm 0.3(\text{stat})\pm
0.2(\text{syst}) or simply 7.1\pm 0.3\pm 0.2 with an indication which component is the systematic one given in the text. Sometimes the systematic error is asymmetric and denoted as 7.1\pm 0.3^{+0.1}_{-0.2}.

What are we supposed to do with this extra information? This depends largely on the context, i.e. on the nature of the systematic uncertainties being quoted. For example, the theoretical prediction of an observable x (or its extraction from raw experimental data) might require knowledge of some other quantity \lambda which has been measured elsewhere with some Gaussian uncertainty. In this case one also speaks of a parametric uncertainty. If you know the dependence of x on \lambda it is best to treat \lambda as a parameter of your model, add an observable which represents the measurement of \lambda (i.e. with a Gaussian distribution centered on \lambda and an appropriate standard deviation), and use only the statistical error to model the distribution of x. In particular, this is the only (correct) way in which you can combine x with other observables that depend on the same parameter \lambda. In this case the parameter \lambda is called a nuisance parameter, since we are not interested in its value but need it to extract values for the parameters we are interested in.

If you don’t know the dependence of x on \lambda but you are sure that the systematic uncertainty for x given in a paper is (mainly) parametric due to \lambda and that \lambda does not affect any other observables in your fit you can combine the statistical and systematic error in quadrature. This means you assume a Gaussian distribution for x with standard deviation (\sigma^2_\text{stat} +
\sigma^2_\text{syst})^{1/2}, where \sigma_\text{stat} is the statistical and \sigma_\text{syst} the systematic error (which should be symmetric in this case). This procedure is also correct if the systematic error is the combined parametric uncertainty due to several parameters, as long as none of these parameters affect any other observables in your fit.

So far in our discussion we have assumed that the nuisance parameter(s) \lambda can be measured and have a Gaussian distribution. This is not always the case. If, for example, the theoretical prediction for an observable is an approximation there will be a constant offset between the theory prediction and the measured value. This offset does not average out when the experiment is repeated many times, and it cannot be measured separately either. At best we can find some sort of upper bound on the size of this offset. The extraction of an observable from the raw data may also depend on parameters which can only be bounded but not measured. In this case the offset between the theory prediction and the mean of the measured quantity should be treated as an additional model parameter whose values are restricted to a finite range. Assume that the systematic error of x is given as {}^{+\delta_+}_{-\delta_-} with \delta_+,\delta_->0. Let \mu_\text{th}(\vec\omega) be the theory prediction for x. The mean \mu of the observable x then depends on an additional nuisance parameter \delta which takes values in the interval [-\delta_-,+\delta_+]:

\mu(\vec\omega,\delta) = \mu_\text{th}(\vec\omega) - \delta
\eqsep\text{with}\eqsep\delta\in[-\delta_-,+\delta_+]
\eqsep.

Note that the formula above holds for systematic uncertainties associated with the measurement of x. If the theory prediction for an observable is quoted with a systematic error {}^{+\delta_+}_{-\delta_-} the correct range for \delta is [-\delta_+,+\delta_-].