The procedure of drawing conclusions about model parameters from randomly distributed data is known as statistical inference. This section explains the mathematical theory behind statistical inference. If you are already familiar with this subject feel free to skip to the next section.
For a statistician a model is simply a probability density function (PDF)
, which describes the (joint) distribution of a set of observables
in the sample space
and
depends on some parameters
. The parameter space
can be any (discrete or continuous) set and
can be any
function of
and
with
(1)
for all . Statistical inference can help us answer
the following questions:
Let us begin with the first question. What we need is a function
which assigns to each observable vector
an estimate of the model parameters. Such a function is called an
estimator. There are different ways to define an estimator, but the most
popular choice (and the one that myFitter supports) is the
maximum likelihood estimator defined by
In other words, for a given observation the maximum likelihood
estimate of the parameters are those parameters which maximise
. When we keep the observables fixed and regard
as a function
of the parameters we call it the likelihood function. Finding the maximum of
the likelihood function is commonly referred to as fitting the model
to the data
. It usually requires numerical optimisation
techniques.
The second question is answered (in frequentist statistics) by performing a
hypothesis test. Let’s say we want know if the model is realised with
parameters from some subset . We call this the
null hypothesis. To test this hypothesis we have to define some function
, called test statistic, which quantifies the disagreement of
the observation
with the null hypothesis. The phrase
“quantifying the disagreement” is meant in the rather vague sense that
should be large when
looks like it was not
drawn from a distribution with parameters in
. In principle any
function on
can be used as a test statistic, but of course
some are less useful than others. The most popular choice (and the one
myFitter is designed for) is
For this choice of , the test is called a likelihood ratio test.
With a slight abuse of notation, we define
(2)
Then we may write
(3)
Clearly large values of indicate that the null hypothesis
is unlikely, but at which value should we reject it? To determine that we have to
compute another quantity called the p-value
(4)
where is the Heavyside step function. Note that the integral is
simply the probability that for some random vector
of toy
observables drawn from the distribution
the test
statistic
is larger than the observed value
. The meaning of the p-value is the following: if
the null hypothesis is true and we reject it for values of
larger than the observed value
the probability
for wrongly rejecting the null hypothesis is at most
. Thus, the smaller the p-value, the more confident we can be in rejecting
the null hypothesis. The complementary probability
is therefore called the confidence level at which we may reject the null
hypothesis
, based on the observation
.
Instead of specifying the p-value directly people often give the Z-value (or number of ‘standard deviations‘ or ‘sigmas‘) which is related to the p-value by
where ‘‘ denotes the error function. The relation is chosen in such
a way that the integral of a normal distribution from
to
is the confidence level
. So, if someone tells you that they have
discovered something at
they have (usually) done a likelihood
ratio test and rejected the “no signal” hypothesis at a confidence level
corresponding to
.
Evaluating the right-hand side of (4) exactly for a realistic model can be extremely challenging. However, for a specific class of models called linear regression models we can evaluate (4) analytically. The good news is that most models can be approximated by linear regression models and that this approximation is often sufficient for the purpose of estimating the p-value. A concrete formulation of this statement is given by a theorem by Wilks, which we shall discuss shortly. The myFitter package was originally written to handle problems where Wilks’ theorem is not applicable, but of course the framework can also be used in cases where Wilks’ theorem applies.
In a linear regression model the parameter space is a
-dimensional real vector space and the PDF
is a
(multi-dimensional) normal distribution with a fixed
(i.e. parameter-independent) covariance matrix
centered on an
observable-vector
which is an affine-linear function
of the parameters
:
(5)
where is a symmetric, positive definite matrix,
is a
non-singular (
)-matrix and
is a
-dimensional real vector. Now assume that the parameter space
under the null hypothesis is a
-dimensional affine
sub-space of
(i.e. a linear sub-space with a constant shift).
In this case the formula (4) for the p-value becomes
(6)
where is the normalised upper incomplete Gamma function which can be
found in any decent special functions library. Note that, for a linear regession
model, the p-value only depends on the dimension of the affine sub-space
but not on its position within the larger space
.
The models we encounter in global fits are usually not linear regession models.
They typically belong to a class of models known as non-linear regression
models. The PDF of a non-linear regression model still has the form
(5) with a fixed covariance matrix , but the
means
may depend on the parameters
in an
arbitrary way. Furthermore, the parameter spaces
and
don’t have to be vector spaces but can be arbitrary smooth
manifolds. The means
of a non-linear regression model are
usually the theory predictions for the observables and have a known functional
dependence on the parameters
of the theory. The information
about experimental uncertainties (and correlations) is encoded in the covariance
matrix
.
Since any smooth non-linear function can be locally approximated by a linear
function it seems plausible that (6) can still hold approximately even
for non-linear regression models. This is essentially the content of Wilks’
theorem. It states that, for a general model (not only non-linear regression
models) which satisfies certain regularity conditions, (6) holds
asymptotically in the limit of an infinite number of independent
observations. The parameter spaces and
in the
general case can be smooth manifolds with dimensions
and
, respectively. Note, however, that for Wilks’ theorem to hold
must still be a subset of
.
In practice we cannot make an arbitrarily large number of independent
observations. Thus (6) is only an approximation whose quality depends on
the amount of collected data and the model under consideration. To test the
quality of the approximation (6) for a specific model one must evaluate
or at least approximate (4) by other means, e.g. numerically. Such
computations are called toy simulations, and their computational cost can be
immense. Note that the evaluation of (and thus of the
integrand in (4)) requires a numerical optimisation. And then the
integral in (4) still needs to be maximised over
.
In most cases the integral is reasonably close to its maxiumum value when
is set to its maximum likelihood estimate for the
observed data
and the parameter space
:
Thus, the plug-in p-value
is often a good approximation to the real p-value.
The standard numerical approach to computing the remaining integral is to
generate a large number of toy observations
distributed according to
and then determine the fraction of toy observations where
. If the p-value is small
this method can become very inefficient. The efficiency can be significantly
improved by importance sampling methods, i.e. by drawing toy observations
from some other (suitably constructed) sampling density which avoids the region
with
. This method lies at
the heart of the myFitter package.
As mentioned above the most common type of model we deal with in global fits are
non-linear regression models where the means
represent theoretical predictions for the
observables and the information about experimental uncertainties is encoded in
the covariance matrix
. If the value of an observable is given as
in a scientific text this usually implies that the observable
has a Gaussian distribution with a standard deviation of 0.5. Measurements
from different experiments are usually uncorrelated. Thus the covariance matrix
for
independent observables is
where the are the errors of the individual measurements. If
several quantities are measured in the same experiment they can be correlated.
Information about the correlations is usually presented in the form of a
correlation matrix
. The correlation matrix is always symmetric
and its diagonal entries are 1. (Thus it is common to only show the lower or
upper triangle of
in a publication.) The elements
of
are called correlation coefficients. For
observables
with errors
and correlation coefficients
the covariance matrix
is given by
In some cases the error of an observable is broken up into a statistical and a
systematic component. Typical notations are or simply
with an indication which
component is the systematic one given in the text. Sometimes the systematic
error is asymmetric and denoted as
.
What are we supposed to do with this extra information? This depends largely on
the context, i.e. on the nature of the systematic uncertainties being
quoted. For example, the theoretical prediction of an observable (or
its extraction from raw experimental data) might require knowledge of some other
quantity
which has been measured elsewhere with some Gaussian
uncertainty. In this case one also speaks of a parametric uncertainty. If you
know the dependence of
on
it is best to treat
as a parameter of your model, add an observable which represents
the measurement of
(i.e. with a Gaussian distribution centered
on
and an appropriate standard deviation), and use only the
statistical error to model the distribution of
. In particular, this is
the only (correct) way in which you can combine
with other observables
that depend on the same parameter
. In this case the parameter
is called a nuisance parameter, since we are not interested in
its value but need it to extract values for the parameters we are interested in.
If you don’t know the dependence of on
but you are
sure that the systematic uncertainty for
given in a paper is (mainly)
parametric due to
and that
does not affect any
other observables in your fit you can combine the statistical and systematic
error in quadrature. This means you assume a Gaussian distribution for
with standard deviation
, where
is the
statistical and
the systematic error (which should be
symmetric in this case). This procedure is also correct if the systematic error
is the combined parametric uncertainty due to several parameters, as long as
none of these parameters affect any other observables in your fit.
So far in our discussion we have assumed that the nuisance parameter(s) can be measured and have a Gaussian
distribution. This is not always the case. If, for example, the theoretical
prediction for an observable is an approximation there will be a constant offset
between the theory prediction and the measured value. This offset does not
average out when the experiment is repeated many times, and it cannot be
measured separately either. At best we can find some sort of upper bound on the
size of this offset. The extraction of an observable from the raw data may also
depend on parameters which can only be bounded but not measured. In this case
the offset between the theory prediction and the mean of the measured quantity
should be treated as an additional model parameter whose values are restricted
to a finite range. Assume that the systematic error of
is given as
with
. Let
be the theory prediction for
. The
mean
of the observable
then depends on an additional
nuisance parameter
which takes values in the interval
:
Note that the formula above holds for systematic uncertainties associated with
the measurement of . If the theory prediction for an observable
is quoted with a systematic error
the correct
range for
is
.