fbpx
Wikipedia

E-values

In statistical hypothesis testing, e-values quantify the evidence in the data against a null hypothesis (e.g., "the coin is fair", or, in a medical context, "this new treatment has no effect"). They serve as a more robust alternative to p-values, addressing some shortcomings of the latter.

In contrast to p-values, e-values can deal with optional continuation: e-values of subsequent experiments (e.g. clinical trials concerning the same treatment) may simply be multiplied to provide a new, "product" e-value that represents the evidence in the joint experiment. This works even if, as often happens in practice, the decision to perform later experiments may depend in vague, unknown ways on the data observed in earlier experiments, and it is not known beforehand how many trials will be conducted: the product e-value remains a meaningful quantity, leading to tests with Type-I error control. For this reason, e-values and their sequential extension, the e-process, are the fundamental building blocks for anytime-valid statistical methods (e.g. confidence sequences). Another advantage over p-values is that any weighted average of e-values remains an e-value, even if the individual e-values are arbitrarily dependent. This is one of the reasons why e-values have also turned out to be useful tools in multiple testing.[1]

E-values can be interpreted in a number of different ways: first, the reciprocal of any e-value is itself a p-value, but a special, conservative one, quite different from p-values used in practice. Second, they are broad generalizations of likelihood ratios and are also related to, yet distinct from, Bayes factors. Third, they have an interpretation as bets. Finally, in a sequential context, they can also be interpreted as increments of nonnegative supermartingales. Interest in e-values has exploded since 2019, when the term 'e-value' was coined and a number of breakthrough results were achieved by several research groups. The first overview article appeared in 2023.[2]

Definition and mathematical background edit

Let the null hypothesis   be given as a set of distributions for data  . Usually   with each   a single outcome and   a fixed sample size or some stopping time. We shall refer to such  , which represent the full sequence of outcomes of a statistical experiment, as a sample or batch of outcomes. But in some cases   may also be an unordered bag of outcomes or a single outcome.

An e-variable or e-statistic is a nonnegative random variable   such that under all  , its expected value is bounded by 1:

 .

The value taken by e-variable   is called the e-value. In practice, the term e-value (a number) is often used when one is really referring to the underlying e-variable (a random variable, that is, a measurable function of the data).

Interpretations edit

As conservative p-values edit

For any e-variable   and any   and all  , it holds that

 

In words:   is a p-value, and the e-value based test with significance level  , which rejects   if  , has Type-I error bounded by  . But, whereas with standard p-values the inequality (*) above is usually an equality (with continuous-valued data) or near-equality (with discrete data), this is not the case with e-variables. This makes e-value-based tests more conservative (less power) than those based on standard p-values, and it is the price to pay for safety (i.e., retaining Type-I error guarantees) under optional continuation and averaging.

As generalizations of likelihood ratios edit

Let   be a simple null hypothesis. Let   be any other distribution on  , and let

 

be their likelihood ratio. Then   is an e-variable. Conversely, any e-variable relative to a simple null   can be written as a likelihood ratio with respect to some distribution  . Thus, when the null is simple, e-variables coincide with likelihood ratios. E-variables exist for general composite nulls as well though, and they may then be thought of as generalizations of likelihood ratios. The two main ways of constructing e-variables, UI and RIPr (see below) both lead to expressions that are variations of likelihood ratios as well.

Two other standard generalizations of the likelihood ratio are (a) the generalized likelihood ratio as used in the standard, classical likelihood ratio test and (b) the Bayes factor. Importantly, neither (a) nor (b) are e-variables in general: generalized likelihood ratios in sense (a) are not e-variables unless the alternative is simple (see below under "universal inference"). Bayes factors are e-variables if the null is simple. To see this, note that, if   represents a statistical model, and   a prior density on  , then we can set   as above to be the Bayes marginal distribution with density

 

and then   is also a Bayes factor of   vs.  . If the null is composite, then some special e-variables can be written as Bayes factors with some very special priors, but most Bayes factors one encounters in practice are not e-variables and many e-variables one encounters in practice are not Bayes factors.[2]

As bets edit

Suppose you can buy a ticket for 1 monetary unit, with nonnegative pay-off  . The statements "  is an e-variable" and "if the null hypothesis is true, you do not expect to gain any money if you engage in this bet" are logically equivalent. This is because   being an e-variable means that the expected gain of buying the ticket is the pay-off minus the cost, i.e.  , which has expectation  . Based on this interpretation, the product e-value for a sequence of tests can be interpreted as the amount of money you have gained by sequentially betting with pay-offs given by the individual e-variables and always re-investing all your gains.[3]

The betting interpretation becomes particularly visible if we rewrite an e-variable as   where   has expectation   under all   and   is chosen so that   a.s. Any e-variable can be written in the   form although with parametric nulls, writing it as a likelihood ratio is usually mathematically more convenient. The   form on the other hand is often more convenient in nonparametric settings. As a prototypical example,[4] consider the case that   with the   taking values in the bounded interval  . According to  , the   are i.i.d. according to a distribution   with mean  ; no other assumptions about   are made. Then we may first construct a family of e-variables for single outcomes,  , for any   (these are the   for which   is guaranteed to be nonnegative). We may then define a new e-variable for the complete data vector   by taking the product

 ,

where   is an estimate for  , based only on past data  , and designed to make   as large as possible in the "e-power" or "GRO" sense (see below). Waudby-Smith and Ramdas use this approach to construct "nonparametric" confidence intervals for the mean that tend to be significantly narrower than those based on more classical methods such as Chernoff, Hoeffding and Bernstein bounds.[4]

A fundamental property: optional continuation edit

E-values are more suitable than p-value when one expects follow-up tests involving the same null hypothesis with different data or experimental set-ups. This includes, for example, combining individual results in a meta-analysis. The advantage of e-values in this setting is that they allow for optional continuation. Indeed, they have been employed in what may be the world's first fully 'online' meta-analysis with explicit Type-I error control.[5]

Informally, optional continuation implies that the product of any number of e-values,  , defined on independent samples  , is itself an e-value, even if the definition of each e-value is allowed to depend on all previous outcomes, and no matter what rule is used to decide when to stop gathering new samples (e.g. to perform new trials). It follows that, for any significance level  , if the null is true, then the probability that a product of e-values will ever become larger than  , is bounded by  . Thus if we decide to combine the samples observed so far and reject the null if the product e-value is larger than  , then our Type-I error probability remains bounded by  . We say that testing based on e-values remains safe (Type-I valid) under optional continuation.

Mathematically, this is shown by first showing that the product e-variables form a nonnegative discrete-time martingale in the filtration generated by   (the individual e-variables are then increments of this martingale). The results then follow as a consequence of Doob's optional stopping theorem and Ville's inequality.

We already implicitly used product e-variables in the example above, where we defined e-variables on individual outcomes   and designed a new e-value by taking products. Thus, in the example the individual outcomes   play the role of 'batches' (full samples)   above, and we can therefore even engage in optional stopping "within" the original batch  : we may stop the data analysis at any individual outcome (not just "batch of outcomes") we like, for whatever reason, and reject if the product so far exceeds  . Not all e-variables defined for batches of outcomes   can be decomposed as a product of per-outcome e-values in this way though. If this is not possible, we cannot use them for optional stopping (within a sample  ) but only for optional continuation (from one sample  to the next   and so on).

Construction and optimality edit

If we set   independently of the data we get a trivial e-value: it is an e-variable by definition, but it will never allow us to reject the null hypothesis. This example shows that some e-variables may be better than others, in a sense to be defined below. Intuitively, a good e-variable is one that tends to be large (much larger than 1) if the alternative is true. This is analogous to the situation with p-values: both e-values and p-values can be defined without referring to an alternative, but if an alternative is available, we would like them to be small (p-values) or large (e-values) with high probability. In standard hypothesis tests, the quality of a valid test is formalized by the notion of statistical power but this notion has to be suitably modified in the context of e-values.[2][6]

The standard notion of quality of an e-variable relative to a given alternative  , used by most authors in the field, is a generalization of the Kelly criterion in economics and (since it does exhibit close relations to classical power) is sometimes called e-power;[7] the optimal e-variable in this sense is known as log-optimal or growth-rate optimal (often abbreviated to GRO[6]). In the case of a simple alternative  , the e-power of a given e-variable   is simply defined as the expectation  ; in case of composite alternatives, there are various versions (e.g. worst-case absolute, worst-case relative)[6] of e-power and GRO.

Simple alternative, simple null: likelihood ratio edit

Let   and   both be simple. Then the likelihood ratio e-variable   has maximal e-power in the sense above, i.e. it is GRO.[2]

Simple alternative, composite null: reverse information projection (RIPr) edit

Let   be simple and   be composite, such that all elements of   have densities (denoted by lower-case letters) relative to the same underlying measure. Grünwald et al. show that under weak regularity conditions, the GRO e-variable exists, is essentially unique, and is given by

 

where  is the Reverse Information Projection (RIPr) of   unto the convex hull of  .[6] Under further regularity conditions (and in all practically relevant cases encountered so far),  is given by a Bayes marginal density: there exists a specific, unique distribution   on   such that  .

Simple alternative, composite null: universal inference (UI) edit

In the same setting as above,[8] show that, under no regularity conditions at all,

 

is an e-variable (with the second equality holding if the MLE (maximum likelihood estimator)   based on data   is always well-defined). This way of constructing e-variables has been called the universal inference (UI) method, "universal" referring to the fact that no regularity conditions are required.

Composite alternative, simple null edit

Now let   be simple and   be composite, such that all elements of   have densities relative to the same underlying measure. There are now two generic, closely related ways of obtaining e-variables that are close to growth-optimal (appropriately redefined[2] for composite  ): Robbins' method of mixtures and the plug-in method, originally due to Wald [9] but, in essence, re-discovered by Philip Dawid as "prequential plug-in" [10] and Jorma Rissanen as "predictive MDL".[11] The method of mixtures essentially amounts to "being Bayesian about the numerator" (the reason it is not called "Bayesian method" is that, when both null and alternative are composite, the numerator may often not be a Bayes marginal): we posit any prior distribution   on   and set

 

and use the e-variable  .

To explicate the plug-in method, suppose that   where   constitute a stochastic process and let   be an estimator of   based on data   for  . In practice one usually takes a "smoothed" maximum likelihood estimator (such as, for example, the regression coefficients in ridge regression), initially set to some "default value"  . One now recursively constructs a density   for   by setting   .

Effectively, both the method of mixtures and the plug-in method can be thought of learning a specific instantiation of the alternative that explains the data well.[2]

Composite null and alternative edit

In parametric settings, we can simply combine the main methods for the composite alternative (obtaining   or  ) with the main methods for the composite null (UI or RIPr, using the single distribution   or   as an alternative). Note in particular that when using the plug-in method together with the UI method, the resulting e-variable will look like

 

which resembles, but is still fundamentally different from, the generalized likelihood ratio as used in the classical likelihood ratio test.

The advantage of the UI method compared to RIPr is that (a) it can be applied whenever the MLE can be efficiently computed - in many such cases, it is not known whether/how the reverse information projection can be calculated; and (b) that it 'automatically' gives not just an e-variable but a full e-process (see below): if we replace   in the formula above by a general stopping time  , the resulting ratio is still an e-variable; for the reverse information projection this automatic e-process generation only holds in special cases.

Its main disadvantage compared to RIPr is that it can be substantially sub-optimal in terms of the e-power/GRO criterion, which means that it leads to tests which also have less classical statistical power than RIPr-based methods. Thus, for settings in which the RIPr-method is computationally feasible and leads to e-processes, it is to be preferred. These include the z-test, t-test and corresponding linear regressions, k-sample tests with Bernoulli, Gaussian and Poisson distributions and the logrank test (an R package is available for a subset of these), as well as conditional independence testing under a model-X assumption.[12] However, in many other statistical testing problems, it is currently (2023) unknown whether fast implementations of the reverse information projection exist, and they may very well not exist (e.g. generalized linear models without the model-X assumption).

In nonparametric settings (such as testing a mean as in the example above, or nonparametric 2-sample testing), it is often more natural to consider e-variables of the   type. However, while these superficially look very different from likelihood ratios, they can often still be interpreted as such and sometimes can even be re-interpreted as implementing a version of the RIPr-construction.[2]

Finally, in practice, one sometimes resorts to mathematically or computationally convenient combinations of RIPr, UI and other methods.[2] For example, RIPr is applied to get optimal e-variables for small blocks of outcomes and these are then multiplied to obtain e-variables for larger samples - these e-variables work well in practice but cannot be considered optimal anymore.

A third construction method: p-to-e (and e-to-p) calibration edit

There exist functions that convert p-values into e-values.[13][14][15] Such functions are called p-to-e calibrators. Formally, a calibrator is a nonnegative decreasing function   which, when applied to a p-variable (a random variable whose value is a p-value), yields an e-variable. A calibrator   is said to dominate another calibrator   if  , and this domination is strict if the inequality is strict. An admissible calibrator is one that is not strictly dominated by any other calibrator. One can show that for a function to be a calibrator, it must have an integral of at most 1 over the uniform probability measure.

One family of admissible calibrators is given by the set of functions   with  . Another calibrator is given by integrating out  :

 

Conversely, an e-to-p calibrator transforms e-values back into p-variables. Interestingly, the following calibrator dominates all other e-to-p calibrators:

 .

While of theoretical importance, calibration is not much used in the practical design of e-variables since the resulting e-variables are often far from growth-optimal for any given  .[6]

E-Processes edit

Definition edit

Now consider data   arriving sequentially, constituting a discrete-time stochastic process. Let   be another discrete-time process where for each   can be written as a (measurable) function of the first   outcomes. We call   an e-process if for any stopping time   is an e-variable, i.e. for all  .

In basic cases, the stopping time can be defined by any rule that determines, at each sample size  , based only on the data observed so far, whether to stop collecting data or not. For example, this could be "stop when you have seen four consecutive outcomes larger than 1", "stop at  ", or the level- -aggressive rule, "stop as soon as you can reject at level  -level, i.e. at the smallest   such that  ", and so on. With e-processes, we obtain an e-variable with any such rule. Crucially, the data analyst may not know the rule used for stopping. For example, her boss may tell her to stop data collecting and she may not know exactly why - nevertheless, she gets a valid e-variable and Type-I error control. This is in sharp contrast to data analysis based on p-values (which becomes invalid if stopping rules are not determined in advance) or in classical Wald-style sequential analysis (which works with data of varying length but again, with stopping times that need to be determined in advance). In more complex cases, the stopping time has to be defined relative to some slightly reduced filtration, but this is not a big restriction in practice. In particular, the level- -aggressive rule is always allowed. Because of this validity under optional stopping, e-processes are the fundamental building block of confidence sequences, also known as anytime-valid confidence intervals.[16][2]

Technically, e-processes are generalizations of test supermartingales, which are nonnegative supermartingales with starting value 1: any test supermartingale constitutes an e-process but not vice versa.

Construction edit

E-processes can be constructed in a number of ways. Often, one starts with an e-value   for   whose definition is allowed to depend on previous data, i.e.,

for all  

(again, in complex testing problems this definition needs to be modified a bit using reduced filtrations). Then the product process   with   is a test supermartingale, and hence also an e-process (note that we already used this construction in the example described under "e-values as bets" above: for fixed   , the e-values   were not dependent on past-data, but by using   depending on the past, they became dependent on past data).

Another way to construct an e-process is to use the universal inference construction described above for sample sizes   The resulting sequence of e-values   will then always be an e-process.[2]

History edit

Historically, e-values implicitly appear as building blocks of nonnegative supermartingales in the pioneering work on anytime-valid confidence methods by well-known mathematician Herbert Robbins and some of his students.[16] The first time e-values (or something very much like them) are treated as a quantity of independent interest is by another well-known mathematician, Leonid Levin, in 1976, within the theory of algorithmic randomness. With the exception of contributions by pioneer V. Vovk in various papers with various collaborators (e.g.[14][13]), and an independent re-invention of the concept in an entirely different field,[17] the concept did not catch on at all until 2019, when, within just a few months, several pioneering papers by several research groups appeared on arXiv (the corresponding journal publications referenced below sometimes coming years later). In these, the concept was finally given a proper name ("S-Value"[6] and "E-Value";[15] in later versions of their paper,[6] also adapted "E-Value"); describing their general properties,[15] two generic ways to construct them,[8] and their intimate relation to betting[3]). Since then, interest by researchers around the world has been surging. In 2023 the first overview paper on "safe, anytime-valid methods", in which e-values play a central role, appeared.[2]

References edit

  1. ^ Wang, Ruodu; Ramdas, Aaditya (2022-07-01). "False Discovery Rate Control with E-values". Journal of the Royal Statistical Society Series B: Statistical Methodology. 84 (3): 822–852. arXiv:2009.02824. doi:10.1111/rssb.12489. ISSN 1369-7412.
  2. ^ a b c d e f g h i j k Ramdas, Aaditya; Grünwald, Peter; Vovk, Vladimir; Shafer, Glenn (2023-11-01). "Game-Theoretic Statistics and Safe Anytime-Valid Inference". Statistical Science. 38 (4). arXiv:2210.01948. doi:10.1214/23-sts894. ISSN 0883-4237.
  3. ^ a b Shafer, Glenn (2021-04-01). "Testing by Betting: A Strategy for Statistical and Scientific Communication". Journal of the Royal Statistical Society Series A: Statistics in Society. 184 (2): 407–431. doi:10.1111/rssa.12647. ISSN 0964-1998.
  4. ^ a b Waudby-Smith, Ian; Ramdas, Aaditya (2023-02-16). "Estimating means of bounded random variables by betting". Journal of the Royal Statistical Society Series B: Statistical Methodology. arXiv:2010.09686. doi:10.1093/jrsssb/qkad009. ISSN 1369-7412.
  5. ^ Ter Schure, J.A. (Judith); Ly, Alexander; Belin, Lisa; Benn, Christine S.; Bonten, Marc J.M.; Cirillo, Jeffrey D.; Damen, Johanna A.A.; Fronteira, Inês; Hendriks, Kelly D. (2022-12-19). Bacillus Calmette-Guérin vaccine to reduce COVID-19 infections and hospitalisations in healthcare workers – a living systematic review and prospective ALL-IN meta-analysis of individual participant data from randomised controlled trials (Report). Infectious Diseases (except HIV/AIDS). doi:10.1101/2022.12.15.22283474.
  6. ^ a b c d e f g Grünwald, Peter; De Heide, Rianne; Koolen, Wouter (2024). "Safe Testing". Journal of the Royal Statistical Society, Series B.
  7. ^ Wang, Qiuqi; Wang, Ruodu; Ziegel, Johanna (2022). "E-backtesting". SSRN Electronic Journal. doi:10.2139/ssrn.4206997. ISSN 1556-5068.
  8. ^ a b Wasserman, Larry; Ramdas, Aaditya; Balakrishnan, Sivaraman (2020-07-06). "Universal inference". Proceedings of the National Academy of Sciences. 117 (29): 16880–16890. arXiv:1912.11436. doi:10.1073/pnas.1922664117. ISSN 0027-8424.
  9. ^ Wald, Abraham (1947). Sequential analysis (Section 10.10). J. Wiley & sons, Incorporated.
  10. ^ Dawid, A. P. (2004-07-15). "Prequential Analysis". Encyclopedia of Statistical Sciences. doi:10.1002/0471667196.ess0335. ISBN 978-0-471-15044-2.
  11. ^ Rissanen, J. (July 1984). "Universal coding, information, prediction, and estimation". IEEE Transactions on Information Theory. 30 (4): 629–636. doi:10.1109/tit.1984.1056936. ISSN 0018-9448.
  12. ^ Candès, Emmanuel; Fan, Yingying; Janson, Lucas; Lv, Jinchi (2018-01-08). "Panning for Gold: 'Model-X' Knockoffs for High Dimensional Controlled Variable Selection". Journal of the Royal Statistical Society Series B: Statistical Methodology. 80 (3): 551–577. arXiv:1610.02351. doi:10.1111/rssb.12265. ISSN 1369-7412.
  13. ^ a b Shafer, Glenn; Shen, Alexander; Vereshchagin, Nikolai; Vovk, Vladimir (2011-02-01). "Test Martingales, Bayes Factors and p-Values". Statistical Science. 26 (1). arXiv:0912.4269. doi:10.1214/10-sts347. ISSN 0883-4237.
  14. ^ a b Vovk, V. G. (January 1993). "A Logic of Probability, with Application to the Foundations of Statistics". Journal of the Royal Statistical Society, Series B (Methodological). 55 (2): 317–341. doi:10.1111/j.2517-6161.1993.tb01904.x. ISSN 0035-9246.
  15. ^ a b c Vovk, Vladimir; Wang, Ruodu (2021-06-01). "E-values: Calibration, combination and applications". The Annals of Statistics. 49 (3). arXiv:1912.06116. doi:10.1214/20-aos2020. ISSN 0090-5364.
  16. ^ a b Darling, D. A.; Robbins, Herbert (July 1967). "Confidence Sequences for Mean, Variance, and Median". Proceedings of the National Academy of Sciences. 58 (1): 66–68. doi:10.1073/pnas.58.1.66. ISSN 0027-8424. PMC 335597. PMID 16578652.
  17. ^ Zhang, Yanbao; Glancy, Scott; Knill, Emanuel (2011-12-22). "Asymptotically optimal data analysis for rejecting local realism". Physical Review A. 84 (6): 062118. arXiv:1108.2468. doi:10.1103/physreva.84.062118. ISSN 1050-2947.

values, value, redirects, here, confused, with, expected, value, statistical, hypothesis, testing, values, quantify, evidence, data, against, null, hypothesis, coin, fair, medical, context, this, treatment, effect, they, serve, more, robust, alternative, value. E value redirects here Not to be confused with Expected value In statistical hypothesis testing e values quantify the evidence in the data against a null hypothesis e g the coin is fair or in a medical context this new treatment has no effect They serve as a more robust alternative to p values addressing some shortcomings of the latter In contrast to p values e values can deal with optional continuation e values of subsequent experiments e g clinical trials concerning the same treatment may simply be multiplied to provide a new product e value that represents the evidence in the joint experiment This works even if as often happens in practice the decision to perform later experiments may depend in vague unknown ways on the data observed in earlier experiments and it is not known beforehand how many trials will be conducted the product e value remains a meaningful quantity leading to tests with Type I error control For this reason e values and their sequential extension the e process are the fundamental building blocks for anytime valid statistical methods e g confidence sequences Another advantage over p values is that any weighted average of e values remains an e value even if the individual e values are arbitrarily dependent This is one of the reasons why e values have also turned out to be useful tools in multiple testing 1 E values can be interpreted in a number of different ways first the reciprocal of any e value is itself a p value but a special conservative one quite different from p values used in practice Second they are broad generalizations of likelihood ratios and are also related to yet distinct from Bayes factors Third they have an interpretation as bets Finally in a sequential context they can also be interpreted as increments of nonnegative supermartingales Interest in e values has exploded since 2019 when the term e value was coined and a number of breakthrough results were achieved by several research groups The first overview article appeared in 2023 2 Contents 1 Definition and mathematical background 2 Interpretations 2 1 As conservative p values 2 2 As generalizations of likelihood ratios 2 3 As bets 3 A fundamental property optional continuation 4 Construction and optimality 4 1 Simple alternative simple null likelihood ratio 4 2 Simple alternative composite null reverse information projection RIPr 4 3 Simple alternative composite null universal inference UI 4 4 Composite alternative simple null 4 5 Composite null and alternative 4 6 A third construction method p to e and e to p calibration 5 E Processes 5 1 Definition 5 2 Construction 6 History 7 ReferencesDefinition and mathematical background editLet the null hypothesis H0 displaystyle H 0 nbsp be given as a set of distributions for data Y displaystyle Y nbsp Usually Y X1 Xt displaystyle Y X 1 ldots X tau nbsp with each Xi displaystyle X i nbsp a single outcome and t displaystyle tau nbsp a fixed sample size or some stopping time We shall refer to such Y displaystyle Y nbsp which represent the full sequence of outcomes of a statistical experiment as a sample or batch of outcomes But in some cases Y displaystyle Y nbsp may also be an unordered bag of outcomes or a single outcome An e variable or e statistic is a nonnegative random variable E E Y displaystyle E E Y nbsp such that under all P H0 displaystyle P in H 0 nbsp its expected value is bounded by 1 EP E 1 displaystyle mathbb E P E leq 1 nbsp The value taken by e variable E displaystyle E nbsp is called the e value In practice the term e value a number is often used when one is really referring to the underlying e variable a random variable that is a measurable function of the data Interpretations editAs conservative p values edit For any e variable E displaystyle E nbsp and any 0 lt a 1 displaystyle 0 lt alpha leq 1 nbsp and all P H0 displaystyle P in H 0 nbsp it holds thatP E 1a P 1 E a a displaystyle P left E geq frac 1 alpha right P 1 E leq alpha overset leq alpha nbsp In words 1 E displaystyle 1 E nbsp is a p value and the e value based test with significance level a displaystyle alpha nbsp which rejects P0 displaystyle P 0 nbsp if 1 E a displaystyle 1 E leq alpha nbsp has Type I error bounded by a displaystyle alpha nbsp But whereas with standard p values the inequality above is usually an equality with continuous valued data or near equality with discrete data this is not the case with e variables This makes e value based tests more conservative less power than those based on standard p values and it is the price to pay for safety i e retaining Type I error guarantees under optional continuation and averaging As generalizations of likelihood ratios edit Let H0 P0 displaystyle H 0 P 0 nbsp be a simple null hypothesis Let Q displaystyle Q nbsp be any other distribution on Y displaystyle Y nbsp and letE q Y p0 Y displaystyle E frac q Y p 0 Y nbsp be their likelihood ratio Then E displaystyle E nbsp is an e variable Conversely any e variable relative to a simple null H0 P0 displaystyle H 0 P 0 nbsp can be written as a likelihood ratio with respect to some distribution Q displaystyle Q nbsp Thus when the null is simple e variables coincide with likelihood ratios E variables exist for general composite nulls as well though and they may then be thought of as generalizations of likelihood ratios The two main ways of constructing e variables UI and RIPr see below both lead to expressions that are variations of likelihood ratios as well Two other standard generalizations of the likelihood ratio are a the generalized likelihood ratio as used in the standard classical likelihood ratio test and b the Bayes factor Importantly neither a nor b are e variables in general generalized likelihood ratios in sense a are not e variables unless the alternative is simple see below under universal inference Bayes factors are e variables if the null is simple To see this note that if Q Q8 8 8 displaystyle mathcal Q Q theta theta in Theta nbsp represents a statistical model and w displaystyle w nbsp a prior density on 8 displaystyle Theta nbsp then we can set Q displaystyle Q nbsp as above to be the Bayes marginal distribution with densityq Y q8 Y w 8 d8 displaystyle q Y int q theta Y w theta d theta nbsp and then E q Y p0 Y displaystyle E q Y p 0 Y nbsp is also a Bayes factor of H0 displaystyle H 0 nbsp vs H1 Q displaystyle H 1 mathcal Q nbsp If the null is composite then some special e variables can be written as Bayes factors with some very special priors but most Bayes factors one encounters in practice are not e variables and many e variables one encounters in practice are not Bayes factors 2 As bets edit Suppose you can buy a ticket for 1 monetary unit with nonnegative pay off E E Y displaystyle E E Y nbsp The statements E displaystyle E nbsp is an e variable and if the null hypothesis is true you do not expect to gain any money if you engage in this bet are logically equivalent This is because E displaystyle E nbsp being an e variable means that the expected gain of buying the ticket is the pay off minus the cost i e E 1 displaystyle E 1 nbsp which has expectation 0 displaystyle leq 0 nbsp Based on this interpretation the product e value for a sequence of tests can be interpreted as the amount of money you have gained by sequentially betting with pay offs given by the individual e variables and always re investing all your gains 3 The betting interpretation becomes particularly visible if we rewrite an e variable as E 1 lU displaystyle E 1 lambda U nbsp where U displaystyle U nbsp has expectation 0 displaystyle leq 0 nbsp under all P H0 displaystyle P in H 0 nbsp and l R displaystyle lambda in mathbb R nbsp is chosen so that E 0 displaystyle E geq 0 nbsp a s Any e variable can be written in the 1 lU displaystyle 1 lambda U nbsp form although with parametric nulls writing it as a likelihood ratio is usually mathematically more convenient The 1 lU displaystyle 1 lambda U nbsp form on the other hand is often more convenient in nonparametric settings As a prototypical example 4 consider the case that Y X1 Xn displaystyle Y X 1 ldots X n nbsp with the Xi displaystyle X i nbsp taking values in the bounded interval 0 1 displaystyle 0 1 nbsp According to H0 displaystyle H 0 nbsp the Xi displaystyle X i nbsp are i i d according to a distribution P displaystyle P nbsp with mean m displaystyle mu nbsp no other assumptions about P displaystyle P nbsp are made Then we may first construct a family of e variables for single outcomes Ei l 1 l Xi m displaystyle E i lambda 1 lambda X i mu nbsp for any l 1 1 m 1 m displaystyle lambda in 1 1 mu 1 mu nbsp these are the l displaystyle lambda nbsp for which Ei l displaystyle E i lambda nbsp is guaranteed to be nonnegative We may then define a new e variable for the complete data vector Y displaystyle Y nbsp by taking the productE i 1nEi l Xi 1 displaystyle E prod i 1 n E i breve lambda X i 1 nbsp where l Xi 1 displaystyle breve lambda X i 1 nbsp is an estimate for l displaystyle lambda nbsp based only on past data Xi 1 X1 Xi 1 displaystyle X i 1 X 1 ldots X i 1 nbsp and designed to make Ei l displaystyle E i lambda nbsp as large as possible in the e power or GRO sense see below Waudby Smith and Ramdas use this approach to construct nonparametric confidence intervals for the mean that tend to be significantly narrower than those based on more classical methods such as Chernoff Hoeffding and Bernstein bounds 4 A fundamental property optional continuation editE values are more suitable than p value when one expects follow up tests involving the same null hypothesis with different data or experimental set ups This includes for example combining individual results in a meta analysis The advantage of e values in this setting is that they allow for optional continuation Indeed they have been employed in what may be the world s first fully online meta analysis with explicit Type I error control 5 Informally optional continuation implies that the product of any number of e values E 1 E 2 displaystyle E 1 E 2 ldots nbsp defined on independent samples Y 1 Y 2 displaystyle Y 1 Y 2 ldots nbsp is itself an e value even if the definition of each e value is allowed to depend on all previous outcomes and no matter what rule is used to decide when to stop gathering new samples e g to perform new trials It follows that for any significance level 0 lt a lt 1 displaystyle 0 lt alpha lt 1 nbsp if the null is true then the probability that a product of e values will ever become larger than 1 a displaystyle 1 alpha nbsp is bounded by a displaystyle alpha nbsp Thus if we decide to combine the samples observed so far and reject the null if the product e value is larger than 1 a displaystyle 1 alpha nbsp then our Type I error probability remains bounded by a displaystyle alpha nbsp We say that testing based on e values remains safe Type I valid under optional continuation Mathematically this is shown by first showing that the product e variables form a nonnegative discrete time martingale in the filtration generated by Y 1 Y 2 displaystyle Y 1 Y 2 ldots nbsp the individual e variables are then increments of this martingale The results then follow as a consequence of Doob s optional stopping theorem and Ville s inequality We already implicitly used product e variables in the example above where we defined e variables on individual outcomes Xi displaystyle X i nbsp and designed a new e value by taking products Thus in the example the individual outcomes Xi displaystyle X i nbsp play the role of batches full samples Y j displaystyle Y j nbsp above and we can therefore even engage in optional stopping within the original batch Y displaystyle Y nbsp we may stop the data analysis at any individual outcome not just batch of outcomes we like for whatever reason and reject if the product so far exceeds 1 a displaystyle 1 alpha nbsp Not all e variables defined for batches of outcomes Y displaystyle Y nbsp can be decomposed as a product of per outcome e values in this way though If this is not possible we cannot use them for optional stopping within a sample Y displaystyle Y nbsp but only for optional continuation from one sample Y j displaystyle Y j nbsp to the next Y j 1 displaystyle Y j 1 nbsp and so on Construction and optimality editIf we set E 1 displaystyle E 1 nbsp independently of the data we get a trivial e value it is an e variable by definition but it will never allow us to reject the null hypothesis This example shows that some e variables may be better than others in a sense to be defined below Intuitively a good e variable is one that tends to be large much larger than 1 if the alternative is true This is analogous to the situation with p values both e values and p values can be defined without referring to an alternative but if an alternative is available we would like them to be small p values or large e values with high probability In standard hypothesis tests the quality of a valid test is formalized by the notion of statistical power but this notion has to be suitably modified in the context of e values 2 6 The standard notion of quality of an e variable relative to a given alternative H1 displaystyle H 1 nbsp used by most authors in the field is a generalization of the Kelly criterion in economics and since it does exhibit close relations to classical power is sometimes called e power 7 the optimal e variable in this sense is known as log optimal or growth rate optimal often abbreviated to GRO 6 In the case of a simple alternative H1 Q displaystyle H 1 Q nbsp the e power of a given e variable S displaystyle S nbsp is simply defined as the expectation EQ log E displaystyle mathbb E Q log E nbsp in case of composite alternatives there are various versions e g worst case absolute worst case relative 6 of e power and GRO Simple alternative simple null likelihood ratio edit Let H0 P0 displaystyle H 0 P 0 nbsp and H1 Q displaystyle H 1 Q nbsp both be simple Then the likelihood ratio e variable E q Y p0 Y displaystyle E q Y p 0 Y nbsp has maximal e power in the sense above i e it is GRO 2 Simple alternative composite null reverse information projection RIPr edit Let H1 Q displaystyle H 1 Q nbsp be simple and H0 P8 8 80 displaystyle H 0 P theta theta in Theta 0 nbsp be composite such that all elements of H0 H1 displaystyle H 0 cup H 1 nbsp have densities denoted by lower case letters relative to the same underlying measure Grunwald et al show that under weak regularity conditions the GRO e variable exists is essentially unique and is given byE q Y p Q Y displaystyle E frac q Y p curvearrowleft Q Y nbsp where p Q displaystyle p curvearrowleft Q nbsp is the Reverse Information Projection RIPr of Q displaystyle Q nbsp unto the convex hull of H0 displaystyle H 0 nbsp 6 Under further regularity conditions and in all practically relevant cases encountered so far p Q displaystyle p curvearrowleft Q nbsp is given by a Bayes marginal density there exists a specific unique distribution W displaystyle W nbsp on 80 displaystyle Theta 0 nbsp such that p Q Y 80p8 Y dW 8 displaystyle p curvearrowleft Q Y int Theta 0 p theta Y dW theta nbsp Simple alternative composite null universal inference UI edit In the same setting as above 8 show that under no regularity conditions at all E q Y supP H0p Y q Y p8 Y Y displaystyle E frac q Y sup P in H 0 p Y left frac q Y p hat theta mid Y Y right nbsp is an e variable with the second equality holding if the MLE maximum likelihood estimator 8 Y displaystyle hat theta mid Y nbsp based on data Y displaystyle Y nbsp is always well defined This way of constructing e variables has been called the universal inference UI method universal referring to the fact that no regularity conditions are required Composite alternative simple null edit Now let H0 P displaystyle H 0 P nbsp be simple and H1 Q8 8 81 displaystyle H 1 Q theta theta in Theta 1 nbsp be composite such that all elements of H0 H1 displaystyle H 0 cup H 1 nbsp have densities relative to the same underlying measure There are now two generic closely related ways of obtaining e variables that are close to growth optimal appropriately redefined 2 for composite H1 displaystyle H 1 nbsp Robbins method of mixtures and the plug in method originally due to Wald 9 but in essence re discovered by Philip Dawid as prequential plug in 10 and Jorma Rissanen as predictive MDL 11 The method of mixtures essentially amounts to being Bayesian about the numerator the reason it is not called Bayesian method is that when both null and alternative are composite the numerator may often not be a Bayes marginal we posit any prior distribution W displaystyle W nbsp on 81 displaystyle Theta 1 nbsp and setq W Y 81q8 Y dW 8 displaystyle bar q W Y int Theta 1 q theta Y dW theta nbsp and use the e variable q W Y p Y displaystyle bar q W Y p Y nbsp To explicate the plug in method suppose that Y X1 Xn displaystyle Y X 1 ldots X n nbsp where X1 X2 displaystyle X 1 X 2 ldots nbsp constitute a stochastic process and let 8 Xi displaystyle breve theta mid X i nbsp be an estimator of 8 81 displaystyle theta in Theta 1 nbsp based on data Xi X1 Xi displaystyle X i X 1 ldots X i nbsp for i 0 displaystyle i geq 0 nbsp In practice one usually takes a smoothed maximum likelihood estimator such as for example the regression coefficients in ridge regression initially set to some default value 8 X0 80 displaystyle breve theta mid X 0 theta 0 nbsp One now recursively constructs a density q 8 displaystyle bar q breve theta nbsp for Xn displaystyle X n nbsp by setting q 8 Xn i 1nq8 Xi 1 Xi Xi 1 displaystyle bar q breve theta X n prod i 1 n q breve theta mid X i 1 X i mid X i 1 nbsp Effectively both the method of mixtures and the plug in method can be thought of learning a specific instantiation of the alternative that explains the data well 2 Composite null and alternative edit In parametric settings we can simply combine the main methods for the composite alternative obtaining q 8 displaystyle bar q breve theta nbsp or q W displaystyle bar q W nbsp with the main methods for the composite null UI or RIPr using the single distribution q 8 displaystyle bar q breve theta nbsp or q W displaystyle bar q W nbsp as an alternative Note in particular that when using the plug in method together with the UI method the resulting e variable will look like i 1nq8 Xi 1 Xi q8 Xn Xn displaystyle frac prod i 1 n q breve theta mid X i 1 X i q hat theta mid X n X n nbsp which resembles but is still fundamentally different from the generalized likelihood ratio as used in the classical likelihood ratio test The advantage of the UI method compared to RIPr is that a it can be applied whenever the MLE can be efficiently computed in many such cases it is not known whether how the reverse information projection can be calculated and b that it automatically gives not just an e variable but a full e process see below if we replace n displaystyle n nbsp in the formula above by a general stopping time t displaystyle tau nbsp the resulting ratio is still an e variable for the reverse information projection this automatic e process generation only holds in special cases Its main disadvantage compared to RIPr is that it can be substantially sub optimal in terms of the e power GRO criterion which means that it leads to tests which also have less classical statistical power than RIPr based methods Thus for settings in which the RIPr method is computationally feasible and leads to e processes it is to be preferred These include the z test t test and corresponding linear regressions k sample tests with Bernoulli Gaussian and Poisson distributions and the logrank test an R package is available for a subset of these as well as conditional independence testing under a model X assumption 12 However in many other statistical testing problems it is currently 2023 unknown whether fast implementations of the reverse information projection exist and they may very well not exist e g generalized linear models without the model X assumption In nonparametric settings such as testing a mean as in the example above or nonparametric 2 sample testing it is often more natural to consider e variables of the 1 lU displaystyle 1 lambda U nbsp type However while these superficially look very different from likelihood ratios they can often still be interpreted as such and sometimes can even be re interpreted as implementing a version of the RIPr construction 2 Finally in practice one sometimes resorts to mathematically or computationally convenient combinations of RIPr UI and other methods 2 For example RIPr is applied to get optimal e variables for small blocks of outcomes and these are then multiplied to obtain e variables for larger samples these e variables work well in practice but cannot be considered optimal anymore A third construction method p to e and e to p calibration edit There exist functions that convert p values into e values 13 14 15 Such functions are called p to e calibrators Formally a calibrator is a nonnegative decreasing function f 0 1 0 displaystyle f 0 1 rightarrow 0 infty nbsp which when applied to a p variable a random variable whose value is a p value yields an e variable A calibrator f displaystyle f nbsp is said to dominate another calibrator g displaystyle g nbsp if f g displaystyle f geq g nbsp and this domination is strict if the inequality is strict An admissible calibrator is one that is not strictly dominated by any other calibrator One can show that for a function to be a calibrator it must have an integral of at most 1 over the uniform probability measure One family of admissible calibrators is given by the set of functions fk 0 lt k lt 1 displaystyle f kappa 0 lt kappa lt 1 nbsp with fk p kpk 1 displaystyle f kappa p kappa p kappa 1 nbsp Another calibrator is given by integrating out k displaystyle kappa nbsp 01kpk 1dk 1 p plog pp log p 2 displaystyle int 0 1 kappa p kappa 1 d kappa frac 1 p p log p p log p 2 nbsp Conversely an e to p calibrator transforms e values back into p variables Interestingly the following calibrator dominates all other e to p calibrators f t min 1 1 t displaystyle f t min 1 1 t nbsp While of theoretical importance calibration is not much used in the practical design of e variables since the resulting e variables are often far from growth optimal for any given H1 displaystyle H 1 nbsp 6 E Processes editDefinition edit Now consider data X1 X2 displaystyle X 1 X 2 ldots nbsp arriving sequentially constituting a discrete time stochastic process Let E1 E2 displaystyle E 1 E 2 ldots nbsp be another discrete time process where for each n En displaystyle n E n nbsp can be written as a measurable function of the first X1 Xn displaystyle X 1 ldots X n nbsp outcomes We call E1 E2 displaystyle E 1 E 2 ldots nbsp an e process if for any stopping time t Et displaystyle tau E tau nbsp is an e variable i e for all P H0 EP Et 1 displaystyle P in H 0 mathbb E P E tau leq 1 nbsp In basic cases the stopping time can be defined by any rule that determines at each sample size n displaystyle n nbsp based only on the data observed so far whether to stop collecting data or not For example this could be stop when you have seen four consecutive outcomes larger than 1 stop at n 100 displaystyle n 100 nbsp or the level a displaystyle alpha nbsp aggressive rule stop as soon as you can reject at level a displaystyle alpha nbsp level i e at the smallest n displaystyle n nbsp such that En 1 a displaystyle E n geq 1 alpha nbsp and so on With e processes we obtain an e variable with any such rule Crucially the data analyst may not know the rule used for stopping For example her boss may tell her to stop data collecting and she may not know exactly why nevertheless she gets a valid e variable and Type I error control This is in sharp contrast to data analysis based on p values which becomes invalid if stopping rules are not determined in advance or in classical Wald style sequential analysis which works with data of varying length but again with stopping times that need to be determined in advance In more complex cases the stopping time has to be defined relative to some slightly reduced filtration but this is not a big restriction in practice In particular the level a displaystyle alpha nbsp aggressive rule is always allowed Because of this validity under optional stopping e processes are the fundamental building block of confidence sequences also known as anytime valid confidence intervals 16 2 Technically e processes are generalizations of test supermartingales which are nonnegative supermartingales with starting value 1 any test supermartingale constitutes an e process but not vice versa Construction edit E processes can be constructed in a number of ways Often one starts with an e value Si displaystyle S i nbsp for Xi displaystyle X i nbsp whose definition is allowed to depend on previous data i e for all P H0 EP Ei X1 Xi 1 1 displaystyle P in H 0 mathbb E P E i X 1 ldots X i 1 leq 1 nbsp again in complex testing problems this definition needs to be modified a bit using reduced filtrations Then the product process M1 M2 displaystyle M 1 M 2 ldots nbsp with Mn E1 E2 En displaystyle M n E 1 times E 2 cdots times E n nbsp is a test supermartingale and hence also an e process note that we already used this construction in the example described under e values as bets above for fixed l displaystyle lambda nbsp the e values Ei l displaystyle E i lambda nbsp were not dependent on past data but by using l l Xi 1 displaystyle lambda breve lambda X i 1 nbsp depending on the past they became dependent on past data Another way to construct an e process is to use the universal inference construction described above for sample sizes 1 2 displaystyle 1 2 ldots nbsp The resulting sequence of e values E1 E2 displaystyle E 1 E 2 ldots nbsp will then always be an e process 2 History editHistorically e values implicitly appear as building blocks of nonnegative supermartingales in the pioneering work on anytime valid confidence methods by well known mathematician Herbert Robbins and some of his students 16 The first time e values or something very much like them are treated as a quantity of independent interest is by another well known mathematician Leonid Levin in 1976 within the theory of algorithmic randomness With the exception of contributions by pioneer V Vovk in various papers with various collaborators e g 14 13 and an independent re invention of the concept in an entirely different field 17 the concept did not catch on at all until 2019 when within just a few months several pioneering papers by several research groups appeared on arXiv the corresponding journal publications referenced below sometimes coming years later In these the concept was finally given a proper name S Value 6 and E Value 15 in later versions of their paper 6 also adapted E Value describing their general properties 15 two generic ways to construct them 8 and their intimate relation to betting 3 Since then interest by researchers around the world has been surging In 2023 the first overview paper on safe anytime valid methods in which e values play a central role appeared 2 References edit Wang Ruodu Ramdas Aaditya 2022 07 01 False Discovery Rate Control with E values Journal of the Royal Statistical Society Series B Statistical Methodology 84 3 822 852 arXiv 2009 02824 doi 10 1111 rssb 12489 ISSN 1369 7412 a b c d e f g h i j k Ramdas Aaditya Grunwald Peter Vovk Vladimir Shafer Glenn 2023 11 01 Game Theoretic Statistics and Safe Anytime Valid Inference Statistical Science 38 4 arXiv 2210 01948 doi 10 1214 23 sts894 ISSN 0883 4237 a b Shafer Glenn 2021 04 01 Testing by Betting A Strategy for Statistical and Scientific Communication Journal of the Royal Statistical Society Series A Statistics in Society 184 2 407 431 doi 10 1111 rssa 12647 ISSN 0964 1998 a b Waudby Smith Ian Ramdas Aaditya 2023 02 16 Estimating means of bounded random variables by betting Journal of the Royal Statistical Society Series B Statistical Methodology arXiv 2010 09686 doi 10 1093 jrsssb qkad009 ISSN 1369 7412 Ter Schure J A Judith Ly Alexander Belin Lisa Benn Christine S Bonten Marc J M Cirillo Jeffrey D Damen Johanna A A Fronteira Ines Hendriks Kelly D 2022 12 19 Bacillus Calmette Guerin vaccine to reduce COVID 19 infections and hospitalisations in healthcare workers a living systematic review and prospective ALL IN meta analysis of individual participant data from randomised controlled trials Report Infectious Diseases except HIV AIDS doi 10 1101 2022 12 15 22283474 a b c d e f g Grunwald Peter De Heide Rianne Koolen Wouter 2024 Safe Testing Journal of the Royal Statistical Society Series B Wang Qiuqi Wang Ruodu Ziegel Johanna 2022 E backtesting SSRN Electronic Journal doi 10 2139 ssrn 4206997 ISSN 1556 5068 a b Wasserman Larry Ramdas Aaditya Balakrishnan Sivaraman 2020 07 06 Universal inference Proceedings of the National Academy of Sciences 117 29 16880 16890 arXiv 1912 11436 doi 10 1073 pnas 1922664117 ISSN 0027 8424 Wald Abraham 1947 Sequential analysis Section 10 10 J Wiley amp sons Incorporated Dawid A P 2004 07 15 Prequential Analysis Encyclopedia of Statistical Sciences doi 10 1002 0471667196 ess0335 ISBN 978 0 471 15044 2 Rissanen J July 1984 Universal coding information prediction and estimation IEEE Transactions on Information Theory 30 4 629 636 doi 10 1109 tit 1984 1056936 ISSN 0018 9448 Candes Emmanuel Fan Yingying Janson Lucas Lv Jinchi 2018 01 08 Panning for Gold Model X Knockoffs for High Dimensional Controlled Variable Selection Journal of the Royal Statistical Society Series B Statistical Methodology 80 3 551 577 arXiv 1610 02351 doi 10 1111 rssb 12265 ISSN 1369 7412 a b Shafer Glenn Shen Alexander Vereshchagin Nikolai Vovk Vladimir 2011 02 01 Test Martingales Bayes Factors and p Values Statistical Science 26 1 arXiv 0912 4269 doi 10 1214 10 sts347 ISSN 0883 4237 a b Vovk V G January 1993 A Logic of Probability with Application to the Foundations of Statistics Journal of the Royal Statistical Society Series B Methodological 55 2 317 341 doi 10 1111 j 2517 6161 1993 tb01904 x ISSN 0035 9246 a b c Vovk Vladimir Wang Ruodu 2021 06 01 E values Calibration combination and applications The Annals of Statistics 49 3 arXiv 1912 06116 doi 10 1214 20 aos2020 ISSN 0090 5364 a b Darling D A Robbins Herbert July 1967 Confidence Sequences for Mean Variance and Median Proceedings of the National Academy of Sciences 58 1 66 68 doi 10 1073 pnas 58 1 66 ISSN 0027 8424 PMC 335597 PMID 16578652 Zhang Yanbao Glancy Scott Knill Emanuel 2011 12 22 Asymptotically optimal data analysis for rejecting local realism Physical Review A 84 6 062118 arXiv 1108 2468 doi 10 1103 physreva 84 062118 ISSN 1050 2947 Retrieved from https en wikipedia org w index php title E values amp oldid 1211967525, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.