fbpx
Wikipedia

Bias–variance tradeoff

In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as we increase the number of tunable parameters in a model, it becomes more flexible, and can better fit a training data set. It is said to have lower error, or bias. However, for more flexible models, there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set. It is said that there is greater variance in the model's estimated parameters.

Function and noisy data
Spread=5
Spread=1
Spread=0.1
A function (red) is approximated using radial basis functions (blue). Several trials are shown in each graph. For each trial, a few noisy data points are provided as a training set (top). For a wide spread (image 2) the bias is high: the RBFs cannot fully approximate the function (especially the central dip), but the variance between different trials is low. As spread decreases (image 3 and 4) the bias decreases: the blue curves more closely approximate the red. However, depending on the noise in different trials the variance between trials increases. In the lowermost image the approximated values for x=0 varies wildly depending on where the data points were located.
Bias and variance as function of model complexity

The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:[1][2]

  • The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
  • The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

Motivation Edit

The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

It is an often made fallacy[3][4] to assume that complex models must have high variance; High variance models are 'complex' in some sense, but the reverse needs not be true[clarification needed]. In addition, one has to be careful how to define complexity: In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:[5] The model   has only two parameters ( ) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.

An analogy can be made to the relationship between accuracy and precision. Accuracy is a description of bias and can intuitively be improved by selecting from only local information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, test data may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is due to inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit regularization, such as shrinkage.

Bias–variance decomposition of mean squared error Edit

Suppose that we have a training set consisting of a set of points   and real values   associated with each point  . We assume that there is a function f(x) such as  , where the noise,  , has zero mean and variance  .

We want to find a function  , that approximates the true function   as well as possible, by means of some learning algorithm based on a training dataset (sample)  . We make "as well as possible" precise by measuring the mean squared error between   and  : we want   to be minimal, both for   and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the   contain noise  ; this means we must be prepared to accept an irreducible error in any function we come up with.

Finding an   that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function   we select, we can decompose its expected error on an unseen sample   (i.e. conditional to x) as follows:[6]: 34 [7]: 223 

 

where

 
 

and

 

The expectation ranges over different choices of the training set  , all sampled from the same joint distribution   which can for example be done via bootstrapping. The three terms represent:

  • the square of the bias of the learning method, which can be thought of as the error caused by the simplifying assumptions built into the method. E.g., when approximating a non-linear function   using a learning method for linear models, there will be error in the estimates   due to this assumption;
  • the variance of the learning method, or, intuitively, how much the learning method   will move around its mean;
  • the irreducible error  .

Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples.[6]: 34 

The more complex the model   is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

Derivation Edit

The derivation of the bias–variance decomposition for squared error proceeds as follows.[8][9] For notational convenience, we abbreviate  ,   and we drop the   subscript on our expectation operators.

Let us write the mean-squared error of our model:

 

First,

 

Secondly, since we model  , we show that

 

Lastly,

 

Eventually, we plug these 3 formulas in our previous derivation of   and thus show that:

 

Finally, MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over  :

 

Approaches Edit

Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,

One way of resolving the trade-off is to use mixture models and ensemble learning.[13][14] For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.

Model validation methods such as cross-validation (statistics) can be used to tune models so as to optimize the trade-off.

k-nearest neighbors Edit

In the case of k-nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a closed-form expression exists that relates the bias–variance decomposition to the parameter k:[7]: 37, 223 

 

where   are the k nearest neighbors of x in the training set. The bias (first term) is a monotone rising function of k, while the variance (second term) drops off as k is increased. In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.[11]

Applications Edit

In regression Edit

The bias–variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.

In classification Edit

The bias–variance decomposition was originally formulated for least-squares regression. For the case of classification under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition.[15][16] Alternatively, if the classification problem can be phrased as probabilistic classification, then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before.[17]

It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimized by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimize variance.[18]

In reinforcement learning Edit

Even though the bias–variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.[19]

In human learning Edit

While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of human cognition, most notably by Gerd Gigerenzer and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterised training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalisability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.[20]

Geman et al.[11] argue that the bias–variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.

See also Edit

References Edit

  1. ^ Kohavi, Ron; Wolpert, David H. (1996). "Bias Plus Variance Decomposition for Zero-One Loss Functions". ICML. 96.
  2. ^ Luxburg, Ulrike V.; Schölkopf, B. (2011). "Statistical learning theory: Models, concepts, and results". Handbook of the History of Logic. 10: Section 2.4.
  3. ^ Neal, Brady (2019). "On the Bias-Variance Tradeoff: Textbooks Need an Update". arXiv:1912.08286 [cs.LG].
  4. ^ a b Neal, Brady; Mittal, Sarthak; Baratin, Aristide; Tantia, Vinayak; Scicluna, Matthew; Lacoste-Julien, Simon; Mitliagkas, Ioannis (2018). "A Modern Take on the Bias-Variance Tradeoff in Neural Networks". arXiv:1810.08591 [cs.LG].
  5. ^ Vapnik, Vladimir (2000). The nature of statistical learning theory. New York: Springer-Verlag. doi:10.1007/978-1-4757-3264-1. ISBN 978-1-4757-3264-1. S2CID 7138354.
  6. ^ a b c James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2013). An Introduction to Statistical Learning. Springer.
  7. ^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2009). . Archived from the original on 2015-01-26. Retrieved 2014-08-20.
  8. ^ Vijayakumar, Sethu (2007). "The Bias–Variance Tradeoff" (PDF). University of Edinburgh. Retrieved 19 August 2014.
  9. ^ Shakhnarovich, Greg (2011). (PDF). Archived from the original (PDF) on 21 August 2014. Retrieved 20 August 2014.
  10. ^ Belsley, David (1991). Conditioning diagnostics : collinearity and weak data in regression. New York (NY): Wiley. ISBN 978-0471528890.
  11. ^ a b c Geman, Stuart; Bienenstock, Élie; Doursat, René (1992). "Neural networks and the bias/variance dilemma" (PDF). Neural Computation. 4: 1–58. doi:10.1162/neco.1992.4.1.1. S2CID 14215320.
  12. ^ Gagliardi, Francesco (May 2011). "Instance-based classifiers applied to medical databases: diagnosis and knowledge extraction". Artificial Intelligence in Medicine. 52 (3): 123–139. doi:10.1016/j.artmed.2011.04.002. PMID 21621400.
  13. ^ Ting, Jo-Anne; Vijaykumar, Sethu; Schaal, Stefan (2011). "Locally Weighted Regression for Control". In Sammut, Claude; Webb, Geoffrey I. (eds.). Encyclopedia of Machine Learning (PDF). Springer. p. 615. Bibcode:2010eoml.book.....S.
  14. ^ Fortmann-Roe, Scott (2012). "Understanding the Bias–Variance Tradeoff".
  15. ^ Domingos, Pedro (2000). A unified bias-variance decomposition (PDF). ICML.
  16. ^ Valentini, Giorgio; Dietterich, Thomas G. (2004). "Bias–variance analysis of support vector machines for the development of SVM-based ensemble methods" (PDF). Journal of Machine Learning Research. 5: 725–775.
  17. ^ Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich (2008). "Vector Space Classification" (PDF). Introduction to Information Retrieval. Cambridge University Press. pp. 308–314.
  18. ^ Brain, Damian; Webb, Geoffrey (2002). The Need for Low Bias Algorithms in Classification Learning From Large Data Sets (PDF). Proceedings of the Sixth European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002).
  19. ^ Francois-Lavet, Vincent; Rabusseau, Guillaume; Pineau, Joelle; Ernst, Damien; Fonteneau, Raphael (2019). "On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability". Journal of Artificial Intelligence Research. 65: 1–30. doi:10.1613/jair.1.11478.
  20. ^ Gigerenzer, Gerd; Brighton, Henry (2009). "Homo Heuristicus: Why Biased Minds Make Better Inferences". Topics in Cognitive Science. 1 (1): 107–143. doi:10.1111/j.1756-8765.2008.01006.x. hdl:11858/00-001M-0000-0024-F678-0. PMID 25164802.

External links Edit

  • MLU-Explain: The Bias Variance Tradeoff — An interactive visualization of the bias-variance tradeoff in LOESS Regression and K-Nearest Neighbors.

Literature Edit

  • Harry L. Van Trees; Kristine L. Bell, "Exploring Estimator BiasVariance Tradeoffs Using the Uniform CR Bound," in Bayesian Bounds for Parameter Estimation and Nonlinear Filtering/Tracking , IEEE, 2007, pp.451-466, doi: 10.1109/9780470544198.ch40.

bias, variance, tradeoff, statistics, machine, learning, bias, variance, tradeoff, describes, relationship, between, model, complexity, accuracy, predictions, well, make, predictions, previously, unseen, data, that, were, used, train, model, general, increase,. In statistics and machine learning the bias variance tradeoff describes the relationship between a model s complexity the accuracy of its predictions and how well it can make predictions on previously unseen data that were not used to train the model In general as we increase the number of tunable parameters in a model it becomes more flexible and can better fit a training data set It is said to have lower error or bias However for more flexible models there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set It is said that there is greater variance in the model s estimated parameters Function and noisy dataSpread 5Spread 1Spread 0 1A function red is approximated using radial basis functions blue Several trials are shown in each graph For each trial a few noisy data points are provided as a training set top For a wide spread image 2 the bias is high the RBFs cannot fully approximate the function especially the central dip but the variance between different trials is low As spread decreases image 3 and 4 the bias decreases the blue curves more closely approximate the red However depending on the noise in different trials the variance between trials increases In the lowermost image the approximated values for x 0 varies wildly depending on where the data points were located Bias and variance as function of model complexityThe bias variance dilemma or bias variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set 1 2 The bias error is an error from erroneous assumptions in the learning algorithm High bias can cause an algorithm to miss the relevant relations between features and target outputs underfitting The variance is an error from sensitivity to small fluctuations in the training set High variance may result from an algorithm modeling the random noise in the training data overfitting The bias variance decomposition is a way of analyzing a learning algorithm s expected generalization error with respect to a particular problem as a sum of three terms the bias variance and a quantity called the irreducible error resulting from noise in the problem itself Contents 1 Motivation 2 Bias variance decomposition of mean squared error 2 1 Derivation 3 Approaches 3 1 k nearest neighbors 4 Applications 4 1 In regression 4 2 In classification 4 3 In reinforcement learning 4 4 In human learning 5 See also 6 References 7 External links 8 LiteratureMotivation EditSee also Accuracy and precision nbsp High bias low variance nbsp High bias high variance nbsp Low bias low variance nbsp Low bias high varianceThe bias variance tradeoff is a central problem in supervised learning Ideally one wants to choose a model that both accurately captures the regularities in its training data but also generalizes well to unseen data Unfortunately it is typically impossible to do both simultaneously High variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data In contrast algorithms with high bias typically produce simpler models that may fail to capture important regularities i e underfit in the data It is an often made fallacy 3 4 to assume that complex models must have high variance High variance models are complex in some sense but the reverse needs not be true clarification needed In addition one has to be careful how to define complexity In particular the number of parameters used to describe the model is a poor measure of complexity This is illustrated by an example adapted from 5 The model f a b x a sin b x displaystyle f a b x a sin bx nbsp has only two parameters a b displaystyle a b nbsp but it can interpolate any number of points by oscillating with a high enough frequency resulting in both a high bias and high variance An analogy can be made to the relationship between accuracy and precision Accuracy is a description of bias and can intuitively be improved by selecting from only local information Consequently a sample will appear accurate i e have low bias under the aforementioned selection conditions but may result in underfitting In other words test data may not agree as closely with training data which would indicate imprecision and therefore inflated variance A graphical example would be a straight line fit to data exhibiting quadratic behavior overall Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space The option to select many data points over a broad sample space is the ideal condition for any analysis However intrinsic constraints whether physical theoretical computational etc will always play a limiting role The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall but may also result in an overreliance on the training data overfitting This means that test data would also not agree as closely with the training data but in this case the reason is due to inaccuracy or high bias To borrow from the previous example the graphical representation would appear as a high order polynomial fit to the same data exhibiting quadratic behavior Note that error in each case is measured the same way but the reason ascribed to the error is different depending on the balance between bias and variance To mitigate how much information is used from neighboring observations a model can be smoothed via explicit regularization such as shrinkage Bias variance decomposition of mean squared error EditMain article Mean squared error Suppose that we have a training set consisting of a set of points x 1 x n displaystyle x 1 dots x n nbsp and real values y i displaystyle y i nbsp associated with each point x i displaystyle x i nbsp We assume that there is a function f x such as y f x e displaystyle y f x varepsilon nbsp where the noise e displaystyle varepsilon nbsp has zero mean and variance s 2 displaystyle sigma 2 nbsp We want to find a function f x D displaystyle hat f x D nbsp that approximates the true function f x displaystyle f x nbsp as well as possible by means of some learning algorithm based on a training dataset sample D x 1 y 1 x n y n displaystyle D x 1 y 1 dots x n y n nbsp We make as well as possible precise by measuring the mean squared error between y displaystyle y nbsp and f x D displaystyle hat f x D nbsp we want y f x D 2 displaystyle y hat f x D 2 nbsp to be minimal both for x 1 x n displaystyle x 1 dots x n nbsp and for points outside of our sample Of course we cannot hope to do so perfectly since the y i displaystyle y i nbsp contain noise e displaystyle varepsilon nbsp this means we must be prepared to accept an irreducible error in any function we come up with Finding an f displaystyle hat f nbsp that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning It turns out that whichever function f displaystyle hat f nbsp we select we can decompose its expected error on an unseen sample x displaystyle x nbsp i e conditional to x as follows 6 34 7 223 E D e y f x D 2 Bias D f x D 2 Var D f x D s 2 displaystyle operatorname E D varepsilon Big big y hat f x D big 2 Big Big operatorname Bias D big hat f x D big Big 2 operatorname Var D big hat f x D big sigma 2 nbsp where Bias D f x D E D f x D f x E D f x D E y x y x displaystyle operatorname Bias D big hat f x D big operatorname E D big hat f x D f x big operatorname E D big hat f x D big operatorname E y x big y x big nbsp Var D f x D E D E D f x D f x D 2 displaystyle operatorname Var D big hat f x D big operatorname E D big operatorname E D hat f x D hat f x D big 2 nbsp and s 2 E y y f x E y x y 2 displaystyle sigma 2 operatorname E y y underbrace f x E y x y 2 nbsp The expectation ranges over different choices of the training set D x 1 y 1 x n y n displaystyle D x 1 y 1 dots x n y n nbsp all sampled from the same joint distribution P x y displaystyle P x y nbsp which can for example be done via bootstrapping The three terms represent the square of the bias of the learning method which can be thought of as the error caused by the simplifying assumptions built into the method E g when approximating a non linear function f x displaystyle f x nbsp using a learning method for linear models there will be error in the estimates f x displaystyle hat f x nbsp due to this assumption the variance of the learning method or intuitively how much the learning method f x displaystyle hat f x nbsp will move around its mean the irreducible error s 2 displaystyle sigma 2 nbsp Since all three terms are non negative the irreducible error forms a lower bound on the expected error on unseen samples 6 34 The more complex the model f x displaystyle hat f x nbsp is the more data points it will capture and the lower the bias will be However complexity will make the model move more to capture the data points and hence its variance will be larger Derivation Edit The derivation of the bias variance decomposition for squared error proceeds as follows 8 9 For notational convenience we abbreviate f f x displaystyle f f x nbsp f f x D displaystyle hat f hat f x D nbsp and we drop the D displaystyle D nbsp subscript on our expectation operators Let us write the mean squared error of our model MSE E y f 2 E y 2 2 y f f 2 E y 2 2 E y f E f 2 displaystyle text MSE triangleq operatorname E big y hat f 2 big operatorname E big y 2 2y hat f hat f 2 big operatorname E big y 2 big 2 operatorname E big y hat f big operatorname E big hat f 2 big nbsp First E f 2 Var f E f 2 since Var X E X E X 2 E X 2 E X 2 for any random variable X displaystyle begin aligned operatorname E big hat f 2 big amp operatorname Var hat f operatorname E hat f 2 amp amp text since operatorname Var X triangleq operatorname E Big X operatorname E X 2 Big operatorname E X 2 operatorname E X 2 text for any random variable X end aligned nbsp Secondly since we model y f e displaystyle y f varepsilon nbsp we show that E y 2 E f e 2 E f 2 2 E f e E e 2 by linearity of E f 2 2 f E e E e 2 since f does not depend on the data f 2 2 f 0 s 2 since e has zero mean and variance s 2 displaystyle begin aligned operatorname E big y 2 big amp operatorname E big f varepsilon 2 big amp operatorname E f 2 2 operatorname E f varepsilon operatorname E varepsilon 2 amp amp text by linearity of operatorname E amp f 2 2f operatorname E varepsilon operatorname E varepsilon 2 amp amp text since f text does not depend on the data amp f 2 2f cdot 0 sigma 2 amp amp text since varepsilon text has zero mean and variance sigma 2 end aligned nbsp Lastly E y f E f e f E f f E e f by linearity of E E f f E e E f since f and e are independent f E f since E e 0 displaystyle begin aligned operatorname E big y hat f big amp operatorname E big f varepsilon hat f big amp operatorname E f hat f operatorname E varepsilon hat f amp amp text by linearity of operatorname E amp operatorname E f hat f operatorname E varepsilon operatorname E hat f amp amp text since hat f text and varepsilon text are independent amp f operatorname E hat f amp amp text since operatorname E varepsilon 0 end aligned nbsp Eventually we plug these 3 formulas in our previous derivation of MSE displaystyle text MSE nbsp and thus show that MSE f 2 s 2 2 f E f Var f E f 2 f E f 2 s 2 Var f Bias f 2 s 2 Var f displaystyle begin aligned text MSE amp f 2 sigma 2 2f operatorname E hat f operatorname Var hat f operatorname E hat f 2 amp f operatorname E hat f 2 sigma 2 operatorname Var big hat f big 5pt amp operatorname Bias hat f 2 sigma 2 operatorname Var big hat f big end aligned nbsp Finally MSE loss function or negative log likelihood is obtained by taking the expectation value over x P displaystyle x sim P nbsp MSE E x Bias D f x D 2 Var D f x D s 2 displaystyle text MSE operatorname E x bigg operatorname Bias D hat f x D 2 operatorname Var D big hat f x D big bigg sigma 2 nbsp Approaches EditDimensionality reduction and feature selection can decrease variance by simplifying models Similarly a larger training set tends to decrease variance Adding features predictors tends to decrease bias at the expense of introducing additional variance Learning algorithms typically have some tunable parameters that control bias and variance for example linear and Generalized linear models can be regularized to decrease their variance at the cost of increasing their bias 10 In artificial neural networks the variance increases and the bias decreases as the number of hidden units increase 11 although this classical assumption has been the subject of recent debate 4 Like in GLMs regularization is typically applied In k nearest neighbor models a high value of k leads to high bias and low variance see below In instance based learning regularization can be achieved varying the mixture of prototypes and exemplars 12 In decision trees the depth of the tree determines the variance Decision trees are commonly pruned to control variance 6 307 One way of resolving the trade off is to use mixture models and ensemble learning 13 14 For example boosting combines many weak high bias models in an ensemble that has lower bias than the individual models while bagging combines strong learners in a way that reduces their variance Model validation methods such as cross validation statistics can be used to tune models so as to optimize the trade off k nearest neighbors Edit In the case of k nearest neighbors regression when the expectation is taken over the possible labeling of a fixed training set a closed form expression exists that relates the bias variance decomposition to the parameter k 7 37 223 E y f x 2 X x f x 1 k i 1 k f N i x 2 s 2 k s 2 displaystyle operatorname E y hat f x 2 mid X x left f x frac 1 k sum i 1 k f N i x right 2 frac sigma 2 k sigma 2 nbsp where N 1 x N k x displaystyle N 1 x dots N k x nbsp are the k nearest neighbors of x in the training set The bias first term is a monotone rising function of k while the variance second term drops off as k is increased In fact under reasonable assumptions the bias of the first nearest neighbor 1 NN estimator vanishes entirely as the size of the training set approaches infinity 11 Applications EditIn regression Edit The bias variance decomposition forms the conceptual basis for regression regularization methods such as Lasso and ridge regression Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares OLS solution Although the OLS solution provides non biased regression estimates the lower variance solutions produced by regularization techniques provide superior MSE performance In classification Edit The bias variance decomposition was originally formulated for least squares regression For the case of classification under the 0 1 loss misclassification rate it is possible to find a similar decomposition 15 16 Alternatively if the classification problem can be phrased as probabilistic classification then the expected squared error of the predicted probabilities with respect to the true probabilities can be decomposed as before 17 It has been argued that as training data increases the variance of learned models will tend to decrease and hence that as training data quantity increases error is minimized by methods that learn models with lesser bias and that conversely for smaller training data quantities it is ever more important to minimize variance 18 In reinforcement learning Edit Even though the bias variance decomposition does not directly apply in reinforcement learning a similar tradeoff can also characterize generalization When an agent has limited information on its environment the suboptimality of an RL algorithm can be decomposed into the sum of two terms a term related to an asymptotic bias and a term due to overfitting The asymptotic bias is directly related to the learning algorithm independently of the quantity of data while the overfitting term comes from the fact that the amount of data is limited 19 In human learning Edit While widely discussed in the context of machine learning the bias variance dilemma has been examined in the context of human cognition most notably by Gerd Gigerenzer and co workers in the context of learned heuristics They have argued see references below that the human brain resolves the dilemma in the case of the typically sparse poorly characterised training sets provided by experience by adopting high bias low variance heuristics This reflects the fact that a zero bias approach has poor generalisability to new situations and also unreasonably presumes precise knowledge of the true state of the world The resulting heuristics are relatively simple but produce better inferences in a wider variety of situations 20 Geman et al 11 argue that the bias variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch but require a certain degree of hard wiring that is later tuned by experience This is because model free approaches to inference require impractically large training sets if they are to avoid high variance See also EditAccuracy and precision Bias of an estimator Double descent Gauss Markov theorem Hyperparameter optimization Law of total variance Minimum variance unbiased estimator Model selection Regression model validation Supervised learning Cramer Rao inequality Prediction intervalReferences Edit Kohavi Ron Wolpert David H 1996 Bias Plus Variance Decomposition for Zero One Loss Functions ICML 96 Luxburg Ulrike V Scholkopf B 2011 Statistical learning theory Models concepts and results Handbook of the History of Logic 10 Section 2 4 Neal Brady 2019 On the Bias Variance Tradeoff Textbooks Need an Update arXiv 1912 08286 cs LG a b Neal Brady Mittal Sarthak Baratin Aristide Tantia Vinayak Scicluna Matthew Lacoste Julien Simon Mitliagkas Ioannis 2018 A Modern Take on the Bias Variance Tradeoff in Neural Networks arXiv 1810 08591 cs LG Vapnik Vladimir 2000 The nature of statistical learning theory New York Springer Verlag doi 10 1007 978 1 4757 3264 1 ISBN 978 1 4757 3264 1 S2CID 7138354 a b c James Gareth Witten Daniela Hastie Trevor Tibshirani Robert 2013 An Introduction to Statistical Learning Springer a b Hastie Trevor Tibshirani Robert Friedman Jerome H 2009 The Elements of Statistical Learning Archived from the original on 2015 01 26 Retrieved 2014 08 20 Vijayakumar Sethu 2007 The Bias Variance Tradeoff PDF University of Edinburgh Retrieved 19 August 2014 Shakhnarovich Greg 2011 Notes on derivation of bias variance decomposition in linear regression PDF Archived from the original PDF on 21 August 2014 Retrieved 20 August 2014 Belsley David 1991 Conditioning diagnostics collinearity and weak data in regression New York NY Wiley ISBN 978 0471528890 a b c Geman Stuart Bienenstock Elie Doursat Rene 1992 Neural networks and the bias variance dilemma PDF Neural Computation 4 1 58 doi 10 1162 neco 1992 4 1 1 S2CID 14215320 Gagliardi Francesco May 2011 Instance based classifiers applied to medical databases diagnosis and knowledge extraction Artificial Intelligence in Medicine 52 3 123 139 doi 10 1016 j artmed 2011 04 002 PMID 21621400 Ting Jo Anne Vijaykumar Sethu Schaal Stefan 2011 Locally Weighted Regression for Control In Sammut Claude Webb Geoffrey I eds Encyclopedia of Machine Learning PDF Springer p 615 Bibcode 2010eoml book S Fortmann Roe Scott 2012 Understanding the Bias Variance Tradeoff Domingos Pedro 2000 A unified bias variance decomposition PDF ICML Valentini Giorgio Dietterich Thomas G 2004 Bias variance analysis of support vector machines for the development of SVM based ensemble methods PDF Journal of Machine Learning Research 5 725 775 Manning Christopher D Raghavan Prabhakar Schutze Hinrich 2008 Vector Space Classification PDF Introduction to Information Retrieval Cambridge University Press pp 308 314 Brain Damian Webb Geoffrey 2002 The Need for Low Bias Algorithms in Classification Learning From Large Data Sets PDF Proceedings of the Sixth European Conference on Principles of Data Mining and Knowledge Discovery PKDD 2002 Francois Lavet Vincent Rabusseau Guillaume Pineau Joelle Ernst Damien Fonteneau Raphael 2019 On Overfitting and Asymptotic Bias in Batch Reinforcement Learning with Partial Observability Journal of Artificial Intelligence Research 65 1 30 doi 10 1613 jair 1 11478 Gigerenzer Gerd Brighton Henry 2009 Homo Heuristicus Why Biased Minds Make Better Inferences Topics in Cognitive Science 1 1 107 143 doi 10 1111 j 1756 8765 2008 01006 x hdl 11858 00 001M 0000 0024 F678 0 PMID 25164802 External links EditMLU Explain The Bias Variance Tradeoff An interactive visualization of the bias variance tradeoff in LOESS Regression and K Nearest Neighbors Literature EditHarry L Van Trees Kristine L Bell Exploring Estimator BiasVariance Tradeoffs Using the Uniform CR Bound in Bayesian Bounds for Parameter Estimation and Nonlinear Filtering Tracking IEEE 2007 pp 451 466 doi 10 1109 9780470544198 ch40 Retrieved from https en wikipedia org w index php title Bias variance tradeoff amp oldid 1176219028, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.