fbpx
Wikipedia

Out-of-bag error

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). Bagging uses subsampling with replacement to create training samples for the model to learn from. OOB error is the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample.[1]

Bootstrap aggregating allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations that were not used in the building of the next base learner.

Out-of-bag dataset edit

When bootstrap aggregating is performed, two independent sets are created. One set, the bootstrap sample, is the data chosen to be "in-the-bag" by sampling with replacement. The out-of-bag set is all data not chosen in the sampling process.

When this process is repeated, such as when building a random forest, many bootstrap samples and OOB sets are created. The OOB sets can be aggregated into one dataset, but each sample is only considered out-of-bag for the trees that do not include it in their bootstrap sample. The picture below shows that for each bag sampled, the data is separated into two groups.

 
Visualizing the bagging process. Sampling 4 patients from the original set with replacement and showing the out-of-bag sets. Only patients in the bootstrap sample would be used to train the model for that bag.

This example shows how bagging could be used in the context of diagnosing disease. A set of patients are the original dataset, but each model is trained only by the patients in its bag. The patients in each out-of-bag set can be used to test their respective models. The test would consider whether the model can accurately determine if the patient has the disease.

Calculating out-of-bag error edit

Since each out-of-bag set is not used to train the model, it is a good test for the performance of the model. The specific calculation of OOB error depends on the implementation of the model, but a general calculation is as follows.

  1. Find all models (or trees, in the case of a random forest) that are not trained by the OOB instance.
  2. Take the majority vote of these models' result for the OOB instance, compared to the true value of the OOB instance.
  3. Compile the OOB error for all instances in the OOB dataset.
 
An illustration of OOB error

The bagging process can be customized to fit the needs of a model. To ensure an accurate model, the bootstrap training sample size should be close to that of the original set.[2] Also, the number of iterations (trees) of the model (forest) should be considered to find the true OOB error. The OOB error will stabilize over many iterations so starting with a high number of iterations is a good idea.[3]

Shown in the example to the right, the OOB error can be found using the method above once the forest is set up.

Comparison to cross-validation edit

Out-of-bag error and cross-validation (CV) are different methods of measuring the error estimate of a machine learning model. Over many iterations, the two methods should produce a very similar error estimate. That is, once the OOB error stabilizes, it will converge to the cross-validation (specifically leave-one-out cross-validation) error.[3] The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained.

Accuracy and Consistency edit

Out-of-bag error is used frequently for error estimation within random forests but with the conclusion of a study done by Silke Janitza and Roman Hornung, out-of-bag error has shown to overestimate in settings that include an equal number of observations from all response classes (balanced samples), small sample sizes, a large number of predictor variables, small correlation between predictors, and weak effects.[4]

See also edit

References edit

  1. ^ James, Gareth; Witten, Daniela; Hastie, Trevor; Tibshirani, Robert (2013). An Introduction to Statistical Learning. Springer. pp. 316–321.
  2. ^ Ong, Desmond (2014). A primer to bootstrapping; and an overview of doBootstrap (PDF). pp. 2–4.
  3. ^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (PDF). Springer. pp. 592–593.
  4. ^ Janitza, Silke; Hornung, Roman (2018-08-06). "On the overestimation of random forest's out-of-bag error". PLOS ONE. 13 (8): e0201904. doi:10.1371/journal.pone.0201904. ISSN 1932-6203. PMC 6078316. PMID 30080866.

error, error, also, called, estimate, method, measuring, prediction, error, random, forests, boosted, decision, trees, other, machine, learning, models, utilizing, bootstrap, aggregating, bagging, bagging, uses, subsampling, with, replacement, create, training. Out of bag OOB error also called out of bag estimate is a method of measuring the prediction error of random forests boosted decision trees and other machine learning models utilizing bootstrap aggregating bagging Bagging uses subsampling with replacement to create training samples for the model to learn from OOB error is the mean prediction error on each training sample xi using only the trees that did not have xi in their bootstrap sample 1 Bootstrap aggregating allows one to define an out of bag estimate of the prediction performance improvement by evaluating predictions on those observations that were not used in the building of the next base learner Contents 1 Out of bag dataset 2 Calculating out of bag error 3 Comparison to cross validation 4 Accuracy and Consistency 5 See also 6 ReferencesOut of bag dataset editWhen bootstrap aggregating is performed two independent sets are created One set the bootstrap sample is the data chosen to be in the bag by sampling with replacement The out of bag set is all data not chosen in the sampling process When this process is repeated such as when building a random forest many bootstrap samples and OOB sets are created The OOB sets can be aggregated into one dataset but each sample is only considered out of bag for the trees that do not include it in their bootstrap sample The picture below shows that for each bag sampled the data is separated into two groups nbsp Visualizing the bagging process Sampling 4 patients from the original set with replacement and showing the out of bag sets Only patients in the bootstrap sample would be used to train the model for that bag This example shows how bagging could be used in the context of diagnosing disease A set of patients are the original dataset but each model is trained only by the patients in its bag The patients in each out of bag set can be used to test their respective models The test would consider whether the model can accurately determine if the patient has the disease Calculating out of bag error editSince each out of bag set is not used to train the model it is a good test for the performance of the model The specific calculation of OOB error depends on the implementation of the model but a general calculation is as follows Find all models or trees in the case of a random forest that are not trained by the OOB instance Take the majority vote of these models result for the OOB instance compared to the true value of the OOB instance Compile the OOB error for all instances in the OOB dataset nbsp An illustration of OOB errorThe bagging process can be customized to fit the needs of a model To ensure an accurate model the bootstrap training sample size should be close to that of the original set 2 Also the number of iterations trees of the model forest should be considered to find the true OOB error The OOB error will stabilize over many iterations so starting with a high number of iterations is a good idea 3 Shown in the example to the right the OOB error can be found using the method above once the forest is set up Comparison to cross validation editOut of bag error and cross validation CV are different methods of measuring the error estimate of a machine learning model Over many iterations the two methods should produce a very similar error estimate That is once the OOB error stabilizes it will converge to the cross validation specifically leave one out cross validation error 3 The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained Accuracy and Consistency editOut of bag error is used frequently for error estimation within random forests but with the conclusion of a study done by Silke Janitza and Roman Hornung out of bag error has shown to overestimate in settings that include an equal number of observations from all response classes balanced samples small sample sizes a large number of predictor variables small correlation between predictors and weak effects 4 See also editBoosting meta algorithm Bootstrap aggregating Bootstrapping statistics Cross validation statistics Random forest Random subspace method attribute bagging References edit James Gareth Witten Daniela Hastie Trevor Tibshirani Robert 2013 An Introduction to Statistical Learning Springer pp 316 321 Ong Desmond 2014 A primer to bootstrapping and an overview of doBootstrap PDF pp 2 4 a b Hastie Trevor Tibshirani Robert Friedman Jerome 2008 The Elements of Statistical Learning PDF Springer pp 592 593 Janitza Silke Hornung Roman 2018 08 06 On the overestimation of random forest s out of bag error PLOS ONE 13 8 e0201904 doi 10 1371 journal pone 0201904 ISSN 1932 6203 PMC 6078316 PMID 30080866 Retrieved from https en wikipedia org w index php title Out of bag error amp oldid 1192319009, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.