fbpx
Wikipedia

Scott's rule

Scott's rule is a method to select the number of bins in a histogram.[1] Scott's rule is widely employed in data analysis software including R,[2] Python[3] and Microsoft Excel where it is the default bin selection method.[4]

For a set of observations let be the histogram approximation of some function . The integrated mean squared error (IMSE) is

Where denotes the expectation across many independent draws of data points. By Taylor expanding to first order in , the bin width, Scott showed that the optimal width is

This formula is also the basis for the Freedman–Diaconis rule.

By taking a normal reference i.e. assuming that is a normal distribution, the equation for becomes

where is the standard deviation of the normal distribution and is estimated from the data. With this value of bin width Scott demonstrates that[5]

showing how quickly the histogram approximation approaches the true distribution as the number of samples increases.

Terrell–Scott rule edit

Another approach developed by Terrell and Scott[6] is based on the observation that, among all densities   defined on a compact interval, say  , with derivatives which are absolutely continuous, the density which minimises   is

 

Using this with   in the expression for   gives an upper bound on the value of bin width which is

 

So, for functions satisfying the continuity conditions, at least

 

bins should be used.[7]

 
10000 samples from a normal distribution binned using different rules. The Scott rule uses 48 bins, the Terrell-Scott rule uses 28 and Sturges's rule 15.

This rule is also called the oversmoothed rule[7] or the Rice rule,[8] so called because both authors worked at Rice University. The Rice rule is often reported with the factor of 2 outside the cube root,  , and may be considered a different rule. The key difference from Scott's rule is that this rule does not assume the data is normally distributed and the bin width only depends on the number of samples, not on any properties of the data.

In general   is not an integer so   is used where   denotes the ceiling function.

References edit

  1. ^ Scott, David W. (1979). "On optimal and data-based histograms". Biometrika. 66 (3): 605–610. doi:10.1093/biomet/66.3.605.
  2. ^ https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist
  3. ^ https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges
  4. ^ "Excel:Create a histogram".
  5. ^ Scott DW. Scott's rule. Wiley Interdisciplinary Reviews: Computational Statistics. 2010 Jul; 2(4):497–502.
  6. ^ Terrell GR, Scott DW. Oversmoothed nonparametric density estimates. Journal of the American Statistical Association. 1985 Mar 1;80(389):209-14.
  7. ^ a b Scott, D.W. (2009). "Sturges' rule". WIREs Computational Statistics. 1 (3): 303–306. doi:10.1002/wics.35. S2CID 197483064.
  8. ^ Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University (chapter 2 "Graphing Distributions", section "Histograms")

scott, rule, method, select, number, bins, histogram, widely, employed, data, analysis, software, including, python, microsoft, excel, where, default, selection, method, displaystyle, observations, displaystyle, displaystyle, histogram, approximation, some, fu. Scott s rule is a method to select the number of bins in a histogram 1 Scott s rule is widely employed in data analysis software including R 2 Python 3 and Microsoft Excel where it is the default bin selection method 4 For a set of n displaystyle n observations x i displaystyle x i let f x displaystyle hat f x be the histogram approximation of some function f x displaystyle f x The integrated mean squared error IMSE is IMSE E d x f x f x 2 displaystyle text IMSE E left int infty infty dx hat f x f x 2 right Where E displaystyle E cdot denotes the expectation across many independent draws of n displaystyle n data points By Taylor expanding to first order in h displaystyle h the bin width Scott showed that the optimal width is h 6 f x 2 d x 1 3 n 1 3 displaystyle h left 6 int infty infty f x 2 dx right 1 3 n 1 3 This formula is also the basis for the Freedman Diaconis rule By taking a normal reference i e assuming that f x displaystyle f x is a normal distribution the equation for h displaystyle h becomes h 24 p 1 3 s n 1 3 3 5 s n 1 3 displaystyle h left 24 sqrt pi right 1 3 sigma n 1 3 sim 3 5 sigma n 1 3 where s displaystyle sigma is the standard deviation of the normal distribution and is estimated from the data With this value of bin width Scott demonstrates that 5 IMSE n 2 3 displaystyle text IMSE propto n 2 3 showing how quickly the histogram approximation approaches the true distribution as the number of samples increases Terrell Scott rule editAnother approach developed by Terrell and Scott 6 is based on the observation that among all densities g x displaystyle g x nbsp defined on a compact interval say x lt 1 2 displaystyle x lt 1 2 nbsp with derivatives which are absolutely continuous the density which minimises d x g k x 2 displaystyle int infty infty dx g k x 2 nbsp is f k x 2 k 1 2 2 k k 2 1 4 x 2 k x 1 2 0 x gt 1 2 displaystyle f k x begin cases frac 2k 1 2 2k k 2 1 4x 2 k quad amp x leq 1 2 0 amp x gt 1 2 end cases nbsp Using this with k 1 displaystyle k 1 nbsp in the expression for h displaystyle h nbsp gives an upper bound on the value of bin width which is h T S 4 n 1 3 displaystyle h TS left frac 4 n right 1 3 nbsp So for functions satisfying the continuity conditions at least k T S b a h 2 n 1 3 displaystyle k TS frac b a h left 2n right 1 3 nbsp bins should be used 7 nbsp 10000 samples from a normal distribution binned using different rules The Scott rule uses 48 bins the Terrell Scott rule uses 28 and Sturges s rule 15 This rule is also called the oversmoothed rule 7 or the Rice rule 8 so called because both authors worked at Rice University The Rice rule is often reported with the factor of 2 outside the cube root 2 n 1 3 displaystyle 2 left n right 1 3 nbsp and may be considered a different rule The key difference from Scott s rule is that this rule does not assume the data is normally distributed and the bin width only depends on the number of samples not on any properties of the data In general 2 n 1 3 displaystyle left 2n right 1 3 nbsp is not an integer so 2 n 1 3 displaystyle lceil left 2n right 1 3 rceil nbsp is used where displaystyle lceil cdot rceil nbsp denotes the ceiling function References edit Scott David W 1979 On optimal and data based histograms Biometrika 66 3 605 610 doi 10 1093 biomet 66 3 605 https www rdocumentation org packages graphics versions 3 6 2 topics hist https numpy org doc stable reference generated numpy histogram bin edges html numpy histogram bin edges Excel Create a histogram Scott DW Scott s rule Wiley Interdisciplinary Reviews Computational Statistics 2010 Jul 2 4 497 502 Terrell GR Scott DW Oversmoothed nonparametric density estimates Journal of the American Statistical Association 1985 Mar 1 80 389 209 14 a b Scott D W 2009 Sturges rule WIREs Computational Statistics 1 3 303 306 doi 10 1002 wics 35 S2CID 197483064 Online Statistics Education A Multimedia Course of Study http onlinestatbook com Project Leader David M Lane Rice University chapter 2 Graphing Distributions section Histograms Retrieved from https en wikipedia org w index php title Scott 27s rule amp oldid 1222914456, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.