Scott's rule

Scott's rule is a method to select the number of bins in a histogram.^[1] Scott's rule is widely employed in data analysis software including R,^[2] Python^[3] and Microsoft Excel where it is the default bin selection method.^[4]

For a set of $n$ observations $x_{i}$ let ${\hat {f}}(x)$ be the histogram approximation of some function $f(x)$ . The integrated mean squared error (IMSE) is

{\text{IMSE}}=E\left[\int _{\infty }^{\infty }dx({\hat {f}}(x)-f(x))^{2}\right]

Where $E[\cdot ]$ denotes the expectation across many independent draws of $n$ data points. By Taylor expanding to first order in $h$ , the bin width, Scott showed that the optimal width is

h^{*}=\left(6/\int _{-\infty }^{\infty }f'(x)^{2}dx\right)^{1/3}n^{-1/3}

This formula is also the basis for the Freedman–Diaconis rule.

By taking a normal reference i.e. assuming that $f(x)$ is a normal distribution, the equation for $h^{*}$ becomes

h^{*}=\left(24{\sqrt {\pi }}\right)^{1/3}\sigma n^{-1/3}\sim 3.5\sigma n^{-1/3}

where $\sigma$ is the standard deviation of the normal distribution and is estimated from the data. With this value of bin width Scott demonstrates that^[5]

{\text{IMSE}}\propto n^{-2/3}

showing how quickly the histogram approximation approaches the true distribution as the number of samples increases.

Terrell–Scott rule edit

Another approach developed by Terrell and Scott^[6] is based on the observation that, among all densities $g(x)$ defined on a compact interval, say $|x|<1/2$ , with derivatives which are absolutely continuous, the density which minimises $\int _{\infty }^{\infty }dx(g^{(k)}(x))^{2}$ is

f_{k}(x)={\begin{cases}{\frac {(2k+1)!}{2^{2k}(k!)^{2}}}(1-4x^{2})^{k},\quad &|x|\leq 1/2\\0&|x|>1/2\end{cases}}

Using this with $k=1$ in the expression for $h^{*}$ gives an upper bound on the value of bin width which is

h_{TS}^{*}=\left({\frac {4}{n}}\right)^{1/3}.

So, for functions satisfying the continuity conditions, at least

k_{TS}={\frac {b-a}{h^{*}}}=\left(2n\right)^{1/3}

bins should be used.^[7]

10000 samples from a normal distribution binned using different rules. The Scott rule uses 48 bins, the Terrell-Scott rule uses 28 and Sturges's rule 15.

This rule is also called the oversmoothed rule^[7] or the Rice rule,^[8] so called because both authors worked at Rice University. The Rice rule is often reported with the factor of 2 outside the cube root, $2\left(n\right)^{1/3}$ , and may be considered a different rule. The key difference from Scott's rule is that this rule does not assume the data is normally distributed and the bin width only depends on the number of samples, not on any properties of the data.

In general $\left(2n\right)^{1/3}$ is not an integer so $\lceil \left(2n\right)^{1/3}\rceil$ is used where $\lceil \cdot \rceil$ denotes the ceiling function.

References edit

^ Scott, David W. (1979). "On optimal and data-based histograms". Biometrika. 66 (3): 605–610. doi:10.1093/biomet/66.3.605.
^ https://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist
^ https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges
^ "Excel:Create a histogram".
^ Scott DW. Scott's rule. Wiley Interdisciplinary Reviews: Computational Statistics. 2010 Jul; 2(4):497–502.
^ Terrell GR, Scott DW. Oversmoothed nonparametric density estimates. Journal of the American Statistical Association. 1985 Mar 1;80(389):209-14.
^ ^a ^b Scott, D.W. (2009). "Sturges' rule". WIREs Computational Statistics. 1 (3): 303–306. doi:10.1002/wics.35. S2CID 197483064.
^ Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University (chapter 2 "Graphing Distributions", section "Histograms")

[scott79-1] Scott, David W. (1979). "On optimal and data-based histograms". Biometrika. 66 (3): 605–610. doi:10.1093/biomet/66.3.605.

[2] ttps://www.rdocumentation.org/packages/graphics/versions/3.6.2/topics/hist

[3] ttps://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html#numpy.histogram_bin_edges

[4] "Excel:Create a histogram".

[5] Scott DW. Scott's rule. Wiley Interdisciplinary Reviews: Computational Statistics. 2010 Jul; 2(4):497–502.

[6] Terrell GR, Scott DW. Oversmoothed nonparametric density estimates. Journal of the American Statistical Association. 1985 Mar 1;80(389):209-14.

[sturges-7] Scott, D.W. (2009). "Sturges' rule". WIREs Computational Statistics. 1 (3): 303–306. doi:10.1002/wics.35. S2CID 197483064.

[8] Online Statistics Education: A Multimedia Course of Study (http://onlinestatbook.com/). Project Leader: David M. Lane, Rice University (chapter 2 "Graphing Distributions", section "Histograms")

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

www.wiki3.en-us.nina.az

Scott's rule

Terrell–Scott rule edit

References edit

List of Kuruluş: Osman characters

List of Kuruluş: Osman episodes

List of Kurdish dynasties and countries

List of Kuwaitis by net worth

List of LGBT-related films directed by women

List of LGBT-related webcomics

List of LGBT YouTubers

List of LGBT awareness days

List of LGBT publications

List of LGBT people from New York City

Empirical distribution function

Empoy Marquez

Employees Provident Fund Nepal

Employers' association

Empress Ashina

article