fbpx
Wikipedia

Qualitative variation

An index of qualitative variation (IQV) is a measure of statistical dispersion in nominal distributions. Examples include the variation ratio or the information entropy.

Properties edit

There are several types of indices used for the analysis of nominal data. Several are standard statistics that are used elsewhere - range, standard deviation, variance, mean deviation, coefficient of variation, median absolute deviation, interquartile range and quartile deviation.

In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:

  • Variation varies between 0 and 1.
  • Variation is 0 if and only if all cases belong to a single category.
  • Variation is 1 if and only if cases are evenly divided across all categories.[1]

In particular, the value of these standardized indices does not depend on the number of categories or number of samples.

For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.

Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.

One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.

Wilcox's indexes edit

Wilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.

ModVR edit

The formula for the variation around the mode (ModVR) is derived as follows:

 

where fm is the modal frequency, K is the number of categories and fi is the frequency of the ith group.

This can be simplified to

 

where N is the total size of the sample.

Freeman's index (or variation ratio) is[2]

 

This is related to M as follows:

 

The ModVR is defined as

 

where v is Freeman's index.

Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.

When K is large, ModVR is approximately equal to Freeman's index v.

RanVR edit

This is based on the range around the mode. It is defined to be

 

where fm is the modal frequency and fl is the lowest frequency.

AvDev edit

This is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.

 

MNDif edit

This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.[3]

 

where fi and fj are the ith and jth frequencies respectively.

The MNDif is the Gini coefficient applied to qualitative data.

VarNC edit

This is an analog of the variance.

 

It is the same index as Mueller and Schussler's Index of Qualitative Variation[4] and Gibbs' M2 index.

It is distributed as a chi square variable with K – 1 degrees of freedom.[5]

StDev edit

Wilson has suggested two versions of this statistic.

The first is based on AvDev.

 

The second is based on MNDif

 

HRel edit

This index was originally developed by Claude Shannon for use in specifying the properties of communication channels.

 

where pi = fi / N.

This is equivalent to information entropy divided by the   and is useful for comparing relative variation between frequency tables of multiple sizes.

B index edit

Wilcox adapted a proposal of Kaiser[6] based on the geometric mean and created the B' index. The B index is defined as

 

R packages edit

Several of these indices have been implemented in the R language.[7]

Gibb's indices and related formulae edit

Gibbs & Poston Jr (1975) proposed six indexes.[8]

M1 edit

The unstandardized index (M1) (Gibbs & Poston Jr 1975, p. 471) is

 

where K is the number of categories and   is the proportion of observations that fall in a given category i.

M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category,[9] so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.

M2 edit

A second index is the M2[10] (Gibbs & Poston Jr 1975, p. 472) is:

 

where K is the number of categories and   is the proportion of observations that fall in a given category i. The factor of   is for standardization.

M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution.

M4 edit

The M4 index is

 

where m is the mean.

M6 edit

The formula for M6 is

 

· where K is the number of categories, Xi is the number of data points in the ith category, N is the total number of data points, || is the absolute value (modulus) and

 

This formula can be simplified

 

where pi is the proportion of the sample in the ith category.

In practice M1 and M6 tend to be highly correlated which militates against their combined use.

Related indices edit

The sum

 

has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology[11]

In linguistics and cryptanalysis this sum is known as the repeat rate. The incidence of coincidence (IC) is an unbiased estimator of this statistic[12]

 

where fi is the count of the ith grapheme in the text and n is the total number of graphemes in the text.

M1

The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,[13] Simpson's measure of diversity,[14] Bachi's index of linguistic homogeneity,[15] Mueller and Schuessler's index of qualitative variation,[16] Gibbs and Martin's index of industry diversification,[17] Lieberson's index.[18] and Blau's index in sociology, psychology and management studies.[19] The formulation of all these indices are identical.

Simpson's D is defined as

 

where n is the total sample size and ni is the number of items in the ith category.

For large n we have

 

Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.[20]

 

where n is the sample size and c(x,y) = 1 if x and y are alike and 0 otherwise.

For large n we have

 

where K is the number of categories.

Another related statistic is the quadratic entropy

 

which is itself related to the Gini index.

M2

Greenberg's monolingual non weighted index of linguistic diversity[21] is the M2 statistic defined above.

M7

Another index – the M7 – was created based on the M4 index of Gibbs & Poston Jr (1975)[22]

 

where

 

and

 

where K is the number of categories, L is the number of subtypes, Oij and Eij are the number observed and expected respectively of subtype j in the ith category, ni is the number in the ith category and pj is the proportion of subtype j in the complete sample.

Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.

Other single sample indices edit

These indices are summary statistics of the variation within the sample.

Berger–Parker index edit

The Berger–Parker index equals the maximum   value in the dataset, i.e. the proportional abundance of the most abundant type.[23] This corresponds to the weighted generalized mean of the   values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/D).

Brillouin index of diversity edit

This index is strictly applicable only to entire populations rather than to finite samples. It is defined as

 

where N is total number of individuals in the population, ni is the number of individuals in the ith category and N! is the factorial of N. Brillouin's index of evenness is defined as

 

where IB(max) is the maximum value of IB.

Hill's diversity numbers edit

Hill suggested a family of diversity numbers[24]

 

For given values of a, several of the other indices can be computed

  • a = 0: Na = species richness
  • a = 1: Na = Shannon's index
  • a = 2: Na = 1/Simpson's index (without the small sample correction)
  • a = 3: Na = 1/Berger–Parker index

Hill also suggested a family of evenness measures

 

where a > b.

Hill's E4 is

 

Hill's E5 is

 

Margalef's index edit

 

where S is the number of data types in the sample and N is the total size of the sample.[25]

Menhinick's index edit

 

where S is the number of data types in the sample and N is the total size of the sample.[26]

In linguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined.[27][28] This index can be derived as a special case of the Generalised Torquist function.[29]

Q statistic edit

This is a statistic invented by Kempton and Taylor.[30] and involves the quartiles of the sample. It is defined as

 

where R1 and R2 are the 25% and 75% quartiles respectively on the cumulative species curve, nj is the number of species in the jth category, nRi is the number of species in the class where Ri falls (i = 1 or 2).

Shannon–Wiener index edit

This is taken from information theory

 

where N is the total number in the sample and pi is the proportion in the ith category.

In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.

An approximate formula for the standard deviation (SD) of H is

 

where pi is the proportion made up by the ith category and N is the total in the sample.

A more accurate approximate value of the variance of H(var(H)) is given by[31]

 

where N is the sample size and K is the number of categories.

A related index is the Pielou J defined as

 

One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample.

Rényi entropy edit

The Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:

 

which equals

 

This means that taking the logarithm of true diversity based on any value of q gives the Rényi entropy corresponding to the same value of q.

The value of   is also known as the Hill number.[24]

McIntosh's D and E edit

McIntosh proposed measure of diversity:[32]

 

where ni is the number in the ith category and K is the number of categories.

He also proposed several normalized versions of this index. First is D:

 

where N is the total sample size.

This index has the advantage of expressing the observed diversity as a proportion of the absolute maximum diversity at a given N.

Another proposed normalization is E — ratio of observed diversity to maximum possible diversity of a given N and K (i.e., if all species are equal in number of individuals):

 

Fisher's alpha edit

This was the first index to be derived for diversity.[33]

 

where K is the number of categories and N is the number of data points in the sample. Fisher's α has to be estimated numerically from the data.

The expected number of individuals in the rth category where the categories have been placed in increasing size is

 

where X is an empirical parameter lying between 0 and 1. While X is best estimated numerically an approximate value can be obtained by solving the following two equations

 
 

where K is the number of categories and N is the total sample size.

The variance of α is approximately[34]

 

Strong's index edit

This index (Dw) is the distance between the Lorenz curve of species distribution and the 45 degree line. It is closely related to the Gini coefficient.[35]

In symbols it is

 

where max() is the maximum value taken over the N data points, K is the number of categories (or species) in the data set and ci is the cumulative total up and including the ith category.

Simpson's E edit

This is related to Simpson's D and is defined as

 

where D is Simpson's D and K is the number of categories in the sample.

Smith & Wilson's indices edit

Smith and Wilson suggested a number of indices based on Simpson's D.

 
 

where D is Simpson's D and K is the number of categories.

Heip's index edit

 

where H is the Shannon entropy and K is the number of categories.

This index is closely related to Sheldon's index which is

 

where H is the Shannon entropy and K is the number of categories.

Camargo's index edit

This index was created by Camargo in 1993.[36]

 

where K is the number of categories and pi is the proportion in the ith category.

Smith and Wilson's B edit

This index was proposed by Smith and Wilson in 1996.[37]

 

where θ is the slope of the log(abundance)-rank curve.

Nee, Harvey, and Cotgreave's index edit

This is the slope of the log(abundance)-rank curve.

Bulla's E edit

There are two versions of this index - one for continuous distributions (Ec) and the other for discrete (Ed).[38]

 
 

where

 

is the Schoener–Czekanoski index, K is the number of categories and N is the sample size.

Horn's information theory index edit

This index (Rik) is based on Shannon's entropy.[39] It is defined as

 

where

 
 
 
 
 
 
 

In these equations xij and xkj are the number of times the jth data type appears in the ith or kth sample respectively.

Rarefaction index edit

In a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let   be the number of groups still present in the subsample of n items.   is less than K the number of categories whenever at least one group is missing from this subsample.

The rarefaction curve,   is defined as:

 

Note that 0 ≤ f(n) ≤ K.

Furthermore,

 

Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.[40]

This index is discussed further in Rarefaction (ecology).

Caswell's V edit

This is a z type statistic based on Shannon's entropy.[41]

 

where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou

 

where pi is the proportion made up by the ith category and N is the total in the sample.

Lloyd & Ghelardi's index edit

This is

 

where K is the number of categories and K' is the number of categories according to MacArthur's broken stick model yielding the observed diversity.

Average taxonomic distinctness index edit

This index is used to compare the relationship between hosts and their parasites.[42] It incorporates information about the phylogenetic relationship amongst the host species.

 

where s is the number of host species used by a parasite and ωij is the taxonomic distinctness between host species i and j.

Index of qualitative variation edit

Several indices with this name have been proposed.

One of these is

 

where K is the number of categories and pi is the proportion of the sample that lies in the ith category.

Theil's H edit

This index is also known as the multigroup entropy index or the information theory index. It was proposed by Theil in 1972.[43] The index is a weighted average of the samples entropy.

Let

 

and

 

where pi is the proportion of type i in the ath sample, r is the total number of samples, ni is the size of the ith sample, N is the size of the population from which the samples were obtained and E is the entropy of the population.

Indices for comparison of two or more data types within a single sample edit

Several of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area.

Index of dissimilarity edit

Let A and B be two types of data item. Then the index of dissimilarity is

 

where

 
 

Ai is the number of data type A at sample site i, Bi is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value.

This index is probably better known as the index of dissimilarity (D).[44] It is closely related to the Gini index.

This index is biased as its expectation under a uniform distribution is > 0.

A modification of this index has been proposed by Gorard and Taylor.[45] Their index (GT) is

 

Index of segregation edit

The index of segregation (IS)[46] is

 

where

 
 

and K is the number of units, Ai and ti is the number of data type A in unit i and the total number of all data types in unit i.

Hutchen's square root index edit

This index (H) is defined as[47]

 

where pi is the proportion of the sample composed of the ith variate.

Lieberson's isolation index edit

This index ( Lxy ) was invented by Lieberson in 1981.[48]

 

where Xi and Yi are the variables of interest at the ith site, K is the number of sites examined and Xtot is the total number of variate of type X in the study.

Bell's index edit

This index is defined as[49]

 

where px is the proportion of the sample made up of variates of type X and

 

where Nx is the total number of variates of type X in the study, K is the number of samples in the study and xi and pi are the number of variates and the proportion of variates of type X respectively in the ith sample.

Index of isolation edit

The index of isolation is

 

where K is the number of units in the study, Ai and ti is the number of units of type A and the number of all units in ith sample.

A modified index of isolation has also been proposed

 

The MII lies between 0 and 1.

Gorard's index of segregation edit

This index (GS) is defined as

 

where

 
 

and Ai and ti are the number of data items of type A and the total number of items in the ith sample.

Index of exposure edit

This index is defined as

 

where

 

and Ai and Bi are the number of types A and B in the ith category and ti is the total number of data points in the ith category.

Ochiai index edit

This is a binary form of the cosine index.[50] It is used to compare presence/absence data of two data types (here A and B). It is defined as

 

where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A.

Kulczyński's coefficient edit

This coefficient was invented by Stanisław Kulczyński in 1927[51] and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as

 

where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A.

Yule's Q edit

This index was invented by Yule in 1900.[52] It concerns the association of two different types (here A and B). It is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ.

Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to a, b, c and d.[53]

Yule's Y edit

This index is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

Baroni–Urbani–Buser coefficient edit

This index was invented by Baroni-Urbani and Buser in 1976.[54] It varies between 0 and 1 in value. It is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

When d = 0, this index is identical to the Jaccard index.

Hamman coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Rogers–Tanimoto coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size

Sokal–Sneath coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Sokal's binary distance edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Russel–Rao coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Phi coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

Soergel's coefficient edit

This coefficient is defined as

 

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Simpson's coefficient edit

This coefficient is defined as

 

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A.

Dennis' coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Forbes' coefficient edit

This coefficient was proposed by Stephen Alfred Forbes in 1907.[55] It is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size (N = a + b + c + d).

A modification of this coefficient which does not require the knowledge of d has been proposed by Alroy[56]

 

Where n = a + b + c.

Simple match coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Fossum's coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Stile's coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A, d is the sample count where neither type A nor type B are present, n equals a + b + c + d and || is the modulus (absolute value) of the difference.

Michael's coefficient edit

This coefficient is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

Peirce's coefficient edit

In 1884 Charles Peirce suggested[57] the following coefficient

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

Hawkin–Dotson coefficient edit

In 1975 Hawkin and Dotson proposed the following coefficient

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Benini coefficient edit

In 1901 Benini proposed the following coefficient

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Min(b, c) is the minimum of b and c.

Gilbert coefficient edit

Gilbert proposed the following coefficient

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

Gini index edit

The Gini index is

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.

Modified Gini index edit

The modified Gini index is

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.

Kuhn's index edit

Kuhn proposed the following coefficient in 1965

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. K is a normalizing parameter. N is the sample size.

This index is also known as the coefficient of arithmetic means.

Eyraud index edit

Eyraud proposed the following coefficient in 1936

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present.

Soergel distance edit

This is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.

Tanimoto index edit

This is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.

Piatetsky–Shapiro's index edit

This is defined as

 

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A.

Indices for comparison between two or more samples edit

Czekanowski's quantitative index edit

This is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index.

 

where xi and xj are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites.

Canberra metric edit

The Canberra distance is a weighted version of the L1 metric. It was introduced by introduced in 1966[58] and refined in 1967[59] by G. N. Lance and W. T. Williams. It is used to define a distance between two vectors – here two sites with K categories within each site.

The Canberra distance d between vectors p and q in a K-dimensional real vector space is

 

where pi and qi are the values of the ith category of the two vectors.

Sorensen's coefficient of community edit

This is used to measure similarities between communities.

 

where s1 and s2 are the number of species in community 1 and 2 respectively and c is the number of species common to both areas.

Jaccard's index edit

This is a measure of the similarity between two samples:

 

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

This index was invented in 1902 by the Swiss botanist Paul Jaccard.[60]

Under a random distribution the expected value of J is[61]

 

The standard error of this index with the assumption of a random distribution is

 

where N is the total size of the sample.

Dice's index edit

This is a measure of the similarity between two samples:

 

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

Match coefficient edit

This is a measure of the similarity between two samples:

 

where N is the number of data points in the two samples and B and C are the data points found only in the first and second samples respectively.

Morisita's index edit

Morisita's index of dispersion ( Im ) is the scaled probability that two points chosen at random from the whole population are in the same sample.[62] Higher values indicate a more clumped distribution.

 

An alternative formulation is

 

where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to

 

where IMC is Lloyd's index of crowding.[63]

This index is relatively independent of the population density but is affected by the sample size.

Morisita showed that the statistic[62]

 

is distributed as a chi-squared variable with n − 1 degrees of freedom.

An alternative significance test for this index has been developed for large samples.[64]

 

where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution.

Morisita's overlap index edit

Morisita's overlap index is used to compare overlap among samples.[65] The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats

 
xi is the number of times species i is represented in the total X from one sample.
yi is the number of times species i is represented in the total Y from another sample.
Dx and Dy are the Simpson's index values for the x and y samples respectively.
S is the number of unique species

CD = 0 if the two samples do not overlap in terms of species, and CD = 1 if the species occur in the same proportions in both samples.

Horn's introduced a modification of the index[66]

 

Standardised Morisita's index edit

Smith-Gill developed a statistic based on Morisita's index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows[67]

First determine Morisita's index ( Id ) in the usual fashion. Then let k be the number of units the population was sampled from. Calculate the two critical values

 
 

where χ2 is the chi square value for n − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence.

The standardised index ( Ip ) is then calculated from one of the formulae below

When IdMc > 1

 

When Mc > Id ≥ 1

 

When 1 > IdMu

 

When 1 > Mu > Id

 

Ip ranges between +1 and −1 with 95% confidence intervals of ±0.5. Ip has the value of 0 if the pattern is random; if the pattern is uniform, Ip < 0 and if the pattern shows aggregation, Ip > 0.

Peet's evenness indices edit

These indices are a measure of evenness between samples.[68]

 
 

where I is an index of diversity, Imax and Imin are the maximum and minimum values of I between the samples being compared.

Loevinger's coefficient edit

Loevinger has suggested a coefficient H defined as follows:

 

where pmax and pmin are the maximum and minimum proportions in the sample.

Tversky index edit

The Tversky index [69] is an asymmetric measure that lies between 0 and 1.

For samples A and B the Tversky index (S) is

 

The values of α and β are arbitrary. Setting both α and β to 0.5 gives Dice's coefficient. Setting both to 1 gives Tanimoto's coefficient.

A symmetrical variant of this index has also been proposed.[70]

 

where

 
 

Several similar indices have been proposed.

Monostori et al. proposed the SymmetricSimilarity index[71]

 

where d(X) is some measure of derived from X.

Bernstein and Zobel have proposed the S2 and S3 indexes[72]

 
 

S3 is simply twice the SymmetricSimilarity index. Both are related to Dice's coefficient

Metrics used edit

A number of metrics (distances between samples) have been proposed.

Euclidean distance edit

While this is usually used in quantitative work it may also be used in qualitative work. This is defined as

 

where djk is the distance between xij and xik.

Gower's distance edit

This is defined as

 

where di is the distance between the ith samples and wi is the weighing give to the ith distance.

Manhattan distance edit

While this is more commonly used in quantitative work it may also be used in qualitative work. This is defined as

 

where djk is the distance between xij and xik and || is the absolute value of the difference between xij and xik.

A modified version of the Manhattan distance can be used to find a zero (root) of a polynomial of any degree using Lill's method.

Prevosti's distance edit

This is related to the Manhattan distance. It was described by Prevosti et al. and was used to compare differences between chromosomes.[73] Let P and Q be two collections of r finite probability distributions. Let these distributions have values that are divided into k categories. Then the distance DPQ is

 

where r is the number of discrete probability distributions in each population, kj is the number of categories in distributions Pj and Qj and pji (respectively qji) is the theoretical probability of category i in distribution Pj (Qj) in population P(Q).

Its statistical properties were examined by Sanchez et al.[74] who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples.

Other metrics edit

Let

 
 
 

where min(x,y) is the lesser value of the pair x and y.

Then

 

is the Manhattan distance,

 

is the Bray−Curtis distance,

 

is the Jaccard (or Ruzicka) distance and

 

is the Kulczynski distance.

Similarities between texts edit

HaCohen-Kerner et al. have proposed a variety of metrics for comparing two or more texts.[75]

Ordinal data edit

If the categories are at least ordinal then a number of other indices may be computed.

Leik's D edit

Leik's measure of dispersion (D) is one such index.[76] Let there be K categories and let pi be fi/N where fi is the number in the ith category and let the categories be arranged in ascending order. Let

 

where aK. Let da = ca if ca ≤ 0.5 and 1 − ca ≤ 0.5 otherwise. Then

 

Normalised Herfindahl measure edit

This is the square of the coefficient of variation divided by N − 1 where N is the sample size.

 

where m is the mean and s is the standard deviation.

Potential-for-conflict Index edit

The potential-for-conflict Index (PCI) describes the ratio of scoring on either side of a rating scale's centre point.[77] This index requires at least ordinal data. This ratio is often displayed as a bubble graph.

The PCI uses an ordinal scale with an odd number of rating points (−n to +n) centred at 0. It is calculated as follows

 

where Z = 2n, |·| is the absolute value (modulus), r+ is the number of responses in the positive side of the scale, r is the number of responses in the negative side of the scale, X+ are the responses on the positive side of the scale, X are the responses on the negative side of the scale and

 

Theoretical difficulties are known to exist with the PCI. The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it. Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale. For example, five-, seven- and nine-point scales with a uniform distribution of responses give PCIs of 0.60, 0.57 and 0.50 respectively.

The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended (or reduced) by a single value to give an odd number of possible responses. Scale can usually be recentred if this is required. The second problem is more difficult to resolve and may limit the PCI's applicability.

The PCI has been extended[78]

qualitative, variation, this, article, multiple, issues, please, help, improve, discuss, these, issues, talk, page, learn, when, remove, these, template, messages, this, article, needs, additional, citations, verification, please, help, improve, this, article,. This article has multiple issues Please help improve it or discuss these issues on the talk page Learn how and when to remove these template messages This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources Qualitative variation news newspapers books scholar JSTOR April 2024 Learn how and when to remove this message This article may contain an excessive amount of intricate detail that may interest only a particular audience Please help by spinning off or relocating any relevant information and removing excessive detail that may be against Wikipedia s inclusion policy April 2024 Learn how and when to remove this message This article needs attention from an expert in Statistics Please add a reason or a talk parameter to this template to explain the issue with the article WikiProject Statistics may be able to help recruit an expert April 2024 Learn how and when to remove this message An index of qualitative variation IQV is a measure of statistical dispersion in nominal distributions Examples include the variation ratio or the information entropy Contents 1 Properties 2 Wilcox s indexes 2 1 ModVR 2 2 RanVR 2 3 AvDev 2 4 MNDif 2 5 VarNC 2 6 StDev 2 7 HRel 2 8 B index 2 9 R packages 3 Gibb s indices and related formulae 3 1 M1 3 2 M2 3 3 M4 3 4 M6 3 5 Related indices 4 Other single sample indices 4 1 Berger Parker index 4 2 Brillouin index of diversity 4 3 Hill s diversity numbers 4 4 Margalef s index 4 5 Menhinick s index 4 6 Q statistic 4 7 Shannon Wiener index 4 8 Renyi entropy 4 9 McIntosh s D and E 4 10 Fisher s alpha 4 11 Strong s index 4 12 Simpson s E 4 13 Smith amp Wilson s indices 4 14 Heip s index 4 15 Camargo s index 4 16 Smith and Wilson s B 4 17 Nee Harvey and Cotgreave s index 4 18 Bulla s E 4 19 Horn s information theory index 4 20 Rarefaction index 4 21 Caswell s V 4 22 Lloyd amp Ghelardi s index 4 23 Average taxonomic distinctness index 4 24 Index of qualitative variation 4 25 Theil s H 5 Indices for comparison of two or more data types within a single sample 5 1 Index of dissimilarity 5 2 Index of segregation 5 3 Hutchen s square root index 5 4 Lieberson s isolation index 5 5 Bell s index 5 6 Index of isolation 5 7 Gorard s index of segregation 5 8 Index of exposure 5 9 Ochiai index 5 10 Kulczynski s coefficient 5 11 Yule s Q 5 12 Yule s Y 5 13 Baroni Urbani Buser coefficient 5 14 Hamman coefficient 5 15 Rogers Tanimoto coefficient 5 16 Sokal Sneath coefficient 5 17 Sokal s binary distance 5 18 Russel Rao coefficient 5 19 Phi coefficient 5 20 Soergel s coefficient 5 21 Simpson s coefficient 5 22 Dennis coefficient 5 23 Forbes coefficient 5 24 Simple match coefficient 5 25 Fossum s coefficient 5 26 Stile s coefficient 5 27 Michael s coefficient 5 28 Peirce s coefficient 5 29 Hawkin Dotson coefficient 5 30 Benini coefficient 5 31 Gilbert coefficient 5 32 Gini index 5 33 Modified Gini index 5 34 Kuhn s index 5 35 Eyraud index 5 36 Soergel distance 5 37 Tanimoto index 5 38 Piatetsky Shapiro s index 6 Indices for comparison between two or more samples 6 1 Czekanowski s quantitative index 6 2 Canberra metric 6 3 Sorensen s coefficient of community 6 4 Jaccard s index 6 5 Dice s index 6 6 Match coefficient 6 7 Morisita s index 6 8 Morisita s overlap index 6 9 Standardised Morisita s index 6 10 Peet s evenness indices 6 11 Loevinger s coefficient 6 12 Tversky index 7 Metrics used 7 1 Euclidean distance 7 2 Gower s distance 7 3 Manhattan distance 7 4 Prevosti s distance 7 5 Other metrics 7 6 Similarities between texts 8 Ordinal data 8 1 Leik s D 8 2 Normalised Herfindahl measure 8 3 Potential for conflict Index 8 4 van der Eijk s A 9 Related statistics 9 1 Birthday problem 9 2 Birthday death day problem 9 3 Rand index 9 4 Adjusted Rand index 9 4 1 The contingency table 9 4 2 Definition 10 Evaluation of indices 11 See also 12 Notes 13 ReferencesProperties editThere are several types of indices used for the analysis of nominal data Several are standard statistics that are used elsewhere range standard deviation variance mean deviation coefficient of variation median absolute deviation interquartile range and quartile deviation In addition to these several statistics have been developed with nominal data in mind A number have been summarized and devised by Wilcox Wilcox 1967 Wilcox 1973 who requires the following standardization properties to be satisfied Variation varies between 0 and 1 Variation is 0 if and only if all cases belong to a single category Variation is 1 if and only if cases are evenly divided across all categories 1 In particular the value of these standardized indices does not depend on the number of categories or number of samples For any index the closer to uniform the distribution the larger the variance and the larger the differences in frequencies across categories the smaller the variance Indices of qualitative variation are then analogous to information entropy which is minimized when all cases belong to a single category and maximized in a uniform distribution Indeed information entropy can be used as an index of qualitative variation One characterization of a particular index of qualitative variation IQV is as a ratio of observed differences to maximum differences Wilcox s indexes editWilcox gives a number of formulae for various indices of QV Wilcox 1973 the first which he designates DM for Deviation from the Mode is a standardized form of the variation ratio and is analogous to variance as deviation from the mean ModVR edit The formula for the variation around the mode ModVR is derived as follows M i 1 K f m f i displaystyle M sum i 1 K f m f i nbsp where fm is the modal frequency K is the number of categories and fi is the frequency of the ith group This can be simplified to M K f m N displaystyle M Kf m N nbsp where N is the total size of the sample Freeman s index or variation ratio is 2 v 1 f m N displaystyle v 1 frac f m N nbsp This is related to M as follows f m N 1 K N K K 1 N M N K 1 displaystyle frac frac f m N frac 1 K frac N K frac K 1 N frac M N K 1 nbsp The ModVR is defined as ModVR 1 K f m N N K 1 K N f m N K 1 K v K 1 displaystyle operatorname ModVR 1 frac Kf m N N K 1 frac K N f m N K 1 frac Kv K 1 nbsp where v is Freeman s index Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation When K is large ModVR is approximately equal to Freeman s index v RanVR edit This is based on the range around the mode It is defined to be RanVR 1 f m f l f m f l f m displaystyle operatorname RanVR 1 frac f m f l f m frac f l f m nbsp where fm is the modal frequency and fl is the lowest frequency AvDev edit This is an analog of the mean deviation It is defined as the arithmetic mean of the absolute differences of each value from the mean AvDev 1 1 2 N K K 1 i 1 K f i N K displaystyle operatorname AvDev 1 frac 1 2N frac K K 1 sum i 1 K left f i frac N K right nbsp MNDif edit This is an analog of the mean difference the average of the differences of all the possible pairs of variate values taken regardless of sign The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value 3 MNDif 1 1 N K 1 i 1 K 1 j i 1 K f i f j displaystyle operatorname MNDif 1 frac 1 N K 1 sum i 1 K 1 sum j i 1 K f i f j nbsp where fi and fj are the ith and jth frequencies respectively The MNDif is the Gini coefficient applied to qualitative data VarNC edit This is an analog of the variance VarNC 1 1 N 2 K K 1 f i N K 2 displaystyle operatorname VarNC 1 frac 1 N 2 frac K K 1 sum left f i frac N K right 2 nbsp It is the same index as Mueller and Schussler s Index of Qualitative Variation 4 and Gibbs M2 index It is distributed as a chi square variable with K 1 degrees of freedom 5 StDev edit Wilson has suggested two versions of this statistic The first is based on AvDev StDev 1 1 i 1 K f i N K 2 N N K 2 K 1 N K 2 displaystyle operatorname StDev 1 1 sqrt frac sum i 1 K left f i frac N K right 2 left N frac N K right 2 K 1 left frac N K right 2 nbsp The second is based on MNDif StDev 2 1 i 1 K 1 j i 1 K f i f j 2 N 2 K 1 displaystyle operatorname StDev 2 1 sqrt frac sum i 1 K 1 sum j i 1 K f i f j 2 N 2 K 1 nbsp HRel edit This index was originally developed by Claude Shannon for use in specifying the properties of communication channels HRel p i log 2 p i log 2 K displaystyle operatorname HRel frac sum p i log 2 p i log 2 K nbsp where pi fi N This is equivalent to information entropy divided by the log 2 K displaystyle log 2 K nbsp and is useful for comparing relative variation between frequency tables of multiple sizes B index edit Wilcox adapted a proposal of Kaiser 6 based on the geometric mean and created the B index The B index is defined as B 1 1 i 1 k f i K N k 2 displaystyle B 1 sqrt 1 left sqrt k prod i 1 k frac f i K N right 2 nbsp R packages edit Several of these indices have been implemented in the R language 7 Gibb s indices and related formulae editGibbs amp Poston Jr 1975 proposed six indexes 8 M1 edit The unstandardized index M1 Gibbs amp Poston Jr 1975 p 471 is M 1 1 i 1 K p i 2 displaystyle M1 1 sum i 1 K p i 2 nbsp where K is the number of categories and p i f i N displaystyle p i f i N nbsp is the proportion of observations that fall in a given category i M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category 9 so this formula for IQV is a standardized likelihood of a random pair falling in the same category This index has also referred to as the index of differentiation the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in M2 edit A second index is the M2 10 Gibbs amp Poston Jr 1975 p 472 is M 2 K K 1 1 i 1 K p i 2 displaystyle M2 frac K K 1 left 1 sum i 1 K p i 2 right nbsp where K is the number of categories and p i f i N displaystyle p i f i N nbsp is the proportion of observations that fall in a given category i The factor of K K 1 displaystyle frac K K 1 nbsp is for standardization M1 and M2 can be interpreted in terms of variance of a multinomial distribution Swanson 1976 there called an expanded binomial model M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution M4 edit The M4 index is M 4 i 1 K X i m 2 i 1 K X i displaystyle M4 frac sum i 1 K X i m 2 sum i 1 K X i nbsp where m is the mean M6 edit The formula for M6 is M 6 K 1 i 1 K X i m 2 N displaystyle M6 K left 1 frac sum i 1 K X i m 2N right nbsp where K is the number of categories Xi is the number of data points in the ith category N is the total number of data points is the absolute value modulus and m i 1 K X i N displaystyle m frac sum i 1 K X i N nbsp This formula can be simplified M 6 K 1 i 1 K p i 1 N 2 displaystyle M6 K left 1 frac sum i 1 K left p i frac 1 N right 2 right nbsp where pi is the proportion of the sample in the ith category In practice M1 and M6 tend to be highly correlated which militates against their combined use Related indices edit The sum i 1 K p i 2 displaystyle sum i 1 K p i 2 nbsp has also found application This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl Hirschman index HHI in economics A variant of this is known as the Hunter Gaston index in microbiology 11 In linguistics and cryptanalysis this sum is known as the repeat rate The incidence of coincidence IC is an unbiased estimator of this statistic 12 IC f i f i 1 n n 1 displaystyle operatorname IC sum frac f i f i 1 n n 1 nbsp where fi is the count of the ith grapheme in the text and n is the total number of graphemes in the text M1 The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names These include Gini s index of mutability 13 Simpson s measure of diversity 14 Bachi s index of linguistic homogeneity 15 Mueller and Schuessler s index of qualitative variation 16 Gibbs and Martin s index of industry diversification 17 Lieberson s index 18 and Blau s index in sociology psychology and management studies 19 The formulation of all these indices are identical Simpson s D is defined as D 1 i 1 K n i n i 1 n n 1 displaystyle D 1 sum i 1 K frac n i n i 1 n n 1 nbsp where n is the total sample size and ni is the number of items in the ith category For large n we have u 1 i 1 K p i 2 displaystyle u sim 1 sum i 1 K p i 2 nbsp Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1 20 u c x y n 2 n displaystyle u frac c x y n 2 n nbsp where n is the sample size and c x y 1 if x and y are alike and 0 otherwise For large n we have u 1 i 1 K p i 2 displaystyle u sim 1 sum i 1 K p i 2 nbsp where K is the number of categories Another related statistic is the quadratic entropy H 2 2 1 i 1 K p i 2 displaystyle H 2 2 left 1 sum i 1 K p i 2 right nbsp which is itself related to the Gini index M2 Greenberg s monolingual non weighted index of linguistic diversity 21 is the M2 statistic defined above M7 Another index the M7 was created based on the M4 index of Gibbs amp Poston Jr 1975 22 M 7 i 1 K j 1 L R i R 2 R i displaystyle M7 frac sum i 1 K sum j 1 L R i R 2 sum R i nbsp where R i j O i j E i j O i j n i p j displaystyle R ij frac O ij E ij frac O ij n i p j nbsp and R i 1 K j 1 L R i j i 1 K n i displaystyle R frac sum i 1 K sum j 1 L R ij sum i 1 K n i nbsp where K is the number of categories L is the number of subtypes Oij and Eij are the number observed and expected respectively of subtype j in the ith category ni is the number in the ith category and pj is the proportion of subtype j in the complete sample Note This index was designed to measure women s participation in the work place the two subtypes it was developed for were male and female Other single sample indices editThese indices are summary statistics of the variation within the sample Berger Parker index edit The Berger Parker index equals the maximum p i displaystyle p i nbsp value in the dataset i e the proportional abundance of the most abundant type 23 This corresponds to the weighted generalized mean of the p i displaystyle p i nbsp values when q approaches infinity and hence equals the inverse of true diversity of order infinity 1 D Brillouin index of diversity edit This index is strictly applicable only to entire populations rather than to finite samples It is defined as I B log N i 1 K log n i N displaystyle I B frac log N sum i 1 K log n i N nbsp where N is total number of individuals in the population ni is the number of individuals in the ith category and N is the factorial of N Brillouin s index of evenness is defined as E B I B I B max displaystyle E B I B I B max nbsp where IB max is the maximum value of IB Hill s diversity numbers edit Hill suggested a family of diversity numbers 24 N a 1 i 1 K p i a a 1 displaystyle N a frac 1 left sum i 1 K p i a right a 1 nbsp For given values of a several of the other indices can be computed a 0 Na species richness a 1 Na Shannon s index a 2 Na 1 Simpson s index without the small sample correction a 3 Na 1 Berger Parker index Hill also suggested a family of evenness measures E a b N a N b displaystyle E a b frac N a N b nbsp where a gt b Hill s E4 is E 4 N 2 N 1 displaystyle E 4 frac N 2 N 1 nbsp Hill s E5 is E 5 N 2 1 N 1 1 displaystyle E 5 frac N 2 1 N 1 1 nbsp Margalef s index edit I Marg S 1 log e N displaystyle I text Marg frac S 1 log e N nbsp where S is the number of data types in the sample and N is the total size of the sample 25 Menhinick s index edit I M e n S N displaystyle I mathrm Men frac S sqrt N nbsp where S is the number of data types in the sample and N is the total size of the sample 26 In linguistics this index is the identical with the Kuraszkiewicz index Guiard index where S is the number of distinct words types and N is the total number of words tokens in the text being examined 27 28 This index can be derived as a special case of the Generalised Torquist function 29 Q statistic edit This is a statistic invented by Kempton and Taylor 30 and involves the quartiles of the sample It is defined as Q 1 2 n R 1 n R 2 j R 1 1 R 2 1 n j log R 2 R 1 displaystyle Q frac frac 1 2 n R1 n R2 sum j R 1 1 R 2 1 n j log R 2 R 1 nbsp where R1 and R2 are the 25 and 75 quartiles respectively on the cumulative species curve nj is the number of species in the jth category nRi is the number of species in the class where Ri falls i 1 or 2 Shannon Wiener index edit This is taken from information theory H log e N 1 N n i p i log p i displaystyle H log e N frac 1 N sum n i p i log p i nbsp where N is the total number in the sample and pi is the proportion in the ith category In ecology where this index is commonly used H usually lies between 1 5 and 3 5 and only rarely exceeds 4 0 An approximate formula for the standard deviation SD of H is SD H 1 N p i log e p i 2 H 2 displaystyle operatorname SD H frac 1 N left sum p i log e p i 2 H 2 right nbsp where pi is the proportion made up by the ith category and N is the total in the sample A more accurate approximate value of the variance of H var H is given by 31 var H p i log p i 2 p i log p i 2 N K 1 2 N 2 1 p i 2 p i 1 log p i p i 1 p i log p i 6 N 3 displaystyle operatorname var H frac sum p i log p i 2 left sum p i log p i right 2 N frac K 1 2N 2 frac 1 sum p i 2 sum p i 1 log p i sum p i 1 sum p i log p i 6N 3 nbsp where N is the sample size and K is the number of categories A related index is the Pielou J defined as J H log e S displaystyle J frac H log e S nbsp One difficulty with this index is that S is unknown for a finite sample In practice S is usually set to the maximum present in any category in the sample Renyi entropy edit The Renyi entropy is a generalization of the Shannon entropy to other values of q than unity It can be expressed q H 1 1 q ln i 1 K p i q displaystyle q H frac 1 1 q ln left sum i 1 K p i q right nbsp which equals q H ln 1 i 1 K p i p i q 1 q 1 ln q D displaystyle q H ln left 1 over sqrt q 1 sum i 1 K p i p i q 1 right ln q D nbsp This means that taking the logarithm of true diversity based on any value of q gives the Renyi entropy corresponding to the same value of q The value of q D displaystyle q D nbsp is also known as the Hill number 24 McIntosh s D and E edit McIntosh proposed measure of diversity 32 I i 1 K n i 2 displaystyle I sqrt sum i 1 K n i 2 nbsp where ni is the number in the ith category and K is the number of categories He also proposed several normalized versions of this index First is D D N I N N displaystyle D frac N I N sqrt N nbsp where N is the total sample size This index has the advantage of expressing the observed diversity as a proportion of the absolute maximum diversity at a given N Another proposed normalization is E ratio of observed diversity to maximum possible diversity of a given N and K i e if all species are equal in number of individuals E N I N N K displaystyle E frac N I N frac N K nbsp Fisher s alpha edit This was the first index to be derived for diversity 33 K a ln 1 N a displaystyle K alpha ln 1 frac N alpha nbsp where K is the number of categories and N is the number of data points in the sample Fisher s a has to be estimated numerically from the data The expected number of individuals in the rth category where the categories have been placed in increasing size is E n r a X r r displaystyle operatorname E n r alpha frac X r r nbsp where X is an empirical parameter lying between 0 and 1 While X is best estimated numerically an approximate value can be obtained by solving the following two equations N a X 1 X displaystyle N frac alpha X 1 X nbsp K a ln 1 X displaystyle K alpha ln 1 X nbsp where K is the number of categories and N is the total sample size The variance of a is approximately 34 var a a ln X 1 X displaystyle operatorname var alpha frac alpha ln X 1 X nbsp Strong s index edit This index Dw is the distance between the Lorenz curve of species distribution and the 45 degree line It is closely related to the Gini coefficient 35 In symbols it is D w m a x c i K i N displaystyle D w max frac c i K frac i N nbsp where max is the maximum value taken over the N data points K is the number of categories or species in the data set and ci is the cumulative total up and including the ith category Simpson s E edit This is related to Simpson s D and is defined as E 1 D K displaystyle E frac 1 D K nbsp where D is Simpson s D and K is the number of categories in the sample Smith amp Wilson s indices edit Smith and Wilson suggested a number of indices based on Simpson s D E 1 1 D 1 1 K displaystyle E 1 frac 1 D 1 frac 1 K nbsp E 2 log e D log e K displaystyle E 2 frac log e D log e K nbsp where D is Simpson s D and K is the number of categories Heip s index edit E e H 1 K 1 displaystyle E frac e H 1 K 1 nbsp where H is the Shannon entropy and K is the number of categories This index is closely related to Sheldon s index which is E e H K displaystyle E frac e H K nbsp where H is the Shannon entropy and K is the number of categories Camargo s index edit This index was created by Camargo in 1993 36 E 1 i 1 K j i 1 K p i p j K displaystyle E 1 sum i 1 K sum j i 1 K frac p i p j K nbsp where K is the number of categories and pi is the proportion in the ith category Smith and Wilson s B edit This index was proposed by Smith and Wilson in 1996 37 B 1 2 p arctan 8 displaystyle B 1 frac 2 pi arctan theta nbsp where 8 is the slope of the log abundance rank curve Nee Harvey and Cotgreave s index edit This is the slope of the log abundance rank curve Bulla s E edit There are two versions of this index one for continuous distributions Ec and the other for discrete Ed 38 E c O 1 K 1 1 K displaystyle E c frac O frac 1 K 1 frac 1 K nbsp E d O 1 K K 1 N 1 1 K K 1 N displaystyle E d frac O frac 1 K frac K 1 N 1 frac 1 K frac K 1 N nbsp where O 1 1 2 p i 1 K displaystyle O 1 frac 1 2 left p i frac 1 K right nbsp is the Schoener Czekanoski index K is the number of categories and N is the sample size Horn s information theory index edit This index Rik is based on Shannon s entropy 39 It is defined as R i k H max H o b s H max H min displaystyle R ik frac H max H mathrm obs H max H min nbsp where X x i j displaystyle X sum x ij nbsp X x k j displaystyle X sum x kj nbsp H X x i j X log X x i j displaystyle H X sum frac x ij X log frac X x ij nbsp H Y x k j Y log Y x k j displaystyle H Y sum frac x kj Y log frac Y x kj nbsp H min X X Y H X Y X Y H Y displaystyle H min frac X X Y H X frac Y X Y H Y nbsp H max x i j X Y log X Y x i j x k j X Y log X Y x k j displaystyle H max sum left frac x ij X Y log frac X Y x ij frac x kj X Y log frac X Y x kj right nbsp H o b s x i j x k j X Y log X Y x i j x k j displaystyle H mathrm obs sum frac x ij x kj X Y log frac X Y x ij x kj nbsp In these equations xij and xkj are the number of times the jth data type appears in the ith or kth sample respectively Rarefaction index edit In a rarefied sample a random subsample n in chosen from the total N items In this sample some groups may be necessarily absent from this subsample Let X n displaystyle X n nbsp be the number of groups still present in the subsample of n items X n displaystyle X n nbsp is less than K the number of categories whenever at least one group is missing from this subsample The rarefaction curve f n displaystyle f n nbsp is defined as f n E X n K N n 1 i 1 K N N i n displaystyle f n operatorname E X n K binom N n 1 sum i 1 K binom N N i n nbsp Note that 0 f n K Furthermore f 0 0 f 1 1 f N K displaystyle f 0 0 f 1 1 f N K nbsp Despite being defined at discrete values of n these curves are most frequently displayed as continuous functions 40 This index is discussed further in Rarefaction ecology Caswell s V edit This is a z type statistic based on Shannon s entropy 41 V H E H SD H displaystyle V frac H operatorname E H operatorname SD H nbsp where H is the Shannon entropy E H is the expected Shannon entropy for a neutral model of distribution and SD H is the standard deviation of the entropy The standard deviation is estimated from the formula derived by Pielou S D H 1 N p i log e p i 2 H 2 displaystyle SD H frac 1 N left sum p i log e p i 2 H 2 right nbsp where pi is the proportion made up by the ith category and N is the total in the sample Lloyd amp Ghelardi s index edit This is I L G K K displaystyle I LG frac K K nbsp where K is the number of categories and K is the number of categories according to MacArthur s broken stick model yielding the observed diversity Average taxonomic distinctness index edit This index is used to compare the relationship between hosts and their parasites 42 It incorporates information about the phylogenetic relationship amongst the host species S T D 2 i lt j w i j s s 1 displaystyle S TD 2 frac sum sum i lt j omega ij s s 1 nbsp where s is the number of host species used by a parasite and wij is the taxonomic distinctness between host species i and j Index of qualitative variation edit Several indices with this name have been proposed One of these is I Q V K 100 2 i 1 K p i 2 100 2 K 1 K K 1 1 i 1 K p i 100 2 displaystyle IQV frac K 100 2 sum i 1 K p i 2 100 2 K 1 frac K K 1 1 sum i 1 K p i 100 2 nbsp where K is the number of categories and pi is the proportion of the sample that lies in the ith category Theil s H edit This index is also known as the multigroup entropy index or the information theory index It was proposed by Theil in 1972 43 The index is a weighted average of the samples entropy Let E a i 1 a p i l o g p i displaystyle E a sum i 1 a p i log p i nbsp andH i 1 r n i E E i N E displaystyle H sum i 1 r frac n i E E i NE nbsp where pi is the proportion of type i in the ath sample r is the total number of samples ni is the size of the ith sample N is the size of the population from which the samples were obtained and E is the entropy of the population Indices for comparison of two or more data types within a single sample editSeveral of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area Index of dissimilarity edit Let A and B be two types of data item Then the index of dissimilarity is D 1 2 i 1 K A i A B i B displaystyle D frac 1 2 sum i 1 K left frac A i A frac B i B right nbsp where A i 1 K A i displaystyle A sum i 1 K A i nbsp B i 1 K B i displaystyle B sum i 1 K B i nbsp Ai is the number of data type A at sample site i Bi is the number of data type B at sample site i K is the number of sites sampled and is the absolute value This index is probably better known as the index of dissimilarity D 44 It is closely related to the Gini index This index is biased as its expectation under a uniform distribution is gt 0 A modification of this index has been proposed by Gorard and Taylor 45 Their index GT is G T D 1 A A B displaystyle GT D left 1 frac A A B right nbsp Index of segregation edit The index of segregation IS 46 is S I 1 2 i 1 K A i A t i A i T A displaystyle SI frac 1 2 sum i 1 K left frac A i A frac t i A i T A right nbsp where A i 1 K A i displaystyle A sum i 1 K A i nbsp T i 1 K t i displaystyle T sum i 1 K t i nbsp and K is the number of units Ai and ti is the number of data type A in unit i and the total number of all data types in unit i Hutchen s square root index edit This index H is defined as 47 H 1 i 1 K j 1 i p i p j displaystyle H 1 sum i 1 K sum j 1 i sqrt p i p j nbsp where pi is the proportion of the sample composed of the ith variate Lieberson s isolation index edit See also isolation index This index Lxy was invented by Lieberson in 1981 48 L x y 1 N i 1 K X i Y i X t o t displaystyle L xy frac 1 N sum i 1 K frac X i Y i X mathrm tot nbsp where Xi and Yi are the variables of interest at the ith site K is the number of sites examined and Xtot is the total number of variate of type X in the study Bell s index edit This index is defined as 49 I R p x x p x 1 p x displaystyle I R frac p xx p x 1 p x nbsp where px is the proportion of the sample made up of variates of type X and p x x i 1 K x i p i N x displaystyle p xx frac sum i 1 K x i p i N x nbsp where Nx is the total number of variates of type X in the study K is the number of samples in the study and xi and pi are the number of variates and the proportion of variates of type X respectively in the ith sample Index of isolation edit The index of isolation is I I i 1 K A i A A i t i displaystyle II sum i 1 K frac A i A frac A i t i nbsp where K is the number of units in the study Ai and ti is the number of units of type A and the number of all units in ith sample A modified index of isolation has also been proposed M I I I I A T 1 A T displaystyle MII frac II frac A T 1 frac A T nbsp The MII lies between 0 and 1 Gorard s index of segregation edit This index GS is defined as G S 1 2 i 1 K A i A t i T displaystyle GS frac 1 2 sum i 1 K left frac A i A frac t i T right nbsp where A i 1 K A i displaystyle A sum i 1 K A i nbsp T i 1 K t i displaystyle T sum i 1 K t i nbsp and Ai and ti are the number of data items of type A and the total number of items in the ith sample Index of exposure edit This index is defined as I E i 1 K A i A B i t i displaystyle IE sum i 1 K frac A i A frac B i t i nbsp where A i 1 K A i displaystyle A sum i 1 K A i nbsp and Ai and Bi are the number of types A and B in the ith category and ti is the total number of data points in the ith category Ochiai index edit This is a binary form of the cosine index 50 It is used to compare presence absence data of two data types here A and B It is defined as O a a b a c displaystyle O frac a sqrt a b a c nbsp where a is the number of sample units where both A and B are found b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A Kulczynski s coefficient edit This coefficient was invented by Stanislaw Kulczynski in 1927 51 and is an index of association between two types here A and B It varies in value between 0 and 1 It is defined as K a 2 1 a b 1 a c displaystyle K frac a 2 left frac 1 a b frac 1 a c right nbsp where a is the number of sample units where type A and type B are present b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A Yule s Q edit This index was invented by Yule in 1900 52 It concerns the association of two different types here A and B It is defined as Q a d b c a d b c displaystyle Q frac ad bc ad bc nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present Q varies in value between 1 and 1 In the ordinal case Q is known as the Goodman Kruskal g Because the denominator potentially may be zero Leinhert and Sporer have recommended adding 1 to a b c and d 53 Yule s Y edit This index is defined as Y a d b c a d b c displaystyle Y frac sqrt ad sqrt bc sqrt ad sqrt bc nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present Baroni Urbani Buser coefficient edit This index was invented by Baroni Urbani and Buser in 1976 54 It varies between 0 and 1 in value It is defined asB U B a d a a d a b c a d a N a d d 1 N a d N a d d displaystyle BUB frac sqrt ad a sqrt ad a b c frac sqrt ad a N sqrt ad d 1 frac N a d N sqrt ad d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size When d 0 this index is identical to the Jaccard index Hamman coefficient edit This coefficient is defined as H a d b c a b c d a d b c N displaystyle H frac a d b c a b c d frac a d b c N nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Rogers Tanimoto coefficient edit This coefficient is defined as R T a d a 2 b c d a d N b c displaystyle RT frac a d a 2 b c d frac a d N b c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Sokal Sneath coefficient edit This coefficient is defined as S S 2 a d 2 a d b c 2 a d N a d displaystyle SS frac 2 a d 2 a d b c frac 2 a d N a d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Sokal s binary distance edit This coefficient is defined as S B D b c a b c d b c N displaystyle SBD sqrt frac b c a b c d sqrt frac b c N nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Russel Rao coefficient edit This coefficient is defined as R R a a b c d a N displaystyle RR frac a a b c d frac a N nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Phi coefficient edit This coefficient is defined as f a d b c a b a c b c c d displaystyle varphi frac ad bc sqrt a b a c b c c d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present Soergel s coefficient edit This coefficient is defined as S b c b c d b c N a displaystyle S frac b c b c d frac b c N a nbsp where b is the number of samples where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Simpson s coefficient edit This coefficient is defined as S a a min b c displaystyle S frac a a min b c nbsp where b is the number of samples where type A is present but not type B c is the number of samples where type B is present but not type A Dennis coefficient edit This coefficient is defined as D a d b c a b c d a b a c a d b c N a b a c displaystyle D frac ad bc sqrt a b c d a b a c frac ad bc sqrt N a b a c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Forbes coefficient edit This coefficient was proposed by Stephen Alfred Forbes in 1907 55 It is defined as F a N a b a c displaystyle F frac aN a b a c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size N a b c d A modification of this coefficient which does not require the knowledge of d has been proposed by Alroy 56 F A a n n a n n 3 2 b c 1 3 b c 2 a n n 3 b c displaystyle F A frac a n sqrt n a n sqrt n frac 3 2 bc 1 frac 3bc 2a n sqrt n 3bc nbsp Where n a b c Simple match coefficient edit This coefficient is defined as S M a d a b c d a d N displaystyle SM frac a d a b c d frac a d N nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Fossum s coefficient edit This coefficient is defined as F a b c d a 0 5 2 a b a c N a 0 5 2 a b a c displaystyle F frac a b c d a 0 5 2 a b a c frac N a 0 5 2 a b a c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Stile s coefficient edit This coefficient is defined as S log n a d b c n 2 2 a b a c b d c d displaystyle S log left frac n ad bc frac n 2 2 a b a c b d c d right nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A d is the sample count where neither type A nor type B are present n equals a b c d and is the modulus absolute value of the difference Michael s coefficient edit This coefficient is defined as M 4 a d b c a d 2 b c 2 displaystyle M frac 4 ad bc a d 2 b c 2 nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present Peirce s coefficient edit In 1884 Charles Peirce suggested 57 the following coefficient P a b b c a b 2 b c c d displaystyle P frac ab bc ab 2bc cd nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present Hawkin Dotson coefficient edit In 1975 Hawkin and Dotson proposed the following coefficient H D 1 2 a a b c d b c d 1 2 a N d d N a displaystyle HD frac 1 2 left frac a a b c frac d b c d right frac 1 2 left frac a N d frac d N a right nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Benini coefficient edit In 1901 Benini proposed the following coefficient B a a b a c a min b c a b a c displaystyle B frac a a b a c a min b c a b a c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B and c is the number of samples where type B is present but not type A Min b c is the minimum of b and c Gilbert coefficient edit Gilbert proposed the following coefficient G a a b a c a b c a b a c a a b a c N a b a c d displaystyle G frac a a b a c a b c a b a c frac a a b a c N a b a c d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present N is the sample size Gini index edit The Gini index is G a a b a c 1 a b 2 1 a c 2 displaystyle G frac a a b a c sqrt 1 a b 2 1 a c 2 nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B and c is the number of samples where type B is present but not type A Modified Gini index edit The modified Gini index is G M a a b a c 1 b c 2 a b a c displaystyle G M frac a a b a c 1 frac b c 2 a b a c nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B and c is the number of samples where type B is present but not type A Kuhn s index edit Kuhn proposed the following coefficient in 1965 I 2 a d b c K 2 a b c 2 a d b c K N a d displaystyle I frac 2 ad bc K 2a b c frac 2 ad bc K N a d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B and c is the number of samples where type B is present but not type A K is a normalizing parameter N is the sample size This index is also known as the coefficient of arithmetic means Eyraud index edit Eyraud proposed the following coefficient in 1936 I a a b a c a c a d b d c d displaystyle I frac a a b a c a c a d b d c d nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present Soergel distance edit This is defined as SD b c b c d b c N a displaystyle operatorname SD frac b c b c d frac b c N a nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present N is the sample size Tanimoto index edit This is defined as T I 1 a b c d 1 a N a N 2 a N a displaystyle TI 1 frac a b c d 1 frac a N a frac N 2a N a nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present N is the sample size Piatetsky Shapiro s index edit This is defined as P S I a b c displaystyle PSI a bc nbsp where a is the number of samples where types A and B are both present b is where type A is present but not type B c is the number of samples where type B is present but not type A Indices for comparison between two or more samples editCzekanowski s quantitative index edit This is also known as the Bray Curtis index Schoener s index least common percentage index index of affinity or proportional similarity It is related to the Sorensen similarity index C Z I min x i x j x i x j displaystyle CZI frac sum min x i x j sum x i x j nbsp where xi and xj are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites Canberra metric edit The Canberra distance is a weighted version of the L1 metric It was introduced by introduced in 1966 58 and refined in 1967 59 by G N Lance and W T Williams It is used to define a distance between two vectors here two sites with K categories within each site The Canberra distance d between vectors p and q in a K dimensional real vector space is d p q i 1 n p i q i p i q i displaystyle d mathbf p mathbf q sum i 1 n frac p i q i p i q i nbsp where pi and qi are the values of the ith category of the two vectors Sorensen s coefficient of community edit This is used to measure similarities between communities C C 2 c s 1 s 2 displaystyle CC frac 2c s 1 s 2 nbsp where s1 and s2 are the number of species in community 1 and 2 respectively and c is the number of species common to both areas Jaccard s index edit This is a measure of the similarity between two samples J A A B C displaystyle J frac A A B C nbsp where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively This index was invented in 1902 by the Swiss botanist Paul Jaccard 60 Under a random distribution the expected value of J is 61 J 1 A 1 A B C displaystyle J frac 1 A left frac 1 A B C right nbsp The standard error of this index with the assumption of a random distribution isS E J A B C N A B C 3 displaystyle SE J sqrt frac A B C N A B C 3 nbsp where N is the total size of the sample Dice s index edit This is a measure of the similarity between two samples D 2 A 2 A B C displaystyle D frac 2A 2A B C nbsp where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively See also Sorensen Dice coefficient Match coefficient edit This is a measure of the similarity between two samples M N B C N 1 B C N displaystyle M frac N B C N 1 frac B C N nbsp where N is the number of data points in the two samples and B and C are the data points found only in the first and second samples respectively Morisita s index edit Morisita s index of dispersion Im is the scaled probability that two points chosen at random from the whole population are in the same sample 62 Higher values indicate a more clumped distribution I m x x 1 n m m 1 displaystyle I m frac sum x x 1 nm m 1 nbsp An alternative formulation is I m n x 2 x x 2 x displaystyle I m n frac sum x 2 sum x left sum x right 2 sum x nbsp where n is the total sample size m is the sample mean and x are the individual values with the sum taken over the whole sample It is also equal to I m n I M C n m 1 displaystyle I m frac n IMC nm 1 nbsp where IMC is Lloyd s index of crowding 63 This index is relatively independent of the population density but is affected by the sample size Morisita showed that the statistic 62 I m x 1 n x displaystyle I m left sum x 1 right n sum x nbsp is distributed as a chi squared variable with n 1 degrees of freedom An alternative significance test for this index has been developed for large samples 64 z I m 1 2 n m 2 displaystyle z frac I m 1 2 nm 2 nbsp where m is the overall sample mean n is the number of sample units and z is the normal distribution abscissa Significance is tested by comparing the value of z against the values of the normal distribution Morisita s overlap index edit Morisita s overlap index is used to compare overlap among samples 65 The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats C D 2 i 1 S x i y i D x D y X Y displaystyle C D frac 2 sum i 1 S x i y i D x D y XY nbsp xi is the number of times species i is represented in the total X from one sample yi is the number of times species i is represented in the total Y from another sample Dx and Dy are the Simpson s index values for the x and y samples respectively S is the number of unique species CD 0 if the two samples do not overlap in terms of species and CD 1 if the species occur in the same proportions in both samples Horn s introduced a modification of the index 66 C H 2 i 1 S x i y i i 1 S x i 2 X 2 i 1 S y i 2 Y 2 X Y displaystyle C H frac 2 sum i 1 S x i y i left sum i 1 S x i 2 over X 2 sum i 1 S y i 2 over Y 2 right XY nbsp Standardised Morisita s index edit Smith Gill developed a statistic based on Morisita s index which is independent of both sample size and population density and bounded by 1 and 1 This statistic is calculated as follows 67 First determine Morisita s index Id in the usual fashion Then let k be the number of units the population was sampled from Calculate the two critical values M u x 0 975 2 k x x 1 displaystyle M u frac chi 0 975 2 k sum x sum x 1 nbsp M c x 0 025 2 k x x 1 displaystyle M c frac chi 0 025 2 k sum x sum x 1 nbsp where x2 is the chi square value for n 1 degrees of freedom at the 97 5 and 2 5 levels of confidence The standardised index Ip is then calculated from one of the formulae belowWhen Id Mc gt 1 I p 0 5 0 5 I d M c k M c displaystyle I p 0 5 0 5 left frac I d M c k M c right nbsp When Mc gt Id 1 I p 0 5 I d 1 M u 1 displaystyle I p 0 5 left frac I d 1 M u 1 right nbsp When 1 gt Id Mu I p 0 5 I d 1 M u 1 displaystyle I p 0 5 left frac I d 1 M u 1 right nbsp When 1 gt Mu gt Id I p 0 5 0 5 I d M u M u displaystyle I p 0 5 0 5 left frac I d M u M u right nbsp Ip ranges between 1 and 1 with 95 confidence intervals of 0 5 Ip has the value of 0 if the pattern is random if the pattern is uniform Ip lt 0 and if the pattern shows aggregation Ip gt 0 Peet s evenness indices edit These indices are a measure of evenness between samples 68 E 1 I I min I max I min displaystyle E 1 frac I I min I max I min nbsp E 2 I I max displaystyle E 2 frac I I max nbsp where I is an index of diversity Imax and Imin are the maximum and minimum values of I between the samples being compared Loevinger s coefficient edit Loevinger has suggested a coefficient H defined as follows H p max 1 p min p min 1 p max displaystyle H sqrt frac p max 1 p min p min 1 p max nbsp where pmax and pmin are the maximum and minimum proportions in the sample Tversky index edit The Tversky index 69 is an asymmetric measure that lies between 0 and 1 For samples A and B the Tversky index S is S A B A B a A B b B A displaystyle S frac A cap B A cap B alpha A B beta B A nbsp The values of a and b are arbitrary Setting both a and b to 0 5 gives Dice s coefficient Setting both to 1 gives Tanimoto s coefficient A symmetrical variant of this index has also been proposed 70 S 1 A B A B b a a 1 a b displaystyle S 1 frac A cap B A cap B beta left alpha a 1 alpha b right nbsp where a min X Y Y X displaystyle a min left X Y Y X right nbsp b max X Y Y X displaystyle b max left X Y Y X right nbsp Several similar indices have been proposed Monostori et al proposed the SymmetricSimilarity index 71 S S A B d A d B d A d B displaystyle SS A B frac d A cap d B d A d B nbsp where d X is some measure of derived from X Bernstein and Zobel have proposed the S2 and S3 indexes 72 S 2 d A d B min d A d B displaystyle S2 frac d A cap d B min d A d B nbsp S 3 2 d A d B d A d B displaystyle S3 frac 2 d A cap d B d A d B nbsp S3 is simply twice the SymmetricSimilarity index Both are related to Dice s coefficientMetrics used editA number of metrics distances between samples have been proposed Euclidean distance edit While this is usually used in quantitative work it may also be used in qualitative work This is defined as d j k i 1 N x i j x i k 2 displaystyle d jk sqrt sum i 1 N x ij x ik 2 nbsp where djk is the distance between xij and xik Gower s distance edit This is defined as G D S i 1 n w i d i S i 1 n w i displaystyle GD frac Sigma i 1 n w i d i Sigma i 1 n w i nbsp where di is the distance between the ith samples and wi is the weighing give to the ith distance Manhattan distance edit While this is more commonly used in quantitative work it may also be used in qualitative work This is defined as d j k i 1 N x i j x i k displaystyle d jk sum i 1 N x ij x ik nbsp where djk is the distance between xij and xik and is the absolute value of the difference between xij and xik A modified version of the Manhattan distance can be used to find a zero root of a polynomial of any degree using Lill s method Prevosti s distance edit This is related to the Manhattan distance It was described by Prevosti et al and was used to compare differences between chromosomes 73 Let P and Q be two collections of r finite probability distributions Let these distributions have values that are divided into k categories Then the distance DPQ is D P Q 1 r j 1 r i 1 k p j i q j i displaystyle D PQ frac 1 r sum j 1 r sum i 1 k p ji q ji nbsp where r is the number of discrete probability distributions in each population kj is the number of categories in distributions Pj and Qj and pji respectively qji is the theoretical probability of category i in distribution Pj Qj in population P Q Its statistical properties were examined by Sanchez et al 74 who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples Other metrics edit Let A x i j displaystyle A sum x ij nbsp B x i k displaystyle B sum x ik nbsp J min x i j x j k displaystyle J sum min x ij x jk nbsp where min x y is the lesser value of the pair x and y Then d j k A B 2 J displaystyle d jk A B 2J nbsp is the Manhattan distance d j k A B 2 J A B displaystyle d jk frac A B 2J A B nbsp is the Bray Curtis distance d j k A B 2 J A B J displaystyle d jk frac A B 2J A B J nbsp is the Jaccard or Ruzicka distance and d j k 1 1 2 J A J B displaystyle d jk 1 frac 1 2 left frac J A frac J B right nbsp is the Kulczynski distance Similarities between texts edit HaCohen Kerner et al have proposed a variety of metrics for comparing two or more texts 75 Ordinal data editIf the categories are at least ordinal then a number of other indices may be computed Leik s D edit Leik s measure of dispersion D is one such index 76 Let there be K categories and let pi be fi N where fi is the number in the ith category and let the categories be arranged in ascending order Let c a i 1 a p i displaystyle c a sum i 1 a p i nbsp where a K Let da ca if ca 0 5 and 1 ca 0 5 otherwise Then D 2 a 1 K d a K 1 displaystyle D 2 sum a 1 K frac d a K 1 nbsp Normalised Herfindahl measure edit This is the square of the coefficient of variation divided by N 1 where N is the sample size H 1 N 1 s 2 m 2 displaystyle H frac 1 N 1 frac s 2 m 2 nbsp where m is the mean and s is the standard deviation Potential for conflict Index edit The potential for conflict Index PCI describes the ratio of scoring on either side of a rating scale s centre point 77 This index requires at least ordinal data This ratio is often displayed as a bubble graph The PCI uses an ordinal scale with an odd number of rating points n to n centred at 0 It is calculated as follows P C I X t Z 1 i 1 r X X t i 1 r X X t displaystyle PCI frac X t Z left 1 left frac sum i 1 r X X t frac sum i 1 r X X t right right nbsp where Z 2n is the absolute value modulus r is the number of responses in the positive side of the scale r is the number of responses in the negative side of the scale X are the responses on the positive side of the scale X are the responses on the negative side of the scale and X t i 1 r X i 1 r X displaystyle X t sum i 1 r X sum i 1 r X nbsp Theoretical difficulties are known to exist with the PCI The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale For example five seven and nine point scales with a uniform distribution of responses give PCIs of 0 60 0 57 and 0 50 respectively The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended or reduced by a single value to give an odd number of possible responses Scale can usually be recentred if this is required The second problem is more difficult to resolve and may limit the PCI s applicability The PCI has been extended 78 P C I 2 i, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.