fbpx
Wikipedia

Curse of dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming.[1][2] The curse generally refers to issues that arise when the number of datapoints is small (in a suitably defined sense) relative to the intrinsic dimension of the data.

Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. In order to obtain a reliable result, the amount of data needed often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient.

Domains edit

Combinatorics edit

In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion. Even in the simplest case of   binary variables, the number of possible combinations already is  , exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations.

Sampling edit

There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 102 = 100 evenly spaced sample points suffice to sample a unit interval (try to visualize a "1-dimensional" cube) with no more than 10−2 = 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2 = 0.01 between adjacent points would require 1020 = [(102)10] sample points. In general, with a spacing distance of 10n the 10-dimensional hypercube appears to be a factor of 10n(10−1) = [(10n)10/(10n)] "larger" than the 1-dimensional hypercube, which is the unit interval. In the above example n = 2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below.

Optimization edit

When solving dynamic optimization problems by numerical backward induction, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large.[3]

Machine learning edit

In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. In an abstract sense, as the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.[4]

A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation.[5] In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon,[5] which is also known as Hughes phenomenon.[6] This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.[7][8][9]

Nevertheless, in the context of a simple classifier (e.g., linear discriminant analysis in the multivariate Gaussian model under the assumption of a common known covariance matrix), Zollanvari, et al., showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power.[10]

In metric learning, higher dimensions can sometimes allow a model to achieve better performance. After normalizing embeddings to the surface of a hypersphere, FaceNet achieves the best performance using 128 dimensions as opposed to 64, 256, or 512 dimensions in one ablation study.[11] A loss function for unitary-invariant dissimilarity between word embeddings was found to be minimized in high dimensions.[12]

Data mining edit

Genetic mutations in individuals data set
Individual name Gene 1 Gene 2 ... Gene 2000
Individual 1 1 0 ... 1
... ... ... ... ...
Individual 200 0 1 ... 1

In data mining, the curse of dimensionality refers to a data set with too many features.

Consider the first table, which depicts 200 individuals and 2000 genes (features) with a 1 or 0 denoting whether or not they have a genetic mutation in that gene. A data mining application to this data set may be finding the correlation between specific genetic mutations and creating a classification algorithm such as a decision tree to determine whether an individual has cancer or not.

Growth of association pair permutations as pair size grows
Number
of pairs
Calculation for
permutations
Number of permutations
calculated for each row
2   3998000
3   7988004000
4   15952043988000
5   31840279800048000

A common practice of data mining in this domain would be to create association rules between genetic mutations that lead to the development of cancers. To do this, one would have to loop through each genetic mutation of each individual and find other genetic mutations that occur over a desired threshold and create pairs. They would start with pairs of two, then three, then four until they result in an empty set of pairs. The complexity of this algorithm can lead to calculating all permutations of gene pairs for each individual or row. Given the formula for calculating the permutations of n items with a group size of r is:  , calculating the number of three pair permutations of any given individual would be 7988004000 different pairs of genes to evaluate for each individual. The number of pairs created will grow by an order of factorial as the size of the pairs increase. The growth is depicted in the permutation table (see right).

As we can see from the permutation table above, one of the major problems data miners face regarding the curse of dimensionality is that the space of possible parameter values grows exponentially or factorially as the number of features in the data set grows. This problem critically affects both computational time and space when searching for associations or optimal features to consider.

Another problem data miners may face when dealing with too many features is the notion that the number of false predictions or classifications tend to increase as the number of features grow in the data set. In terms of the classification problem discussed above, keeping every data point could lead to a higher number of false positives and false negatives in the model.

This may seem counter intuitive but consider the genetic mutation table from above, depicting all genetic mutations for each individual. Each genetic mutation, whether they correlate with cancer or not, will have some input or weight in the model that guides the decision-making process of the algorithm. There may be mutations that are outliers or ones that dominate the overall distribution of genetic mutations when in fact they do not correlate with cancer. These features may be working against one's model, making it more difficult to obtain optimal results.

This problem is up to the data miner to solve, and there is no universal solution. The first step any data miner should take is to explore the data, in an attempt to gain an understanding of how it can be used to solve the problem. One must first understand what the data means, and what they are trying to discover before they can decide if anything must be removed from the data set. Then they can create or use a feature selection or dimensionality reduction algorithm to remove samples or features from the data set if they deem it necessary. One example of such methods is the interquartile range method, used to remove outliers in a data set by calculating the standard deviation of a feature or occurrence.

Distance function edit

When a measure such as a Euclidean distance is defined using many coordinates, there is little difference in the distances between different pairs of points.

One way to illustrate the "vastness" of high-dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius   and dimension  , to that of a hypercube with edges of length   The volume of such a sphere is  , where   is the gamma function, while the volume of the cube is  . As the dimension   of the space increases, the hypersphere becomes an insignificant volume relative to that of the hypercube. This can clearly be seen by comparing the proportions as the dimension   goes to infinity:

  as  .

Furthermore, the distance between the center and the corners is  , which increases without bound for fixed r.

In this sense when points are uniformly generated in a high-dimensional hypercube, almost all points are much farther than   units away from the centre. In high dimensions, the volume of the d-dimensional unit hypercube (with coordinates of the vertices  ) is concentrated near a sphere with the radius   for large dimension d. Indeed, for each coordinate   the average value of   in the cube is[13]

 .

The variance of   for uniform distribution in the cube is

 

Therefore, the squared distance from the origin,   has the average value d/3 and variance 4d/45. For large d, distribution of   is close to the normal distribution with the mean 1/3 and the standard deviation   according to the central limit theorem. Thus, when uniformly generating points in high dimensions, both the "middle" of the hypercube, and the corners are empty, and all the volume is concentrated near the surface of a sphere of "intermediate" radius  .

This also helps to understand the chi-squared distribution. Indeed, the (non-central) chi-squared distribution associated to a random point in the interval [-1, 1] is the same as the distribution of the length-squared of a random point in the d-cube. By the law of large numbers, this distribution concentrates itself in a narrow band around d times the standard deviation squared (σ2) of the original derivation. This illuminates the chi-squared distribution and also illustrates that most of the volume of the d-cube concentrates near the boundary of a sphere of radius  .

A further development of this phenomenon is as follows. Any fixed distribution on the real numbers induces a product distribution on points in  . For any fixed n, it turns out that the difference between the minimum and the maximum distance between a random reference point Q and a list of n random data points P1,...,Pn become indiscernible compared to the minimum distance:[14]

 .

This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions   are independent and identically distributed.[15] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.[15]

More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions. Machine learning can be understood as the problem of assigning instances to their respective generative process of origin, with class labels acting as symbolic representations of individual generative processes. The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process. If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions. Thus, the traditional argument that contrast-loss creates a curse, may be fundamentally inappropriate. In addition, it has been shown that when the generative model is modified to accommodate multiple generative processes, contrast-loss can morph from a curse to a blessing, as it ensures that the nearest-neighbor of an instance is almost-surely its most closely related instance. From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.[16]

Nearest neighbor search edit

The effect complicates nearest neighbor search in high dimensional space. It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions.[17][18]

However, it has recently been observed that the mere number of dimensions does not necessarily result in difficulties,[19] since relevant additional dimensions can also increase the contrast. In addition, for the resulting ranking it remains useful to discern close and far neighbors. Irrelevant ("noise") dimensions, however, reduce the contrast in the manner described above. In time series analysis, where the data are inherently high-dimensional, distance functions also work reliably as long as the signal-to-noise ratio is high enough.[20]

k-nearest neighbor classification edit

Another effect of high dimensionality on distance functions concerns k-nearest neighbor (k-NN) graphs constructed from a data set using a distance function. As the dimension increases, the indegree distribution of the k-NN digraph becomes skewed with a peak on the right because of the emergence of a disproportionate number of hubs, that is, data-points that appear in many more k-NN lists of other data-points than the average. This phenomenon can have a considerable impact on various techniques for classification (including the k-NN classifier), semi-supervised learning, and clustering,[21] and it also affects information retrieval.[22]

Anomaly detection edit

In a 2012 survey, Zimek et al. identified the following problems when searching for anomalies in high-dimensional data:[15]

  1. Concentration of scores and distances: derived values such as distances become numerically similar
  2. Irrelevant attributes: in high dimensional data, a significant number of attributes may be irrelevant
  3. Definition of reference sets: for local methods, reference sets are often nearest-neighbor based
  4. Incomparable scores for different dimensionalities: different subspaces produce incomparable scores
  5. Interpretability of scores: the scores often no longer convey a semantic meaning
  6. Exponential search space: the search space can no longer be systematically scanned
  7. Data snooping bias: given the large search space, for every desired significance a hypothesis can be found
  8. Hubness: certain objects occur more frequently in neighbor lists than others.

Many of the analyzed specialized methods tackle one or another of these problems, but there remain many open research questions.

Blessing of dimensionality edit

Surprisingly and despite the expected "curse of dimensionality" difficulties, common-sense heuristics based on the most straightforward methods "can yield results which are almost surely optimal" for high-dimensional problems.[23] The term "blessing of dimensionality" was introduced in the late 1990s.[23] Donoho in his "Millennium manifesto" clearly explained why the "blessing of dimensionality" will form a basis of future data mining.[24] The effects of the blessing of dimensionality were discovered in many applications and found their foundation in the concentration of measure phenomena.[25] One example of the blessing of dimensionality phenomenon is linear separability of a random point from a large finite random set with high probability even if this set is exponentially large: the number of elements in this random set can grow exponentially with dimension. Moreover, this linear functional can be selected in the form of the simplest linear Fisher discriminant. This separability theorem was proven for a wide class of probability distributions: general uniformly log-concave distributions, product distributions in a cube and many other families (reviewed recently in [25]).

"The blessing of dimensionality and the curse of dimensionality are two sides of the same coin."[26] For example, the typical property of essentially high-dimensional probability distributions in a high-dimensional space is: the squared distance of random points to a selected point is, with high probability, close to the average (or median) squared distance. This property significantly simplifies the expected geometry of data and indexing of high-dimensional data (blessing),[27] but, at the same time, it makes the similarity search in high dimensions difficult and even useless (curse).[28]

Zimek et al.[15] noted that while the typical formalizations of the curse of dimensionality affect i.i.d. data, having data that is separated in each attribute becomes easier even in high dimensions, and argued that the signal-to-noise ratio matters: data becomes easier with each attribute that adds signal, and harder with attributes that only add noise (irrelevant error) to the data. In particular for unsupervised data analysis this effect is known as swamping.

See also edit

References edit

  1. ^ Bellman, Richard Ernest; Rand Corporation (1957). Dynamic programming. Princeton University Press. p. ix. ISBN 978-0-691-07951-6.,
    Republished: Bellman, Richard Ernest (2003). Dynamic Programming. Courier Dover Publications. ISBN 978-0-486-42809-3.
  2. ^ Bellman, Richard Ernest (1961). Adaptive control processes: a guided tour. Princeton University Press. ISBN 9780691079011.
  3. ^ Taylor, C. Robert (1993). "Dynamic Programming and the Curses of Dimensionality". Applications Of Dynamic Programming To Agricultural Decision Problems. Westview Press. pp. 1–10. ISBN 0-8133-8641-1.
  4. ^ Udacity (2015-02-23). "Curse of Dimensionality - Georgia Tech - Machine Learning". YouTube. Retrieved 2022-06-29.
  5. ^ a b Koutroumbas, Konstantinos; Theodoridis, Sergios (2008). Pattern Recognition (4th ed.). Burlington. ISBN 978-1-59749-272-0. Retrieved 2018-01-08.{{cite book}}: CS1 maint: location missing publisher (link)
  6. ^ Hughes, G.F. (January 1968). "On the mean accuracy of statistical pattern recognizers". IEEE Transactions on Information Theory. 14 (1): 55–63. doi:10.1109/TIT.1968.1054102. S2CID 206729491.
  7. ^ Trunk, G. V. (July 1979). "A Problem of Dimensionality: A Simple Example". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (3): 306–307. doi:10.1109/TPAMI.1979.4766926. PMID 21868861. S2CID 13086902.
  8. ^ B. Chandrasekaran; A. K. Jain (1974). "Quantization Complexity and Independent Measurements". IEEE Transactions on Computers. 23 (8): 102–106. doi:10.1109/T-C.1974.223789. S2CID 35360973.
  9. ^ McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience. ISBN 978-0-471-69115-0. MR 1190469.
  10. ^ Zollanvari, A.; James, A. P.; Sameni, R. (2020). "A Theoretical Analysis of the Peaking Phenomenon in Classification". Journal of Classification. 37 (2): 421–434. doi:10.1007/s00357-019-09327-3. S2CID 253851666.
  11. ^ Schroff, Florian; Kalenichenko, Dmitry; Philbin, James (June 2015). "FaceNet: A unified embedding for face recognition and clustering" (PDF). 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 815–823. arXiv:1503.03832. doi:10.1109/CVPR.2015.7298682. ISBN 978-1-4673-6964-0. S2CID 206592766.
  12. ^ Yin, Zi; Shen, Yuanyuan (2018). "On the Dimensionality of Word Embedding" (PDF). Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
  13. ^ Bailey, D.H.; Borwein, J.M.; Crandall, R.E. (2006), "Box integrals", Journal of Computational and Applied Mathematics, 206: 196–208, doi:10.1016/j.cam.2006.06.010, S2CID 2763194
  14. ^ Beyer, K.; Goldstein, J.; Ramakrishnan, R.; Shaft, U. (1999). "When is "Nearest Neighbor" Meaningful?". Database Theory — ICDT'99. LNCS. Vol. 1540. pp. 217–235. doi:10.1007/3-540-49257-7_15. ISBN 978-3-540-65452-0. S2CID 206634099.
  15. ^ a b c d Zimek, A.; Schubert, E.; Kriegel, H.-P. (2012). "A survey on unsupervised outlier detection in high-dimensional numerical data". Statistical Analysis and Data Mining. 5 (5): 363–387. doi:10.1002/sam.11161. S2CID 6724536.
  16. ^ Lin, Wen-Yan; Liu, Siying; Ren, Changhao; Cheung, Ngai-Man; Li, Hongdong; Matsushita, Yasuyuki (2021). "Shell Theory: A Statistical Model of Reality". IEEE Transactions on Pattern Analysis and Machine Intelligence. 44 (10): 6438–6453. doi:10.1109/TPAMI.2021.3084598. ISSN 1939-3539. PMID 34048335. S2CID 235242104.
  17. ^ Marimont, R.B.; Shapiro, M.B. (1979). "Nearest Neighbour Searches and the Curse of Dimensionality". IMA J Appl Math. 24 (1): 59–70. doi:10.1093/imamat/24.1.59.
  18. ^ Chávez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; Marroquín, José Luis (2001). "Searching in Metric Spaces". ACM Computing Surveys. 33 (3): 273–321. CiteSeerX 10.1.1.100.7845. doi:10.1145/502807.502808. S2CID 3201604.
  19. ^ Houle, M. E.; Kriegel, H. P.; Kröger, P.; Schubert, E.; Zimek, A. (2010). Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? (PDF). Scientific and Statistical Database Management. Lecture Notes in Computer Science. Vol. 6187. p. 482. doi:10.1007/978-3-642-13818-8_34. ISBN 978-3-642-13817-1.
  20. ^ Bernecker, T.; Houle, M. E.; Kriegel, H. P.; Kröger, P.; Renz, M.; Schubert, E.; Zimek, A. (2011). Quality of Similarity Rankings in Time Series. Symposium on Spatial and Temporal Databases. Lecture Notes in Computer Science. Vol. 6849. p. 422. doi:10.1007/978-3-642-22922-0_25. ISBN 978-3-642-22921-3.
  21. ^ Radovanović, Miloš; Nanopoulos, Alexandros; Ivanović, Mirjana (2010). "Hubs in space: Popular nearest neighbors in high-dimensional data" (PDF). Journal of Machine Learning Research. 11: 2487–2531.
  22. ^ Radovanović, M.; Nanopoulos, A.; Ivanović, M. (2010). On the existence of obstinate results in vector space models. 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10. p. 186. doi:10.1145/1835449.1835482. ISBN 9781450301534.
  23. ^ a b Kainen, Paul C. (1997), "Utilizing Geometric Anomalies of High Dimension: When Complexity Makes Computation Easier", in Kárný, M.; Warwick, K. (eds.), Computer Intensive Methods in Control and Signal Processing, pp. 283–294, doi:10.1007/978-1-4612-1996-5_18, ISBN 978-1-4612-7373-8
  24. ^ Donoho, David L. (2000), "High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality", Invited lecture at Mathematical Challenges of the 21st Century, AMS National Meeting, Los Angeles, CA, USA, August 6-12, 2000, CiteSeerX 10.1.1.329.3392
  25. ^ a b Gorban, Alexander N.; Makarov, Valery A.; Tyukin, Ivan Y. (2020). "High-Dimensional Brain in a High-Dimensional World: Blessing of Dimensionality". Entropy. 22 (1): 82. arXiv:2001.04959. Bibcode:2020Entrp..22...82G. doi:10.3390/e22010082. PMC 7516518. PMID 33285855.
  26. ^ Gorban, Alexander N.; Tyukin, Ivan Y. (2018). "Blessing of dimensionality: mathematical foundations of the statistical physics of data". Phil. Trans. R. Soc. A. 376 (2118): 20170237. arXiv:1801.03421. Bibcode:2018RSPTA.37670237G. doi:10.1098/rsta.2017.0237. PMC 5869543. PMID 29555807.
  27. ^ Hecht-Nielsen, Robert (1994), "Context vectors: general-purpose approximate meaning representations self-organized from raw data", in Zurada, J.M.; Marks, R.J.; Robinson, C.J. (eds.), Computational intelligence: imitating life; Proceedings of World Congress on Computational Intelligence, Neural Networks; 1994; Orlando; FL, Piscataway, NJ: IEEE Press, pp. 43–56, ISBN 0780311043
  28. ^ Pestov, Vladimir (2013). "Is the k-NN classifier in high dimensions affected by the curse of dimensionality?". Comput. Math. Appl. 65 (10): 43–56. doi:10.1016/j.camwa.2012.09.011.

curse, dimensionality, curse, dimensionality, refers, various, phenomena, that, arise, when, analyzing, organizing, data, high, dimensional, spaces, that, occur, dimensional, settings, such, three, dimensional, physical, space, everyday, experience, expression. The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces that do not occur in low dimensional settings such as the three dimensional physical space of everyday experience The expression was coined by Richard E Bellman when considering problems in dynamic programming 1 2 The curse generally refers to issues that arise when the number of datapoints is small in a suitably defined sense relative to the intrinsic dimension of the data Dimensionally cursed phenomena occur in domains such as numerical analysis sampling combinatorics machine learning data mining and databases The common theme of these problems is that when the dimensionality increases the volume of the space increases so fast that the available data become sparse In order to obtain a reliable result the amount of data needed often grows exponentially with the dimensionality Also organizing and searching data often relies on detecting areas where objects form groups with similar properties in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient Contents 1 Domains 1 1 Combinatorics 1 2 Sampling 1 3 Optimization 1 4 Machine learning 1 5 Data mining 1 6 Distance function 1 7 Nearest neighbor search 1 7 1 k nearest neighbor classification 1 8 Anomaly detection 1 9 Blessing of dimensionality 2 See also 3 ReferencesDomains editCombinatorics edit In some problems each variable can take one of several discrete values or the range of possible values is divided to give a finite number of possibilities Taking the variables together a huge number of combinations of values must be considered This effect is also known as the combinatorial explosion Even in the simplest case of d displaystyle d nbsp binary variables the number of possible combinations already is 2 d displaystyle 2 d nbsp exponential in the dimensionality Naively each additional dimension doubles the effort needed to try all combinations Sampling edit There is an exponential increase in volume associated with adding extra dimensions to a mathematical space For example 102 100 evenly spaced sample points suffice to sample a unit interval try to visualize a 1 dimensional cube with no more than 10 2 0 01 distance between points an equivalent sampling of a 10 dimensional unit hypercube with a lattice that has a spacing of 10 2 0 01 between adjacent points would require 1020 102 10 sample points In general with a spacing distance of 10 n the 10 dimensional hypercube appears to be a factor of 10n 10 1 10n 10 10n larger than the 1 dimensional hypercube which is the unit interval In the above example n 2 when using a sampling distance of 0 01 the 10 dimensional hypercube appears to be 1018 larger than the unit interval This effect is a combination of the combinatorics problems above and the distance function problems explained below Optimization edit When solving dynamic optimization problems by numerical backward induction the objective function must be computed for each combination of values This is a significant obstacle when the dimension of the state variable is large 3 Machine learning edit In machine learning problems that involve learning a state of nature from a finite number of data samples in a high dimensional feature space with each feature having a range of possible values typically an enormous amount of training data is required to ensure that there are several samples with each combination of values In an abstract sense as the number of features or dimensions grows the amount of data we need to generalize accurately grows exponentially 4 A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation 5 In machine learning and insofar as predictive performance is concerned the curse of dimensionality is used interchangeably with the peaking phenomenon 5 which is also known as Hughes phenomenon 6 This phenomenon states that with a fixed number of training samples the average expected predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily 7 8 9 Nevertheless in the context of a simple classifier e g linear discriminant analysis in the multivariate Gaussian model under the assumption of a common known covariance matrix Zollanvari et al showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set with respect to features that are already part of the classifier is greater or less than the size of this additional feature set the expected error of the classifier constructed using these additional features will be less or greater than the expected error of the classifier constructed without them In other words both the size of additional features and their relative cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power 10 In metric learning higher dimensions can sometimes allow a model to achieve better performance After normalizing embeddings to the surface of a hypersphere FaceNet achieves the best performance using 128 dimensions as opposed to 64 256 or 512 dimensions in one ablation study 11 A loss function for unitary invariant dissimilarity between word embeddings was found to be minimized in high dimensions 12 Data mining edit Genetic mutations in individuals data set Individual name Gene 1 Gene 2 Gene 2000 Individual 1 1 0 1 Individual 200 0 1 1 In data mining the curse of dimensionality refers to a data set with too many features Consider the first table which depicts 200 individuals and 2000 genes features with a 1 or 0 denoting whether or not they have a genetic mutation in that gene A data mining application to this data set may be finding the correlation between specific genetic mutations and creating a classification algorithm such as a decision tree to determine whether an individual has cancer or not Growth of association pair permutations as pair size grows Number of pairs Calculation for permutations Number of permutations calculated for each row 2 2000 2000 2 displaystyle 2000 2000 2 nbsp 3998 000 3 2000 2000 3 displaystyle 2000 2000 3 nbsp 7988 004 000 4 2000 2000 4 displaystyle 2000 2000 4 nbsp 15952 043 988 000 5 2000 2000 5 displaystyle 2000 2000 5 nbsp 31840 279 800 048 000 A common practice of data mining in this domain would be to create association rules between genetic mutations that lead to the development of cancers To do this one would have to loop through each genetic mutation of each individual and find other genetic mutations that occur over a desired threshold and create pairs They would start with pairs of two then three then four until they result in an empty set of pairs The complexity of this algorithm can lead to calculating all permutations of gene pairs for each individual or row Given the formula for calculating the permutations of n items with a group size of r is n n r displaystyle frac n n r nbsp calculating the number of three pair permutations of any given individual would be 7988004000 different pairs of genes to evaluate for each individual The number of pairs created will grow by an order of factorial as the size of the pairs increase The growth is depicted in the permutation table see right As we can see from the permutation table above one of the major problems data miners face regarding the curse of dimensionality is that the space of possible parameter values grows exponentially or factorially as the number of features in the data set grows This problem critically affects both computational time and space when searching for associations or optimal features to consider Another problem data miners may face when dealing with too many features is the notion that the number of false predictions or classifications tend to increase as the number of features grow in the data set In terms of the classification problem discussed above keeping every data point could lead to a higher number of false positives and false negatives in the model This may seem counter intuitive but consider the genetic mutation table from above depicting all genetic mutations for each individual Each genetic mutation whether they correlate with cancer or not will have some input or weight in the model that guides the decision making process of the algorithm There may be mutations that are outliers or ones that dominate the overall distribution of genetic mutations when in fact they do not correlate with cancer These features may be working against one s model making it more difficult to obtain optimal results This problem is up to the data miner to solve and there is no universal solution The first step any data miner should take is to explore the data in an attempt to gain an understanding of how it can be used to solve the problem One must first understand what the data means and what they are trying to discover before they can decide if anything must be removed from the data set Then they can create or use a feature selection or dimensionality reduction algorithm to remove samples or features from the data set if they deem it necessary One example of such methods is the interquartile range method used to remove outliers in a data set by calculating the standard deviation of a feature or occurrence Distance function edit When a measure such as a Euclidean distance is defined using many coordinates there is little difference in the distances between different pairs of points One way to illustrate the vastness of high dimensional Euclidean space is to compare the proportion of an inscribed hypersphere with radius r displaystyle r nbsp and dimension d displaystyle d nbsp to that of a hypercube with edges of length 2 r displaystyle 2r nbsp The volume of such a sphere is 2 r d p d 2 d G d 2 displaystyle frac 2r d pi d 2 d Gamma d 2 nbsp where G displaystyle Gamma nbsp is the gamma function while the volume of the cube is 2 r d displaystyle 2r d nbsp As the dimension d displaystyle d nbsp of the space increases the hypersphere becomes an insignificant volume relative to that of the hypercube This can clearly be seen by comparing the proportions as the dimension d displaystyle d nbsp goes to infinity V h y p e r s p h e r e V h y p e r c u b e p d 2 d 2 d 1 G d 2 0 displaystyle frac V mathrm hypersphere V mathrm hypercube frac pi d 2 d2 d 1 Gamma d 2 rightarrow 0 nbsp as d displaystyle d rightarrow infty nbsp Furthermore the distance between the center and the corners is r d displaystyle r sqrt d nbsp which increases without bound for fixed r In this sense when points are uniformly generated in a high dimensional hypercube almost all points are much farther than r displaystyle r nbsp units away from the centre In high dimensions the volume of the d dimensional unit hypercube with coordinates of the vertices 1 displaystyle pm 1 nbsp is concentrated near a sphere with the radius d 3 displaystyle sqrt d sqrt 3 nbsp for large dimension d Indeed for each coordinate x i displaystyle x i nbsp the average value of x i 2 displaystyle x i 2 nbsp in the cube is 13 x i 2 1 2 1 1 x 2 d x 1 3 displaystyle left langle x i 2 right rangle frac 1 2 int 1 1 x 2 dx frac 1 3 nbsp The variance of x i 2 displaystyle x i 2 nbsp for uniform distribution in the cube is 1 2 1 1 x 4 d x x i 2 2 4 45 displaystyle frac 1 2 int 1 1 x 4 dx left langle x i 2 right rangle 2 frac 4 45 nbsp Therefore the squared distance from the origin r 2 i x i 2 textstyle r 2 sum i x i 2 nbsp has the average value d 3 and variance 4d 45 For large d distribution of r 2 d displaystyle r 2 d nbsp is close to the normal distribution with the mean 1 3 and the standard deviation 2 45 d displaystyle 2 sqrt 45d nbsp according to the central limit theorem Thus when uniformly generating points in high dimensions both the middle of the hypercube and the corners are empty and all the volume is concentrated near the surface of a sphere of intermediate radius d 3 textstyle sqrt d 3 nbsp This also helps to understand the chi squared distribution Indeed the non central chi squared distribution associated to a random point in the interval 1 1 is the same as the distribution of the length squared of a random point in the d cube By the law of large numbers this distribution concentrates itself in a narrow band around d times the standard deviation squared s2 of the original derivation This illuminates the chi squared distribution and also illustrates that most of the volume of the d cube concentrates near the boundary of a sphere of radius s d displaystyle sigma sqrt d nbsp A further development of this phenomenon is as follows Any fixed distribution on the real numbers induces a product distribution on points in R d displaystyle mathbb R d nbsp For any fixed n it turns out that the difference between the minimum and the maximum distance between a random reference point Q and a list of n random data points P1 Pn become indiscernible compared to the minimum distance 14 lim d E dist max d dist min d dist min d 0 displaystyle lim d to infty E left frac operatorname dist max d operatorname dist min d operatorname dist min d right to 0 nbsp This is often cited as distance functions losing their usefulness for the nearest neighbor criterion in feature comparison algorithms for example in high dimensions However recent research has shown this to only hold in the artificial scenario when the one dimensional distributions R displaystyle mathbb R nbsp are independent and identically distributed 15 When attributes are correlated data can become easier and provide higher distance contrast and the signal to noise ratio was found to play an important role thus feature selection should be used 15 More recently it has been suggested that there may be a conceptual flaw in the argument that contrast loss creates a curse in high dimensions Machine learning can be understood as the problem of assigning instances to their respective generative process of origin with class labels acting as symbolic representations of individual generative processes The curse s derivation assumes all instances are independent identical outcomes of a single high dimensional generative process If there is only one generative process there would exist only one naturally occurring class and machine learning would be conceptually ill defined in both high and low dimensions Thus the traditional argument that contrast loss creates a curse may be fundamentally inappropriate In addition it has been shown that when the generative model is modified to accommodate multiple generative processes contrast loss can morph from a curse to a blessing as it ensures that the nearest neighbor of an instance is almost surely its most closely related instance From this perspective contrast loss makes high dimensional distances especially meaningful and not especially non meaningful as is often argued 16 Nearest neighbor search edit The effect complicates nearest neighbor search in high dimensional space It is not possible to quickly reject candidates by using the difference in one coordinate as a lower bound for a distance based on all the dimensions 17 18 However it has recently been observed that the mere number of dimensions does not necessarily result in difficulties 19 since relevant additional dimensions can also increase the contrast In addition for the resulting ranking it remains useful to discern close and far neighbors Irrelevant noise dimensions however reduce the contrast in the manner described above In time series analysis where the data are inherently high dimensional distance functions also work reliably as long as the signal to noise ratio is high enough 20 k nearest neighbor classification edit Another effect of high dimensionality on distance functions concerns k nearest neighbor k NN graphs constructed from a data set using a distance function As the dimension increases the indegree distribution of the k NN digraph becomes skewed with a peak on the right because of the emergence of a disproportionate number of hubs that is data points that appear in many more k NN lists of other data points than the average This phenomenon can have a considerable impact on various techniques for classification including the k NN classifier semi supervised learning and clustering 21 and it also affects information retrieval 22 Anomaly detection edit In a 2012 survey Zimek et al identified the following problems when searching for anomalies in high dimensional data 15 Concentration of scores and distances derived values such as distances become numerically similar Irrelevant attributes in high dimensional data a significant number of attributes may be irrelevant Definition of reference sets for local methods reference sets are often nearest neighbor based Incomparable scores for different dimensionalities different subspaces produce incomparable scores Interpretability of scores the scores often no longer convey a semantic meaning Exponential search space the search space can no longer be systematically scanned Data snooping bias given the large search space for every desired significance a hypothesis can be found Hubness certain objects occur more frequently in neighbor lists than others Many of the analyzed specialized methods tackle one or another of these problems but there remain many open research questions Blessing of dimensionality edit Surprisingly and despite the expected curse of dimensionality difficulties common sense heuristics based on the most straightforward methods can yield results which are almost surely optimal for high dimensional problems 23 The term blessing of dimensionality was introduced in the late 1990s 23 Donoho in his Millennium manifesto clearly explained why the blessing of dimensionality will form a basis of future data mining 24 The effects of the blessing of dimensionality were discovered in many applications and found their foundation in the concentration of measure phenomena 25 One example of the blessing of dimensionality phenomenon is linear separability of a random point from a large finite random set with high probability even if this set is exponentially large the number of elements in this random set can grow exponentially with dimension Moreover this linear functional can be selected in the form of the simplest linear Fisher discriminant This separability theorem was proven for a wide class of probability distributions general uniformly log concave distributions product distributions in a cube and many other families reviewed recently in 25 The blessing of dimensionality and the curse of dimensionality are two sides of the same coin 26 For example the typical property of essentially high dimensional probability distributions in a high dimensional space is the squared distance of random points to a selected point is with high probability close to the average or median squared distance This property significantly simplifies the expected geometry of data and indexing of high dimensional data blessing 27 but at the same time it makes the similarity search in high dimensions difficult and even useless curse 28 Zimek et al 15 noted that while the typical formalizations of the curse of dimensionality affect i i d data having data that is separated in each attribute becomes easier even in high dimensions and argued that the signal to noise ratio matters data becomes easier with each attribute that adds signal and harder with attributes that only add noise irrelevant error to the data In particular for unsupervised data analysis this effect is known as swamping See also editBellman equation Clustering high dimensional data Concentration of measure Dimension reduction Dynamic programming Fourier related transforms Grand Tour Linear least squares Model order reduction Multilinear PCA Multilinear subspace learning Principal component analysis Singular value decompositionReferences edit Bellman Richard Ernest Rand Corporation 1957 Dynamic programming Princeton University Press p ix ISBN 978 0 691 07951 6 Republished Bellman Richard Ernest 2003 Dynamic Programming Courier Dover Publications ISBN 978 0 486 42809 3 Bellman Richard Ernest 1961 Adaptive control processes a guided tour Princeton University Press ISBN 9780691079011 Taylor C Robert 1993 Dynamic Programming and the Curses of Dimensionality Applications Of Dynamic Programming To Agricultural Decision Problems Westview Press pp 1 10 ISBN 0 8133 8641 1 Udacity 2015 02 23 Curse of Dimensionality Georgia Tech Machine Learning YouTube Retrieved 2022 06 29 a b Koutroumbas Konstantinos Theodoridis Sergios 2008 Pattern Recognition 4th ed Burlington ISBN 978 1 59749 272 0 Retrieved 2018 01 08 a href Template Cite book html title Template Cite book cite book a CS1 maint location missing publisher link Hughes G F January 1968 On the mean accuracy of statistical pattern recognizers IEEE Transactions on Information Theory 14 1 55 63 doi 10 1109 TIT 1968 1054102 S2CID 206729491 Trunk G V July 1979 A Problem of Dimensionality A Simple Example IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI 1 3 306 307 doi 10 1109 TPAMI 1979 4766926 PMID 21868861 S2CID 13086902 B Chandrasekaran A K Jain 1974 Quantization Complexity and Independent Measurements IEEE Transactions on Computers 23 8 102 106 doi 10 1109 T C 1974 223789 S2CID 35360973 McLachlan G J 2004 Discriminant Analysis and Statistical Pattern Recognition Wiley Interscience ISBN 978 0 471 69115 0 MR 1190469 Zollanvari A James A P Sameni R 2020 A Theoretical Analysis of the Peaking Phenomenon in Classification Journal of Classification 37 2 421 434 doi 10 1007 s00357 019 09327 3 S2CID 253851666 Schroff Florian Kalenichenko Dmitry Philbin James June 2015 FaceNet A unified embedding for face recognition and clustering PDF 2015 IEEE Conference on Computer Vision and Pattern Recognition CVPR pp 815 823 arXiv 1503 03832 doi 10 1109 CVPR 2015 7298682 ISBN 978 1 4673 6964 0 S2CID 206592766 Yin Zi Shen Yuanyuan 2018 On the Dimensionality of Word Embedding PDF Advances in Neural Information Processing Systems 31 Curran Associates Inc Bailey D H Borwein J M Crandall R E 2006 Box integrals Journal of Computational and Applied Mathematics 206 196 208 doi 10 1016 j cam 2006 06 010 S2CID 2763194 Beyer K Goldstein J Ramakrishnan R Shaft U 1999 When is Nearest Neighbor Meaningful Database Theory ICDT 99 LNCS Vol 1540 pp 217 235 doi 10 1007 3 540 49257 7 15 ISBN 978 3 540 65452 0 S2CID 206634099 a b c d Zimek A Schubert E Kriegel H P 2012 A survey on unsupervised outlier detection in high dimensional numerical data Statistical Analysis and Data Mining 5 5 363 387 doi 10 1002 sam 11161 S2CID 6724536 Lin Wen Yan Liu Siying Ren Changhao Cheung Ngai Man Li Hongdong Matsushita Yasuyuki 2021 Shell Theory A Statistical Model of Reality IEEE Transactions on Pattern Analysis and Machine Intelligence 44 10 6438 6453 doi 10 1109 TPAMI 2021 3084598 ISSN 1939 3539 PMID 34048335 S2CID 235242104 Marimont R B Shapiro M B 1979 Nearest Neighbour Searches and the Curse of Dimensionality IMA J Appl Math 24 1 59 70 doi 10 1093 imamat 24 1 59 Chavez Edgar Navarro Gonzalo Baeza Yates Ricardo Marroquin Jose Luis 2001 Searching in Metric Spaces ACM Computing Surveys 33 3 273 321 CiteSeerX 10 1 1 100 7845 doi 10 1145 502807 502808 S2CID 3201604 Houle M E Kriegel H P Kroger P Schubert E Zimek A 2010 Can Shared Neighbor Distances Defeat the Curse of Dimensionality PDF Scientific and Statistical Database Management Lecture Notes in Computer Science Vol 6187 p 482 doi 10 1007 978 3 642 13818 8 34 ISBN 978 3 642 13817 1 Bernecker T Houle M E Kriegel H P Kroger P Renz M Schubert E Zimek A 2011 Quality of Similarity Rankings in Time Series Symposium on Spatial and Temporal Databases Lecture Notes in Computer Science Vol 6849 p 422 doi 10 1007 978 3 642 22922 0 25 ISBN 978 3 642 22921 3 Radovanovic Milos Nanopoulos Alexandros Ivanovic Mirjana 2010 Hubs in space Popular nearest neighbors in high dimensional data PDF Journal of Machine Learning Research 11 2487 2531 Radovanovic M Nanopoulos A Ivanovic M 2010 On the existence of obstinate results in vector space models 33rd international ACM SIGIR conference on Research and development in information retrieval SIGIR 10 p 186 doi 10 1145 1835449 1835482 ISBN 9781450301534 a b Kainen Paul C 1997 Utilizing Geometric Anomalies of High Dimension When Complexity Makes Computation Easier in Karny M Warwick K eds Computer Intensive Methods in Control and Signal Processing pp 283 294 doi 10 1007 978 1 4612 1996 5 18 ISBN 978 1 4612 7373 8 Donoho David L 2000 High Dimensional Data Analysis The Curses and Blessings of Dimensionality Invited lecture at Mathematical Challenges of the 21st Century AMS National Meeting Los Angeles CA USA August 6 12 2000 CiteSeerX 10 1 1 329 3392 a b Gorban Alexander N Makarov Valery A Tyukin Ivan Y 2020 High Dimensional Brain in a High Dimensional World Blessing of Dimensionality Entropy 22 1 82 arXiv 2001 04959 Bibcode 2020Entrp 22 82G doi 10 3390 e22010082 PMC 7516518 PMID 33285855 Gorban Alexander N Tyukin Ivan Y 2018 Blessing of dimensionality mathematical foundations of the statistical physics of data Phil Trans R Soc A 376 2118 20170237 arXiv 1801 03421 Bibcode 2018RSPTA 37670237G doi 10 1098 rsta 2017 0237 PMC 5869543 PMID 29555807 Hecht Nielsen Robert 1994 Context vectors general purpose approximate meaning representations self organized from raw data in Zurada J M Marks R J Robinson C J eds Computational intelligence imitating life Proceedings of World Congress on Computational Intelligence Neural Networks 1994 Orlando FL Piscataway NJ IEEE Press pp 43 56 ISBN 0780311043 Pestov Vladimir 2013 Is the k NN classifier in high dimensions affected by the curse of dimensionality Comput Math Appl 65 10 43 56 doi 10 1016 j camwa 2012 09 011 Retrieved from https en wikipedia org w index php title Curse of dimensionality amp oldid 1223692835, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.