fbpx
Wikipedia

History of artificial neural networks

Linear neural network Edit

The simplest kind of feedforward neural network is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the method of least squares or linear regression. It was used as a means of finding a good rough linear fit to a set of points by Legendre (1805) and Gauss (1795) for the prediction of planetary movement.[1][2][3][4][5]

Recurrent network architectures Edit

Wilhelm Lenz and Ernst Ising created and analyzed the Ising model (1925)[6] which is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements.[4] In 1972, Shun'ichi Amari made this architecture adaptive.[7][4] His learning RNN was popularised by John Hopfield in 1982.[8]

Perceptrons and other early neural networks Edit

Warren McCulloch and Walter Pitts[9] (1943) also considered a non-learning computational model for neural networks.[10] This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.[11]

In the early 1940s, D. O. Hebb[12] created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and Clark[13] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).[14]

Rosenblatt[15] (1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.[16]

Some say that research stagnated following Minsky and Papert (1969),[17] who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. However, by the time this book came out, methods for training multilayer perceptrons (MLPs) by deep learning were already known.[4]

First deep learning Edit

The first deep learning MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965, as the Group Method of Data Handling.[18][19][20] This method employs incremental layer by layer training based on regression analysis, where useless units in hidden layers are pruned with the help of a validation set.

The first deep learning MLP trained by stochastic gradient descent[21] was published in 1967 by Shun'ichi Amari.[22][4] In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful internal representations to classify non-linearily separable pattern classes.[4]

Backpropagation Edit

The backpropagation algorithm is an efficient application of the Leibniz chain rule (1673)[23] to networks of differentiable nodes.[4] It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970).[24][25][26][27][4] The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt,[28][4] but he did not have an implementation of this procedure, although Henry J. Kelley had a continuous precursor of backpropagation[29] already in 1960 in the context of control theory.[4] In 1982, Paul Werbos applied backpropagation to MLPs in the way that has become standard.[30] In 1986, David E. Rumelhart et al. published an experimental analysis of the technique.[31]

Self-organizing maps Edit

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982.[32][33] SOMs are neurophysiologically inspired[34] artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

SOMs create internal representations reminiscent of the cortical homunculus,[35] a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Support vector machines Edit

Support vector machines, developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues (Boser et al., 1992, Isabelle Guyon et al., 1993, Corinna Cortes, 1995, Vapnik et al., 1997) and simpler methods such as linear classifiers gradually overtook neural networks.[citation needed] However, neural networks transformed domains such as the prediction of protein structures.[36][37]

Convolutional neural networks (CNNs) Edit

The origin of the CNN architecture is the "neocognitron"[38] introduced by Kunihiko Fukushima in 1980.[39][40] It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function.[41][4] The rectifier has become the most popular activation function for CNNs and deep neural networks in general.[42]

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[43] It did so by utilizing weight sharing in combination with backpropagation training.[44] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[43]

In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.[45][46]

In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[47] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991[48] and breast cancer detection in mammograms in 1994.[49]

In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system.[50] In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[51][52][53][54] Max-pooling is often used in modern CNNs.[55]

LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998,[56] that classifies digits, was applied by several banks to recognize hand-written numbers on checks (British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.

In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[57] Behnke (2003) relied only on the sign of the gradient (Rprop)[58] on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.[59]

In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and Juergen Schmidhuber achieved human-competitive performance for the first time in computer vision contests.[60] Subsequently, a similar GPU-based CNN by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge 2012.[61] A very deep CNN with over 100 layers by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft won the ImageNet 2015 contest.[62]

ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)[63] whose embodiments are Where-What Networks, WWN-1 (2008)[64] through WWN-7 (2013).[65]

Artificial curiosity and generative adversarial networks Edit

In 1991, Juergen Schmidhuber published adversarial neural networks that contest with each other in the form of a zero-sum game, where one network's gain is the other network's loss.[66][67][68] The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity." Earlier adversarial machine learning systems "neither involved unsupervised neural networks nor were about modeling data nor used gradient descent."[68]

In 2014, this adversarial principle was used in a generative adversarial network (GAN) by Ian Goodfellow et al.[69] Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic deepfakes.[70]

In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the zero-sum game is to create disentangled representations of input patterns. This was called predictability minimization.[71][72]

Nvidia's StyleGAN (2018)[73] is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.[74] Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.

Transformers and their variants Edit

Many modern large language models such as ChatGPT, GPT-4, and BERT use a feedforward neural network called Transformer by Ashish Vaswani et. al. in their 2017 paper "Attention Is All You Need."[75] Transformers have increasingly become the model of choice for natural language processing problems,[76] replacing recurrent neural networks (RNNs) such as long short-term memory (LSTM).[77]

Basic ideas for this go back a long way: in 1992, Juergen Schmidhuber published the Transformer with "linearized self-attention" (save for a normalization operator),[78] which is also called the "linear Transformer."[79][80][4] He advertised it as an "alternative to RNNs"[78] that can learn "internal spotlights of attention,"[81] and experimentally applied it to problems of variable binding.[78] Here a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "self-attention."[80] This fast weight "attention mapping" is applied to queries. The 2017 Transformer[75] combines this with a softmax operator and a projection matrix.[4]

Transformers are also increasingly being used in computer vision.[82]

Deep learning with unsupervised or self-supervised pre-training Edit

In the 1980s, backpropagation did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth.[83] The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.

To overcome this problem, Juergen Schmidhuber (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by self-supervised learning.[84] This "neural history compressor" uses predictive coding to learn internal representations at multiple self-organizing time scales.[4] The deep architecture may be used to reproduce the original data from the top level feature activations.[84] The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network.[84][4] In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000.[85] Such history compressors can substantially facilitate downstream supervised deep learning.[4]

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine[86] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[87][88] In 2012, Andrew Ng and Jeff Dean created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[89]

The vanishing gradient problem and its solutions Edit

Sepp Hochreiter's diploma thesis (1991)[90] was called "one of the most important documents in the history of machine learning" by his supervisor Juergen Schmidhuber.[4] Hochreiter not only tested the neural history compressor,[84] but also identified and analyzed the vanishing gradient problem.[90][91] He proposed recurrent residual connections to solve this problem. This led to the deep learning method called long short-term memory (LSTM), published in 1997.[92] LSTM recurrent neural networks can learn "very deep learning" tasks[83] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by Felix Gers, Schmidhuber and Fred Cummins.[93] LSTM has become the most cited neural network of the 20th century.[4]

In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used LSTM principles to create the Highway network, a feedforward neural network with hundreds of layers, much deeper than previous networks.[94][95] 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless Highway network variant called Residual neural network.[96] This has become the most cited neural network of the 21st century.[4]

In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU[41] of Kunihiko Fukushima also helps to overcome the vanishing gradient problem,[97] compared to widely used activation functions prior to 2011.

Hardware-based designs Edit

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.[98]

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices[99] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).[100] Ciresan and colleagues (2010)[101] in Schmidhuber's group showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.

Contests Edit

Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning.[102][103] For example, the bi-directional and multi-dimensional long short-term memory (LSTM)[104][105][106][107] of Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three languages to be learned.[106][105]

Ciresan and colleagues won pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[108] the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge[109] and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance[60] on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem.

Researchers demonstrated (2010) that deep neural networks interfaced to a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.[citation needed]

GPU-based implementations[110] of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,[108] the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge,[109] the ImageNet Competition[61] and others.

Deep, highly nonlinear neural architectures similar to the neocognitron[111] and the "standard architecture of vision",[112] inspired by simple and complex cells, were pre-trained with unsupervised methods by Hinton.[88][87] A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs.[113]

References Edit

  1. ^ Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"
  2. ^ Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. doi:10.1214/aos/1176345451.
  3. ^ Bretscher, Otto (1995). Linear Algebra With Applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall.
  4. ^ a b c d e f g h i j k l m n o p q r s Schmidhuber, Juergen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].
  5. ^ Stigler, Stephen M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge: Harvard. ISBN 0-674-40340-1.
  6. ^ Brush, Stephen G. (1967). "History of the Lenz-Ising Model". Reviews of Modern Physics. 39 (4): 883–893. Bibcode:1967RvMP...39..883B. doi:10.1103/RevModPhys.39.883.
  7. ^ Amari, Shun-Ichi (1972). "Learning patterns and pattern sequences by self-organizing nets of threshold elements". IEEE Transactions. C (21): 1197–1206.
  8. ^ Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences. 79 (8): 2554–2558. Bibcode:1982PNAS...79.2554H. doi:10.1073/pnas.79.8.2554. PMC 346238. PMID 6953413.
  9. ^ McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133. doi:10.1007/BF02478259.
  10. ^ Kleene, S.C. (1956). "Representation of Events in Nerve Nets and Finite Automata". Annals of Mathematics Studies. No. 34. Princeton University Press. pp. 3–41. Retrieved 17 June 2017.
  11. ^ Kleene, S.C. (1956). "Representation of Events in Nerve Nets and Finite Automata". Annals of Mathematics Studies. No. 34. Princeton University Press. pp. 3–41. Retrieved 2017-06-17.
  12. ^ Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. ISBN 978-1-135-63190-1.
  13. ^ Farley, B.G.; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory. 4 (4): 76–84. doi:10.1109/TIT.1954.1057468.
  14. ^ Rochester, N.; J.H. Holland; L.H. Habit; W.L. Duda (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory. 2 (3): 80–93. doi:10.1109/TIT.1956.1056810.
  15. ^ Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain". Psychological Review. 65 (6): 386–408. CiteSeerX 10.1.1.588.3775. doi:10.1037/h0042519. PMID 13602029. S2CID 12781225.
  16. ^ David H. Hubel and Torsten N. Wiesel (2005). Brain and visual perception: the story of a 25-year collaboration. Oxford University Press US. p. 106. ISBN 978-0-19-517618-6.
  17. ^ Minsky, Marvin; Papert, Seymour (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-63022-1.
  18. ^ Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
  19. ^ Ivakhnenko, A. G. (1973). Cybernetic Predicting Devices. CCM Information Corporation.
  20. ^ Ivakhnenko, A. G.; Grigorʹevich Lapa, Valentin (1967). Cybernetics and forecasting techniques. American Elsevier Pub. Co.
  21. ^ Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
  22. ^ Amari, Shun'ichi (1967). "A theory of adaptive pattern classifier". IEEE Transactions. EC (16): 279–307.
  23. ^ Leibniz, Gottfried Wilhelm Freiherr von (1920). The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir). Open court publishing Company. ISBN 9780598818461.
  24. ^ Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (Masters) (in Finnish). University of Helsinki. pp. 6–7.
  25. ^ Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.
  26. ^ Griewank, Andreas (2012). "Who Invented the Reverse Mode of Differentiation?". Optimization Stories. Documenta Matematica, Extra Volume ISMP. pp. 389–400. S2CID 15568746.
  27. ^ Griewank, Andreas; Walther, Andrea (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition. SIAM. ISBN 978-0-89871-776-1.
  28. ^ Rosenblatt, Frank (1962). Principles of Neurodynamics. Spartan, New York.
  29. ^ Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.
  30. ^ Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770. (PDF) from the original on 14 April 2016. Retrieved 2 July 2017.
  31. ^ Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "Learning Internal Representations by Error Propagation". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.
  32. ^ Kohonen, Teuvo; Honkela, Timo (2007). "Kohonen Network". Scholarpedia. 2 (1): 1568. Bibcode:2007SchpJ...2.1568K. doi:10.4249/scholarpedia.1568.
  33. ^ Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics. 43 (1): 59–69. doi:10.1007/bf00337288. S2CID 206775459.
  34. ^ Von der Malsburg, C (1973). "Self-organization of orientation sensitive cells in the striate cortex". Kybernetik. 14 (2): 85–100. doi:10.1007/bf00288907. PMID 4786750. S2CID 3351573.
  35. ^ . Lexico Dictionaries | English. Archived from the original on May 18, 2021. Retrieved 6 February 2022.
  36. ^ Qian, N.; Sejnowski, T.J. (1988). "Predicting the secondary structure of globular proteins using neural network models" (PDF). Journal of Molecular Biology. 202 (4): 865–884. doi:10.1016/0022-2836(88)90564-5. PMID 3172241. Qian1988.
  37. ^ Rost, B.; Sander, C. (1993). "Prediction of protein secondary structure at better than 70% accuracy" (PDF). Journal of Molecular Biology. 232 (2): 584–599. doi:10.1006/jmbi.1993.1413. PMID 8345525. Rost1993.
  38. ^ Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717.
  39. ^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved 16 November 2013.
  40. ^ LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
  41. ^ a b Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225.
  42. ^ Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].
  43. ^ a b Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
  44. ^ Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989.
  45. ^ Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
  46. ^ Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
  47. ^ LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
  48. ^ Zhang, Wei (1991). "Image processing of human corneal endothelium based on a learning network". Applied Optics. 30 (29): 4211–7. Bibcode:1991ApOpt..30.4211Z. doi:10.1364/AO.30.004211. PMID 20706526.
  49. ^ Zhang, Wei (1994). "Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network". Medical Physics. 21 (4): 517–24. Bibcode:1994MedPh..21..517Z. doi:10.1118/1.597177. PMID 8058017.
  50. ^ Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). . First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.
  51. ^ J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-organizing neural network which grows adaptively," Proc. International Joint Conference on Neural Networks, Baltimore, Maryland, vol I, pp. 576–581, June, 1992.
  52. ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121–128, May, 1993.
  53. ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation using the Cresceptron," International Journal of Computer Vision, vol. 25, no. 2, pp. 105–139, Nov. 1997.
  54. ^ Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". 1993 (4th) International Conference on Computer Vision. pp. 121–128. doi:10.1109/ICCV.1993.378228. ISBN 0-8186-3870-2. S2CID 8619176. {{cite book}}: |journal= ignored (help)
  55. ^ Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia. 10 (11): 1527–54. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  56. ^ LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. CiteSeerX 10.1.1.32.9552. doi:10.1109/5.726791. S2CID 14542261. Retrieved October 7, 2016.
  57. ^ Dominik Scherer, Andreas C. Müller, and Sven Behnke: "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition," In 20th International Conference Artificial Neural Networks (ICANN), pp. 92–101, 2010. doi:10.1007/978-3-642-15825-4_10.
  58. ^ Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.
  59. ^ Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992
  60. ^ a b Ciresan, Dan; Meier, U.; Schmidhuber, J. (June 2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3642–3649. arXiv:1202.2745. Bibcode:2012arXiv1202.2745C. CiteSeerX 10.1.1.300.3283. doi:10.1109/cvpr.2012.6248110. ISBN 978-1-4673-1228-8. S2CID 2161592.
  61. ^ a b Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffry (2012). "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada.
  62. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition" (PDF). 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1. S2CID 206594692.
  63. ^ J. Weng, "Why Have We Passed 'Neural Networks Do not Abstract Well'?," Natural Intelligence: the INNS Magazine, vol. 1, no.1, pp. 13–22, 2011.
  64. ^ Z. Ji, J. Weng, and D. Prokhorov, "Where-What Network 1: Where and What Assist Each Other Through Top-down Connections," Proc. 7th International Conference on Development and Learning (ICDL'08), Monterey, CA, Aug. 9–12, pp. 1–6, 2008.
  65. ^ X. Wu, G. Guo, and J. Weng, "Skull-closed Autonomous Development: WWN-7 Dealing with Scales," Proc. International Conference on Brain-Mind, July 27–28, East Lansing, Michigan, pp. 1–9, 2013.
  66. ^ Schmidhuber, Jürgen (1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.
  67. ^ Schmidhuber, Jürgen (2010). "Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010)". IEEE Transactions on Autonomous Mental Development. 2 (3): 230–247. doi:10.1109/TAMD.2010.2056368. S2CID 234198.
  68. ^ a b Schmidhuber, Jürgen (2020). "Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991)". Neural Networks. 127: 58–66. arXiv:1906.04493. doi:10.1016/j.neunet.2020.04.008. PMID 32334341. S2CID 216056336.
  69. ^ Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks (PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680. (PDF) from the original on 22 November 2019. Retrieved 20 August 2019.
  70. ^ "Prepare, Don't Panic: Synthetic Media and Deepfakes". witness.org. from the original on 2 December 2020. Retrieved 25 November 2020.
  71. ^ Schmidhuber, Jürgen (November 1992). "Learning Factorial Codes by Predictability Minimization". Neural Computation. 4 (6): 863–879. doi:10.1162/neco.1992.4.6.863. S2CID 42023620.
  72. ^ Schmidhuber, Jürgen; Eldracher, Martin; Foltin, Bernhard (1996). "Semilinear predictability minimzation produces well-known feature detectors". Neural Computation. 8 (4): 773–786. doi:10.1162/neco.1996.8.4.773. S2CID 16154391.
  73. ^ "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.
  74. ^ Karras, Tero; Aila, Timo; Laine, Samuli; Lehtinen, Jaakko (October 1, 2017). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". arXiv:1710.10196. {{cite journal}}: Cite journal requires |journal= (help)
  75. ^ a b Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Lukasz; Polosukhin, Illia (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
  76. ^ Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
  77. ^ Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
  78. ^ a b c Schmidhuber, Jürgen (1 November 1992). "Learning to control fast-weight memories: an alternative to recurrent nets". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
  79. ^ Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz; Kaiser, Lukasz; Belanger, David; Colwell, Lucy; Weller, Adrian (2020). "Rethinking Attention with Performers". arXiv:2009.14794 [cs.CL].
  80. ^ a b Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
  81. ^ Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
  82. ^ He, Cheng (31 December 2021). "Transformer in CV". Transformer in CV. Towards Data Science.
  83. ^ a b Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
  84. ^ a b c d Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID 18271205.
  85. ^ Schmidhuber, Jürgen (1993). Habilitation Thesis (PDF).
  86. ^ Smolensky, P. (1986). "Information processing in dynamical systems: Foundations of harmony theory.". In D. E. Rumelhart; J. L. McClelland; PDP Research Group (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. pp. 194–281. ISBN 9780262680530.
  87. ^ a b Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  88. ^ a b Hinton, Geoffrey (2009-05-31). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947. ISSN 1941-6016.
  89. ^ Ng, Andrew; Dean, Jeff (2012). "Building High-level Features Using Large Scale Unsupervised Learning". arXiv:1112.6209 [cs.LG].
  90. ^ a b S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen 2015-03-06 at the Wayback Machine," Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
  91. ^ Hochreiter, S.; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons. ISBN 978-0-7803-5369-5.
  92. ^ Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
  93. ^ Gers, Felix; Schmidhuber, Jürgen; Cummins, Fred (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
  94. ^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
  95. ^ Srivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015). "Training Very Deep Networks". Advances in Neural Information Processing Systems. Curran Associates, Inc. 28: 2377–2385.
  96. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
  97. ^ Xavier Glorot; Antoine Bordes; Yoshua Bengio (2011). Deep sparse rectifier neural networks (PDF). AISTATS. Rectifier and softplus activation functions. The second one is a smooth version of the first.
  98. ^ Mead, Carver A.; Ismail, Mohammed (8 May 1989). Analog VLSI Implementation of Neural Systems (PDF). The Kluwer International Series in Engineering and Computer Science. Vol. 80. Norwell, MA: Kluwer Academic Publishers. doi:10.1007/978-1-4613-1639-8. ISBN 978-1-4613-1639-8.
  99. ^ Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. (2008). "Memristive switching mechanism for metal/oxide/metal nanodevices". Nat. Nanotechnol. 3 (7): 429–433. doi:10.1038/nnano.2008.160. PMID 18654568.
  100. ^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The missing memristor found". Nature. 453 (7191): 80–83. Bibcode:2008Natur.453...80S. doi:10.1038/nature06932. PMID 18451858. S2CID 4367148.
  101. ^ Cireşan, Dan Claudiu; Meier, Ueli; Gambardella, Luca Maria; Schmidhuber, Jürgen (2010-09-21). "Deep, Big, Simple Neural Nets for Handwritten Digit Recognition". Neural Computation. 22 (12): 3207–3220. arXiv:1003.0358. doi:10.1162/neco_a_00052. ISSN 0899-7667. PMID 20858131. S2CID 1918673.
  102. ^ 2012 Kurzweil AI Interview 2018-08-31 at the Wayback Machine with Jürgen Schmidhuber on the eight competitions won by his Deep Learning team 2009–2012
  103. ^ . www.kurzweilai.net. Archived from the original on 2018-08-31. Retrieved 2017-06-16.
  104. ^ Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Advances in Neural Information Processing Systems 22 (NIPS'22), 7–10 December 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552.
  105. ^ a b Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (2009). "A Novel Connectionist System for Improved Unconstrained Handwriting Recognition" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868. CiteSeerX 10.1.1.139.4502. doi:10.1109/tpami.2008.137. PMID 19299860. S2CID 14635907.
  106. ^ a b Graves, Alex; Schmidhuber, Jürgen (2009). Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris; Culotta, Aron (eds.). "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks". Neural Information Processing Systems (NIPS) Foundation. Curran Associates, Inc. 21: 545–552.
  107. ^ Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868. CiteSeerX 10.1.1.139.4502. doi:10.1109/tpami.2008.137. ISSN 0162-8828. PMID 19299860. S2CID 14635907.
  108. ^ a b Cireşan, Dan; Meier, Ueli; Masci, Jonathan; Schmidhuber, Jürgen (August 2012). "Multi-column deep neural network for traffic sign classification". Neural Networks. Selected Papers from IJCNN 2011. 32: 333–338. CiteSeerX 10.1.1.226.8219. doi:10.1016/j.neunet.2012.02.023. PMID 22386783.
  109. ^ a b Ciresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Juergen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 25 (PDF). Curran Associates, Inc. pp. 2843–2851.
  110. ^ Ciresan, D. C.; Meier, U.; Masci, J.; Gambardella, L. M.; Schmidhuber, J. (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). International Joint Conference on Artificial Intelligence. doi:10.5591/978-1-57735-516-8/ijcai11-210.
  111. ^ Fukushima, K. (1980). "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position". Biological Cybernetics. 36 (4): 93–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608.
  112. ^ Riesenhuber, M; Poggio, T (1999). "Hierarchical models of object recognition in cortex". Nature Neuroscience. 2 (11): 1019–1025. doi:10.1038/14819. PMID 10526343. S2CID 8920227.
  113. ^ Markoff, John (November 23, 2012). "Scientists See Promise in Deep-Learning Programs". New York Times.

External links Edit

  • "Lecun 2019-7-11 ACM Tech Talk". Google Docs. Retrieved 2020-02-13.

history, artificial, neural, networks, this, article, multiple, issues, please, help, improve, discuss, these, issues, talk, page, learn, when, remove, these, template, messages, this, article, relies, excessively, references, primary, sources, please, improve. This article has multiple issues Please help improve it or discuss these issues on the talk page Learn how and when to remove these template messages This article relies excessively on references to primary sources Please improve this article by adding secondary or tertiary sources Find sources History of artificial neural networks news newspapers books scholar JSTOR August 2022 Learn how and when to remove this template message This article needs to be updated Please help update this article to reflect recent events or newly available information September 2021 Learn how and when to remove this template message Contents 1 Linear neural network 2 Recurrent network architectures 3 Perceptrons and other early neural networks 4 First deep learning 5 Backpropagation 6 Self organizing maps 7 Support vector machines 8 Convolutional neural networks CNNs 9 Artificial curiosity and generative adversarial networks 10 Transformers and their variants 11 Deep learning with unsupervised or self supervised pre training 12 The vanishing gradient problem and its solutions 13 Hardware based designs 14 Contests 15 References 16 External linksLinear neural network EditThe simplest kind of feedforward neural network is a linear network which consists of a single layer of output nodes the inputs are fed directly to the outputs via a series of weights The sum of the products of the weights and the inputs is calculated in each node The mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights This technique has been known for over two centuries as the method of least squares or linear regression It was used as a means of finding a good rough linear fit to a set of points by Legendre 1805 and Gauss 1795 for the prediction of planetary movement 1 2 3 4 5 Recurrent network architectures EditMain article Recurrent neural network Wilhelm Lenz and Ernst Ising created and analyzed the Ising model 1925 6 which is essentially a non learning artificial recurrent neural network RNN consisting of neuron like threshold elements 4 In 1972 Shun ichi Amari made this architecture adaptive 7 4 His learning RNN was popularised by John Hopfield in 1982 8 Perceptrons and other early neural networks EditWarren McCulloch and Walter Pitts 9 1943 also considered a non learning computational model for neural networks 10 This model paved the way for research to split into two approaches One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence This work led to work on nerve networks and their link to finite automata 11 In the early 1940s D O Hebb 12 created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning Hebbian learning is unsupervised learning This evolved into models for long term potentiation Researchers started applying these ideas to computational models in 1948 with Turing s B type machines Farley and Clark 13 1954 first used computational machines then called calculators to simulate a Hebbian network Other neural network computational machines were created by Rochester Holland Habit and Duda 1956 14 Rosenblatt 15 1958 created the perceptron an algorithm for pattern recognition With mathematical notation Rosenblatt described circuitry not in the basic perceptron such as the exclusive or circuit that could not be processed by neural networks at the time In 1959 a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex simple cells and complex cells 16 Some say that research stagnated following Minsky and Papert 1969 17 who discovered that basic perceptrons were incapable of processing the exclusive or circuit and that computers lacked sufficient power to process useful neural networks However by the time this book came out methods for training multilayer perceptrons MLPs by deep learning were already known 4 First deep learning EditThe first deep learning MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965 as the Group Method of Data Handling 18 19 20 This method employs incremental layer by layer training based on regression analysis where useless units in hidden layers are pruned with the help of a validation set The first deep learning MLP trained by stochastic gradient descent 21 was published in 1967 by Shun ichi Amari 22 4 In computer experiments conducted by Amari s student Saito a five layer MLP with two modifiable layers learned useful internal representations to classify non linearily separable pattern classes 4 Backpropagation EditMain article Backpropagation The backpropagation algorithm is an efficient application of the Leibniz chain rule 1673 23 to networks of differentiable nodes 4 It is also known as the reverse mode of automatic differentiation or reverse accumulation due to Seppo Linnainmaa 1970 24 25 26 27 4 The term back propagating errors was introduced in 1962 by Frank Rosenblatt 28 4 but he did not have an implementation of this procedure although Henry J Kelley had a continuous precursor of backpropagation 29 already in 1960 in the context of control theory 4 In 1982 Paul Werbos applied backpropagation to MLPs in the way that has become standard 30 In 1986 David E Rumelhart et al published an experimental analysis of the technique 31 Self organizing maps EditMain article Self organizing map Self organizing maps SOMs were described by Teuvo Kohonen in 1982 32 33 SOMs are neurophysiologically inspired 34 artificial neural networks that learn low dimensional representations of high dimensional data while preserving the topological structure of the data They are trained using competitive learning SOMs create internal representations reminiscent of the cortical homunculus 35 a distorted representation of the human body based on a neurological map of the areas and proportions of the human brain dedicated to processing sensory functions for different parts of the body Support vector machines EditMain article Support vector machine Support vector machines developed at AT amp T Bell Laboratories by Vladimir Vapnik with colleagues Boser et al 1992 Isabelle Guyon et al 1993 Corinna Cortes 1995 Vapnik et al 1997 and simpler methods such as linear classifiers gradually overtook neural networks citation needed However neural networks transformed domains such as the prediction of protein structures 36 37 Convolutional neural networks CNNs EditMain article Convolutional neural network The origin of the CNN architecture is the neocognitron 38 introduced by Kunihiko Fukushima in 1980 39 40 It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field The neocognitron introduced the two basic types of layers in CNNs convolutional layers and downsampling layers A convolutional layer contains units whose receptive fields cover a patch of the previous layer The weight vector the set of adaptive parameters of such a unit is often called a filter Units can share filters Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers Such a unit typically computes the average of the activations of the units in its patch This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted In 1969 Kunihiko Fukushima also introduced the ReLU rectified linear unit activation function 41 4 The rectifier has become the most popular activation function for CNNs and deep neural networks in general 42 The time delay neural network TDNN was introduced in 1987 by Alex Waibel and was one of the first CNNs as it achieved shift invariance 43 It did so by utilizing weight sharing in combination with backpropagation training 44 Thus while also using a pyramidal structure as in the neocognitron it performed a global optimization of the weights instead of a local one 43 In 1988 Wei Zhang et al applied backpropagation to a CNN a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer for alphabet recognition They also proposed an implementation of the CNN with an optical computing system 45 46 In 1989 Yann LeCun et al trained a CNN with the purpose of recognizing handwritten ZIP codes on mail While the algorithm worked training required 3 days 47 Learning was fully automatic performed better than manual coefficient design and was suited to a broader range of image recognition problems and image types Subsequently Wei Zhang et al modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991 48 and breast cancer detection in mammograms in 1994 49 In 1990 Yamaguchi et al introduced max pooling a fixed filtering operation that calculates and propagates the maximum value of a given region They combined TDNNs with max pooling in order to realize a speaker independent isolated word recognition system 50 In a variant of the neocognitron called the cresceptron instead of using Fukushima s spatial averaging J Weng et al also used max pooling where a downsampling unit computes the maximum of the activations of the units in its patch 51 52 53 54 Max pooling is often used in modern CNNs 55 LeNet 5 a 7 level CNN by Yann LeCun et al in 1998 56 that classifies digits was applied by several banks to recognize hand written numbers on checks British English cheques digitized in 32x32 pixel images The ability to process higher resolution images requires larger and more layers of CNNs so this technique is constrained by the availability of computing resources In 2010 Backpropagation training through max pooling was accelerated by GPUs and shown to perform better than other pooling variants 57 Behnke 2003 relied only on the sign of the gradient Rprop 58 on problems such as image reconstruction and face localization Rprop is a first order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992 59 In 2011 a deep GPU based CNN called DanNet by Dan Ciresan Ueli Meier and Juergen Schmidhuber achieved human competitive performance for the first time in computer vision contests 60 Subsequently a similar GPU based CNN by Alex Krizhevsky Ilya Sutskever and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge 2012 61 A very deep CNN with over 100 layers by Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun of Microsoft won the ImageNet 2015 contest 62 ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes only when invariance extended beyond shift to all ANN learned concepts such as location type object class label scale lighting and others This was realized in Developmental Networks DNs 63 whose embodiments are Where What Networks WWN 1 2008 64 through WWN 7 2013 65 Artificial curiosity and generative adversarial networks EditMain article Generative adversarial network In 1991 Juergen Schmidhuber published adversarial neural networks that contest with each other in the form of a zero sum game where one network s gain is the other network s loss 66 67 68 The first network is a generative model that models a probability distribution over output patterns The second network learns by gradient descent to predict the reactions of the environment to these patterns This was called artificial curiosity Earlier adversarial machine learning systems neither involved unsupervised neural networks nor were about modeling data nor used gradient descent 68 In 2014 this adversarial principle was used in a generative adversarial network GAN by Ian Goodfellow et al 69 Here the environmental reaction is 1 or 0 depending on whether the first network s output is in a given set This can be used to create realistic deepfakes 70 In 1992 Schmidhuber also published another type of gradient based adversarial neural networks where the goal of the zero sum game is to create disentangled representations of input patterns This was called predictability minimization 71 72 Nvidia s StyleGAN 2018 73 is based on the Progressive GAN by Tero Karras Timo Aila Samuli Laine and Jaakko Lehtinen 74 Here the GAN generator is grown from small to large scale in a pyramidal fashion StyleGANs improve consistency between fine and coarse details in the generator network Transformers and their variants EditMain article Transformer machine learning model Many modern large language models such as ChatGPT GPT 4 and BERT use a feedforward neural network called Transformer by Ashish Vaswani et al in their 2017 paper Attention Is All You Need 75 Transformers have increasingly become the model of choice for natural language processing problems 76 replacing recurrent neural networks RNNs such as long short term memory LSTM 77 Basic ideas for this go back a long way in 1992 Juergen Schmidhuber published the Transformer with linearized self attention save for a normalization operator 78 which is also called the linear Transformer 79 80 4 He advertised it as an alternative to RNNs 78 that can learn internal spotlights of attention 81 and experimentally applied it to problems of variable binding 78 Here a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer products of self generated activation patterns called FROM and TO which in Transformer terminology are called key and value for self attention 80 This fast weight attention mapping is applied to queries The 2017 Transformer 75 combines this with a softmax operator and a projection matrix 4 Transformers are also increasingly being used in computer vision 82 Deep learning with unsupervised or self supervised pre training EditIn the 1980s backpropagation did not work well for deep FNNs and RNNs Here the word deep refers to the number of layers through which the data is transformed More precisely deep learning systems have a substantial credit assignment path CAP depth 83 The CAP is the chain of transformations from input to output CAPs describe potentially causal connections between input and output For an FNN the depth of the CAPs is that of the network and is the number of hidden layers plus one as the output layer is also parameterized For RNNs in which a signal may propagate through a layer more than once the CAP depth is potentially unlimited To overcome this problem Juergen Schmidhuber 1992 proposed a self supervised hierarchy of RNNs pre trained one level at a time by self supervised learning 84 This neural history compressor uses predictive coding to learn internal representations at multiple self organizing time scales 4 The deep architecture may be used to reproduce the original data from the top level feature activations 84 The RNN hierarchy can be collapsed into a single RNN by distilling a higher level chunker network into a lower level automatizer network 84 4 In 1993 a chunker solved a deep learning task whose CAP depth exceeded 1000 85 Such history compressors can substantially facilitate downstream supervised deep learning 4 Geoffrey Hinton et al 2006 proposed learning a high level internal representation using successive layers of binary or real valued latent variables with a restricted Boltzmann machine 86 to model each layer This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model an ancestral pass from the top level feature activations 87 88 In 2012 Andrew Ng and Jeff Dean created an FNN that learned to recognize higher level concepts such as cats only from watching unlabeled images taken from YouTube videos 89 The vanishing gradient problem and its solutions EditMain article Long short term memory Sepp Hochreiter s diploma thesis 1991 90 was called one of the most important documents in the history of machine learning by his supervisor Juergen Schmidhuber 4 Hochreiter not only tested the neural history compressor 84 but also identified and analyzed the vanishing gradient problem 90 91 He proposed recurrent residual connections to solve this problem This led to the deep learning method called long short term memory LSTM published in 1997 92 LSTM recurrent neural networks can learn very deep learning tasks 83 with long credit assignment paths that require memories of events that happened thousands of discrete time steps before The vanilla LSTM with forget gate was introduced in 1999 by Felix Gers Schmidhuber and Fred Cummins 93 LSTM has become the most cited neural network of the 20th century 4 In 2015 Rupesh Kumar Srivastava Klaus Greff and Schmidhuber used LSTM principles to create the Highway network a feedforward neural network with hundreds of layers much deeper than previous networks 94 95 7 months later Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun won the ImageNet 2015 competition with an open gated or gateless Highway network variant called Residual neural network 96 This has become the most cited neural network of the 21st century 4 In 2011 Xavier Glorot Antoine Bordes and Yoshua Bengio found that the ReLU 41 of Kunihiko Fukushima also helps to overcome the vanishing gradient problem 97 compared to widely used activation functions prior to 2011 Hardware based designs EditThe development of metal oxide semiconductor MOS very large scale integration VLSI combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS CMOS technology enabled the development of practical artificial neural networks in the 1980s 98 Computational devices were created in CMOS for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain Nanodevices 99 for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital even though the first implementations may use digital devices 100 Ciresan and colleagues 2010 101 in Schmidhuber s group showed that despite the vanishing gradient problem GPUs make backpropagation feasible for many layered feedforward neural networks Contests EditBetween 2009 and 2012 recurrent neural networks and deep feedforward neural networks developed in Schmidhuber s research group won eight international competitions in pattern recognition and machine learning 102 103 For example the bi directional and multi dimensional long short term memory LSTM 104 105 106 107 of Graves et al won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition ICDAR without any prior knowledge about the three languages to be learned 106 105 Ciresan and colleagues won pattern recognition contests including the IJCNN 2011 Traffic Sign Recognition Competition 108 the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge 109 and others Their neural networks were the first pattern recognizers to achieve human competitive superhuman performance 60 on benchmarks such as traffic sign recognition IJCNN 2012 or the MNIST handwritten digits problem Researchers demonstrated 2010 that deep neural networks interfaced to a hidden Markov model with context dependent states that define the neural network output layer can drastically reduce errors in large vocabulary speech recognition tasks such as voice search citation needed GPU based implementations 110 of this approach won many pattern recognition contests including the IJCNN 2011 Traffic Sign Recognition Competition 108 the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge 109 the ImageNet Competition 61 and others Deep highly nonlinear neural architectures similar to the neocognitron 111 and the standard architecture of vision 112 inspired by simple and complex cells were pre trained with unsupervised methods by Hinton 88 87 A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs 113 References Edit Mansfield Merriman A List of Writings Relating to the Method of Least Squares Stigler Stephen M 1981 Gauss and the Invention of Least Squares Ann Stat 9 3 465 474 doi 10 1214 aos 1176345451 Bretscher Otto 1995 Linear Algebra With Applications 3rd ed Upper Saddle River NJ Prentice Hall a b c d e f g h i j k l m n o p q r s Schmidhuber Juergen 2022 Annotated History of Modern AI and Deep Learning arXiv 2212 11279 cs NE Stigler Stephen M 1986 The History of Statistics The Measurement of Uncertainty before 1900 Cambridge Harvard ISBN 0 674 40340 1 Brush Stephen G 1967 History of the Lenz Ising Model Reviews of Modern Physics 39 4 883 893 Bibcode 1967RvMP 39 883B doi 10 1103 RevModPhys 39 883 Amari Shun Ichi 1972 Learning patterns and pattern sequences by self organizing nets of threshold elements IEEE Transactions C 21 1197 1206 Hopfield J J 1982 Neural networks and physical systems with emergent collective computational abilities Proceedings of the National Academy of Sciences 79 8 2554 2558 Bibcode 1982PNAS 79 2554H doi 10 1073 pnas 79 8 2554 PMC 346238 PMID 6953413 McCulloch Warren Walter Pitts 1943 A Logical Calculus of Ideas Immanent in Nervous Activity Bulletin of Mathematical Biophysics 5 4 115 133 doi 10 1007 BF02478259 Kleene S C 1956 Representation of Events in Nerve Nets and Finite Automata Annals of Mathematics Studies No 34 Princeton University Press pp 3 41 Retrieved 17 June 2017 Kleene S C 1956 Representation of Events in Nerve Nets and Finite Automata Annals of Mathematics Studies No 34 Princeton University Press pp 3 41 Retrieved 2017 06 17 Hebb Donald 1949 The Organization of Behavior New York Wiley ISBN 978 1 135 63190 1 Farley B G W A Clark 1954 Simulation of Self Organizing Systems by Digital Computer IRE Transactions on Information Theory 4 4 76 84 doi 10 1109 TIT 1954 1057468 Rochester N J H Holland L H Habit W L Duda 1956 Tests on a cell assembly theory of the action of the brain using a large digital computer IRE Transactions on Information Theory 2 3 80 93 doi 10 1109 TIT 1956 1056810 Rosenblatt F 1958 The Perceptron A Probabilistic Model For Information Storage And Organization In The Brain Psychological Review 65 6 386 408 CiteSeerX 10 1 1 588 3775 doi 10 1037 h0042519 PMID 13602029 S2CID 12781225 David H Hubel and Torsten N Wiesel 2005 Brain and visual perception the story of a 25 year collaboration Oxford University Press US p 106 ISBN 978 0 19 517618 6 Minsky Marvin Papert Seymour 1969 Perceptrons An Introduction to Computational Geometry MIT Press ISBN 978 0 262 63022 1 Schmidhuber J 2015 Deep Learning in Neural Networks An Overview Neural Networks 61 85 117 arXiv 1404 7828 doi 10 1016 j neunet 2014 09 003 PMID 25462637 S2CID 11715509 Ivakhnenko A G 1973 Cybernetic Predicting Devices CCM Information Corporation Ivakhnenko A G Grigorʹevich Lapa Valentin 1967 Cybernetics and forecasting techniques American Elsevier Pub Co Robbins H Monro S 1951 A Stochastic Approximation Method The Annals of Mathematical Statistics 22 3 400 doi 10 1214 aoms 1177729586 Amari Shun ichi 1967 A theory of adaptive pattern classifier IEEE Transactions EC 16 279 307 Leibniz Gottfried Wilhelm Freiherr von 1920 The Early Mathematical Manuscripts of Leibniz Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes Leibniz published the chain rule in a 1676 memoir Open court publishing Company ISBN 9780598818461 Linnainmaa Seppo 1970 The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors Masters in Finnish University of Helsinki pp 6 7 Linnainmaa Seppo 1976 Taylor expansion of the accumulated rounding error BIT Numerical Mathematics 16 2 146 160 doi 10 1007 bf01931367 S2CID 122357351 Griewank Andreas 2012 Who Invented the Reverse Mode of Differentiation Optimization Stories Documenta Matematica Extra Volume ISMP pp 389 400 S2CID 15568746 Griewank Andreas Walther Andrea 2008 Evaluating Derivatives Principles and Techniques of Algorithmic Differentiation Second Edition SIAM ISBN 978 0 89871 776 1 Rosenblatt Frank 1962 Principles of Neurodynamics Spartan New York Kelley Henry J 1960 Gradient theory of optimal flight paths ARS Journal 30 10 947 954 doi 10 2514 8 5282 Werbos Paul 1982 Applications of advances in nonlinear sensitivity analysis PDF System modeling and optimization Springer pp 762 770 Archived PDF from the original on 14 April 2016 Retrieved 2 July 2017 Rumelhart David E Geoffrey E Hinton and R J Williams Learning Internal Representations by Error Propagation David E Rumelhart James L McClelland and the PDP research group editors Parallel distributed processing Explorations in the microstructure of cognition Volume 1 Foundation MIT Press 1986 Kohonen Teuvo Honkela Timo 2007 Kohonen Network Scholarpedia 2 1 1568 Bibcode 2007SchpJ 2 1568K doi 10 4249 scholarpedia 1568 Kohonen Teuvo 1982 Self Organized Formation of Topologically Correct Feature Maps Biological Cybernetics 43 1 59 69 doi 10 1007 bf00337288 S2CID 206775459 Von der Malsburg C 1973 Self organization of orientation sensitive cells in the striate cortex Kybernetik 14 2 85 100 doi 10 1007 bf00288907 PMID 4786750 S2CID 3351573 Homunculus Meaning amp Definition in UK English Lexico com Lexico Dictionaries English Archived from the original on May 18 2021 Retrieved 6 February 2022 Qian N Sejnowski T J 1988 Predicting the secondary structure of globular proteins using neural network models PDF Journal of Molecular Biology 202 4 865 884 doi 10 1016 0022 2836 88 90564 5 PMID 3172241 Qian1988 Rost B Sander C 1993 Prediction of protein secondary structure at better than 70 accuracy PDF Journal of Molecular Biology 232 2 584 599 doi 10 1006 jmbi 1993 1413 PMID 8345525 Rost1993 Fukushima K 2007 Neocognitron Scholarpedia 2 1 1717 Bibcode 2007SchpJ 2 1717F doi 10 4249 scholarpedia 1717 Fukushima Kunihiko 1980 Neocognitron A Self organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position PDF Biological Cybernetics 36 4 193 202 doi 10 1007 BF00344251 PMID 7370364 S2CID 206775608 Retrieved 16 November 2013 LeCun Yann Bengio Yoshua Hinton Geoffrey 2015 Deep learning Nature 521 7553 436 444 Bibcode 2015Natur 521 436L doi 10 1038 nature14539 PMID 26017442 S2CID 3074096 a b Fukushima K 1969 Visual feature extraction by a multilayered network of analog threshold elements IEEE Transactions on Systems Science and Cybernetics 5 4 322 333 doi 10 1109 TSSC 1969 300225 Ramachandran Prajit Barret Zoph Quoc V Le October 16 2017 Searching for Activation Functions arXiv 1710 05941 cs NE a b Waibel Alex December 1987 Phoneme Recognition Using Time Delay Neural Networks Meeting of the Institute of Electrical Information and Communication Engineers IEICE Tokyo Japan Alexander Waibel et al Phoneme Recognition Using Time Delay Neural Networks IEEE Transactions on Acoustics Speech and Signal Processing Volume 37 No 3 pp 328 339 March 1989 Zhang Wei 1988 Shift invariant pattern recognition neural network and its optical architecture Proceedings of Annual Conference of the Japan Society of Applied Physics Zhang Wei 1990 Parallel distributed processing model with local space invariant interconnections and its optical architecture Applied Optics 29 32 4790 7 Bibcode 1990ApOpt 29 4790Z doi 10 1364 AO 29 004790 PMID 20577468 LeCun et al Backpropagation Applied to Handwritten Zip Code Recognition Neural Computation 1 pp 541 551 1989 Zhang Wei 1991 Image processing of human corneal endothelium based on a learning network Applied Optics 30 29 4211 7 Bibcode 1991ApOpt 30 4211Z doi 10 1364 AO 30 004211 PMID 20706526 Zhang Wei 1994 Computerized detection of clustered microcalcifications in digital mammograms using a shift invariant artificial neural network Medical Physics 21 4 517 24 Bibcode 1994MedPh 21 517Z doi 10 1118 1 597177 PMID 8058017 Yamaguchi Kouichi Sakamoto Kenji Akabane Toshio Fujimoto Yoshiji November 1990 A Neural Network for Speaker Independent Isolated Word Recognition First International Conference on Spoken Language Processing ICSLP 90 Kobe Japan Archived from the original on 2021 03 07 Retrieved 2019 09 04 J Weng N Ahuja and T S Huang Cresceptron a self organizing neural network which grows adaptively Proc International Joint Conference on Neural Networks Baltimore Maryland vol I pp 576 581 June 1992 J Weng N Ahuja and T S Huang Learning recognition and segmentation of 3 D objects from 2 D images Proc 4th International Conf Computer Vision Berlin Germany pp 121 128 May 1993 J Weng N Ahuja and T S Huang Learning recognition and segmentation using the Cresceptron International Journal of Computer Vision vol 25 no 2 pp 105 139 Nov 1997 Weng J Ahuja N Huang TS 1993 Learning recognition and segmentation of 3 D objects from 2 D images 1993 4th International Conference on Computer Vision pp 121 128 doi 10 1109 ICCV 1993 378228 ISBN 0 8186 3870 2 S2CID 8619176 a href Template Cite book html title Template Cite book cite book a journal ignored help Schmidhuber Jurgen 2015 Deep Learning Scholarpedia 10 11 1527 54 CiteSeerX 10 1 1 76 1541 doi 10 1162 neco 2006 18 7 1527 PMID 16764513 S2CID 2309950 LeCun Yann Leon Bottou Yoshua Bengio Patrick Haffner 1998 Gradient based learning applied to document recognition PDF Proceedings of the IEEE 86 11 2278 2324 CiteSeerX 10 1 1 32 9552 doi 10 1109 5 726791 S2CID 14542261 Retrieved October 7 2016 Dominik Scherer Andreas C Muller and Sven Behnke Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition In 20th International Conference Artificial Neural Networks ICANN pp 92 101 2010 doi 10 1007 978 3 642 15825 4 10 Sven Behnke 2003 Hierarchical Neural Networks for Image Interpretation PDF Lecture Notes in Computer Science Vol 2766 Springer Martin Riedmiller und Heinrich Braun Rprop A Fast Adaptive Learning Algorithm Proceedings of the International Symposium on Computer and Information Science VII 1992 a b Ciresan Dan Meier U Schmidhuber J June 2012 Multi column deep neural networks for image classification 2012 IEEE Conference on Computer Vision and Pattern Recognition pp 3642 3649 arXiv 1202 2745 Bibcode 2012arXiv1202 2745C CiteSeerX 10 1 1 300 3283 doi 10 1109 cvpr 2012 6248110 ISBN 978 1 4673 1228 8 S2CID 2161592 a b Krizhevsky Alex Sutskever Ilya Hinton Geoffry 2012 ImageNet Classification with Deep Convolutional Neural Networks PDF NIPS 2012 Neural Information Processing Systems Lake Tahoe Nevada He Kaiming Zhang Xiangyu Ren Shaoqing Sun Jian 2016 Deep Residual Learning for Image Recognition PDF 2016 IEEE Conference on Computer Vision and Pattern Recognition CVPR pp 770 778 arXiv 1512 03385 doi 10 1109 CVPR 2016 90 ISBN 978 1 4673 8851 1 S2CID 206594692 J Weng Why Have We Passed Neural Networks Do not Abstract Well Natural Intelligence the INNS Magazine vol 1 no 1 pp 13 22 2011 Z Ji J Weng and D Prokhorov Where What Network 1 Where and What Assist Each Other Through Top down Connections Proc 7th International Conference on Development and Learning ICDL 08 Monterey CA Aug 9 12 pp 1 6 2008 X Wu G Guo and J Weng Skull closed Autonomous Development WWN 7 Dealing with Scales Proc International Conference on Brain Mind July 27 28 East Lansing Michigan pp 1 9 2013 Schmidhuber Jurgen 1991 A possibility for implementing curiosity and boredom in model building neural controllers Proc SAB 1991 MIT Press Bradford Books pp 222 227 Schmidhuber Jurgen 2010 Formal Theory of Creativity Fun and Intrinsic Motivation 1990 2010 IEEE Transactions on Autonomous Mental Development 2 3 230 247 doi 10 1109 TAMD 2010 2056368 S2CID 234198 a b Schmidhuber Jurgen 2020 Generative Adversarial Networks are Special Cases of Artificial Curiosity 1990 and also Closely Related to Predictability Minimization 1991 Neural Networks 127 58 66 arXiv 1906 04493 doi 10 1016 j neunet 2020 04 008 PMID 32334341 S2CID 216056336 Goodfellow Ian Pouget Abadie Jean Mirza Mehdi Xu Bing Warde Farley David Ozair Sherjil Courville Aaron Bengio Yoshua 2014 Generative Adversarial Networks PDF Proceedings of the International Conference on Neural Information Processing Systems NIPS 2014 pp 2672 2680 Archived PDF from the original on 22 November 2019 Retrieved 20 August 2019 Prepare Don t Panic Synthetic Media and Deepfakes witness org Archived from the original on 2 December 2020 Retrieved 25 November 2020 Schmidhuber Jurgen November 1992 Learning Factorial Codes by Predictability Minimization Neural Computation 4 6 863 879 doi 10 1162 neco 1992 4 6 863 S2CID 42023620 Schmidhuber Jurgen Eldracher Martin Foltin Bernhard 1996 Semilinear predictability minimzation produces well known feature detectors Neural Computation 8 4 773 786 doi 10 1162 neco 1996 8 4 773 S2CID 16154391 GAN 2 0 NVIDIA s Hyperrealistic Face Generator SyncedReview com December 14 2018 Retrieved October 3 2019 Karras Tero Aila Timo Laine Samuli Lehtinen Jaakko October 1 2017 Progressive Growing of GANs for Improved Quality Stability and Variation arXiv 1710 10196 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help a b Vaswani Ashish Shazeer Noam Parmar Niki Uszkoreit Jakob Jones Llion Gomez Aidan N Kaiser Lukasz Polosukhin Illia 2017 06 12 Attention Is All You Need arXiv 1706 03762 cs CL Wolf Thomas Debut Lysandre Sanh Victor Chaumond Julien Delangue Clement Moi Anthony Cistac Pierric Rault Tim Louf Remi Funtowicz Morgan Davison Joe Shleifer Sam von Platen Patrick Ma Clara Jernite Yacine Plu Julien Xu Canwen Le Scao Teven Gugger Sylvain Drame Mariama Lhoest Quentin Rush Alexander 2020 Transformers State of the Art Natural Language Processing Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing System Demonstrations pp 38 45 doi 10 18653 v1 2020 emnlp demos 6 S2CID 208117506 Hochreiter Sepp Schmidhuber Jurgen 1 November 1997 Long Short Term Memory Neural Computation 9 8 1735 1780 doi 10 1162 neco 1997 9 8 1735 ISSN 0899 7667 PMID 9377276 S2CID 1915014 a b c Schmidhuber Jurgen 1 November 1992 Learning to control fast weight memories an alternative to recurrent nets Neural Computation 4 1 131 139 doi 10 1162 neco 1992 4 1 131 S2CID 16683347 Choromanski Krzysztof Likhosherstov Valerii Dohan David Song Xingyou Gane Andreea Sarlos Tamas Hawkins Peter Davis Jared Mohiuddin Afroz Kaiser Lukasz Belanger David Colwell Lucy Weller Adrian 2020 Rethinking Attention with Performers arXiv 2009 14794 cs CL a b Schlag Imanol Irie Kazuki Schmidhuber Jurgen 2021 Linear Transformers Are Secretly Fast Weight Programmers ICML 2021 Springer pp 9355 9366 Schmidhuber Jurgen 1993 Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets ICANN 1993 Springer pp 460 463 He Cheng 31 December 2021 Transformer in CV Transformer in CV Towards Data Science a b Schmidhuber J 2015 Deep Learning in Neural Networks An Overview Neural Networks 61 85 117 arXiv 1404 7828 doi 10 1016 j neunet 2014 09 003 PMID 25462637 S2CID 11715509 a b c d Schmidhuber Jurgen 1992 Learning complex extended sequences using the principle of history compression PDF Neural Computation 4 2 234 242 doi 10 1162 neco 1992 4 2 234 S2CID 18271205 Schmidhuber Jurgen 1993 Habilitation Thesis PDF Smolensky P 1986 Information processing in dynamical systems Foundations of harmony theory In D E Rumelhart J L McClelland PDP Research Group eds Parallel Distributed Processing Explorations in the Microstructure of Cognition Vol 1 pp 194 281 ISBN 9780262680530 a b Hinton G E Osindero S Teh Y 2006 A fast learning algorithm for deep belief nets PDF Neural Computation 18 7 1527 1554 CiteSeerX 10 1 1 76 1541 doi 10 1162 neco 2006 18 7 1527 PMID 16764513 S2CID 2309950 a b Hinton Geoffrey 2009 05 31 Deep belief networks Scholarpedia 4 5 5947 Bibcode 2009SchpJ 4 5947H doi 10 4249 scholarpedia 5947 ISSN 1941 6016 Ng Andrew Dean Jeff 2012 Building High level Features Using Large Scale Unsupervised Learning arXiv 1112 6209 cs LG a b S Hochreiter Untersuchungen zu dynamischen neuronalen Netzen Archived 2015 03 06 at the Wayback Machine Diploma thesis Institut f Informatik Technische Univ Munich Advisor J Schmidhuber 1991 Hochreiter S et al 15 January 2001 Gradient flow in recurrent nets the difficulty of learning long term dependencies In Kolen John F Kremer Stefan C eds A Field Guide to Dynamical Recurrent Networks John Wiley amp Sons ISBN 978 0 7803 5369 5 Hochreiter Sepp Schmidhuber Jurgen 1 November 1997 Long Short Term Memory Neural Computation 9 8 1735 1780 doi 10 1162 neco 1997 9 8 1735 ISSN 0899 7667 PMID 9377276 S2CID 1915014 Gers Felix Schmidhuber Jurgen Cummins Fred 1999 Learning to forget Continual prediction with LSTM 9th International Conference on Artificial Neural Networks ICANN 99 Vol 1999 pp 850 855 doi 10 1049 cp 19991218 ISBN 0 85296 721 7 Srivastava Rupesh Kumar Greff Klaus Schmidhuber Jurgen 2 May 2015 Highway Networks arXiv 1505 00387 cs LG Srivastava Rupesh K Greff Klaus Schmidhuber Juergen 2015 Training Very Deep Networks Advances in Neural Information Processing Systems Curran Associates Inc 28 2377 2385 He Kaiming Zhang Xiangyu Ren Shaoqing Sun Jian 2016 Deep Residual Learning for Image Recognition 2016 IEEE Conference on Computer Vision and Pattern Recognition CVPR Las Vegas NV USA IEEE pp 770 778 arXiv 1512 03385 doi 10 1109 CVPR 2016 90 ISBN 978 1 4673 8851 1 Xavier Glorot Antoine Bordes Yoshua Bengio 2011 Deep sparse rectifier neural networks PDF AISTATS Rectifier and softplus activation functions The second one is a smooth version of the first Mead Carver A Ismail Mohammed 8 May 1989 Analog VLSI Implementation of Neural Systems PDF The Kluwer International Series in Engineering and Computer Science Vol 80 Norwell MA Kluwer Academic Publishers doi 10 1007 978 1 4613 1639 8 ISBN 978 1 4613 1639 8 Yang J J Pickett M D Li X M Ohlberg D A A Stewart D R Williams R S 2008 Memristive switching mechanism for metal oxide metal nanodevices Nat Nanotechnol 3 7 429 433 doi 10 1038 nnano 2008 160 PMID 18654568 Strukov D B Snider G S Stewart D R Williams R S 2008 The missing memristor found Nature 453 7191 80 83 Bibcode 2008Natur 453 80S doi 10 1038 nature06932 PMID 18451858 S2CID 4367148 Ciresan Dan Claudiu Meier Ueli Gambardella Luca Maria Schmidhuber Jurgen 2010 09 21 Deep Big Simple Neural Nets for Handwritten Digit Recognition Neural Computation 22 12 3207 3220 arXiv 1003 0358 doi 10 1162 neco a 00052 ISSN 0899 7667 PMID 20858131 S2CID 1918673 2012 Kurzweil AI Interview Archived 2018 08 31 at the Wayback Machine with Jurgen Schmidhuber on the eight competitions won by his Deep Learning team 2009 2012 How bio inspired deep learning keeps winning competitions KurzweilAI www kurzweilai net Archived from the original on 2018 08 31 Retrieved 2017 06 16 Graves Alex and Schmidhuber Jurgen Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks in Advances in Neural Information Processing Systems 22 NIPS 22 7 10 December 2009 Vancouver BC Neural Information Processing Systems NIPS Foundation 2009 pp 545 552 a b Graves A Liwicki M Fernandez S Bertolami R Bunke H Schmidhuber J 2009 A Novel Connectionist System for Improved Unconstrained Handwriting Recognition PDF IEEE Transactions on Pattern Analysis and Machine Intelligence 31 5 855 868 CiteSeerX 10 1 1 139 4502 doi 10 1109 tpami 2008 137 PMID 19299860 S2CID 14635907 a b Graves Alex Schmidhuber Jurgen 2009 Bengio Yoshua Schuurmans Dale Lafferty John Williams Chris Culotta Aron eds Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks Neural Information Processing Systems NIPS Foundation Curran Associates Inc 21 545 552 Graves A Liwicki M Fernandez S Bertolami R Bunke H Schmidhuber J May 2009 A Novel Connectionist System for Unconstrained Handwriting Recognition IEEE Transactions on Pattern Analysis and Machine Intelligence 31 5 855 868 CiteSeerX 10 1 1 139 4502 doi 10 1109 tpami 2008 137 ISSN 0162 8828 PMID 19299860 S2CID 14635907 a b Ciresan Dan Meier Ueli Masci Jonathan Schmidhuber Jurgen August 2012 Multi column deep neural network for traffic sign classification Neural Networks Selected Papers from IJCNN 2011 32 333 338 CiteSeerX 10 1 1 226 8219 doi 10 1016 j neunet 2012 02 023 PMID 22386783 a b Ciresan Dan Giusti Alessandro Gambardella Luca M Schmidhuber Juergen 2012 Pereira F Burges C J C Bottou L Weinberger K Q eds Advances in Neural Information Processing Systems 25 PDF Curran Associates Inc pp 2843 2851 Ciresan D C Meier U Masci J Gambardella L M Schmidhuber J 2011 Flexible High Performance Convolutional Neural Networks for Image Classification PDF International Joint Conference on Artificial Intelligence doi 10 5591 978 1 57735 516 8 ijcai11 210 Fukushima K 1980 Neocognitron A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Biological Cybernetics 36 4 93 202 doi 10 1007 BF00344251 PMID 7370364 S2CID 206775608 Riesenhuber M Poggio T 1999 Hierarchical models of object recognition in cortex Nature Neuroscience 2 11 1019 1025 doi 10 1038 14819 PMID 10526343 S2CID 8920227 Markoff John November 23 2012 Scientists See Promise in Deep Learning Programs New York Times External links Edit Lecun 2019 7 11 ACM Tech Talk Google Docs Retrieved 2020 02 13 Retrieved from https en wikipedia org w index php title History of artificial neural networks amp oldid 1166512188, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.