fbpx
Wikipedia

Generative adversarial network

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI.[1][2] The concept was initially developed by Ian Goodfellow and his colleagues in June 2014.[3] In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

An illustration of how a GAN works.

Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics. Though originally proposed as a form of generative model for unsupervised learning, GANs have also proved useful for semi-supervised learning,[4] fully supervised learning,[5] and reinforcement learning.[6]

The core idea of a GAN is based on the "indirect" training through the discriminator, another neural network that can tell how "realistic" the input seems, which itself is also being updated dynamically.[7] This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.

GANs are similar to mimicry in evolutionary biology, with an evolutionary arms race between both networks.

Definition edit

Mathematical edit

The original GAN is defined as the following game:[3]

Each probability space   defines a GAN game.

There are 2 players: generator and discriminator.

The generator's strategy set is  , the set of all probability measures   on  .

The discriminator's strategy set is the set of Markov kernels  , where   is the set of probability measures on  .

The GAN game is a zero-sum game, with objective function

 
The generator aims to minimize the objective, and the discriminator aims to maximize the objective.

The generator's task is to approach  , that is, to match its own output distribution as closely as possible to the reference distribution. The discriminator's task is to output a value close to 1 when the input appears to be from the reference distribution, and to output a value close to 0 when the input looks like it came from the generator distribution.

In practice edit

The generative network generates candidates while the discriminative network evaluates them.[3] The contest operates in terms of data distributions. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution. The generative network's training objective is to increase the error rate of the discriminative network (i.e., "fool" the discriminator network by producing novel candidates that the discriminator thinks are not synthesized (are part of the true data distribution)).[3][8]

A known dataset serves as the initial training data for the discriminator. Training involves presenting it with samples from the training dataset until it achieves acceptable accuracy. The generator is trained based on whether it succeeds in fooling the discriminator. Typically, the generator is seeded with randomized input that is sampled from a predefined latent space (e.g. a multivariate normal distribution). Thereafter, candidates synthesized by the generator are evaluated by the discriminator. Independent backpropagation procedures are applied to both networks so that the generator produces better samples, while the discriminator becomes more skilled at flagging synthetic samples.[9] When used for image generation, the generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network.

Relation to other statistical machine learning methods edit

GANs are implicit generative models,[10] which means that they do not explicitly model the likelihood function nor provide a means for finding the latent variable corresponding to a given sample, unlike alternatives such as flow-based generative model.

 
Main types of deep generative models that perform maximum likelihood estimation[11]

Compared to fully visible belief networks such as WaveNet and PixelRNN and autoregressive models in general, GANs can generate one complete sample in one pass, rather than multiple passes through the network.

Compared to Boltzmann machines and nonlinear ICA, there is no restriction on the type of function used by the network.

Since neural networks are universal approximators, GANs are asymptotically consistent. Variational autoencoders might be universal approximators, but it is not proven as of 2017.[11]

Mathematical properties edit

Measure-theoretic considerations edit

This section provides some of the mathematical theory behind these methods.

In modern probability theory based on measure theory, a probability space also needs to be equipped with a σ-algebra. As a result, a more rigorous definition of the GAN game would make the following changes:

Each probability space   defines a GAN game.

The generator's strategy set is  , the set of all probability measures   on the measure-space  .

The discriminator's strategy set is the set of Markov kernels  , where   is the Borel σ-algebra on  .

Since issues of measurability never arise in practice, these will not concern us further.

Choice of the strategy set edit

In the most generic version of the GAN game described above, the strategy set for the discriminator contains all Markov kernels  , and the strategy set for the generator contains arbitrary probability distributions   on  .

However, as shown below, the optimal discriminator strategy against any   is deterministic, so there is no loss of generality in restricting the discriminator's strategies to deterministic functions  . In most applications,   is a deep neural network function.

As for the generator, while   could theoretically be any computable probability distribution, in practice, it is usually implemented as a pushforward:  . That is, start with a random variable  , where   is a probability distribution that is easy to compute (such as the uniform distribution, or the Gaussian distribution), then define a function  . Then the distribution   is the distribution of  .

Consequently, the generator's strategy is usually defined as just  , leaving   implicit. In this formalism, the GAN game objective is

 

Generative reparametrization edit

The GAN architecture has two main components. One is casting optimization into a game, of form  , which is different from the usual kind of optimization, of form  . The other is the decomposition of   into  , which can be understood as a reparametrization trick.

To see its significance, one must compare GAN with previous methods for learning generative models, which were plagued with "intractable probabilistic computations that arise in maximum likelihood estimation and related strategies".[3]

At the same time, Kingma and Welling[12] and Rezende et al.[13] developed the same idea of reparametrization into a general stochastic backpropagation method. Among its first applications was the variational autoencoder.

Move order and strategic equilibria edit

In the original paper, as well as most subsequent papers, it is usually assumed that the generator moves first, and the discriminator moves second, thus giving the following minimax game:

 

If both the generator's and the discriminator's strategy sets are spanned by a finite number of strategies, then by the minimax theorem,

 
that is, the move order does not matter.

However, since the strategy sets are both not finitely spanned, the minimax theorem does not apply, and the idea of an "equilibrium" becomes delicate. To wit, there are the following different concepts of equilibrium:

  • Equilibrium when generator moves first, and discriminator moves second:
     
  • Equilibrium when discriminator moves first, and generator moves second:
     
  • Nash equilibrium  , which is stable under simultaneous move order:
     

For general games, these equilibria do not have to agree, or even to exist. For the original GAN game, these equilibria all exist, and are all equal. However, for more general GAN games, these do not necessarily exist, or agree.[14]

Main theorems for GAN game edit

The original GAN paper proved the following two theorems:[3]

Theorem (the optimal discriminator computes the Jensen–Shannon divergence) — For any fixed generator strategy  , let the optimal reply be  , then

 

where the derivative is the Radon–Nikodym derivative, and   is the Jensen–Shannon divergence.

Proof

By Jensen's inequality,

 
and similarly for the other term. Therefore, the optimal reply can be deterministic, i.e.   for some function  , in which case
 

To define suitable density functions, we define a base measure  , which allows us to take the Radon–Nikodym derivatives

 
with  .

We then have

 

The integrand is just the negative cross-entropy between two Bernoulli random variables with parameters   and  . We can write this as  , where   is the binary entropy function, so

 

This means that the optimal strategy for the discriminator is  , with

 

after routine calculation.

Interpretation: For any fixed generator strategy  , the optimal discriminator keeps track of the likelihood ratio between the reference distribution and the generator distribution:

 
where   is the logistic function. In particular, if the prior probability for an image   to come from the reference distribution is equal to  , then   is just the posterior probability that   came from the reference distribution:
 

Theorem (the unique equilibrium point) — For any GAN game, there exists a pair   that is both a sequential equilibrium and a Nash equilibrium:

 

That is, the generator perfectly mimics the reference, and the discriminator outputs   deterministically on all inputs.

Proof

From the previous proposition,

 

For any fixed discriminator strategy  , any   concentrated on the set

 
is an optimal strategy for the generator. Thus,
 

By Jensen's inequality, the discriminator can only improve by adopting the deterministic strategy of always playing  . Therefore,

 

By Jensen's inequality,

 

with equality if  , so

 

Finally, to check that this is a Nash equilibrium, note that when  , we have

 
which is always maximized by  .

When  , any strategy is optimal for the generator.

Training and evaluating GAN edit

Training edit

Unstable convergence edit

While the GAN game has a unique global equilibrium point when both the generator and discriminator have access to their entire strategy sets, the equilibrium is no longer guaranteed when they have a restricted strategy set.[14]

In practice, the generator has access only to measures of form  , where   is a function computed by a neural network with parameters  , and   is an easily sampled distribution, such as the uniform or normal distribution. Similarly, the discriminator has access only to functions of form  , a function computed by a neural network with parameters  . These restricted strategy sets take up a vanishingly small proportion of their entire strategy sets.[15]

Further, even if an equilibrium still exists, it can only be found by searching in the high-dimensional space of all possible neural network functions. The standard strategy of using gradient descent to find the equilibrium often does not work for GAN, and often the game "collapses" into one of several failure modes. To improve the convergence stability, some training strategies start with an easier task, such as generating low-resolution images[16] or simple images (one object with uniform background),[17] and gradually increase the difficulty of the task during training. This essentially translates to applying a curriculum learning scheme.[18]

Mode collapse edit

GANs often suffer from mode collapse where they fail to generalize properly, missing entire modes from the input data. For example, a GAN trained on the MNIST dataset containing many samples of each digit might only generate pictures of digit 0. This was named in the first paper as the "Helvetica scenario".

One way this can happen is if the generator learns too fast compared to the discriminator. If the discriminator   is held constant, then the optimal generator would only output elements of  .[19] So for example, if during GAN training for generating MNIST dataset, for a few epochs, the discriminator somehow prefers the digit 0 slightly more than other digits, the generator may seize the opportunity to generate only digit 0, then be unable to escape the local minimum after the discriminator improves.

Some researchers perceive the root problem to be a weak discriminative network that fails to notice the pattern of omission, while others assign blame to a bad choice of objective function. Many solutions have been proposed, but it is still an open problem.[20][21]

Even the state-of-the-art architecture, BigGAN (2019), could not avoid mode collapse. The authors resorted to "allowing collapse to occur at the later stages of training, by which time a model is sufficiently trained to achieve good results".[22]

Two time-scale update rule edit

The two time-scale update rule (TTUR) is proposed to make GAN convergence more stable by making the learning rate of the generator lower than that of the discriminator. The authors argued that the generator should move slower than the discriminator, so that it does not "drive the discriminator steadily into new regions without capturing its gathered information".

They proved that a general class of games that included the GAN game, when trained under TTUR, "converges under mild assumptions to a stationary local Nash equilibrium".[23]

They also proposed using the Adam stochastic optimization[24] to avoid mode collapse, as well as the Fréchet inception distance for evaluating GAN performances.

Vanishing gradient edit

Conversely, if the discriminator learns too fast compared to the generator, then the discriminator could almost perfectly distinguish  . In such case, the generator   could be stuck with a very high loss no matter which direction it changes its  , meaning that the gradient   would be close to zero. In such case, the generator cannot learn, a case of the vanishing gradient problem.[15]

Intuitively speaking, the discriminator is too good, and since the generator cannot take any small step (only small steps are considered in gradient descent) to improve its payoff, it does not even try.

One important method for solving this problem is the Wasserstein GAN.

Evaluation edit

GANs are usually evaluated by Inception score (IS), which measures how varied the generator's outputs are (as classified by an image classifier, usually Inception-v3), or Fréchet inception distance (FID), which measures how similar the generator's outputs are to a reference set (as classified by a learned image featurizer, such as Inception-v3 without its final layer). Many papers that propose new GAN architectures for image generation report how their architectures break the state of the art on FID or IS.

Another evaluation method is the Learned Perceptual Image Patch Similarity (LPIPS), which starts with a learned image featurizer  , and finetunes it by supervised learning on a set of  , where   is an image,   is a perturbed version of it, and   is how much they differ, as reported by human subjects. The model is finetuned so that it can approximate  . This finetuned model is then used to define  .[25]

Other evaluation methods are reviewed in.[26]

Variants edit

There is a veritable zoo of GAN variants.[27] Some of the most prominent are as follows:

Conditional GAN edit

Conditional GANs are similar to standard GANs except they allow the model to conditionally generate samples based on additional information. For example, if we want to generate a cat face given a dog picture, we could use a conditional GAN.

The generator in a GAN game generates  , a probability distribution on the probability space  . This leads to the idea of a conditional GAN, where instead of generating one probability distribution on  , the generator generates a different probability distribution   on  , for each given class label  .

For example, for generating images that look like ImageNet, the generator should be able to generate a picture of cat when given the class label "cat".

In the original paper,[3] the authors noted that GAN can be trivially extended to conditional GAN by providing the labels to both the generator and the discriminator.

Concretely, the conditional GAN game is just the GAN game with class labels provided:

 
where   is a probability distribution over classes,   is the probability distribution of real images of class  , and   the probability distribution of images generated by the generator when given class label  .

In 2017, a conditional GAN learned to generate 1000 image classes of ImageNet.[28]

GANs with alternative architectures edit

The GAN game is a general framework and can be run with any reasonable parametrization of the generator   and discriminator  . In the original paper, the authors demonstrated it using multilayer perceptron networks and convolutional neural networks. Many alternative architectures have been tried.

Deep convolutional GAN (DCGAN):[29] For both generator and discriminator, uses only deep networks consisting entirely of convolution-deconvolution layers, that is, fully convolutional networks.[30]

Self-attention GAN (SAGAN):[31] Starts with the DCGAN, then adds residually-connected standard self-attention modules to the generator and discriminator.

Variational autoencoder GAN (VAEGAN):[32] Uses a variational autoencoder (VAE) for the generator.

Transformer GAN (TransGAN):[33] Uses the pure transformer architecture for both the generator and discriminator, entirely devoid of convolution-deconvolution layers.

Flow-GAN:[34] Uses flow-based generative model for the generator, allowing efficient computation of the likelihood function.

GANs with alternative objectives edit

Many GAN variants are merely obtained by changing the loss functions for the generator and discriminator.

Original GAN:

We recast the original GAN objective into a form more convenient for comparison:

 

Original GAN, non-saturating loss:

This objective for generator was recommended in the original paper for faster convergence.[3]

 
The effect of using this objective is analyzed in Section 2.2.2 of Arjovsky et al.[35]

Original GAN, maximum likelihood:

 
where   is the logistic function. When the discriminator is optimal, the generator gradient is the same as in maximum likelihood estimation, even though GAN cannot perform maximum likelihood estimation itself.[36][37]

Hinge loss GAN:[38]

 
 
Least squares GAN:[39]
 
 
where   are parameters to be chosen. The authors recommended  .

Wasserstein GAN (WGAN) edit

The Wasserstein GAN modifies the GAN game at two points:

  • The discriminator's strategy set is the set of measurable functions of type   with bounded Lipschitz norm:  , where   is a fixed positive constant.
  • The objective is
     

One of its purposes is to solve the problem of mode collapse (see above).[15] The authors claim "In no experiment did we see evidence of mode collapse for the WGAN algorithm".

GANs with more than 2 players edit

Adversarial autoencoder edit

An adversarial autoencoder (AAE)[40] is more autoencoder than GAN. The idea is to start with a plain autoencoder, but train a discriminator to discriminate the latent vectors from a reference distribution (often the normal distribution).

InfoGAN edit

In conditional GAN, the generator receives both a noise vector   and a label  , and produces an image  . The discriminator receives image-label pairs  , and computes  .

When the training dataset is unlabeled, conditional GAN does not work directly.

The idea of InfoGAN is to decree that every latent vector in the latent space can be decomposed as  : an incompressible noise part  , and an informative label part  , and encourage the generator to comply with the decree, by encouraging it to maximize  , the mutual information between   and  , while making no demands on the mutual information   between  .

Unfortunately,   is intractable in general, The key idea of InfoGAN is Variational Mutual Information Maximization:[41] indirectly maximize it by maximizing a lower bound

 
where   ranges over all Markov kernels of type  .

The InfoGAN game is defined as follows:[42]

Three probability spaces define an InfoGAN game:

  •  , the space of reference images.
  •  , the fixed random noise generator.
  •  , the fixed random information generator.

There are 3 players in 2 teams: generator, Q, and discriminator. The generator and Q are on one team, and the discriminator on the other team.

The objective function is

 
where   is the original GAN game objective, and  

Generator-Q team aims to minimize the objective, and discriminator aims to maximize it:

 

Bidirectional GAN (BiGAN) edit

The standard GAN generator is a function of type  , that is, it is a mapping from a latent space   to the image space  . This can be understood as a "decoding" process, whereby every latent vector   is a code for an image  , and the generator performs the decoding. This naturally leads to the idea of training another network that performs "encoding", creating an autoencoder out of the encoder-generator pair.

Already in the original paper,[3] the authors noted that "Learned approximate inference can be performed by training an auxiliary network to predict   given  ". The bidirectional GAN architecture performs exactly this.[43]

The BiGAN is defined as follows:

Two probability spaces define a BiGAN game:

  •  , the space of reference images.
  •  , the latent space.

There are 3 players in 2 teams: generator, encoder, and discriminator. The generator and encoder are on one team, and the discriminator on the other team.

The generator's strategies are functions  , and the encoder's strategies are functions  . The discriminator's strategies are functions  .

The objective function is

 

Generator-encoder team aims to minimize the objective, and discriminator aims to maximize it:

 

In the paper, they gave a more abstract definition of the objective as:

 
where   is the probability distribution on   obtained by pushing   forward via  , and   is the probability distribution on   obtained by pushing   forward via  .

Applications of bidirectional models include semi-supervised learning,[44] interpretable machine learning,[45] and neural machine translation.[46]

CycleGAN edit

CycleGAN is an architecture for performing translations between two domains, such as between photos of horses and photos of zebras, or photos of night cities and photos of day cities.

The CycleGAN game is defined as follows:[47]

There are two probability spaces  , corresponding to the two domains needed for translations fore-and-back.

There are 4 players in 2 teams: generators  , and discriminators  .

The objective function is

 

where   is a positive adjustable parameter,   is the GAN game objective, and   is the cycle consistency loss:

 
The generators aim to minimize the objective, and the discriminators aim to maximize it:
 

Unlike previous work like pix2pix,[48] which requires paired training data, cycleGAN requires no paired data. For example, to train a pix2pix model to turn a summer scenery photo to winter scenery photo and back, the dataset must contain pairs of the same place in summer and winter, shot at the same angle; cycleGAN would only need a set of summer scenery photos, and an unrelated set of winter scenery photos.

GANs with particularly large or small scales edit

BigGAN edit

The BigGAN is essentially a self-attention GAN trained on a large scale (up to 80 million parameters) to generate large images of ImageNet (up to 512 x 512 resolution), with numerous engineering tricks to make it converge.[22][49]

Invertible data augmentation edit

When there is insufficient training data, the reference distribution   cannot be well-approximated by the empirical distribution given by the training dataset. In such cases, data augmentation can be applied, to allow training GAN on smaller datasets. Naïve data augmentation, however, brings its problems.

Consider the original GAN game, slightly reformulated as follows:

 
Now we use data augmentation by randomly sampling semantic-preserving transforms   and applying them to the dataset, to obtain the reformulated GAN game:
 
This is equivalent to a GAN game with a different distribution  , sampled by  , with  . For example, if   is the distribution of images in ImageNet, and   samples identity-transform with probability 0.5, and horizontal-reflection with probability 0.5, then   is the distribution of images in ImageNet and horizontally-reflected ImageNet, combined.

The result of such training would be a generator that mimics  . For example, it would generate images that look like they are randomly cropped, if the data augmentation uses random cropping.

The solution is to apply data augmentation to both generated and real images:

 
The authors demonstrated high-quality generation using just 100-picture-large datasets.[50]

The StyleGAN-2-ADA paper points out a further point on data augmentation: it must be invertible.[51] Continue with the example of generating ImageNet pictures. If the data augmentation is "randomly rotate the picture by 0, 90, 180, 270 degrees with equal probability", then there is no way for the generator to know which is the true orientation: Consider two generators  , such that for any latent  , the generated image   is a 90-degree rotation of  . They would have exactly the same expected loss, and so neither is preferred over the other.

The solution is to only use invertible data augmentation: instead of "randomly rotate the picture by 0, 90, 180, 270 degrees with equal probability", use "randomly rotate the picture by 90, 180, 270 degrees with 0.1 probability, and keep the picture as it is with 0.7 probability". This way, the generator is still rewarded to keep images oriented the same way as un-augmented ImageNet pictures.

Abstractly, the effect of randomly sampling transformations   from the distribution   is to define a Markov kernel  . Then, the data-augmented GAN game pushes the generator to find some  , such that

 
where   is the Markov kernel convolution. A data-augmentation method is defined to be invertible if its Markov kernel   satisfies
 
Immediately by definition, we see that composing multiple invertible data-augmentation methods results in yet another invertible method. Also by definition, if the data-augmentation method is invertible, then using it in a GAN game does not change the optimal strategy   for the generator, which is still  .

There are two prototypical examples of invertible Markov kernels:

Discrete case: Invertible stochastic matrices, when   is finite.

For example, if   is the set of four images of an arrow, pointing in 4 directions, and the data augmentation is "randomly rotate the picture by 90, 180, 270 degrees with probability  , and keep the picture as it is with probability  ", then the Markov kernel   can be represented as a stochastic matrix:

 
and   is an invertible kernel iff   is an invertible matrix, that is,  .

Continuous case: The gaussian kernel, when   for some  .

For example, if   is the space of 256x256 images, and the data-augmentation method is "generate a gaussian noise  , then add   to the image", then   is just convolution by the density function of  . This is invertible, because convolution by a gaussian is just convolution by the heat kernel, so given any  , the convolved distribution   can be obtained by heating up   precisely according to  , then wait for time  . With that, we can recover   by running the heat equation backwards in time for  .

More examples of invertible data augmentations are found in the paper.[51]

SinGAN edit

SinGAN pushes data augmentation to the limit, by using only a single image as training data and performing data augmentation on it. The GAN architecture is adapted to this training method by using a multi-scale pipeline.

The generator   is decomposed into a pyramid of generators  , with the lowest one generating the image   at the lowest resolution, then the generated image is scaled up to  , and fed to the next level to generate an image   at a higher resolution, and so on. The discriminator is decomposed into a pyramid as well.[52]

StyleGAN series edit

The StyleGAN family is a series of architectures published by Nvidia's research division.

Progressive GAN edit

Progressive GAN[16] is a method for training GAN for large-scale image generation stably, by growing a GAN generator from small to large scale in a pyramidal fashion. Like SinGAN, it decomposes the generator as , and the discriminator as  .

During training, at first only   are used in a GAN game to generate 4x4 images. Then   are added to reach the second stage of GAN game, to generate 8x8 images, and so on, until we reach a GAN game to generate 1024x1024 images.

To avoid shock between stages of the GAN game, each new layer is "blended in" (Figure 2 of the paper[16]). For example, this is how the second stage GAN game starts:

  • Just before, the GAN game consists of the pair   generating and discriminating 4x4 images.
  • Just after, the GAN game consists of the pair   generating and discriminating 8x8 images. Here, the functions   are image up- and down-sampling functions, and   is a blend-in factor (much like an alpha in image composing) that smoothly glides from 0 to 1.

StyleGAN-1 edit

 
The main architecture of StyleGAN-1 and StyleGAN-2

StyleGAN-1 is designed as a combination of Progressive GAN with neural style transfer.[53]

The key architectural choice of StyleGAN-1 is a progressive growth mechanism, similar to Progressive GAN. Each generated image starts as a constant   array, and repeatedly passed through style blocks. Each style block applies a "style latent vector" via affine transform ("adaptive instance normalization"), similar to how neural style transfer uses Gramian matrix. It then adds noise, and normalize (subtract the mean, then divide by the variance).

At training time, usually only one style latent vector is used per image generated, but sometimes two ("mixing regularization") in order to encourage each style block to independently perform its stylization without expecting help from other style blocks (since they might receive an entirely different style latent vector).

After training, multiple style latent vectors can be fed into each style block. Those fed to the lower layers control the large-scale styles, and those fed to the higher layers control the fine-detail styles.

Style-mixing between two images   can be performed as well. First, run a gradient descent to find   such that  . This is called "projecting an image back to style latent space". Then,   can be fed to the lower style blocks, and   to the higher style blocks, to generate a composite image that has the large-scale style of  , and the fine-detail style of  . Multiple images can also be composed this way.

StyleGAN-2 edit

StyleGAN-2 improves upon StyleGAN-1, by using the style latent vector to transform the convolution layer's weights instead, thus solving the "blob" problem.[54]

This was updated by the StyleGAN-2-ADA ("ADA" stands for "adaptive"),[51] which uses invertible data augmentation as described above. It also tunes the amount of data augmentation applied by starting at zero, and gradually increasing it until an "overfitting heuristic" reaches a target level, thus the name "adaptive".

StyleGAN-3 edit

StyleGAN-3[55] improves upon StyleGAN-2 by solving the "texture sticking" problem, which can be seen in the official videos.[56] They analyzed the problem by the Nyquist–Shannon sampling theorem, and argued that the layers in the generator learned to exploit the high-frequency signal in the pixels they operate upon.

To solve this, they proposed imposing strict lowpass filters between each generator's layers, so that the generator is forced to operate on the pixels in a way faithful to the continuous signals they represent, rather than operate on them as merely discrete signals. They further imposed rotational and translational invariance by using more signal filters. The resulting StyleGAN-3 is able to solve the texture sticking problem, as well as generating images that rotate and translate smoothly.

Applications ed

generative, adversarial, network, confused, with, adversarial, machine, learning, generative, adversarial, network, class, machine, learning, frameworks, prominent, framework, approaching, generative, concept, initially, developed, goodfellow, colleagues, june. Not to be confused with Adversarial machine learning A generative adversarial network GAN is a class of machine learning frameworks and a prominent framework for approaching generative AI 1 2 The concept was initially developed by Ian Goodfellow and his colleagues in June 2014 3 In a GAN two neural networks contest with each other in the form of a zero sum game where one agent s gain is another agent s loss An illustration of how a GAN works Given a training set this technique learns to generate new data with the same statistics as the training set For example a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers having many realistic characteristics Though originally proposed as a form of generative model for unsupervised learning GANs have also proved useful for semi supervised learning 4 fully supervised learning 5 and reinforcement learning 6 The core idea of a GAN is based on the indirect training through the discriminator another neural network that can tell how realistic the input seems which itself is also being updated dynamically 7 This means that the generator is not trained to minimize the distance to a specific image but rather to fool the discriminator This enables the model to learn in an unsupervised manner GANs are similar to mimicry in evolutionary biology with an evolutionary arms race between both networks Contents 1 Definition 1 1 Mathematical 1 2 In practice 1 3 Relation to other statistical machine learning methods 2 Mathematical properties 2 1 Measure theoretic considerations 2 2 Choice of the strategy set 2 3 Generative reparametrization 2 4 Move order and strategic equilibria 2 5 Main theorems for GAN game 3 Training and evaluating GAN 3 1 Training 3 1 1 Unstable convergence 3 1 2 Mode collapse 3 1 3 Two time scale update rule 3 1 4 Vanishing gradient 3 2 Evaluation 4 Variants 4 1 Conditional GAN 4 2 GANs with alternative architectures 4 3 GANs with alternative objectives 4 4 Wasserstein GAN WGAN 4 5 GANs with more than 2 players 4 5 1 Adversarial autoencoder 4 5 2 InfoGAN 4 5 3 Bidirectional GAN BiGAN 4 5 4 CycleGAN 4 6 GANs with particularly large or small scales 4 6 1 BigGAN 4 6 2 Invertible data augmentation 4 6 3 SinGAN 4 7 StyleGAN series 4 7 1 Progressive GAN 4 7 2 StyleGAN 1 4 7 3 StyleGAN 2 4 7 4 StyleGAN 3 5 Applications 5 1 Fashion art and advertising 5 2 Interactive Media 5 3 Science 5 4 Video games 5 5 AI generated video 5 6 Audio synthesis 5 7 Concerns about malicious applications 5 8 Transfer learning 5 9 Miscellaneous applications 6 History 7 References 8 External linksDefinition editMathematical editThe original GAN is defined as the following game 3 Each probability space W m ref displaystyle Omega mu text ref nbsp defines a GAN game There are 2 players generator and discriminator The generator s strategy set is P W displaystyle mathcal P Omega nbsp the set of all probability measures m G displaystyle mu G nbsp on W displaystyle Omega nbsp The discriminator s strategy set is the set of Markov kernels m D W P 0 1 displaystyle mu D Omega to mathcal P 0 1 nbsp where P 0 1 displaystyle mathcal P 0 1 nbsp is the set of probability measures on 0 1 displaystyle 0 1 nbsp The GAN game is a zero sum game with objective functionL m G m D E x m ref y m D x ln y E x m G y m D x ln 1 y displaystyle L mu G mu D mathbb E x sim mu text ref y sim mu D x ln y mathbb E x sim mu G y sim mu D x ln 1 y nbsp The generator aims to minimize the objective and the discriminator aims to maximize the objective The generator s task is to approach m G m ref displaystyle mu G approx mu text ref nbsp that is to match its own output distribution as closely as possible to the reference distribution The discriminator s task is to output a value close to 1 when the input appears to be from the reference distribution and to output a value close to 0 when the input looks like it came from the generator distribution In practice edit The generative network generates candidates while the discriminative network evaluates them 3 The contest operates in terms of data distributions Typically the generative network learns to map from a latent space to a data distribution of interest while the discriminative network distinguishes candidates produced by the generator from the true data distribution The generative network s training objective is to increase the error rate of the discriminative network i e fool the discriminator network by producing novel candidates that the discriminator thinks are not synthesized are part of the true data distribution 3 8 A known dataset serves as the initial training data for the discriminator Training involves presenting it with samples from the training dataset until it achieves acceptable accuracy The generator is trained based on whether it succeeds in fooling the discriminator Typically the generator is seeded with randomized input that is sampled from a predefined latent space e g a multivariate normal distribution Thereafter candidates synthesized by the generator are evaluated by the discriminator Independent backpropagation procedures are applied to both networks so that the generator produces better samples while the discriminator becomes more skilled at flagging synthetic samples 9 When used for image generation the generator is typically a deconvolutional neural network and the discriminator is a convolutional neural network Relation to other statistical machine learning methods edit GANs are implicit generative models 10 which means that they do not explicitly model the likelihood function nor provide a means for finding the latent variable corresponding to a given sample unlike alternatives such as flow based generative model nbsp Main types of deep generative models that perform maximum likelihood estimation 11 Compared to fully visible belief networks such as WaveNet and PixelRNN and autoregressive models in general GANs can generate one complete sample in one pass rather than multiple passes through the network Compared to Boltzmann machines and nonlinear ICA there is no restriction on the type of function used by the network Since neural networks are universal approximators GANs are asymptotically consistent Variational autoencoders might be universal approximators but it is not proven as of 2017 11 Mathematical properties editMeasure theoretic considerations edit This section provides some of the mathematical theory behind these methods In modern probability theory based on measure theory a probability space also needs to be equipped with a s algebra As a result a more rigorous definition of the GAN game would make the following changes Each probability space W B m ref displaystyle Omega mathcal B mu text ref nbsp defines a GAN game The generator s strategy set is P W B displaystyle mathcal P Omega mathcal B nbsp the set of all probability measures m G displaystyle mu G nbsp on the measure space W B displaystyle Omega mathcal B nbsp The discriminator s strategy set is the set of Markov kernels m D W B P 0 1 B 0 1 displaystyle mu D Omega mathcal B to mathcal P 0 1 mathcal B 0 1 nbsp where B 0 1 displaystyle mathcal B 0 1 nbsp is the Borel s algebra on 0 1 displaystyle 0 1 nbsp Since issues of measurability never arise in practice these will not concern us further Choice of the strategy set edit In the most generic version of the GAN game described above the strategy set for the discriminator contains all Markov kernels m D W P 0 1 displaystyle mu D Omega to mathcal P 0 1 nbsp and the strategy set for the generator contains arbitrary probability distributions m G displaystyle mu G nbsp on W displaystyle Omega nbsp However as shown below the optimal discriminator strategy against any m G displaystyle mu G nbsp is deterministic so there is no loss of generality in restricting the discriminator s strategies to deterministic functions D W 0 1 displaystyle D Omega to 0 1 nbsp In most applications D displaystyle D nbsp is a deep neural network function As for the generator while m G displaystyle mu G nbsp could theoretically be any computable probability distribution in practice it is usually implemented as a pushforward m G m Z G 1 displaystyle mu G mu Z circ G 1 nbsp That is start with a random variable z m Z displaystyle z sim mu Z nbsp where m Z displaystyle mu Z nbsp is a probability distribution that is easy to compute such as the uniform distribution or the Gaussian distribution then define a function G W Z W displaystyle G Omega Z to Omega nbsp Then the distribution m G displaystyle mu G nbsp is the distribution of G z displaystyle G z nbsp Consequently the generator s strategy is usually defined as just G displaystyle G nbsp leaving z m Z displaystyle z sim mu Z nbsp implicit In this formalism the GAN game objective isL G D E x m ref ln D x E z m Z ln 1 D G z displaystyle L G D mathbb E x sim mu text ref ln D x mathbb E z sim mu Z ln 1 D G z nbsp Generative reparametrization edit The GAN architecture has two main components One is casting optimization into a game of form min G max D L G D displaystyle min G max D L G D nbsp which is different from the usual kind of optimization of form min 8 L 8 displaystyle min theta L theta nbsp The other is the decomposition of m G displaystyle mu G nbsp into m Z G 1 displaystyle mu Z circ G 1 nbsp which can be understood as a reparametrization trick To see its significance one must compare GAN with previous methods for learning generative models which were plagued with intractable probabilistic computations that arise in maximum likelihood estimation and related strategies 3 At the same time Kingma and Welling 12 and Rezende et al 13 developed the same idea of reparametrization into a general stochastic backpropagation method Among its first applications was the variational autoencoder Move order and strategic equilibria edit In the original paper as well as most subsequent papers it is usually assumed that the generator moves first and the discriminator moves second thus giving the following minimax game min m G max m D L m G m D E x m ref y m D x ln y E x m G y m D x ln 1 y displaystyle min mu G max mu D L mu G mu D mathbb E x sim mu text ref y sim mu D x ln y mathbb E x sim mu G y sim mu D x ln 1 y nbsp If both the generator s and the discriminator s strategy sets are spanned by a finite number of strategies then by the minimax theorem min m G max m D L m G m D max m D min m G L m G m D displaystyle min mu G max mu D L mu G mu D max mu D min mu G L mu G mu D nbsp that is the move order does not matter However since the strategy sets are both not finitely spanned the minimax theorem does not apply and the idea of an equilibrium becomes delicate To wit there are the following different concepts of equilibrium Equilibrium when generator moves first and discriminator moves second m G arg min m G max m D L m G m D m D arg max m D L m G m D displaystyle hat mu G in arg min mu G max mu D L mu G mu D quad hat mu D in arg max mu D L hat mu G mu D quad nbsp Equilibrium when discriminator moves first and generator moves second m D arg max m D min m G L m G m D m G arg min m G L m G m D displaystyle hat mu D in arg max mu D min mu G L mu G mu D quad hat mu G in arg min mu G L mu G hat mu D nbsp Nash equilibrium m D m G displaystyle hat mu D hat mu G nbsp which is stable under simultaneous move order m D arg max m D L m G m D m G arg min m G L m G m D displaystyle hat mu D in arg max mu D L hat mu G mu D quad hat mu G in arg min mu G L mu G hat mu D nbsp For general games these equilibria do not have to agree or even to exist For the original GAN game these equilibria all exist and are all equal However for more general GAN games these do not necessarily exist or agree 14 Main theorems for GAN game editThe original GAN paper proved the following two theorems 3 Theorem the optimal discriminator computes the Jensen Shannon divergence For any fixed generator strategy m G displaystyle mu G nbsp let the optimal reply be D arg max D L m G D displaystyle D arg max D L mu G D nbsp thenD x d m ref d m ref m G L m G D 2 D J S m ref m G 2 ln 2 displaystyle begin aligned D x amp frac d mu text ref d mu text ref mu G L mu G D amp 2D JS mu text ref mu G 2 ln 2 end aligned nbsp where the derivative is the Radon Nikodym derivative and D J S displaystyle D JS nbsp is the Jensen Shannon divergence Proof By Jensen s inequality E x m ref y m D x ln y E x m ref ln E y m D x y displaystyle mathbb E x sim mu text ref y sim mu D x ln y leq mathbb E x sim mu text ref ln mathbb E y sim mu D x y nbsp and similarly for the other term Therefore the optimal reply can be deterministic i e m D x d D x displaystyle mu D x delta D x nbsp for some function D W 0 1 displaystyle D Omega to 0 1 nbsp in which case L m G m D E x m ref ln D x E x m G ln 1 D x displaystyle L mu G mu D mathbb E x sim mu text ref ln D x mathbb E x sim mu G ln 1 D x nbsp To define suitable density functions we define a base measure m m ref m G displaystyle mu mu text ref mu G nbsp which allows us to take the Radon Nikodym derivativesr ref d m ref d m r G d m G d m displaystyle rho text ref frac d mu text ref d mu quad rho G frac d mu G d mu nbsp with r ref r G 1 displaystyle rho text ref rho G 1 nbsp We then haveL m G m D m d x r ref x ln D x r G x ln 1 D x displaystyle L mu G mu D int mu dx left rho text ref x ln D x rho G x ln 1 D x right nbsp The integrand is just the negative cross entropy between two Bernoulli random variables with parameters r ref x displaystyle rho text ref x nbsp and D x displaystyle D x nbsp We can write this as H r ref x D K L r ref x D x displaystyle H rho text ref x D KL rho text ref x D x nbsp where H displaystyle H nbsp is the binary entropy function soL m G m D m d x H r ref x D K L r ref x D x displaystyle L mu G mu D int mu dx H rho text ref x D KL rho text ref x D x nbsp This means that the optimal strategy for the discriminator is D x r ref x displaystyle D x rho text ref x nbsp withL m G m D m d x H r ref x D J S m ref m G 2 ln 2 displaystyle L mu G mu D int mu dx H rho text ref x D JS mu text ref mu G 2 ln 2 nbsp after routine calculation Interpretation For any fixed generator strategy m G displaystyle mu G nbsp the optimal discriminator keeps track of the likelihood ratio between the reference distribution and the generator distribution D x 1 D x d m ref d m G x m ref d x m G d x D x s ln m ref d x ln m G d x displaystyle frac D x 1 D x frac d mu text ref d mu G x frac mu text ref dx mu G dx quad D x sigma ln mu text ref dx ln mu G dx nbsp where s displaystyle sigma nbsp is the logistic function In particular if the prior probability for an image x displaystyle x nbsp to come from the reference distribution is equal to 1 2 displaystyle frac 1 2 nbsp then D x displaystyle D x nbsp is just the posterior probability that x displaystyle x nbsp came from the reference distribution D x P r x came from reference distribution x displaystyle D x Pr x text came from reference distribution x nbsp Theorem the unique equilibrium point For any GAN game there exists a pair m D m G displaystyle hat mu D hat mu G nbsp that is both a sequential equilibrium and a Nash equilibrium L m G m D min m G max m D L m G m D max m D min m G L m G m D 2 ln 2 m D arg max m D min m G L m G m D m G arg min m G max m D L m G m D m D arg max m D L m G m D m G arg min m G L m G m D x W m D x d 1 2 m G m ref displaystyle begin aligned L hat mu G hat mu D min mu G max mu D L mu G mu D amp max mu D min mu G L mu G mu D 2 ln 2 hat mu D in arg max mu D min mu G L mu G mu D amp quad hat mu G in arg min mu G max mu D L mu G mu D hat mu D in arg max mu D L hat mu G mu D amp quad hat mu G in arg min mu G L mu G hat mu D forall x in Omega hat mu D x delta frac 1 2 amp quad hat mu G mu text ref end aligned nbsp That is the generator perfectly mimics the reference and the discriminator outputs 1 2 displaystyle frac 1 2 nbsp deterministically on all inputs Proof From the previous proposition arg min m G max m D L m G m D m ref min m G max m D L m G m D 2 ln 2 displaystyle arg min mu G max mu D L mu G mu D mu text ref quad min mu G max mu D L mu G mu D 2 ln 2 nbsp For any fixed discriminator strategy m D displaystyle mu D nbsp any m G displaystyle mu G nbsp concentrated on the set x E y m D x ln 1 y inf x E y m D x ln 1 y displaystyle x mathbb E y sim mu D x ln 1 y inf x mathbb E y sim mu D x ln 1 y nbsp is an optimal strategy for the generator Thus arg max m D min m G L m G m D arg max m D E x m ref y m D x ln y inf x E y m D x ln 1 y displaystyle arg max mu D min mu G L mu G mu D arg max mu D mathbb E x sim mu text ref y sim mu D x ln y inf x mathbb E y sim mu D x ln 1 y nbsp By Jensen s inequality the discriminator can only improve by adopting the deterministic strategy of always playing D x E y m D x y displaystyle D x mathbb E y sim mu D x y nbsp Therefore arg max m D min m G L m G m D arg max D E x m ref ln D x inf x ln 1 D x displaystyle arg max mu D min mu G L mu G mu D arg max D mathbb E x sim mu text ref ln D x inf x ln 1 D x nbsp By Jensen s inequality ln E x m ref D x inf x ln 1 D x ln E x m ref D x ln 1 sup x D x ln E x m ref D x 1 sup x D x ln sup x D x 1 sup x D x ln 1 4 displaystyle ln mathbb E x sim mu text ref D x inf x ln 1 D x ln mathbb E x sim mu text ref D x ln 1 sup x D x ln mathbb E x sim mu text ref D x 1 sup x D x leq ln sup x D x 1 sup x D x leq ln frac 1 4 nbsp with equality if D x 1 2 displaystyle D x frac 1 2 nbsp so x W m D x d 1 2 max m D min m G L m G m D 2 ln 2 displaystyle forall x in Omega hat mu D x delta frac 1 2 quad max mu D min mu G L mu G mu D 2 ln 2 nbsp Finally to check that this is a Nash equilibrium note that when m G m ref displaystyle mu G mu text ref nbsp we haveL m G m D E x m ref y m D x ln y 1 y displaystyle L mu G mu D mathbb E x sim mu text ref y sim mu D x ln y 1 y nbsp which is always maximized by y 1 2 displaystyle y frac 1 2 nbsp When x W m D x d 1 2 displaystyle forall x in Omega mu D x delta frac 1 2 nbsp any strategy is optimal for the generator Training and evaluating GAN editTraining edit Unstable convergence edit While the GAN game has a unique global equilibrium point when both the generator and discriminator have access to their entire strategy sets the equilibrium is no longer guaranteed when they have a restricted strategy set 14 In practice the generator has access only to measures of form m Z G 8 1 displaystyle mu Z circ G theta 1 nbsp where G 8 displaystyle G theta nbsp is a function computed by a neural network with parameters 8 displaystyle theta nbsp and m Z displaystyle mu Z nbsp is an easily sampled distribution such as the uniform or normal distribution Similarly the discriminator has access only to functions of form D z displaystyle D zeta nbsp a function computed by a neural network with parameters z displaystyle zeta nbsp These restricted strategy sets take up a vanishingly small proportion of their entire strategy sets 15 Further even if an equilibrium still exists it can only be found by searching in the high dimensional space of all possible neural network functions The standard strategy of using gradient descent to find the equilibrium often does not work for GAN and often the game collapses into one of several failure modes To improve the convergence stability some training strategies start with an easier task such as generating low resolution images 16 or simple images one object with uniform background 17 and gradually increase the difficulty of the task during training This essentially translates to applying a curriculum learning scheme 18 Mode collapse edit GANs often suffer from mode collapse where they fail to generalize properly missing entire modes from the input data For example a GAN trained on the MNIST dataset containing many samples of each digit might only generate pictures of digit 0 This was named in the first paper as the Helvetica scenario One way this can happen is if the generator learns too fast compared to the discriminator If the discriminator D displaystyle D nbsp is held constant then the optimal generator would only output elements of arg max x D x displaystyle arg max x D x nbsp 19 So for example if during GAN training for generating MNIST dataset for a few epochs the discriminator somehow prefers the digit 0 slightly more than other digits the generator may seize the opportunity to generate only digit 0 then be unable to escape the local minimum after the discriminator improves Some researchers perceive the root problem to be a weak discriminative network that fails to notice the pattern of omission while others assign blame to a bad choice of objective function Many solutions have been proposed but it is still an open problem 20 21 Even the state of the art architecture BigGAN 2019 could not avoid mode collapse The authors resorted to allowing collapse to occur at the later stages of training by which time a model is sufficiently trained to achieve good results 22 Two time scale update rule edit The two time scale update rule TTUR is proposed to make GAN convergence more stable by making the learning rate of the generator lower than that of the discriminator The authors argued that the generator should move slower than the discriminator so that it does not drive the discriminator steadily into new regions without capturing its gathered information They proved that a general class of games that included the GAN game when trained under TTUR converges under mild assumptions to a stationary local Nash equilibrium 23 They also proposed using the Adam stochastic optimization 24 to avoid mode collapse as well as the Frechet inception distance for evaluating GAN performances Vanishing gradient edit Conversely if the discriminator learns too fast compared to the generator then the discriminator could almost perfectly distinguish m G 8 m ref displaystyle mu G theta mu text ref nbsp In such case the generator G 8 displaystyle G theta nbsp could be stuck with a very high loss no matter which direction it changes its 8 displaystyle theta nbsp meaning that the gradient 8 L G 8 D z displaystyle nabla theta L G theta D zeta nbsp would be close to zero In such case the generator cannot learn a case of the vanishing gradient problem 15 Intuitively speaking the discriminator is too good and since the generator cannot take any small step only small steps are considered in gradient descent to improve its payoff it does not even try One important method for solving this problem is the Wasserstein GAN Evaluation edit GANs are usually evaluated by Inception score IS which measures how varied the generator s outputs are as classified by an image classifier usually Inception v3 or Frechet inception distance FID which measures how similar the generator s outputs are to a reference set as classified by a learned image featurizer such as Inception v3 without its final layer Many papers that propose new GAN architectures for image generation report how their architectures break the state of the art on FID or IS Another evaluation method is the Learned Perceptual Image Patch Similarity LPIPS which starts with a learned image featurizer f 8 Image R n displaystyle f theta text Image to mathbb R n nbsp and finetunes it by supervised learning on a set of x x PerceptualDifference x x displaystyle x x text PerceptualDifference x x nbsp where x displaystyle x nbsp is an image x displaystyle x nbsp is a perturbed version of it and PerceptualDifference x x displaystyle text PerceptualDifference x x nbsp is how much they differ as reported by human subjects The model is finetuned so that it can approximate f 8 x f 8 x PerceptualDifference x x displaystyle f theta x f theta x approx text PerceptualDifference x x nbsp This finetuned model is then used to define LPIPS x x f 8 x f 8 x displaystyle text LPIPS x x f theta x f theta x nbsp 25 Other evaluation methods are reviewed in 26 Variants editThere is a veritable zoo of GAN variants 27 Some of the most prominent are as follows Conditional GAN edit Conditional GANs are similar to standard GANs except they allow the model to conditionally generate samples based on additional information For example if we want to generate a cat face given a dog picture we could use a conditional GAN The generator in a GAN game generates m G displaystyle mu G nbsp a probability distribution on the probability space W displaystyle Omega nbsp This leads to the idea of a conditional GAN where instead of generating one probability distribution on W displaystyle Omega nbsp the generator generates a different probability distribution m G c displaystyle mu G c nbsp on W displaystyle Omega nbsp for each given class label c displaystyle c nbsp For example for generating images that look like ImageNet the generator should be able to generate a picture of cat when given the class label cat In the original paper 3 the authors noted that GAN can be trivially extended to conditional GAN by providing the labels to both the generator and the discriminator Concretely the conditional GAN game is just the GAN game with class labels provided L m G D E c m C x m ref c ln D x c E c m C x m G c ln 1 D x c displaystyle L mu G D mathbb E c sim mu C x sim mu text ref c ln D x c mathbb E c sim mu C x sim mu G c ln 1 D x c nbsp where m C displaystyle mu C nbsp is a probability distribution over classes m ref c displaystyle mu text ref c nbsp is the probability distribution of real images of class c displaystyle c nbsp and m G c displaystyle mu G c nbsp the probability distribution of images generated by the generator when given class label c displaystyle c nbsp In 2017 a conditional GAN learned to generate 1000 image classes of ImageNet 28 GANs with alternative architectures edit The GAN game is a general framework and can be run with any reasonable parametrization of the generator G displaystyle G nbsp and discriminator D displaystyle D nbsp In the original paper the authors demonstrated it using multilayer perceptron networks and convolutional neural networks Many alternative architectures have been tried Deep convolutional GAN DCGAN 29 For both generator and discriminator uses only deep networks consisting entirely of convolution deconvolution layers that is fully convolutional networks 30 Self attention GAN SAGAN 31 Starts with the DCGAN then adds residually connected standard self attention modules to the generator and discriminator Variational autoencoder GAN VAEGAN 32 Uses a variational autoencoder VAE for the generator Transformer GAN TransGAN 33 Uses the pure transformer architecture for both the generator and discriminator entirely devoid of convolution deconvolution layers Flow GAN 34 Uses flow based generative model for the generator allowing efficient computation of the likelihood function GANs with alternative objectives edit Many GAN variants are merely obtained by changing the loss functions for the generator and discriminator Original GAN We recast the original GAN objective into a form more convenient for comparison min D L D D m G E x m G ln D x E x m ref ln 1 D x min G L G D m G E x m G ln 1 D x displaystyle begin cases min D L D D mu G mathbb E x sim mu G ln D x mathbb E x sim mu text ref ln 1 D x min G L G D mu G mathbb E x sim mu G ln 1 D x end cases nbsp Original GAN non saturating loss This objective for generator was recommended in the original paper for faster convergence 3 L G E x m G ln D x displaystyle L G mathbb E x sim mu G ln D x nbsp The effect of using this objective is analyzed in Section 2 2 2 of Arjovsky et al 35 Original GAN maximum likelihood L G E x m G exp s 1 D x displaystyle L G mathbb E x sim mu G exp circ sigma 1 circ D x nbsp where s displaystyle sigma nbsp is the logistic function When the discriminator is optimal the generator gradient is the same as in maximum likelihood estimation even though GAN cannot perform maximum likelihood estimation itself 36 37 Hinge loss GAN 38 L D E x p ref min 0 1 D x E x m G min 0 1 D x displaystyle L D mathbb E x sim p text ref left min left 0 1 D left x right right right mathbb E x sim mu G left min left 0 1 D left x right right right nbsp L G E x m G D x displaystyle L G mathbb E x sim mu G D left x right nbsp Least squares GAN 39 L D E x m ref D x b 2 E x m G D x a 2 displaystyle L D mathbb E x sim mu text ref D x b 2 mathbb E x sim mu G D x a 2 nbsp L G E x m G D x c 2 displaystyle L G mathbb E x sim mu G D x c 2 nbsp where a b c displaystyle a b c nbsp are parameters to be chosen The authors recommended a 1 b 1 c 0 displaystyle a 1 b 1 c 0 nbsp Wasserstein GAN WGAN edit Main article Wasserstein GAN The Wasserstein GAN modifies the GAN game at two points The discriminator s strategy set is the set of measurable functions of type D W R displaystyle D Omega to mathbb R nbsp with bounded Lipschitz norm D L K displaystyle D L leq K nbsp where K displaystyle K nbsp is a fixed positive constant The objective isL W G A N m G D E x m G D x E x m ref D x displaystyle L WGAN mu G D mathbb E x sim mu G D x mathbb E x sim mu text ref D x nbsp One of its purposes is to solve the problem of mode collapse see above 15 The authors claim In no experiment did we see evidence of mode collapse for the WGAN algorithm GANs with more than 2 players edit Adversarial autoencoder edit An adversarial autoencoder AAE 40 is more autoencoder than GAN The idea is to start with a plain autoencoder but train a discriminator to discriminate the latent vectors from a reference distribution often the normal distribution InfoGAN edit In conditional GAN the generator receives both a noise vector z displaystyle z nbsp and a label c displaystyle c nbsp and produces an image G z c displaystyle G z c nbsp The discriminator receives image label pairs x c displaystyle x c nbsp and computes D x c displaystyle D x c nbsp When the training dataset is unlabeled conditional GAN does not work directly The idea of InfoGAN is to decree that every latent vector in the latent space can be decomposed as z c displaystyle z c nbsp an incompressible noise part z displaystyle z nbsp and an informative label part c displaystyle c nbsp and encourage the generator to comply with the decree by encouraging it to maximize I c G z c displaystyle I c G z c nbsp the mutual information between c displaystyle c nbsp and G z c displaystyle G z c nbsp while making no demands on the mutual information z displaystyle z nbsp between G z c displaystyle G z c nbsp Unfortunately I c G z c displaystyle I c G z c nbsp is intractable in general The key idea of InfoGAN is Variational Mutual Information Maximization 41 indirectly maximize it by maximizing a lower boundI G Q E z m Z c m C ln Q c G z c I c G z c sup Q I G Q displaystyle hat I G Q mathbb E z sim mu Z c sim mu C ln Q c G z c quad I c G z c geq sup Q hat I G Q nbsp where Q displaystyle Q nbsp ranges over all Markov kernels of type Q W Y P W C displaystyle Q Omega Y to mathcal P Omega C nbsp The InfoGAN game is defined as follows 42 Three probability spaces define an InfoGAN game W X m ref displaystyle Omega X mu text ref nbsp the space of reference images W Z m Z displaystyle Omega Z mu Z nbsp the fixed random noise generator W C m C displaystyle Omega C mu C nbsp the fixed random information generator There are 3 players in 2 teams generator Q and discriminator The generator and Q are on one team and the discriminator on the other team The objective function isL G Q D L G A N G D l I G Q displaystyle L G Q D L GAN G D lambda hat I G Q nbsp where L G A N G D E x m ref ln D x E z m Z ln 1 D G z c displaystyle L GAN G D mathbb E x sim mu text ref ln D x mathbb E z sim mu Z ln 1 D G z c nbsp is the original GAN game objective and I G Q E z m Z c m C ln Q c G z c displaystyle hat I G Q mathbb E z sim mu Z c sim mu C ln Q c G z c nbsp Generator Q team aims to minimize the objective and discriminator aims to maximize it min G Q max D L G Q D displaystyle min G Q max D L G Q D nbsp Bidirectional GAN BiGAN edit The standard GAN generator is a function of type G W Z W X displaystyle G Omega Z to Omega X nbsp that is it is a mapping from a latent space W Z displaystyle Omega Z nbsp to the image space W X displaystyle Omega X nbsp This can be understood as a decoding process whereby every latent vector z W Z displaystyle z in Omega Z nbsp is a code for an image x W X displaystyle x in Omega X nbsp and the generator performs the decoding This naturally leads to the idea of training another network that performs encoding creating an autoencoder out of the encoder generator pair Already in the original paper 3 the authors noted that Learned approximate inference can be performed by training an auxiliary network to predict z displaystyle z nbsp given x displaystyle x nbsp The bidirectional GAN architecture performs exactly this 43 The BiGAN is defined as follows Two probability spaces define a BiGAN game W X m X displaystyle Omega X mu X nbsp the space of reference images W Z m Z displaystyle Omega Z mu Z nbsp the latent space There are 3 players in 2 teams generator encoder and discriminator The generator and encoder are on one team and the discriminator on the other team The generator s strategies are functions G W Z W X displaystyle G Omega Z to Omega X nbsp and the encoder s strategies are functions E W X W Z displaystyle E Omega X to Omega Z nbsp The discriminator s strategies are functions D W X 0 1 displaystyle D Omega X to 0 1 nbsp The objective function isL G E D E x m X ln D x E x E z m Z ln 1 D G z z displaystyle L G E D mathbb E x sim mu X ln D x E x mathbb E z sim mu Z ln 1 D G z z nbsp Generator encoder team aims to minimize the objective and discriminator aims to maximize it min G E max D L G E D displaystyle min G E max D L G E D nbsp In the paper they gave a more abstract definition of the objective as L G E D E x z m E X ln D x z E x z m G Z ln 1 D x z displaystyle L G E D mathbb E x z sim mu E X ln D x z mathbb E x z sim mu G Z ln 1 D x z nbsp where m E X d x d z m X d x d E x d z displaystyle mu E X dx dz mu X dx cdot delta E x dz nbsp is the probability distribution on W X W Z displaystyle Omega X times Omega Z nbsp obtained by pushing m X displaystyle mu X nbsp forward via x x E x displaystyle x mapsto x E x nbsp and m G Z d x d z d G z d x m Z d z displaystyle mu G Z dx dz delta G z dx cdot mu Z dz nbsp is the probability distribution on W X W Z displaystyle Omega X times Omega Z nbsp obtained by pushing m Z displaystyle mu Z nbsp forward via z G x z displaystyle z mapsto G x z nbsp Applications of bidirectional models include semi supervised learning 44 interpretable machine learning 45 and neural machine translation 46 CycleGAN edit CycleGAN is an architecture for performing translations between two domains such as between photos of horses and photos of zebras or photos of night cities and photos of day cities The CycleGAN game is defined as follows 47 There are two probability spaces W X m X W Y m Y displaystyle Omega X mu X Omega Y mu Y nbsp corresponding to the two domains needed for translations fore and back There are 4 players in 2 teams generators G X W X W Y G Y W Y W X displaystyle G X Omega X to Omega Y G Y Omega Y to Omega X nbsp and discriminators D X W X 0 1 D Y W Y 0 1 displaystyle D X Omega X to 0 1 D Y Omega Y to 0 1 nbsp The objective function isL G X G Y D X D Y L G A N G X D X L G A N G Y D Y l L c y c l e G X G Y displaystyle L G X G Y D X D Y L GAN G X D X L GAN G Y D Y lambda L cycle G X G Y nbsp where l displaystyle lambda nbsp is a positive adjustable parameter L G A N displaystyle L GAN nbsp is the GAN game objective and L c y c l e displaystyle L cycle nbsp is the cycle consistency loss L c y c l e G X G Y E x m X G X G Y x x E y m Y G Y G X y y displaystyle L cycle G X G Y E x sim mu X G X G Y x x E y sim mu Y G Y G X y y nbsp The generators aim to minimize the objective and the discriminators aim to maximize it min G X G Y max D X D Y L G X G Y D X D Y displaystyle min G X G Y max D X D Y L G X G Y D X D Y nbsp Unlike previous work like pix2pix 48 which requires paired training data cycleGAN requires no paired data For example to train a pix2pix model to turn a summer scenery photo to winter scenery photo and back the dataset must contain pairs of the same place in summer and winter shot at the same angle cycleGAN would only need a set of summer scenery photos and an unrelated set of winter scenery photos GANs with particularly large or small scales edit BigGAN edit The BigGAN is essentially a self attention GAN trained on a large scale up to 80 million parameters to generate large images of ImageNet up to 512 x 512 resolution with numerous engineering tricks to make it converge 22 49 Invertible data augmentation edit When there is insufficient training data the reference distribution m ref displaystyle mu text ref nbsp cannot be well approximated by the empirical distribution given by the training dataset In such cases data augmentation can be applied to allow training GAN on smaller datasets Naive data augmentation however brings its problems Consider the original GAN game slightly reformulated as follows min D L D D m G E x m ref ln D x E x m G ln 1 D x min G L G D m G E x m G ln 1 D x displaystyle begin cases min D L D D mu G mathbb E x sim mu text ref ln D x mathbb E x sim mu G ln 1 D x min G L G D mu G mathbb E x sim mu G ln 1 D x end cases nbsp Now we use data augmentation by randomly sampling semantic preserving transforms T W W displaystyle T Omega to Omega nbsp and applying them to the dataset to obtain the reformulated GAN game min D L D D m G E x m ref T m t r a n s ln D T x E x m G ln 1 D x min G L G D m G E x m G ln 1 D x displaystyle begin cases min D L D D mu G mathbb E x sim mu text ref T sim mu trans ln D T x mathbb E x sim mu G ln 1 D x min G L G D mu G mathbb E x sim mu G ln 1 D x end cases nbsp This is equivalent to a GAN game with a different distribution m ref displaystyle mu text ref nbsp sampled by T x displaystyle T x nbsp with x m ref T m t r a n s displaystyle x sim mu text ref T sim mu trans nbsp For example if m ref displaystyle mu text ref nbsp is the distribution of images in ImageNet and m t r a n s displaystyle mu trans nbsp samples identity transform with probability 0 5 and horizontal reflection with probability 0 5 then m ref displaystyle mu text ref nbsp is the distribution of images in ImageNet and horizontally reflected ImageNet combined The result of such training would be a generator that mimics m ref displaystyle mu text ref nbsp For example it would generate images that look like they are randomly cropped if the data augmentation uses random cropping The solution is to apply data augmentation to both generated and real images min D L D D m G E x m ref T m t r a n s ln D T x E x m G T m t r a n s ln 1 D T x min G L G D m G E x m G T m t r a n s ln 1 D T x displaystyle begin cases min D L D D mu G mathbb E x sim mu text ref T sim mu trans ln D T x mathbb E x sim mu G T sim mu trans ln 1 D T x min G L G D mu G mathbb E x sim mu G T sim mu trans ln 1 D T x end cases nbsp The authors demonstrated high quality generation using just 100 picture large datasets 50 The StyleGAN 2 ADA paper points out a further point on data augmentation it must be invertible 51 Continue with the example of generating ImageNet pictures If the data augmentation is randomly rotate the picture by 0 90 180 270 degrees with equal probability then there is no way for the generator to know which is the true orientation Consider two generators G G displaystyle G G nbsp such that for any latent z displaystyle z nbsp the generated image G z displaystyle G z nbsp is a 90 degree rotation of G z displaystyle G z nbsp They would have exactly the same expected loss and so neither is preferred over the other The solution is to only use invertible data augmentation instead of randomly rotate the picture by 0 90 180 270 degrees with equal probability use randomly rotate the picture by 90 180 270 degrees with 0 1 probability and keep the picture as it is with 0 7 probability This way the generator is still rewarded to keep images oriented the same way as un augmented ImageNet pictures Abstractly the effect of randomly sampling transformations T W W displaystyle T Omega to Omega nbsp from the distribution m t r a n s displaystyle mu trans nbsp is to define a Markov kernel K t r a n s W P W displaystyle K trans Omega to mathcal P Omega nbsp Then the data augmented GAN game pushes the generator to find some m G P W displaystyle hat mu G in mathcal P Omega nbsp such thatK t r a n s m ref K t r a n s m G displaystyle K trans mu text ref K trans hat mu G nbsp where displaystyle nbsp is the Markov kernel convolution A data augmentation method is defined to be invertible if its Markov kernel K t r a n s displaystyle K trans nbsp satisfiesK t r a n s m K t r a n s m m m m m P W displaystyle K trans mu K trans mu implies mu mu quad forall mu mu in mathcal P Omega nbsp Immediately by definition we see that composing multiple invertible data augmentation methods results in yet another invertible method Also by definition if the data augmentation method is invertible then using it in a GAN game does not change the optimal strategy m G displaystyle hat mu G nbsp for the generator which is still m ref displaystyle mu text ref nbsp There are two prototypical examples of invertible Markov kernels Discrete case Invertible stochastic matrices when W displaystyle Omega nbsp is finite For example if W displaystyle Omega uparrow downarrow leftarrow rightarrow nbsp is the set of four images of an arrow pointing in 4 directions and the data augmentation is randomly rotate the picture by 90 180 270 degrees with probability p displaystyle p nbsp and keep the picture as it is with probability 1 3 p displaystyle 1 3p nbsp then the Markov kernel K t r a n s displaystyle K trans nbsp can be represented as a stochastic matrix K t r a n s 1 3 p p p p p 1 3 p p p p p 1 3 p p p p p 1 3 p displaystyle K trans begin bmatrix 1 3p amp p amp p amp p p amp 1 3p amp p amp p p amp p amp 1 3p amp p p amp p amp p amp 1 3p end bmatrix nbsp and K t r a n s displaystyle K trans nbsp is an invertible kernel iff K t r a n s displaystyle K trans nbsp is an invertible matrix that is p 1 4 displaystyle p neq 1 4 nbsp Continuous case The gaussian kernel when W R n displaystyle Omega mathbb R n nbsp for some n 1 displaystyle n geq 1 nbsp For example if W R 256 2 displaystyle Omega mathbb R 256 2 nbsp is the space of 256x256 images and the data augmentation method is generate a gaussian noise z N 0 I 256 2 displaystyle z sim mathcal N 0 I 256 2 nbsp then add ϵ z displaystyle epsilon z nbsp to the image then K t r a n s displaystyle K trans nbsp is just convolution by the density function of N 0 ϵ 2 I 256 2 displaystyle mathcal N 0 epsilon 2 I 256 2 nbsp This is invertible because convolution by a gaussian is just convolution by the heat kernel so given any m P R n displaystyle mu in mathcal P mathbb R n nbsp the convolved distribution K t r a n s m displaystyle K trans mu nbsp can be obtained by heating up R n displaystyle mathbb R n nbsp precisely according to m displaystyle mu nbsp then wait for time ϵ 2 4 displaystyle epsilon 2 4 nbsp With that we can recover m displaystyle mu nbsp by running the heat equation backwards in time for ϵ 2 4 displaystyle epsilon 2 4 nbsp More examples of invertible data augmentations are found in the paper 51 SinGAN edit SinGAN pushes data augmentation to the limit by using only a single image as training data and performing data augmentation on it The GAN architecture is adapted to this training method by using a multi scale pipeline The generator G displaystyle G nbsp is decomposed into a pyramid of generators G G 1 G 2 G N displaystyle G G 1 circ G 2 circ cdots circ G N nbsp with the lowest one generating the image G N z N displaystyle G N z N nbsp at the lowest resolution then the generated image is scaled up to r G N z N displaystyle r G N z N nbsp and fed to the next level to generate an image G N 1 z N 1 r G N z N displaystyle G N 1 z N 1 r G N z N nbsp at a higher resolution and so on The discriminator is decomposed into a pyramid as well 52 StyleGAN series edit Main article StyleGAN The StyleGAN family is a series of architectures published by Nvidia s research division Progressive GAN edit Progressive GAN 16 is a method for training GAN for large scale image generation stably by growing a GAN generator from small to large scale in a pyramidal fashion Like SinGAN it decomposes the generator asG G 1 G 2 G N displaystyle G G 1 circ G 2 circ cdots circ G N nbsp and the discriminator as D D 1 D 2 D N displaystyle D D 1 circ D 2 circ cdots circ D N nbsp During training at first only G N D N displaystyle G N D N nbsp are used in a GAN game to generate 4x4 images Then G N 1 D N 1 displaystyle G N 1 D N 1 nbsp are added to reach the second stage of GAN game to generate 8x8 images and so on until we reach a GAN game to generate 1024x1024 images To avoid shock between stages of the GAN game each new layer is blended in Figure 2 of the paper 16 For example this is how the second stage GAN game starts Just before the GAN game consists of the pair G N D N displaystyle G N D N nbsp generating and discriminating 4x4 images Just after the GAN game consists of the pair 1 a a G N 1 u G N D N d 1 a a D N 1 displaystyle 1 alpha alpha cdot G N 1 circ u circ G N D N circ d circ 1 alpha alpha cdot D N 1 nbsp generating and discriminating 8x8 images Here the functions u d displaystyle u d nbsp are image up and down sampling functions and a displaystyle alpha nbsp is a blend in factor much like an alpha in image composing that smoothly glides from 0 to 1 StyleGAN 1 edit nbsp The main architecture of StyleGAN 1 and StyleGAN 2StyleGAN 1 is designed as a combination of Progressive GAN with neural style transfer 53 The key architectural choice of StyleGAN 1 is a progressive growth mechanism similar to Progressive GAN Each generated image starts as a constant 4 4 512 displaystyle 4 times 4 times 512 nbsp array and repeatedly passed through style blocks Each style block applies a style latent vector via affine transform adaptive instance normalization similar to how neural style transfer uses Gramian matrix It then adds noise and normalize subtract the mean then divide by the variance At training time usually only one style latent vector is used per image generated but sometimes two mixing regularization in order to encourage each style block to independently perform its stylization without expecting help from other style blocks since they might receive an entirely different style latent vector After training multiple style latent vectors can be fed into each style block Those fed to the lower layers control the large scale styles and those fed to the higher layers control the fine detail styles Style mixing between two images x x displaystyle x x nbsp can be performed as well First run a gradient descent to find z z displaystyle z z nbsp such that G z x G z x displaystyle G z approx x G z approx x nbsp This is called projecting an image back to style latent space Then z displaystyle z nbsp can be fed to the lower style blocks and z displaystyle z nbsp to the higher style blocks to generate a composite image that has the large scale style of x displaystyle x nbsp and the fine detail style of x displaystyle x nbsp Multiple images can also be composed this way StyleGAN 2 edit StyleGAN 2 improves upon StyleGAN 1 by using the style latent vector to transform the convolution layer s weights instead thus solving the blob problem 54 This was updated by the StyleGAN 2 ADA ADA stands for adaptive 51 which uses invertible data augmentation as described above It also tunes the amount of data augmentation applied by starting at zero and gradually increasing it until an overfitting heuristic reaches a target level thus the name adaptive StyleGAN 3 edit StyleGAN 3 55 improves upon StyleGAN 2 by solving the texture sticking problem which can be seen in the official videos 56 They analyzed the problem by the Nyquist Shannon sampling theorem and argued that the layers in the generator learned to exploit the high frequency signal in the pixels they operate upon To solve this they proposed imposing strict lowpass filters between each generator s layers so that the generator is forced to operate on the pixels in a way faithful to the continuous signals they represent rather than operate on them as merely discrete signals They further imposed rotational and translational invariance by using more signal filters The resulting StyleGAN 3 is able to solve the texture sticking problem as well as generating images that rotate and translate smoothly Applications ed, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.