fbpx
Wikipedia

Text-to-image model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 3, Google Brain's Imagen, StabilityAI's Stable Diffusion, and Midjourney began to approach the quality of real photographs and human-drawn art.

An image conditioned on the prompt "an astronaut riding a horse, by Hiroshige", generated by Stable Diffusion, a large-scale text-to-image model released in 2022

Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.[1]

History edit

Before the rise of deep learning, attempts to build text-to-image models were limited to collages by arranging existing component images, such as from a database of clip art.[2][3]

The inverse task, image captioning, was more tractable and a number of image captioning deep learning models came prior to the first text-to-image models.[4]

The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the University of Toronto. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent variational autoencoder with an attention mechanism) to be conditioned on text sequences.[4] Images generated by alignDRAW were blurry and not photorealistic, but the model was able to generalize to objects not represented in the training data (such as a red school bus), and appropriately handled novel prompts such as "a stop sign is flying in blue skies", showing that it was not merely "memorizing" data from the training set.[4][5]

 
Eight images generated from the text prompt "A stop sign is flying in blue skies." by AlignDRAW (2015). Enlarged to show detail.[6]

In 2016, Reed, Akata, Yan et al. became the first to use generative adversarial networks for the text-to-image task.[5][7] With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like "an all black bird with a distinct thick, rounded bill". A model trained on the more diverse COCO dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details.[5] Later systems include VQGAN+CLIP,[8] XMC-GAN, and GauGAN2.[9]

One of the first text-to-image models to capture widespread public attention was OpenAI's DALL-E, a transformer system announced in January 2021.[10] A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022,[11] followed by Stable Diffusion publicly released in August 2022.[12]

Following other text-to-image models, language model-powered text-to-video platforms such as Runway, Make-A-Video,[13] Imagen Video,[14] Midjourney,[15] and Phenaki[16] can generate video from text and/or text/image prompts.[17]

In August 2022, it was further shown how a large text-to-image foundation models can be "personalized". Text-to-Image personalization allows to teach the model a new consept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by Textual inversion, namely, finding a new text term that correspond to these images.

Architecture and training edit

 
High level architecture showing the state of AI art machine learning models, the larger or more notable models and applications in the AI art landscape, and pertinent relationships and dependencies as a clickable SVG image map

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a recurrent neural network such as a long short-term memory (LSTM) network, though transformer models have since become a more popular option. For the image generation step, conditional generative adversarial networks have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details.

Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a large language model trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.[18]

Datasets edit

 
Examples of images and captions from three public datasets which are commonly used to train text-to-image models

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is COCO (Common Objects in Context). Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects, with five captions per image, generated by human annotators. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets, because of their narrow range of subject matter.[7]

Evaluation edit

Evaluating and comparing the quality of text-to-image models is a challenging problem, and involves assessing multiple desirable properties. As with any generative image model, it is desirable that the generated images be realistic (in the sense of appearing as if they could plausibly have come from the training set), and diverse in their style. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement.[7]

A common algorithmic metric for assessing image quality and diversity is Inception score (IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images, according to features extracted by one of the final layers of a pretrained image classification model.[7]

Impact and applications edit

The exhibition "Thinking Machines: Art and Design in the Computer Age, 1959–1989" at MoMA provided an overview of AI applications for art, architecture, and design. Exhibitions showcasing the usage of AI to produce art include the 2016 Google-sponsored benefit and auction at the Gray Area Foundation in San Francisco, where artists experimented with the DeepDream algorithm and the 2017 exhibition "Unhuman: Art in the Age of AI", which took place in Los Angeles and Frankfurt. In spring 2018, the Association for Computing Machinery dedicated a magazine issue to the subject of computers and art. In June 2018, "Duet for Human and Machine", an art piece permitting viewers to interact with an artificial intelligence, premiered at the Beall Center for Art + Technology. The Austrian Ars Electronica and Museum of Applied Arts, Vienna opened exhibitions on AI in 2019. Ars Electronica's 2019 festival "Out of the box" explored art's role in a sustainable societal transformation.

Examples of such augmentation may include e.g. enabling expansion of noncommercial niche genres (common examples are cyberpunk derivatives like solarpunk) by amateurs, novel entertainment, novel imaginative childhood play, very fast prototyping,[19] increasing art-making accessibility[19] and artistic output per effort and/or expenses and/or time[19] – e.g. via generating drafts, inspirations, draft-refinitions, and image-components (Inpainting).

See also edit

References edit

  1. ^ Vincent, James (May 24, 2022). "All these images were generated by Google's latest text-to-image AI". The Verge. Vox Media. Retrieved May 28, 2022.
  2. ^ Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan (October 2019), A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis, arXiv:1910.09399
  3. ^ Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley (2007). "A text-to-picture synthesis system for augmenting communication" (PDF). AAAI. 7: 1590–1595.
  4. ^ a b c Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan (November 2015). "Generating Images from Captions with Attention". ICLR. arXiv:1511.02793.
  5. ^ a b c Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak (June 2016). "Generative Adversarial Text to Image Synthesis" (PDF). International Conference on Machine Learning.
  6. ^ Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan (February 29, 2016). "Generating Images from Captions with Attention". International Conference on Learning Representations. arXiv:1511.02793.
  7. ^ a b c d Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv:2101.09983. doi:10.1016/j.neunet.2021.07.019. PMID 34500257. S2CID 231698782.
  8. ^ Rodriguez, Jesus. "🌅 Edge#229: VQGAN + CLIP". thesequence.substack.com. Retrieved 2022-10-10.
  9. ^ Rodriguez, Jesus. "🎆🌆 Edge#231: Text-to-Image Synthesis with GANs". thesequence.substack.com. Retrieved 2022-10-10.
  10. ^ Coldewey, Devin (5 January 2021). "OpenAI's DALL-E creates plausible images of literally anything you ask it to". TechCrunch.
  11. ^ Coldewey, Devin (6 April 2022). "OpenAI's new DALL-E model draws anything — but bigger, better and faster than before". TechCrunch.
  12. ^ "Stable Diffusion Public Release". Stability.Ai. Retrieved 2022-10-27.
  13. ^ Kumar, Ashish (2022-10-03). "Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text". MarkTechPost. Retrieved 2022-10-03.
  14. ^ Edwards, Benj (2022-10-05). "Google's newest AI generator creates HD video from text prompts". Ars Technica. Retrieved 2022-10-25.
  15. ^ Rodriguez, Jesus. "🎨 Edge#237: What is Midjourney?". thesequence.substack.com. Retrieved 2022-10-26.
  16. ^ "Phenaki". phenaki.video. Retrieved 2022-10-03.
  17. ^ Edwards, Benj (9 September 2022). "Runway teases AI-powered text-to-video editing using written prompts". Ars Technica. Retrieved 12 September 2022.
  18. ^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
  19. ^ a b c Elgan, Mike (1 November 2022). "How 'synthetic media' will transform business forever". Computerworld. Retrieved 9 November 2022.

text, image, model, text, image, model, machine, learning, model, which, takes, input, natural, language, description, produces, image, matching, that, description, such, models, began, developed, 2010s, result, advances, deep, neural, networks, 2022, output, . A text to image model is a machine learning model which takes an input natural language description and produces an image matching that description Such models began to be developed in the mid 2010s as a result of advances in deep neural networks In 2022 the output of state of the art text to image models such as OpenAI s DALL E 3 Google Brain s Imagen StabilityAI s Stable Diffusion and Midjourney began to approach the quality of real photographs and human drawn art An image conditioned on the prompt an astronaut riding a horse by Hiroshige generated by Stable Diffusion a large scale text to image model released in 2022Text to image models generally combine a language model which transforms the input text into a latent representation and a generative image model which produces an image conditioned on that representation The most effective models have generally been trained on massive amounts of image and text data scraped from the web 1 Contents 1 History 2 Architecture and training 3 Datasets 4 Evaluation 5 Impact and applications 6 See also 7 ReferencesHistory editBefore the rise of deep learning attempts to build text to image models were limited to collages by arranging existing component images such as from a database of clip art 2 3 The inverse task image captioning was more tractable and a number of image captioning deep learning models came prior to the first text to image models 4 The first modern text to image model alignDRAW was introduced in 2015 by researchers from the University of Toronto alignDRAW extended the previously introduced DRAW architecture which used a recurrent variational autoencoder with an attention mechanism to be conditioned on text sequences 4 Images generated by alignDRAW were blurry and not photorealistic but the model was able to generalize to objects not represented in the training data such as a red school bus and appropriately handled novel prompts such as a stop sign is flying in blue skies showing that it was not merely memorizing data from the training set 4 5 nbsp Eight images generated from the text prompt A stop sign is flying in blue skies by AlignDRAW 2015 Enlarged to show detail 6 In 2016 Reed Akata Yan et al became the first to use generative adversarial networks for the text to image task 5 7 With models trained on narrow domain specific datasets they were able to generate visually plausible images of birds and flowers from text captions like an all black bird with a distinct thick rounded bill A model trained on the more diverse COCO dataset produced images which were from a distance encouraging but which lacked coherence in their details 5 Later systems include VQGAN CLIP 8 XMC GAN and GauGAN2 9 One of the first text to image models to capture widespread public attention was OpenAI s DALL E a transformer system announced in January 2021 10 A successor capable of generating more complex and realistic images DALL E 2 was unveiled in April 2022 11 followed by Stable Diffusion publicly released in August 2022 12 Following other text to image models language model powered text to video platforms such as Runway Make A Video 13 Imagen Video 14 Midjourney 15 and Phenaki 16 can generate video from text and or text image prompts 17 In August 2022 it was further shown how a large text to image foundation models can be personalized Text to Image personalization allows to teach the model a new consept using a small set of images of a new object that was not included in the training set of the text to image foundation model This is achieved by Textual inversion namely finding a new text term that correspond to these images Dall e2 April 2022 and Dall 3 September 2023 interpretations of A stop sign is flying in blue skies nbsp Dall e2 April 2022 nbsp Dall e2 April 2022 nbsp Dall e3 September 2023 nbsp Dall e3 September 2023 Architecture and training edit nbsp High level architecture showing the state of AI art machine learning models the larger or more notable models and applications in the AI art landscape and pertinent relationships and dependencies as a clickable SVG image mapText to image models have been built using a variety of architectures The text encoding step may be performed with a recurrent neural network such as a long short term memory LSTM network though transformer models have since become a more popular option For the image generation step conditional generative adversarial networks have been commonly used with diffusion models also becoming a popular option in recent years Rather than directly training a model to output a high resolution image conditioned on a text embedding a popular technique is to train a model to generate low resolution images and use one or more auxiliary deep learning models to upscale it filling in finer details Text to image models are trained on large datasets of text image pairs often scraped from the web With their 2022 Imagen model Google Brain reported positive results from using a large language model trained separately on a text only corpus with its weights subsequently frozen a departure from the theretofore standard approach 18 Datasets edit nbsp Examples of images and captions from three public datasets which are commonly used to train text to image modelsTraining a text to image model requires a dataset of images paired with text captions One dataset commonly used for this purpose is COCO Common Objects in Context Released by Microsoft in 2014 COCO consists of around 123 000 images depicting a diversity of objects with five captions per image generated by human annotators Oxford 120 Flowers and CUB 200 Birds are smaller datasets of around 10 000 images each restricted to flowers and birds respectively It is considered less difficult to train a high quality text to image model with these datasets because of their narrow range of subject matter 7 Evaluation editEvaluating and comparing the quality of text to image models is a challenging problem and involves assessing multiple desirable properties As with any generative image model it is desirable that the generated images be realistic in the sense of appearing as if they could plausibly have come from the training set and diverse in their style A desideratum specific to text to image models is that generated images semantically align with the text captions used to generate them A number of schemes have been devised for assessing these qualities some automated and others based on human judgement 7 A common algorithmic metric for assessing image quality and diversity is Inception score IS which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text to image model The score is increased when the image classification model predicts a single label with high probability a scheme intended to favour distinct generated images Another popular metric is the related Frechet inception distance which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model 7 Impact and applications editThis section is an excerpt from Artificial intelligence art Impact and applications edit The exhibition Thinking Machines Art and Design in the Computer Age 1959 1989 at MoMA provided an overview of AI applications for art architecture and design Exhibitions showcasing the usage of AI to produce art include the 2016 Google sponsored benefit and auction at the Gray Area Foundation in San Francisco where artists experimented with the DeepDream algorithm and the 2017 exhibition Unhuman Art in the Age of AI which took place in Los Angeles and Frankfurt In spring 2018 the Association for Computing Machinery dedicated a magazine issue to the subject of computers and art In June 2018 Duet for Human and Machine an art piece permitting viewers to interact with an artificial intelligence premiered at the Beall Center for Art Technology The Austrian Ars Electronica and Museum of Applied Arts Vienna opened exhibitions on AI in 2019 Ars Electronica s 2019 festival Out of the box explored art s role in a sustainable societal transformation Examples of such augmentation may include e g enabling expansion of noncommercial niche genres common examples are cyberpunk derivatives like solarpunk by amateurs novel entertainment novel imaginative childhood play very fast prototyping 19 increasing art making accessibility 19 and artistic output per effort and or expenses and or time 19 e g via generating drafts inspirations draft refinitions and image components Inpainting See also editArtificial intelligence artReferences edit Vincent James May 24 2022 All these images were generated by Google s latest text to image AI The Verge Vox Media Retrieved May 28 2022 Agnese Jorge Herrera Jonathan Tao Haicheng Zhu Xingquan October 2019 A Survey and Taxonomy of Adversarial Neural Networks for Text to Image Synthesis arXiv 1910 09399 Zhu Xiaojin Goldberg Andrew B Eldawy Mohamed Dyer Charles R Strock Bradley 2007 A text to picture synthesis system for augmenting communication PDF AAAI 7 1590 1595 a b c Mansimov Elman Parisotto Emilio Lei Ba Jimmy Salakhutdinov Ruslan November 2015 Generating Images from Captions with Attention ICLR arXiv 1511 02793 a b c Reed Scott Akata Zeynep Logeswaran Lajanugen Schiele Bernt Lee Honglak June 2016 Generative Adversarial Text to Image Synthesis PDF International Conference on Machine Learning Mansimov Elman Parisotto Emilio Ba Jimmy Lei Salakhutdinov Ruslan February 29 2016 Generating Images from Captions with Attention International Conference on Learning Representations arXiv 1511 02793 a b c d Frolov Stanislav Hinz Tobias Raue Federico Hees Jorn Dengel Andreas December 2021 Adversarial text to image synthesis A review Neural Networks 144 187 209 arXiv 2101 09983 doi 10 1016 j neunet 2021 07 019 PMID 34500257 S2CID 231698782 Rodriguez Jesus Edge 229 VQGAN CLIP thesequence substack com Retrieved 2022 10 10 Rodriguez Jesus Edge 231 Text to Image Synthesis with GANs thesequence substack com Retrieved 2022 10 10 Coldewey Devin 5 January 2021 OpenAI s DALL E creates plausible images of literally anything you ask it to TechCrunch Coldewey Devin 6 April 2022 OpenAI s new DALL E model draws anything but bigger better and faster than before TechCrunch Stable Diffusion Public Release Stability Ai Retrieved 2022 10 27 Kumar Ashish 2022 10 03 Meta AI Introduces Make A Video An Artificial Intelligence System That Generates Videos From Text MarkTechPost Retrieved 2022 10 03 Edwards Benj 2022 10 05 Google s newest AI generator creates HD video from text prompts Ars Technica Retrieved 2022 10 25 Rodriguez Jesus Edge 237 What is Midjourney thesequence substack com Retrieved 2022 10 26 Phenaki phenaki video Retrieved 2022 10 03 Edwards Benj 9 September 2022 Runway teases AI powered text to video editing using written prompts Ars Technica Retrieved 12 September 2022 Saharia Chitwan Chan William Saxena Saurabh Li Lala Whang Jay Denton Emily Kamyar Seyed Ghasemipour Seyed Karagol Ayan Burcu Sara Mahdavi S Gontijo Lopes Rapha Salimans Tim Ho Jonathan J Fleet David Norouzi Mohammad 23 May 2022 Photorealistic Text to Image Diffusion Models with Deep Language Understanding arXiv 2205 11487 cs CV a b c Elgan Mike 1 November 2022 How synthetic media will transform business forever Computerworld Retrieved 9 November 2022 Retrieved from https en wikipedia org w index php title Text to image model amp oldid 1184947180, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.