fbpx
Wikipedia

Synthetic data

Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.[1]

Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated.

Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public;[2] synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.

Usefulness edit

Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing many systems, from simulations based on theoretical value, to database processors, etc. This helps detect and solve unexpected issues such as information processing limitations. Synthetic data are often generated to represent the authentic data and allows a baseline to be set.[3] Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data, while still allowing for use in testing systems.

A science article's abstract, quoted below, describes software that generates synthetic data for testing fraud detection systems. "This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment."[3] In defense and military contexts, synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems, particularly in contexts where high-quality real-world data is scarce.[4]

History edit

Scientific modelling of physical systems, which allows to run simulations in which one can estimate/compute/generate datapoints that haven't been observed in actual reality, has a long history that runs concurrent with the history of physics itself. For example, research into synthesis of audio and voice can be traced back to the 1930s and before, driven forward by the developments of e.g. the telephone and audio recording. Digitization gave rise to software synthesizers from the 1970s onwards.

In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Rubin.[5] Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household.[6] Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file.[7]

In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.[6] Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.[6]

Calculations edit

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms".[8]

Synthetic data can be generated through the use of random lines, having different orientations and starting positions.[9] Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data.[10]

Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data.[10]

David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their data model."[10] To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure: random graphs that are generated by some random process; lattice graphs having a ring structure; lattice graphs having a grid structure, etc.[10] In all cases, the data generation process follows the same process:

  1. Generate the empty graph structure.
  2. Generate attribute values based on user-supplied prior probabilities.

Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.[10]

Applications edit

Fraud detection and confidentiality systems edit

Testing and training fraud detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, [11] which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion.[3]

Scientific research edit

Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing.

Real data can contain information that researchers may not want released,[12] so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual.

Machine learning edit

Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault.[13] In general, synthetic data has several natural advantages:

  • once the synthetic environment is ready, it is fast and cheap to produce as much data as needed;
  • synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand;
  • the synthetic environment can be modified to improve the model and training;
  • synthetic data can be used as a substitute for certain real data segments that contain, e.g., sensitive information.

This usage of synthetic data has been proposed for computer vision applications, in particular object detection, where the synthetic environment is a 3D model of the object,[14] and learning to navigate environments by visual information.

At the same time, transfer learning remains a nontrivial problem, and synthetic data has not become ubiquitous yet. Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data. Advances in generative adversarial networks (GAN), lead to the natural idea that one can produce data and then use it for training. Since at least 2016, such adversarial training has been successfully used to produce synthetic data of sufficient quality to produce state-of-the-art results in some domains, without even needing to re-mix real data in with the generated synthetic data.[15]

Examples edit

In 1987, a Navlab autonomous vehicle used 1200 synthetic road images as one approach to training.[16]

In 2021, Microsoft released a database of 100,000 synthetic faces based on (500 real faces) that claims to "match real data in accuracy".[16][17]

See also edit

References edit

  1. ^ "What is synthetic data? - Definition from WhatIs.com". SearchCIO. Retrieved 2022-09-08.
  2. ^ Nikolenko, Sergey I. (2021). Synthetic Data for Deep Learning. Springer Optimization and Its Applications. Vol. 174. doi:10.1007/978-3-030-75178-4. ISBN 978-3-030-75177-7. S2CID 202750227.
  3. ^ a b c Barse, E.L.; Kvarnström, H.; Jonsson, E. (2003). Synthesizing test data for fraud detection systems. Proceedings of the 19th Annual Computer Security Applications Conference. IEEE. doi:10.1109/CSAC.2003.1254343.
  4. ^ Deng, Harry (30 November 2023). "Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer". United Nations Institute for Disarmament Research.
  5. ^ "Discussion: Statistical Disclosure Limitation". Journal of Official Statistics. 9: 461–468. 1993.
  6. ^ a b c Abowd, John M. "Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods. [Powerpoint slides]". Retrieved 17 February 2011.
  7. ^ "Statistical Analysis of Masked Data". Journal of Official Statistics. 9: 407–426. 1993.
  8. ^ Jackson, Charles; Murphy, Robert F.; Kovačević, Jelena (September 2009). "Intelligent Acquisition and Learning of Fluorescence Microscope Data Models" (PDF). IEEE Transactions on Image Processing. 18 (9): 2071–84. Bibcode:2009ITIP...18.2071J. doi:10.1109/TIP.2009.2024580. PMID 19502128. S2CID 3718670.
  9. ^ Wang, Aiqi; Qiu, Tianshuang; Shao, Longtan (July 2009). "A Simple Method of Radial Distortion Correction with Centre of Distortion Estimation". Journal of Mathematical Imaging and Vision. 35 (3): 165–172. doi:10.1007/s10851-009-0162-1. S2CID 207175690.
  10. ^ a b c d e David Jensen (2004). "6. Using Scripts". Proximity 4.3 Tutorial.
  11. ^ Deng, Robert H.; Bao, Feng; Zhou, Jianying (December 2002). Information and Communications Security. Proceedings of the 4th International Conference, ICICS 2002 Singapore. ISBN 9783540361596.
  12. ^ Abowd, John M.; Lane, Julia (June 9–11, 2004). New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. Privacy in Statistical Databases: CASC Project Final Conference, Proceedings. Barcelona, Spain. doi:10.1007/978-3-540-25955-8_22.
  13. ^ Patki, Neha; Wedge, Roy; Veeramachaneni, Kalyan. The Synthetic Data Vault. Data Science and Advanced Analytics (DSAA) 2016. IEEE. doi:10.1109/DSAA.2016.49.
  14. ^ Peng, Xingchao; Sun, Baochen; Ali, Karim; Saenko, Kate (2015). "Learning Deep Object Detectors from 3D Models". arXiv:1412.7122 [cs.CV].
  15. ^ Shrivastava, Ashish; Pfister, Tomas; Tuzel, Oncel; Susskind, Josh; Wang, Wenda; Webb, Russ (2016). "Learning from Simulated and Unsupervised Images through Adversarial Training". arXiv:1612.07828 [cs.CV].
  16. ^ a b "Neural Networks Need Data to Learn. Even If It's Fake". June 2023. Retrieved 17 June 2023.
  17. ^ Wood, Erroll; Baltrušaitis, Tadas; Hewitt, Charlie; Dziadzio, Sebastian; Cashman, Thomas J.; Shotton, Jamie (2021). "Fake It Till You Make It: Face Analysis in the Wild Using Synthetic Data Alone". Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV): 3681–3691. arXiv:2109.15102.
  • Duncan, G. (2006). . Archived from the original on 2006-09-05.
  • Adam Coates and Blake Carpenter and Carl Case and Sanjeev Satheesh and Bipin Suresh and Tao Wang and David J. Wu and Andrew Y. Ng (2011). "Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning" (PDF). ICDAR. pp. 440–445. Retrieved 13 May 2014.
  • "Three Common Misconceptions about Synthetic and Anonymised Data". 28 November 2019.

Further reading edit

  • Fienberg, Stephen E. (1994). "Conflicts between the needs for access to statistical information and demands for confidentiality". Journal of Official Statistics. 10 (2): 115–132.
  • Little, Roderick J.A. (1993). "Statistical Analysis of Masked Data". Journal of Official Statistics. 9 (2): 407–426.
  • Raghunathan, T.E.; Reiter, J.P.; Rubin, D.B. (2003). "Multiple Imputation for Statistical Disclosure Limitation" (PDF). Journal of Official Statistics. 19 (1): 1–16.
  • Reiter, Jerome P. (2004). "Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation" (PDF). Survey Methodology. 30: 235–242.

synthetic, data, information, that, artificially, generated, rather, than, produced, real, world, events, typically, created, using, algorithms, synthetic, data, deployed, validate, mathematical, models, train, machine, learning, models, data, generated, compu. Synthetic data is information that is artificially generated rather than produced by real world events Typically created using algorithms synthetic data can be deployed to validate mathematical models and to train machine learning models 1 Data generated by a computer simulation can be seen as synthetic data This encompasses most applications of physical modeling such as music synthesizers or flight simulators The output of such systems approximates the real thing but is fully algorithmically generated Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data In many sensitive applications datasets theoretically exist but cannot be released to the general public 2 synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation Contents 1 Usefulness 2 History 3 Calculations 4 Applications 4 1 Fraud detection and confidentiality systems 4 2 Scientific research 4 3 Machine learning 5 Examples 6 See also 7 References 8 Further readingUsefulness editSynthetic data is generated to meet specific needs or certain conditions that may not be found in the original real data This can be useful when designing many systems from simulations based on theoretical value to database processors etc This helps detect and solve unexpected issues such as information processing limitations Synthetic data are often generated to represent the authentic data and allows a baseline to be set 3 Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data while still allowing for use in testing systems A science article s abstract quoted below describes software that generates synthetic data for testing fraud detection systems This enables us to create realistic behavior profiles for users and attackers The data is used to train the fraud detection system itself thus creating the necessary adaptation of the system to a specific environment 3 In defense and military contexts synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems particularly in contexts where high quality real world data is scarce 4 History editScientific modelling of physical systems which allows to run simulations in which one can estimate compute generate datapoints that haven t been observed in actual reality has a long history that runs concurrent with the history of physics itself For example research into synthesis of audio and voice can be traced back to the 1930s and before driven forward by the developments of e g the telephone and audio recording Digitization gave rise to software synthesizers from the 1970s onwards In the context of privacy preserving statistical analysis in 1993 the idea of original fully synthetic data was created by Rubin 5 Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households He then released samples that did not include any actual long form records in this he preserved anonymity of the household 6 Later that year the idea of original partially synthetic data was created by Little Little used this idea to synthesize the sensitive values on the public use file 7 In 1994 Fienberg came up with the idea of critical refinement in which he used a parametric posterior predictive distribution instead of a Bayes bootstrap to do the sampling 6 Later other important contributors to the development of synthetic data generation were Trivellore Raghunathan Jerry Reiter Donald Rubin John M Abowd and Jim Woodcock Collectively they came up with a solution for how to treat partially synthetic data with missing data Similarly they came up with the technique of Sequential Regression Multivariate Imputation 6 Calculations editResearchers test the framework on synthetic data which is the only source of ground truth on which they can objectively assess the performance of their algorithms 8 Synthetic data can be generated through the use of random lines having different orientations and starting positions 9 Datasets can get fairly complicated A more complicated dataset can be generated by using a synthesizer build To create a synthesizer build first use the original data to create a model or equation that fits the data the best This model or equation will be called a synthesizer build This build can be used to generate more data 10 Constructing a synthesizer build involves constructing a statistical model In a linear regression line example the original data can be plotted and a best fit linear line can be created from the data This line is a synthesizer created from the original data The next step will be generating more synthetic data from the synthesizer build or from this linear line equation In this way the new data can be used for studies and research and it protects the confidentiality of the original data 10 David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data Researchers frequently need to explore the effects of certain data characteristics on their data model 10 To help construct datasets exhibiting specific properties such as auto correlation or degree disparity proximity can generate synthetic data having one of several types of graph structure random graphs that are generated by some random process lattice graphs having a ring structure lattice graphs having a grid structure etc 10 In all cases the data generation process follows the same process Generate the empty graph structure Generate attribute values based on user supplied prior probabilities Since the attribute values of one object may depend on the attribute values of related objects the attribute generation process assigns values collectively 10 Applications editFraud detection and confidentiality systems edit Testing and training fraud detection and confidentiality systems are devised using synthetic data Specific algorithms and generators are designed to create realistic data 11 which then assists in teaching a system how to react to certain situations or criteria For example intrusion detection software is tested using synthetic data This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data The synthetic data allows the software to recognize these situations and react accordingly If synthetic data was not used the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion 3 Scientific research edit Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing Real data can contain information that researchers may not want released 12 so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual Machine learning edit Synthetic data is increasingly being used for machine learning applications a model is trained on a synthetically generated dataset with the intention of transfer learning to real data Efforts have been made to enable more data science experiments via the construction of general purpose synthetic data generators such as the Synthetic Data Vault 13 In general synthetic data has several natural advantages once the synthetic environment is ready it is fast and cheap to produce as much data as needed synthetic data can have perfectly accurate labels including labeling that may be very expensive or impossible to obtain by hand the synthetic environment can be modified to improve the model and training synthetic data can be used as a substitute for certain real data segments that contain e g sensitive information This usage of synthetic data has been proposed for computer vision applications in particular object detection where the synthetic environment is a 3D model of the object 14 and learning to navigate environments by visual information At the same time transfer learning remains a nontrivial problem and synthetic data has not become ubiquitous yet Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data Advances in generative adversarial networks GAN lead to the natural idea that one can produce data and then use it for training Since at least 2016 such adversarial training has been successfully used to produce synthetic data of sufficient quality to produce state of the art results in some domains without even needing to re mix real data in with the generated synthetic data 15 Examples editIn 1987 a Navlab autonomous vehicle used 1200 synthetic road images as one approach to training 16 In 2021 Microsoft released a database of 100 000 synthetic faces based on 500 real faces that claims to match real data in accuracy 16 17 See also editSurrogate data Reinforcement learning Rendering computer graphics References edit What is synthetic data Definition from WhatIs com SearchCIO Retrieved 2022 09 08 Nikolenko Sergey I 2021 Synthetic Data for Deep Learning Springer Optimization and Its Applications Vol 174 doi 10 1007 978 3 030 75178 4 ISBN 978 3 030 75177 7 S2CID 202750227 a b c Barse E L Kvarnstrom H Jonsson E 2003 Synthesizing test data for fraud detection systems Proceedings of the 19th Annual Computer Security Applications Conference IEEE doi 10 1109 CSAC 2003 1254343 Deng Harry 30 November 2023 Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems A Primer United Nations Institute for Disarmament Research Discussion Statistical Disclosure Limitation Journal of Official Statistics 9 461 468 1993 a b c Abowd John M Confidentiality Protection of Social Science Micro Data Synthetic Data and Related Methods Powerpoint slides Retrieved 17 February 2011 Statistical Analysis of Masked Data Journal of Official Statistics 9 407 426 1993 Jackson Charles Murphy Robert F Kovacevic Jelena September 2009 Intelligent Acquisition and Learning of Fluorescence Microscope Data Models PDF IEEE Transactions on Image Processing 18 9 2071 84 Bibcode 2009ITIP 18 2071J doi 10 1109 TIP 2009 2024580 PMID 19502128 S2CID 3718670 Wang Aiqi Qiu Tianshuang Shao Longtan July 2009 A Simple Method of Radial Distortion Correction with Centre of Distortion Estimation Journal of Mathematical Imaging and Vision 35 3 165 172 doi 10 1007 s10851 009 0162 1 S2CID 207175690 a b c d e David Jensen 2004 6 Using Scripts Proximity 4 3 Tutorial Deng Robert H Bao Feng Zhou Jianying December 2002 Information and Communications Security Proceedings of the 4th International Conference ICICS 2002 Singapore ISBN 9783540361596 Abowd John M Lane Julia June 9 11 2004 New Approaches to Confidentiality Protection Synthetic Data Remote Access and Research Data Centers Privacy in Statistical Databases CASC Project Final Conference Proceedings Barcelona Spain doi 10 1007 978 3 540 25955 8 22 Patki Neha Wedge Roy Veeramachaneni Kalyan The Synthetic Data Vault Data Science and Advanced Analytics DSAA 2016 IEEE doi 10 1109 DSAA 2016 49 Peng Xingchao Sun Baochen Ali Karim Saenko Kate 2015 Learning Deep Object Detectors from 3D Models arXiv 1412 7122 cs CV Shrivastava Ashish Pfister Tomas Tuzel Oncel Susskind Josh Wang Wenda Webb Russ 2016 Learning from Simulated and Unsupervised Images through Adversarial Training arXiv 1612 07828 cs CV a b Neural Networks Need Data to Learn Even If It s Fake June 2023 Retrieved 17 June 2023 Wood Erroll Baltrusaitis Tadas Hewitt Charlie Dziadzio Sebastian Cashman Thomas J Shotton Jamie 2021 Fake It Till You Make It Face Analysis in the Wild Using Synthetic Data Alone Proceedings of the IEEE CVF International Conference on Computer Vision ICCV 3681 3691 arXiv 2109 15102 Duncan G 2006 Statistical confidentiality Is Synthetic Data the Answer Archived from the original on 2006 09 05 Adam Coates and Blake Carpenter and Carl Case and Sanjeev Satheesh and Bipin Suresh and Tao Wang and David J Wu and Andrew Y Ng 2011 Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning PDF ICDAR pp 440 445 Retrieved 13 May 2014 Three Common Misconceptions about Synthetic and Anonymised Data 28 November 2019 Further reading editFienberg Stephen E 1994 Conflicts between the needs for access to statistical information and demands for confidentiality Journal of Official Statistics 10 2 115 132 Little Roderick J A 1993 Statistical Analysis of Masked Data Journal of Official Statistics 9 2 407 426 Raghunathan T E Reiter J P Rubin D B 2003 Multiple Imputation for Statistical Disclosure Limitation PDF Journal of Official Statistics 19 1 1 16 Reiter Jerome P 2004 Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation PDF Survey Methodology 30 235 242 Retrieved from https en wikipedia org w index php title Synthetic data amp oldid 1218003179, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.