fbpx
Wikipedia

Data engineering

Data engineering refers to the building of systems to enable the collection and usage of data. This data is usually used to enable subsequent analysis and data science; which often involves machine learning.[1][2] Making the data usable usually involves substantial compute and storage, as well as data processing and cleaning.

History Edit

Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe database design and the use of software for data analysis and processing.[3][4] These techniques were intended to be used by database administrators (DBAs) and by systems analysts based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian Clive Finkelstein, who wrote several articles about it between 1976 and 1980, and also co-authored an influential Savant Institute report on it with James Martin.[5][6][7] Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM.

In the early 2000s, the data and data tooling was generally held by the information technology (IT) teams in most companies.[8] Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business.

In the early 2010s, with the rise of the internet, the massive increase in data volumes, velocity, and variety led to the term big data to describe the data itself, and data-driven tech companies like Facebook and Airbnb started using the phrase data engineer.[3][8] Due to the new scale of the data, major firms like Google, Facebook, Amazon, Apple, Microsoft, and Netflix started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of software engineering focused on data, and in particular infrastructure, warehousing, data protection, cybersecurity, mining, modelling, processing, and metadata management.[3][8] This change in approach was particularly focused on cloud computing.[8] Data started to be handled and used by many parts of the business, such as sales and marketing, and not just IT.[8]

Tools Edit

Compute Edit

High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is dataflow programming, in which the computation is represented as a directed graph (dataflow graph); nodes are the operations, and edges represent the flow of data.[9] Popular implementations include Apache Spark, and the deep learning specific TensorFlow.[9][10][11] More recent implementations such as Differential/Timely Dataflow have used incremental computing for much more efficient data processing.[9][12][13]

Storage Edit

Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used.

Databases Edit

If the data is structured and some form of online transaction processing is required, then databases are generally used.[14] Originally mostly relational databases were used, with strong ACID transaction correctness guarantees; most relational databases use SQL for their queries. However, with the growth of data in the 2010s, NoSQL databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the object-relational impedance mismatch.[15] More recently, NewSQL databases — which attempt to allow horizontal scaling while retaining ACID guarantees — have become popular.[16][17][18][19]

Data warehouses Edit

If the data is structured and online analytical processing is required (but not online transaction processing), then data warehouses are a main choice.[20] They enable data analysis, mining, and artificial intelligence on a much larger scale than databases can allow,[20] and indeed data often flow from databases into data warehouses.[21] Business analysts, data engineers, and data scientists can access data warehouses using tools such as SQL or business intelligence software.[21]

Data lakes Edit

A data lake is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain structured data from relational databases, semi-structured data, unstructured data, and binary data. A data lake can be created on premises or in a cloud-based environment using the services from public cloud vendors such as Amazon, Microsoft, or Google.

Files Edit

If the data is less structured, then often they are just stored as files. There are several options:

Management Edit

The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a workflow management system (e.g. Airflow) to allow the data tasks to be specified, created, and monitored.[24] The tasks are often specified as a directed acyclic graph (DAG).[24]

Lifecycle Edit

Business planning Edit

Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan.

Systems design Edit

The design of data systems involves several components such as architecting data platforms, and designing data stores.[25][26]

Data modeling Edit

This is the process of producing a data model, an abstract model to describe the data and relationships between different parts of the data.[27]

Roles Edit

Data engineer Edit

A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into insights.[28] They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like Java, Python, Scala, and Rust.[29][3] They will be more familiar with databases, architecture, cloud computing, and Agile software development.[3]

Data scientist Edit

Data scientists are more focused on the analysis of the data, they will be more familiar with mathematics, algorithms, statistics, and machine learning.[3]

See also Edit

References Edit

  1. ^ "What is Data Engineering? | A Quick Glance of Data Engineering". EDUCBA. January 5, 2020. Retrieved July 31, 2022.
  2. ^ "Introduction to Data Engineering". Dremio. Retrieved July 31, 2022.
  3. ^ a b c d e f Black, Nathan (January 15, 2020). "What is Data Engineering and Why Is It So Important?". QuantHub. Retrieved July 31, 2022.
  4. ^ "Information Engineering - an overview | ScienceDirect Topics". www.sciencedirect.com. Retrieved August 23, 2022.
  5. ^ "Information engineering," part 3, part 4, part 5, Part 6" by Clive Finkelstein. In Computerworld, In depths, appendix. May 25 – June 15, 1981.
  6. ^ Christopher Allen, Simon Chatwin, Catherine Creary (2003). Introduction to Relational Databases and SQL Programming.
  7. ^ Terry Halpin, Tony Morgan (2010). Information Modeling and Relational Databases. p. 343
  8. ^ a b c d e Dodds, Eric. "The History of the Data Engineering and the Megatrends". Rudderstack. Retrieved July 31, 2022.
  9. ^ a b c Schwarzkopf, Malte (March 7, 2020). "The Remarkable Utility of Dataflow Computing". ACM SIGOPS. Retrieved July 31, 2022.
  10. ^ "sparkpaper" (PDF). Retrieved July 31, 2022.
  11. ^ Abadi, Martin; Barham, Paul; Chen, Jianmin; Chen, Zhifeng; Davis, Andy; Dean, Jeffrey; Devin, Matthieu; Ghemawat, Sanjay; Irving, Geoffrey; Isard, Michael; Kudlur, Manjunath; Levenberg, Josh; Monga, Rajat; Moore, Sherry; Murray, Derek G.; Steiner, Benoit; Tucker, Paul; Vasudevan, Vijay; Warden, Pete; Wicke, Martin; Yu, Yuan; Zheng, Xiaoqiang (2016). "TensorFlow: A system for large-scale machine learning". 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283. Retrieved July 31, 2022.
  12. ^ McSherry, Frank; Murray, Derek; Isaacs, Rebecca; Isard, Michael (January 5, 2013). "Differential dataflow". Microsoft. Retrieved July 31, 2022.
  13. ^ "Differential Dataflow". Timely Dataflow. July 30, 2022. Retrieved July 31, 2022.
  14. ^ "Lecture Notes | Database Systems | Electrical Engineering and Computer Science | MIT OpenCourseWare". ocw.mit.edu. Retrieved July 31, 2022.
  15. ^ Leavitt, Neal (2010). "Will NoSQL Databases Live Up to Their Promise?" (PDF). IEEE Computer. 43 (2): 12–14. doi:10.1109/MC.2010.58. S2CID 26876882.
  16. ^ Aslett, Matthew (2011). "How Will The Database Incumbents Respond To NoSQL And NewSQL?" (PDF). 451 Group (published April 4, 2011). Retrieved February 22, 2020.
  17. ^ Pavlo, Andrew; Aslett, Matthew (2016). "What's Really New with NewSQL?" (PDF). SIGMOD Record. Retrieved February 22, 2020.
  18. ^ Stonebraker, Michael (June 16, 2011). "NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps". Communications of the ACM Blog. Retrieved February 22, 2020.
  19. ^ Hoff, Todd (September 24, 2012). "Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In". Retrieved February 22, 2020.
  20. ^ a b "What is a Data Warehouse?". www.ibm.com. Retrieved July 31, 2022.
  21. ^ a b "What is a Data Warehouse? | Key Concepts | Amazon Web Services". Amazon Web Services, Inc. Retrieved July 31, 2022.
  22. ^ a b c "File storage, block storage, or object storage?". www.redhat.com. Retrieved July 31, 2022.
  23. ^ "Cloud Object Storage – Amazon S3 – Amazon Web Services". Amazon Web Services, Inc. Retrieved July 31, 2022.
  24. ^ a b "Home". Apache Airflow. Retrieved July 31, 2022.
  25. ^ "Introduction to Data Engineering". Coursera. Retrieved July 31, 2022.
  26. ^ Finkelstein, Clive. What are The Phases of Information Engineering.
  27. ^ "What is Data Modelling? Overview, Basic Concepts, and Types in Detail". Simplilearn.com. June 15, 2021. Retrieved July 31, 2022.
  28. ^ Tamir, Mike; Miller, Steven; Gagliardi, Alessandro (December 11, 2015). "The Data Engineer". Rochester, NY. doi:10.2139/ssrn.2762013. S2CID 113342650. SSRN 2762013. {{cite journal}}: Cite journal requires |journal= (help)
  29. ^ "Data Engineer vs. Data Scientist". Springboard Blog. February 7, 2019. Retrieved March 14, 2021.

Further reading Edit

  • John Hares (1992). "Information Engineering for the Advanced Practitioner", Wiley.
  • Clive Finkelstein (1989). An Introduction to Information Engineering: From Strategic Planning to Information Systems. Sydney: Addison-Wesley.
  • Clive Finkelstein (1992). "Information Engineering: Strategic Systems Development". Sydney: Addison-Wesley.
  • Ian Macdonald (1986). "Information engineering". in: Information Systems Design Methodologies. T.W. Olle et al. (ed.). North-Holland.
  • Ian Macdonald (1988). "Automating the Information engineering methodology with the Information Engineering Facility". In: Computerized Assistance during the Information Systems Life Cycle. T.W. Olle et al. (ed.). North-Holland.
  • James Martin and Clive Finkelstein. (1981). Information engineering. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK.
  • James Martin (1989). Information engineering. (3 volumes), Prentice-Hall Inc.
  • Clive Finkelstein (2006) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". First Edition, Artech House, Norwood MA in hardcover.
  • Clive Finkelstein (2011) "Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies". Second Edition is in PDF at www.ies.aust.com and as an ebook on the Apple iPad and ebook on the Amazon Kindle.
  • Reis, Joe; Housley, Matt (2022) "Fundamentals of Data Engineering". O'Reilly Media, Inc. ISBN 9781098108304

External links Edit

  • The Complex Method IEM
  • Enterprise Engineering and Rapid Delivery of Enterprise Architecture

data, engineering, refers, building, systems, enable, collection, usage, data, this, data, usually, used, enable, subsequent, analysis, data, science, which, often, involves, machine, learning, making, data, usable, usually, involves, substantial, compute, sto. Data engineering refers to the building of systems to enable the collection and usage of data This data is usually used to enable subsequent analysis and data science which often involves machine learning 1 2 Making the data usable usually involves substantial compute and storage as well as data processing and cleaning Contents 1 History 2 Tools 2 1 Compute 2 2 Storage 2 2 1 Databases 2 2 2 Data warehouses 2 2 3 Data lakes 2 2 4 Files 2 3 Management 3 Lifecycle 3 1 Business planning 3 2 Systems design 3 3 Data modeling 4 Roles 4 1 Data engineer 4 2 Data scientist 5 See also 6 References 7 Further reading 8 External linksHistory EditAround the 1970s 1980s the term information engineering methodology IEM was created to describe database design and the use of software for data analysis and processing 3 4 These techniques were intended to be used by database administrators DBAs and by systems analysts based upon an understanding of the operational processing needs of organizations for the 1980s In particular these techniques were meant to help bridge the gap between strategic business planning and information systems A key early contributor often called the father of information engineering methodology was the Australian Clive Finkelstein who wrote several articles about it between 1976 and 1980 and also co authored an influential Savant Institute report on it with James Martin 5 6 7 Over the next few years Finkelstein continued work in a more business driven direction which was intended to address a rapidly changing business environment Martin continued work in a more data processing driven direction From 1983 to 1987 Charles M Richter guided by Clive Finkelstein played a significant role in revamping IEM as well as helping to design the IEM software product user data which helped automate IEM In the early 2000s the data and data tooling was generally held by the information technology IT teams in most companies 8 Other teams then used data for their work e g reporting and there was usually little overlap in data skillset between these parts of the business In the early 2010s with the rise of the internet the massive increase in data volumes velocity and variety led to the term big data to describe the data itself and data driven tech companies like Facebook and Airbnb started using the phrase data engineer 3 8 Due to the new scale of the data major firms like Google Facebook Amazon Apple Microsoft and Netflix started to move away from traditional ETL and storage techniques They started creating data engineering a type of software engineering focused on data and in particular infrastructure warehousing data protection cybersecurity mining modelling processing and metadata management 3 8 This change in approach was particularly focused on cloud computing 8 Data started to be handled and used by many parts of the business such as sales and marketing and not just IT 8 Tools EditCompute Edit High performance computing is critical for the processing and analysis of data One particularly widespread approach to computing for data engineering is dataflow programming in which the computation is represented as a directed graph dataflow graph nodes are the operations and edges represent the flow of data 9 Popular implementations include Apache Spark and the deep learning specific TensorFlow 9 10 11 More recent implementations such as Differential Timely Dataflow have used incremental computing for much more efficient data processing 9 12 13 Storage Edit Data is stored in a variety of ways one of the key deciding factors is in how the data will be used Databases Edit If the data is structured and some form of online transaction processing is required then databases are generally used 14 Originally mostly relational databases were used with strong ACID transaction correctness guarantees most relational databases use SQL for their queries However with the growth of data in the 2010s NoSQL databases have also become popular since they horizontally scaled more easily than relational databases by giving up the ACID transaction guarantees as well as reducing the object relational impedance mismatch 15 More recently NewSQL databases which attempt to allow horizontal scaling while retaining ACID guarantees have become popular 16 17 18 19 Data warehouses Edit Main article Data warehouse If the data is structured and online analytical processing is required but not online transaction processing then data warehouses are a main choice 20 They enable data analysis mining and artificial intelligence on a much larger scale than databases can allow 20 and indeed data often flow from databases into data warehouses 21 Business analysts data engineers and data scientists can access data warehouses using tools such as SQL or business intelligence software 21 Data lakes Edit A data lake is a centralized repository for storing processing and securing large volumes of data A data lake can contain structured data from relational databases semi structured data unstructured data and binary data A data lake can be created on premises or in a cloud based environment using the services from public cloud vendors such as Amazon Microsoft or Google Files Edit If the data is less structured then often they are just stored as files There are several options File systems represent data hierarchically in nested folders 22 Block storage splits data into regularly sized chunks 22 this often matches up with virtual hard drives or solid state drives Object storage manages data using metadata 22 often each file is assigned a key such as a UUID 23 Management Edit The number and variety of different data processes and storage locations can become overwhelming for users This inspired the usage of a workflow management system e g Airflow to allow the data tasks to be specified created and monitored 24 The tasks are often specified as a directed acyclic graph DAG 24 Lifecycle EditBusiness planning Edit Business objectives that executives set for what s to come are characterized in key business plans with their more noteworthy definition in tactical business plans and implementation in operational business plans Most businesses today recognize the fundamental need to grow a business plan that follows this strategy It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan Systems design Edit The design of data systems involves several components such as architecting data platforms and designing data stores 25 26 Data modeling Edit Main article Data modelling This is the process of producing a data model an abstract model to describe the data and relationships between different parts of the data 27 Roles EditData engineer Edit A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization This makes it possible to take huge amounts of data and translate it into insights 28 They are focused on the production readiness of data and things like formats resilience scaling and security Data engineers usually hail from a software engineering background and are proficient in programming languages like Java Python Scala and Rust 29 3 They will be more familiar with databases architecture cloud computing and Agile software development 3 Data scientist Edit Main article Data science Data scientists are more focused on the analysis of the data they will be more familiar with mathematics algorithms statistics and machine learning 3 See also EditBig data Information technology Software engineering Computer scienceReferences Edit What is Data Engineering A Quick Glance of Data Engineering EDUCBA January 5 2020 Retrieved July 31 2022 Introduction to Data Engineering Dremio Retrieved July 31 2022 a b c d e f Black Nathan January 15 2020 What is Data Engineering and Why Is It So Important QuantHub Retrieved July 31 2022 Information Engineering an overview ScienceDirect Topics www sciencedirect com Retrieved August 23 2022 Information engineering part 3 part 4 part 5 Part 6 by Clive Finkelstein In Computerworld In depths appendix May 25 June 15 1981 Christopher Allen Simon Chatwin Catherine Creary 2003 Introduction to Relational Databases and SQL Programming Terry Halpin Tony Morgan 2010 Information Modeling and Relational Databases p 343 a b c d e Dodds Eric The History of the Data Engineering and the Megatrends Rudderstack Retrieved July 31 2022 a b c Schwarzkopf Malte March 7 2020 The Remarkable Utility of Dataflow Computing ACM SIGOPS Retrieved July 31 2022 sparkpaper PDF Retrieved July 31 2022 Abadi Martin Barham Paul Chen Jianmin Chen Zhifeng Davis Andy Dean Jeffrey Devin Matthieu Ghemawat Sanjay Irving Geoffrey Isard Michael Kudlur Manjunath Levenberg Josh Monga Rajat Moore Sherry Murray Derek G Steiner Benoit Tucker Paul Vasudevan Vijay Warden Pete Wicke Martin Yu Yuan Zheng Xiaoqiang 2016 TensorFlow A system for large scale machine learning 12th USENIX Symposium on Operating Systems Design and Implementation OSDI 16 pp 265 283 Retrieved July 31 2022 McSherry Frank Murray Derek Isaacs Rebecca Isard Michael January 5 2013 Differential dataflow Microsoft Retrieved July 31 2022 Differential Dataflow Timely Dataflow July 30 2022 Retrieved July 31 2022 Lecture Notes Database Systems Electrical Engineering and Computer Science MIT OpenCourseWare ocw mit edu Retrieved July 31 2022 Leavitt Neal 2010 Will NoSQL Databases Live Up to Their Promise PDF IEEE Computer 43 2 12 14 doi 10 1109 MC 2010 58 S2CID 26876882 Aslett Matthew 2011 How Will The Database Incumbents Respond To NoSQL And NewSQL PDF 451 Group published April 4 2011 Retrieved February 22 2020 Pavlo Andrew Aslett Matthew 2016 What s Really New with NewSQL PDF SIGMOD Record Retrieved February 22 2020 Stonebraker Michael June 16 2011 NewSQL An Alternative to NoSQL and Old SQL for New OLTP Apps Communications of the ACM Blog Retrieved February 22 2020 Hoff Todd September 24 2012 Google Spanner s Most Surprising Revelation NoSQL is Out and NewSQL is In Retrieved February 22 2020 a b What is a Data Warehouse www ibm com Retrieved July 31 2022 a b What is a Data Warehouse Key Concepts Amazon Web Services Amazon Web Services Inc Retrieved July 31 2022 a b c File storage block storage or object storage www redhat com Retrieved July 31 2022 Cloud Object Storage Amazon S3 Amazon Web Services Amazon Web Services Inc Retrieved July 31 2022 a b Home Apache Airflow Retrieved July 31 2022 Introduction to Data Engineering Coursera Retrieved July 31 2022 Finkelstein Clive What are The Phases of Information Engineering What is Data Modelling Overview Basic Concepts and Types in Detail Simplilearn com June 15 2021 Retrieved July 31 2022 Tamir Mike Miller Steven Gagliardi Alessandro December 11 2015 The Data Engineer Rochester NY doi 10 2139 ssrn 2762013 S2CID 113342650 SSRN 2762013 a href Template Cite journal html title Template Cite journal cite journal a Cite journal requires journal help Data Engineer vs Data Scientist Springboard Blog February 7 2019 Retrieved March 14 2021 Further reading EditJohn Hares 1992 Information Engineering for the Advanced Practitioner Wiley Clive Finkelstein 1989 An Introduction to Information Engineering From Strategic Planning to Information Systems Sydney Addison Wesley Clive Finkelstein 1992 Information Engineering Strategic Systems Development Sydney Addison Wesley Ian Macdonald 1986 Information engineering in Information Systems Design Methodologies T W Olle et al ed North Holland Ian Macdonald 1988 Automating the Information engineering methodology with the Information Engineering Facility In Computerized Assistance during the Information Systems Life Cycle T W Olle et al ed North Holland James Martin and Clive Finkelstein 1981 Information engineering Technical Report 2 volumes Savant Institute Carnforth Lancs UK James Martin 1989 Information engineering 3 volumes Prentice Hall Inc Clive Finkelstein 2006 Enterprise Architecture for Integration Rapid Delivery Methods and Technologies First Edition Artech House Norwood MA in hardcover Clive Finkelstein 2011 Enterprise Architecture for Integration Rapid Delivery Methods and Technologies Second Edition is in PDF at www ies aust com and as an ebook on the Apple iPad and ebook on the Amazon Kindle Reis Joe Housley Matt 2022 Fundamentals of Data Engineering O Reilly Media Inc ISBN 9781098108304External links Edit Wikimedia Commons has media related to Information Engineering The Complex Method IEM Rapid Application Development Enterprise Engineering and Rapid Delivery of Enterprise Architecture Retrieved from https en wikipedia org w index php title Data engineering amp oldid 1170779477, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.