fbpx
Wikipedia

RCFile

Within database management systems, the RCFile (Record Columnar File)[1] is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

RCFile is the result of research and collaborative efforts from Facebook, The Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences.

Summary edit

Data storage format edit

For example, a table in a database consists of 4 columns (c1 to c4):

c1 c2 c3 c4
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
51 52 53 54

To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group.

Row Group 1
c1 c2 c3 c4
11 12 13 14
21 22 23 24
31 32 33 34
Row Group 2
c1 c2 c3 c4
41 42 43 44
51 52 53 54

Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:

 Row Group 1 Row Group 2 11, 21, 31; 41, 51; 12, 22, 32; 42, 52; 13, 23, 33; 43, 53; 14, 24, 34; 44, 54; 

Column data compression edit

Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.

Performance Benefits edit

Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row.

RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.

For example, a database might have this table:

EmpId Lastname Firstname Salary
10 Smith Joe 40000
12 Jones Mary 50000
11 Johnson Cathy 44000
22 Jones Bob 55000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). This two-dimensional format exists only in theory, in practice, storage hardware requires the data to be serialized into one form or another.

In MapReduce-based systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks.

Row-oriented systems edit

The common solution to the storage problem is to serialize each row of data, like this;

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000; 

Row-based systems are designed to efficiently return data for an entire row, or an entire record, in as few operations as possible. This matches use-cases where the system is attempting to retrieve all the information about a particular object, say the full information about one contact in a rolodex system, or the complete information about one product in an online shopping system.

Row-based systems are not efficient at performing operations that apply to the entire data set, as opposed to a specific record. For instance, in order to find all the records in the example table that have salaries between 40,000 and 50,000, the row-based system would have to seek through the entire data set looking for matching records. While the example table shown above may fit in a single disk block, a table with even a few hundred rows would not, therefore multiple disk operations would be needed to retrieve the data.

Column-oriented systems edit

A column-oriented system serializes all of the values of a column together, then the values of the next column. For our example table, the data would be stored in this fashion;

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004; 

The difference can be more clearly seen in this common modification:

...;Smith:001,Jones:002,004,Johnson:003;... 

Two of the records store the same value, "Jones", therefore it is now possible to store this in the column-oriented system only once instead of twice. For many common searches, like "find all the people with the last name Jones", the answer can now be retrieved in a single operation.

Whether or not a column-oriented system will be more efficient in operation depends heavily on the operations being automated. Operations that retrieve data for objects would be slower, requiring numerous disk operations to assemble data from different columns to build up a whole-row record. However, such whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first names and last names from many rows in order to build a list of contacts is far more common than operations reading the data for home address.

Adoption edit

RCFile has been adopted in real-world systems for big data analytics.

  1. RCFile became the default data placement structure in Facebook's production Hadoop cluster.[2] By 2010 it was the world's largest Hadoop cluster,[3] where 40 terabytes compressed data sets are added every day.[4] In addition, all the data sets stored in HDFS before RCFile have also been transformed to use RCFile .[2]
  2. RCFile has been adopted in Apache Hive (since v0.4),[5] which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world,[6] including several Internet services, such as Facebook, Taobao, and Netflix.[7]
  3. RCFile has been adopted in Apache Pig (since v0.7),[8] which is another open source data processing system being widely used in many organizations,[9] including several major Web service providers, such as Twitter, Yahoo, LinkedIn, AOL, and Salesforce.com.
  4. RCFile became the de facto standard data storage structure in Hadoop software environment supported by the Apache HCatalog project (formerly known as Howl[10]) that is the table and storage management service for Hadoop.[11] RCFile is supported by the open source Elephant Bird library used in Twitter for daily data analytics.[12]

Over the following years, other Hadoop data formats also became popular. In February 2013, an Optimized Row Columnar (ORC) file format was announced by Hortonworks.[13] A month later, the Apache Parquet format was announced, developed by Cloudera and Twitter.[14]

See also edit

References edit

  1. ^ Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu (2011). "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems". IEEE 27th International Conference on Data Engineering. pp. 1200–1208.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  2. ^ a b "Hive integration: HBase and Rcfile__HadoopSummit2010". 2010-06-30.
  3. ^ "Facebook has the world's largest Hadoop cluster!". 2010-05-09.
  4. ^ "Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain". 2011-02-24.
  5. ^ . Archived from the original on 2011-11-23. Retrieved 2012-07-21.
  6. ^ "PoweredBy - Apache Hive - Apache Software Foundation".
  7. ^ "Hive user group presentation from Netflix (3/18/2010)". 2010-03-19.
  8. ^ "HiveRCInputFormat (Pig 0.17.0 API)".
  9. ^ "PoweredBy - Apache Pig - Apache Software Foundation".
  10. ^ Howl
  11. ^ . Archived from the original on 2012-07-20. Retrieved 2012-07-21.
  12. ^ "Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.: Kevinweil/elephant-bird". GitHub. 2018-12-15.
  13. ^ Alan Gates (February 20, 2013). "The Stinger Initiative: Making Apache Hive 100 Times Faster". Hortonworks blog. Retrieved May 4, 2017.
  14. ^ Justin Kestelyn (March 13, 2013). "Introducing Parquet: Efficient Columnar Storage for Apache Hadoop". Cloudera blog. Retrieved May 4, 2017.

External links edit

  • RCFile on the Apache Software Foundation website
  • Source Code
  • Hive website
  • Hive page on Hadoop Wiki

rcfile, this, article, needs, additional, citations, verification, please, help, improve, this, article, adding, citations, reliable, sources, unsourced, material, challenged, removed, find, sources, news, newspapers, books, scholar, jstor, 2023, learn, when, . This article needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed Find sources RCFile news newspapers books scholar JSTOR May 2023 Learn how and when to remove this template message Within database management systems the RCFile Record Columnar File 1 is a data placement structure that determines how to store relational tables on computer clusters It is designed for systems using the MapReduce framework The RCFile structure includes a data storage format data compression approach and optimization techniques for data reading It is able to meet all the four requirements of data placement 1 fast data loading 2 fast query processing 3 highly efficient storage space utilization and 4 a strong adaptivity to dynamic data access patterns RCFile is the result of research and collaborative efforts from Facebook The Ohio State University and the Institute of Computing Technology at the Chinese Academy of Sciences Contents 1 Summary 1 1 Data storage format 1 2 Column data compression 1 3 Performance Benefits 1 4 Row oriented systems 1 5 Column oriented systems 2 Adoption 3 See also 4 References 5 External linksSummary editData storage format edit For example a table in a database consists of 4 columns c1 to c4 c1 c2 c3 c411 12 13 1421 22 23 2431 32 33 3441 42 43 4451 52 53 54To serialize the table RCFile partitions this table first horizontally and then vertically instead of only partitioning the table horizontally like the row oriented DBMS row store The horizontal partitioning will first partition the table into multiple row groups based on the row group size which is a user specified value determining the size of each row group For example the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group Row Group 1 c1 c2 c3 c411 12 13 1421 22 23 2431 32 33 34Row Group 2 c1 c2 c3 c441 42 43 4451 52 53 54 Then in every row group RCFile partitions the data vertically like column store Thus the table will be serialized as Row Group 1 Row Group 2 11 21 31 41 51 12 22 32 42 52 13 23 33 43 53 14 24 34 44 54 Column data compression edit Within each row group columns are compressed to reduce storage space usage Since data of a column are stored adjacently the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio Performance Benefits edit Column store is more efficient when a query only requires a subset of columns because column store only read necessary columns from disks but row store will read an entire row RCFile combines merits of row store and column store via horizontal vertical partitioning With horizontal partitioning RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row With vertical partitioning for a query RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I O costs Moreover in every row group data compression can be done by using compression algorithms used in column store For example a database might have this table EmpId Lastname Firstname Salary10 Smith Joe 4000012 Jones Mary 5000011 Johnson Cathy 4400022 Jones Bob 55000This simple table includes an employee identifier EmpId name fields Lastname and Firstname and a salary Salary This two dimensional format exists only in theory in practice storage hardware requires the data to be serialized into one form or another In MapReduce based systems data is normally stored on a distributed system such as Hadoop Distributed File System HDFS and different data blocks might be stored in different machines Thus for column store on MapReduce different groups of columns might be stored on different machines which introduces extra network costs when a query projects columns placed on different machines For MapReduce based systems the merit of row store is that there is no extra network costs to construct a row in query processing and the merit of column store is that there is no unnecessary local I O costs when read data from disks Row oriented systems edit The common solution to the storage problem is to serialize each row of data like this 001 10 Smith Joe 40000 002 12 Jones Mary 50000 003 11 Johnson Cathy 44000 004 22 Jones Bob 55000 Row based systems are designed to efficiently return data for an entire row or an entire record in as few operations as possible This matches use cases where the system is attempting to retrieve all the information about a particular object say the full information about one contact in a rolodex system or the complete information about one product in an online shopping system Row based systems are not efficient at performing operations that apply to the entire data set as opposed to a specific record For instance in order to find all the records in the example table that have salaries between 40 000 and 50 000 the row based system would have to seek through the entire data set looking for matching records While the example table shown above may fit in a single disk block a table with even a few hundred rows would not therefore multiple disk operations would be needed to retrieve the data Column oriented systems edit A column oriented system serializes all of the values of a column together then the values of the next column For our example table the data would be stored in this fashion 10 001 12 002 11 003 22 004 Smith 001 Jones 002 Johnson 003 Jones 004 Joe 001 Mary 002 Cathy 003 Bob 004 40000 001 50000 002 44000 003 55000 004 The difference can be more clearly seen in this common modification Smith 001 Jones 002 004 Johnson 003 Two of the records store the same value Jones therefore it is now possible to store this in the column oriented system only once instead of twice For many common searches like find all the people with the last name Jones the answer can now be retrieved in a single operation Whether or not a column oriented system will be more efficient in operation depends heavily on the operations being automated Operations that retrieve data for objects would be slower requiring numerous disk operations to assemble data from different columns to build up a whole row record However such whole row operations are generally rare In the majority of cases only a limited subset of data is retrieved In a rolodex application for instance operations collecting the first names and last names from many rows in order to build a list of contacts is far more common than operations reading the data for home address Adoption editThis section is in list format but may read better as prose You can help by converting this section if appropriate Editing help is available October 2016 RCFile has been adopted in real world systems for big data analytics RCFile became the default data placement structure in Facebook s production Hadoop cluster 2 By 2010 it was the world s largest Hadoop cluster 3 where 40 terabytes compressed data sets are added every day 4 In addition all the data sets stored in HDFS before RCFile have also been transformed to use RCFile 2 RCFile has been adopted in Apache Hive since v0 4 5 which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world 6 including several Internet services such as Facebook Taobao and Netflix 7 RCFile has been adopted in Apache Pig since v0 7 8 which is another open source data processing system being widely used in many organizations 9 including several major Web service providers such as Twitter Yahoo LinkedIn AOL and Salesforce com RCFile became the de facto standard data storage structure in Hadoop software environment supported by the Apache HCatalog project formerly known as Howl 10 that is the table and storage management service for Hadoop 11 RCFile is supported by the open source Elephant Bird library used in Twitter for daily data analytics 12 Over the following years other Hadoop data formats also became popular In February 2013 an Optimized Row Columnar ORC file format was announced by Hortonworks 13 A month later the Apache Parquet format was announced developed by Cloudera and Twitter 14 See also editColumn data store Column oriented DBMS MapReduce Apache Hadoop Apache Hive Big dataReferences edit Yongqiang He Rubao Lee Yin Huai Zheng Shao Namit Jain Xiaodong Zhang and Zhiwei Xu 2011 RCFile A Fast and Space efficient Data Placement Structure in MapReduce based Warehouse Systems IEEE 27th International Conference on Data Engineering pp 1200 1208 a href Template Cite conference html title Template Cite conference cite conference a CS1 maint multiple names authors list link a b Hive integration HBase and Rcfile HadoopSummit2010 2010 06 30 Facebook has the world s largest Hadoop cluster 2010 05 09 Apache Hadoop India Summit 2011 talk Hive Evolution by Namit Jain 2011 02 24 Class RCFile Archived from the original on 2011 11 23 Retrieved 2012 07 21 PoweredBy Apache Hive Apache Software Foundation Hive user group presentation from Netflix 3 18 2010 2010 03 19 HiveRCInputFormat Pig 0 17 0 API PoweredBy Apache Pig Apache Software Foundation Howl HCatalog Archived from the original on 2012 07 20 Retrieved 2012 07 21 Twitter s collection of LZO and Protocol Buffer related Hadoop Pig Hive and HBase code Kevinweil elephant bird GitHub 2018 12 15 Alan Gates February 20 2013 The Stinger Initiative Making Apache Hive 100 Times Faster Hortonworks blog Retrieved May 4 2017 Justin Kestelyn March 13 2013 Introducing Parquet Efficient Columnar Storage for Apache Hadoop Cloudera blog Retrieved May 4 2017 External links editRCFile on the Apache Software Foundation website Source Code Hive website Hive page on Hadoop Wiki Retrieved from https en wikipedia org w index php title RCFile amp oldid 1153048983, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.