fbpx
Wikipedia

Apache Pig

Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin.[1] Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy[3] and then call directly from the language.

Apache Pig
Developer(s)Apache Software Foundation, Yahoo Research
Initial releaseSeptember 11, 2008; 15 years ago (2008-09-11)
Stable release
0.17.0 / June 19, 2017; 6 years ago (2017-06-19)
Repository
  • svn.apache.org/repos/asf/pig/
Operating systemMicrosoft Windows, OS X, Linux
TypeData analytics
LicenseApache License 2.0
Websitepig.apache.org

History edit

Apache Pig was originally[4] developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007,[5] it was moved into the Apache Software Foundation.

Version Original release date Latest version Release date[6]
Old version, no longer maintained: 0.1 2008-09-11 0.1.1 2008-12-05
Old version, no longer maintained: 0.2 2009-04-08 0.2.0 2009-04-08
Old version, no longer maintained: 0.3 2009-06-25 0.3.0 2009-06-25
Old version, no longer maintained: 0.4 2009-08-29 0.4.0 2009-08-29
Old version, no longer maintained: 0.5 2009-09-29 0.5.0 2009-09-29
Old version, no longer maintained: 0.6 2010-03-01 0.6.0 2010-03-01
Old version, no longer maintained: 0.7 2010-05-13 0.7.0 2010-05-13
Old version, no longer maintained: 0.8 2010-12-17 0.8.1 2011-04-24
Old version, no longer maintained: 0.9 2011-07-29 0.9.2 2012-01-22
Old version, no longer maintained: 0.10 2012-01-22 0.10.1 2012-04-25
Old version, no longer maintained: 0.11 2013-02-21 0.11.1 2013-04-01
Old version, no longer maintained: 0.12 2013-10-14 0.12.1 2014-04-14
Old version, no longer maintained: 0.13 2014-07-04 0.13.0 2014-07-04
Old version, no longer maintained: 0.14 2014-11-20 0.14.0 2014-11-20
Old version, no longer maintained: 0.15 2015-06-06 0.15.0 2015-06-06
Old version, no longer maintained: 0.16 2016-06-08 0.16.0 2016-06-08
Current stable version: 0.17 2017-06-19 0.17.0 2017-06-19
Legend:
Old version
Older version, still maintained
Latest version
Latest preview version
Future release

Naming edit

Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty.[7][8][9]

The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something. Off the top of his head, one researcher suggested Pig, and the name stuck. It is quirky yet memorable and easy to spell. While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository.

— Alan Gates, Daniel Dai, "What Is Pig?", Programming Pig, 2nd Edition (November 2017)

Example edit

Below is an example of a "Word Count" program in Pig Latin:

 input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);    -- Extract words from each line and put them into a pig bag  -- datatype, then flatten the bag to get one word on each row  words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;    -- filter out any words that are just white spaces  filtered_words = FILTER words BY word MATCHES '\\w+';    -- create a group for each word  word_groups = GROUP filtered_words BY word;    -- count the entries in each group  word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;    -- order the records by count  ordered_word_count = ORDER word_count BY count DESC;  STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; 

The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.

Pig vs SQL edit

In comparison to SQL, Pig

  1. has a nested relational model,
  2. uses lazy evaluation,
  3. uses extract, transform, load (ETL),
  4. is able to store data at any point during a pipeline,
  5. declares execution plans,
  6. supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines.

On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance.[10]

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways.[11] In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.[12]

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.[11]

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.[11]

See also edit

References edit

  1. ^ a b "Hadoop: Apache Pig". Retrieved Sep 2, 2011.
  2. ^ "[PIG-4167] Initial implementation of Pig on Spark - ASF JIRA". issues.apache.org. Retrieved 2018-12-29.
  3. ^ "Pig user defined functions". Retrieved May 3, 2013.
  4. ^ . Archived from the original on February 3, 2016. Retrieved May 23, 2015.
  5. ^ . Archived from the original on February 3, 2016. Retrieved May 23, 2015.
  6. ^ "Apache Pig Releases". Apache. Retrieved 2019-03-13.
  7. ^ "1. What Is Pig? - Programming Pig, 2nd Edition [Book]". www.oreilly.com. Retrieved 2021-08-01.
  8. ^ Gates, Alan (2016). Programming Pig. Daniel Dai (Second ed.). Sebastopol, CA. ISBN 978-1-4919-3706-8. OCLC 964523786.{{cite book}}: CS1 maint: location missing publisher (link)
  9. ^ Gates, Alan (2021-07-27). . Pig User Mailing List (Mailing list). Archived from the original on 1 August 2021. Retrieved 1 August 2021.
  10. ^ (PDF). Archived from the original (PDF) on July 1, 2015. Retrieved May 23, 2015.
  11. ^ a b c . Archived from the original on May 30, 2015. Retrieved May 23, 2015.
  12. ^ "ACM SigMod 08: Pig Latin: A Not-So-Foreign Language for Data Processing" (PDF). Retrieved May 23, 2015.

External links edit

  • Official website

apache, high, level, platform, creating, programs, that, apache, hadoop, language, this, platform, called, latin, execute, hadoop, jobs, mapreduce, apache, apache, spark, latin, abstracts, programming, from, java, mapreduce, idiom, into, notation, which, makes. Apache Pig 1 is a high level platform for creating programs that run on Apache Hadoop The language for this platform is called Pig Latin 1 Pig can execute its Hadoop jobs in MapReduce Apache Tez or Apache Spark 2 Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level similar to that of SQL for relational database management systems Pig Latin can be extended using user defined functions UDFs which the user can write in Java Python JavaScript Ruby or Groovy 3 and then call directly from the language Apache PigDeveloper s Apache Software Foundation Yahoo ResearchInitial releaseSeptember 11 2008 15 years ago 2008 09 11 Stable release0 17 0 June 19 2017 6 years ago 2017 06 19 Repositorysvn wbr apache wbr org wbr repos wbr asf wbr pig wbr Operating systemMicrosoft Windows OS X LinuxTypeData analyticsLicenseApache License 2 0Websitepig wbr apache wbr org Contents 1 History 1 1 Naming 2 Example 3 Pig vs SQL 4 See also 5 References 6 External linksHistory editApache Pig was originally 4 developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets In 2007 5 it was moved into the Apache Software Foundation Version Original release date Latest version Release date 6 Old version no longer maintained 0 1 2008 09 11 0 1 1 2008 12 05Old version no longer maintained 0 2 2009 04 08 0 2 0 2009 04 08Old version no longer maintained 0 3 2009 06 25 0 3 0 2009 06 25Old version no longer maintained 0 4 2009 08 29 0 4 0 2009 08 29Old version no longer maintained 0 5 2009 09 29 0 5 0 2009 09 29Old version no longer maintained 0 6 2010 03 01 0 6 0 2010 03 01Old version no longer maintained 0 7 2010 05 13 0 7 0 2010 05 13Old version no longer maintained 0 8 2010 12 17 0 8 1 2011 04 24Old version no longer maintained 0 9 2011 07 29 0 9 2 2012 01 22Old version no longer maintained 0 10 2012 01 22 0 10 1 2012 04 25Old version no longer maintained 0 11 2013 02 21 0 11 1 2013 04 01Old version no longer maintained 0 12 2013 10 14 0 12 1 2014 04 14Old version no longer maintained 0 13 2014 07 04 0 13 0 2014 07 04Old version no longer maintained 0 14 2014 11 20 0 14 0 2014 11 20Old version no longer maintained 0 15 2015 06 06 0 15 0 2015 06 06Old version no longer maintained 0 16 2016 06 08 0 16 0 2016 06 08Current stable version 0 17 2017 06 19 0 17 0 2017 06 19Legend Old versionOlder version still maintainedLatest versionLatest preview versionFuture releaseNaming edit Regarding the naming of the Pig programming language the name was chosen arbitrarily and stuck because it was memorable easy to spell and for novelty 7 8 9 The story goes that the researchers working on the project initially referred to it simply as the language Eventually they needed to call it something Off the top of his head one researcher suggested Pig and the name stuck It is quirky yet memorable and easy to spell While some have hinted that the name sounds coy or silly it has provided us with an entertaining nomenclature such as Pig Latin for the language Grunt for the shell and PiggyBank for the CPAN like shared repository Alan Gates Daniel Dai What Is Pig Programming Pig 2nd Edition November 2017 Example editBelow is an example of a Word Count program in Pig Latin input lines LOAD tmp my copy of all pages on internet AS line chararray Extract words from each line and put them into a pig bag datatype then flatten the bag to get one word on each row words FOREACH input lines GENERATE FLATTEN TOKENIZE line AS word filter out any words that are just white spaces filtered words FILTER words BY word MATCHES w create a group for each word word groups GROUP filtered words BY word count the entries in each group word count FOREACH word groups GENERATE COUNT filtered words AS count group AS word order the records by count ordered word count ORDER word count BY count DESC STORE ordered word count INTO tmp number of words on internet The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet Pig vs SQL editIn comparison to SQL Pig has a nested relational model uses lazy evaluation uses extract transform load ETL is able to store data at any point during a pipeline declares execution plans supports pipeline splits thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines On the other hand it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded but that loading the data takes considerably longer in the database systems It has also been argued RDBMSs offer out of the box support for column storage working with compressed data indexes for efficient random data access and transaction level fault tolerance 10 Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative In SQL users can specify that data from two tables must be joined but not what join implementation to use You can specify the implementation of JOIN in SQL thus for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways 11 In effect Pig Latin programming is similar to specifying a query execution plan making it easier for programmers to explicitly control the flow of their data processing task 12 SQL is oriented around queries that produce a single result SQL handles trees naturally but has no built in mechanism for splitting a data processing stream and applying different operators to each sub stream Pig Latin script describes a directed acyclic graph DAG rather than a pipeline 11 Pig Latin s ability to include user code at any point in the pipeline is useful for pipeline development If SQL is used data must first be imported into the database and then the cleansing and transformation process can begin 11 See also editApache Hive Sawzall similar tool from GoogleReferences edit a b Hadoop Apache Pig Retrieved Sep 2 2011 PIG 4167 Initial implementation of Pig on Spark ASF JIRA issues apache org Retrieved 2018 12 29 Pig user defined functions Retrieved May 3 2013 Yahoo Blog Pig The Road to an Efficient High level language for Hadoop Archived from the original on February 3 2016 Retrieved May 23 2015 Pig into Incubation at the Apache Software Foundation Archived from the original on February 3 2016 Retrieved May 23 2015 Apache Pig Releases Apache Retrieved 2019 03 13 1 What Is Pig Programming Pig 2nd Edition Book www oreilly com Retrieved 2021 08 01 Gates Alan 2016 Programming Pig Daniel Dai Second ed Sebastopol CA ISBN 978 1 4919 3706 8 OCLC 964523786 a href Template Cite book html title Template Cite book cite book a CS1 maint location missing publisher link Gates Alan 2021 07 27 Pig mascot questions Pig User Mailing List Mailing list Archived from the original on 1 August 2021 Retrieved 1 August 2021 Communications of the ACM MapReduce and Parallel DBMSs Friends or Foes PDF Archived from the original PDF on July 1 2015 Retrieved May 23 2015 a b c Yahoo Pig Development Team Comparing Pig Latin and SQL for Constructing Data Processing Pipelines Archived from the original on May 30 2015 Retrieved May 23 2015 ACM SigMod 08 Pig Latin A Not So Foreign Language for Data Processing PDF Retrieved May 23 2015 External links editOfficial website Retrieved from https en wikipedia org w index php title Apache Pig amp oldid 1098415878, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.