fbpx
Wikipedia

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Nutch
Screenshot
Nutch Web Interface Search
Original author(s)Doug Cutting, Mike Cafarella
Developer(s)Apache Software Foundation
Stable release
1.x1.19 / 22 August 2022; 18 months ago (2022-08-22)[1]
2.x2.4 / 11 October 2019; 4 years ago (2019-10-11)[1]
RepositoryNutch Repository
Written inJava
Operating systemCross-platform
TypeWeb crawler
LicenseApache License 2.0
Websitenutch.apache.org

Features edit

 
Nutch robot mascot

Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.

The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.

History edit

Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.

In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.

In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[2]

In February 2014 the Common Crawl project adopted Nutch for its open, large-scale web crawl.[3]

While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.[citation needed]

Release history edit

1.x

Branch

2.x

Branch

Release date Description
1.1 2010-06-06 This release includes several major upgrades of existing libraries (Hadoop, Solr, Tika, etc.) on which Nutch depends. Various bug fixes, and speedups (e.g., to Fetcher2) have also been included.
1.2 2010-10-24 This release includes several improvements (addition of parse-html as a selectable parser again, configurable per-field indexing), new features (including adding timing information to all Tool classes, and implementation of parser timeouts), and bug fixes (fixing an NPE in distributed search, fixing of XML formatting issues per Document fields).
1.3 2011-06-07 This release includes several improvements (improved RSS parsing support, tighter integration with Apache Tika, external parsing support, improved language identification and an order of magnitude smaller source release tarball—only about 2 MB).
1.4 2011-11-26 This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tighter Tika integration, and support for HTTP auth in Solr indexing.
1.5 2012-06-07 This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few.
2.0 2012-07-07 This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora) for big data stores such as Apache Accumulo, Apache Avro, Apache Cassandra, Apache HBase, HDFS, an in memory data store and various high-profile SQL stores.
1.5.1 2012-07-10 This release is a maintenance release of the popular 1.5.X mainstream version of Nutch which has been widely adopted within the community.
2.1 2012-10-05 This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search.
1.6 2012-12-06 This release includes over 20 bug fixes, the same in improvements, as well as new functionalities including a new HostNormalizer, the ability to dynamically set fetchInterval by MIME-type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents. Other notable improvements include the upgrade of key dependencies to Tika 1.2 and Automaton 1.11-8.
2.2 2013-06-08 This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2.x Nutch series. This release features inclusion of Crawler-Commons which Nutch now utilizes for improved robots.txt parsing, library upgrades to Apache Hadoop 1.1.1, Apache Gora 0.3, Apache Tika 1.2 and Automaton 1.11-8.
1.7 2013-06-24 This release includes over 20 bug fixes, as many improvements; most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search. Shadowing the recent Nutch 2.2 release, parsing of Robots.txt is now delegated to Crawler-Commons. Key library upgrades have been made to Apache Hadoop 1.2.0 and Apache Tika 1.3.
2.2.1 2013-07-02 This release includes library upgrades to Apache Hadoop 1.2.0 and Apache Tika 1.3, it is predominantly a bug fix for NUTCH-1591 - Incorrect conversion of ByteBuffer to String.
1.8 2014-03-17 Although this release includes library upgrades to Crawler Commons 0.3 and Apache Tika 1.5, it also provides over 30 bug fixes as well as 18 improvements.
2.3 2015-01-22 Nutch 2.3 release now comes packaged with a self-contained Apache Wicket-based Web Application. The SQL backend for Gora has been deprecated.[4]
1.10 2015-05-06 This release includes library upgrades to Tika 1.6, also provides over 46 bug fixes as well as 37 improvements and 12 new features.[5]
1.11 2015-12-07 This release includes library upgrades to Hadoop 2.X, Tika 1.11, also provides over 32 bug fixes as well as 35 improvements and 14 new features.[6]
2.3.1 2016-01-21 This bug fix release contains around 40 issues addressed.
1.12 2016-06-18
1.13 2017-04-02
1.14 2017-12-23
1.15 2018-08-09
1.16 2019-10-11
2.4 2019-10-11 Expected to be the last release on the 2.X series, as "no committer is actively working on it".[7]
1.17 2020-07-02
1.18 2021-01-24

Scalability edit

IBM Research studied the performance[8] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[9] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[10]

Related projects edit

  • Hadoop – Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch edit

See also edit

References edit

  1. ^ a b "Apache Nutch™ - Downloads". Retrieved 27 September 2022.
  2. ^ "Apache Nutch -". nutch.apache.org.
  3. ^ a b "Common Crawl's Move to Nutch – Common Crawl – Blog". blog.commoncrawl.org. Retrieved 2015-10-14.
  4. ^ "Nutch 2.3 Release". Apache Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January 2016.
  5. ^ "Nutch 1.10 Release Notes". ASF JIRA. The Apache Software Foundation. 6 May 2015. Retrieved 18 January 2016.
  6. ^ "Nutch 1.11 Release Notes". ASF JIRA. The Apache Software Foundation. 7 December 2015. Retrieved 18 January 2016.
  7. ^ "Nutch 2.4 Release". Apache Nutch News. The Apache Software Foundation. 11 October 2019. Retrieved 20 May 2022.
  8. ^ "Scalability of the Nutch search engine" (PDF).
  9. ^ (PDF). Archived from the original (PDF) on December 3, 2008.
  10. ^ The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
  11. ^ "Our Updated Search". Creative Commons. 2004-09-03.
  12. ^ . Creative Commons. 2004-11-22. Archived from the original on 2010-01-07.
  13. ^ "New CC search UI". Creative Commons. 2006-08-02.
  14. ^ . Archived from the original on 2011-11-04. Retrieved 2010-02-12.
  15. ^ "Update on Wikia – doing more of what's working | Jimmy Wales". 31 March 2009.

Bibliography edit

  • Shoberg, J (October 26, 2006). (1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6. Archived from the original on December 2, 2009. Retrieved August 15, 2009.

External links edit

  • Official website

apache, nutch, highly, extensible, scalable, open, source, crawler, software, project, screenshotnutch, interface, searchoriginal, author, doug, cutting, mike, cafarelladeveloper, apache, software, foundationstable, release1, august, 2022, months, 2022, octobe. Apache Nutch is a highly extensible and scalable open source web crawler software project Apache NutchScreenshotNutch Web Interface SearchOriginal author s Doug Cutting Mike CafarellaDeveloper s Apache Software FoundationStable release1 x1 19 22 August 2022 18 months ago 2022 08 22 1 2 x2 4 11 October 2019 4 years ago 2019 10 11 1 RepositoryNutch RepositoryWritten inJavaOperating systemCross platformTypeWeb crawlerLicenseApache License 2 0Websitenutch wbr apache wbr org Contents 1 Features 2 History 2 1 Release history 3 Scalability 4 Related projects 5 Search engines built with Nutch 6 See also 7 References 8 Bibliography 9 External linksFeatures edit nbsp Nutch robot mascotNutch is coded entirely in the Java programming language but data is written in language independent formats It has a highly modular architecture allowing developers to create plug ins for media type parsing data retrieval querying and clustering The fetcher robot or web crawler has been written from scratch specifically for this project History editNutch originated with Doug Cutting creator of both Lucene and Hadoop and Mike Cafarella In June 2003 a successful 100 million page demonstration system was developed To meet the multi machine processing needs of the crawl and index tasks the Nutch project has also implemented a MapReduce facility and a distributed file system The two facilities have been spun out into their own subproject called Hadoop In January 2005 Nutch joined the Apache Incubator from which it graduated to become a subproject of Lucene in June of that same year Since April 2010 Nutch has been considered an independent top level project of the Apache Software Foundation 2 In February 2014 the Common Crawl project adopted Nutch for its open large scale web crawl 3 While it was once a goal for the Nutch project to release a global large scale web search engine that is no longer the case citation needed Release history edit 1 x Branch 2 x Branch Release date Description1 1 2010 06 06 This release includes several major upgrades of existing libraries Hadoop Solr Tika etc on which Nutch depends Various bug fixes and speedups e g to Fetcher2 have also been included 1 2 2010 10 24 This release includes several improvements addition of parse html as a selectable parser again configurable per field indexing new features including adding timing information to all Tool classes and implementation of parser timeouts and bug fixes fixing an NPE in distributed search fixing of XML formatting issues per Document fields 1 3 2011 06 07 This release includes several improvements improved RSS parsing support tighter integration with Apache Tika external parsing support improved language identification and an order of magnitude smaller source release tarball only about 2 MB 1 4 2011 11 26 This release includes several improvements including allowing Parsers to declare support for multiple MIME types configurable Fetcher Queue depth Fetcher speed improvements tighter Tika integration and support for HTTP auth in Solr indexing 1 5 2012 06 07 This release includes several improvements including upgrades of several major components including Tika 1 1 and Hadoop 1 0 0 improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting filtering and parsing to name a few 2 0 2012 07 07 This release offers users an edition focused on large scale crawling which builds on storage abstraction via Apache Gora for big data stores such as Apache Accumulo Apache Avro Apache Cassandra Apache HBase HDFS an in memory data store and various high profile SQL stores 1 5 1 2012 07 10 This release is a maintenance release of the popular 1 5 X mainstream version of Nutch which has been widely adopted within the community 2 1 2012 10 05 This release continues to provide Nutch users with a simplified Nutch distribution building on the 2 x development drive which is growing in popularity amongst the community As well as addressing 20 bugs this release also offers improved properties for better Solr configuration upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search 1 6 2012 12 06 This release includes over 20 bug fixes the same in improvements as well as new functionalities including a new HostNormalizer the ability to dynamically set fetchInterval by MIME type and functional enhancements to the Indexer API including the normalization of URLs and the deletion of robots noIndex documents Other notable improvements include the upgrade of key dependencies to Tika 1 2 and Automaton 1 11 8 2 2 2013 06 08 This release includes over 30 bug fixes and over 25 improvements representing the third release of increasingly popular 2 x Nutch series This release features inclusion of Crawler Commons which Nutch now utilizes for improved robots txt parsing library upgrades to Apache Hadoop 1 1 1 Apache Gora 0 3 Apache Tika 1 2 and Automaton 1 11 8 1 7 2013 06 24 This release includes over 20 bug fixes as many improvements most noticeably featuring a new pluggable indexing architecture which currently supports Apache Solr and Elastic Search Shadowing the recent Nutch 2 2 release parsing of Robots txt is now delegated to Crawler Commons Key library upgrades have been made to Apache Hadoop 1 2 0 and Apache Tika 1 3 2 2 1 2013 07 02 This release includes library upgrades to Apache Hadoop 1 2 0 and Apache Tika 1 3 it is predominantly a bug fix for NUTCH 1591 Incorrect conversion of ByteBuffer to String 1 8 2014 03 17 Although this release includes library upgrades to Crawler Commons 0 3 and Apache Tika 1 5 it also provides over 30 bug fixes as well as 18 improvements 2 3 2015 01 22 Nutch 2 3 release now comes packaged with a self contained Apache Wicket based Web Application The SQL backend for Gora has been deprecated 4 1 10 2015 05 06 This release includes library upgrades to Tika 1 6 also provides over 46 bug fixes as well as 37 improvements and 12 new features 5 1 11 2015 12 07 This release includes library upgrades to Hadoop 2 X Tika 1 11 also provides over 32 bug fixes as well as 35 improvements and 14 new features 6 2 3 1 2016 01 21 This bug fix release contains around 40 issues addressed 1 12 2016 06 181 13 2017 04 021 14 2017 12 231 15 2018 08 091 16 2019 10 112 4 2019 10 11 Expected to be the last release on the 2 X series as no committer is actively working on it 7 1 17 2020 07 021 18 2021 01 24Scalability editIBM Research studied the performance 8 of Nutch Lucene as part of its Commercial Scale Out CSO project 9 Their findings were that a scale out system such as Nutch Lucene could achieve a performance level on a cluster of blades that was not achievable on any scale up computer such as the POWER5 The ClueWeb09 dataset used in e g TREC was gathered using Nutch with an average speed of 755 31 documents per second 10 Related projects editHadoop Java framework that supports distributed applications running on large clusters Search engines built with Nutch editCommon Crawl publicly available internet wide crawls started using Nutch in 2014 3 Creative Commons Search an implementation of Nutch used in the period of 2004 2006 11 12 13 DiscoverEd Open educational resources search prototype developed by Creative Commons Krugle uses Nutch to crawl web pages for code archives and technically interesting content mozDex inactive Wikia Search launched 2008 closed down 2009 14 15 See also edit nbsp Free and open source software portalReferences edit a b Apache Nutch Downloads Retrieved 27 September 2022 Apache Nutch nutch apache org a b Common Crawl s Move to Nutch Common Crawl Blog blog commoncrawl org Retrieved 2015 10 14 Nutch 2 3 Release Apache Nutch News The Apache Software Foundation 22 January 2015 Retrieved 18 January 2016 Nutch 1 10 Release Notes ASF JIRA The Apache Software Foundation 6 May 2015 Retrieved 18 January 2016 Nutch 1 11 Release Notes ASF JIRA The Apache Software Foundation 7 December 2015 Retrieved 18 January 2016 Nutch 2 4 Release Apache Nutch News The Apache Software Foundation 11 October 2019 Retrieved 20 May 2022 Scalability of the Nutch search engine PDF Base Operating System Provisioning and Bringup for a Commercial Supercomputer PDF Archived from the original PDF on December 3 2008 The Sapphire Web Crawler Crawl Statistics Boston lti cs cmu edu 2008 10 01 Retrieved on 2013 07 21 Our Updated Search Creative Commons 2004 09 03 Creative Commons Unique Search Tool Now Integrated into Firefox 1 0 Creative Commons 2004 11 22 Archived from the original on 2010 01 07 New CC search UI Creative Commons 2006 08 02 Where can I get the source code for Wikia Search Archived from the original on 2011 11 04 Retrieved 2010 02 12 Update on Wikia doing more of what s working Jimmy Wales 31 March 2009 Bibliography editShoberg J October 26 2006 Building Search Applications with Lucene and Nutch 1st ed Apress p 350 ISBN 978 1 59059 687 6 Archived from the original on December 2 2009 Retrieved August 15 2009 External links editOfficial website Retrieved from https en wikipedia org w index php title Apache Nutch amp oldid 1209016951, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.