fbpx
Wikipedia

Metasearch engine

A metasearch engine (or search aggregator) is an online information retrieval tool that uses the data of a web search engine to produce its own results.[1][2] Metasearch engines take input from a user and immediately query search engines[3] for results. Sufficient data is gathered, ranked, and presented to the users.

Architecture of a metasearch engine

Problems such as spamming reduces the accuracy and precision of results.[4] The process of fusion aims to improve the engineering of a metasearch engine.[5]

Examples of metasearch engines include Skyscanner and Kayak.com, which aggregate search results of online travel agencies and provider websites and Searx, a free and open-source search engine which aggregates results from internet search engines.

History edit

The first person to incorporate the idea of meta searching was Daniel Dreilinger of Colorado State University. He developed SearchSavvy, which let users search up to 20 different search engines and directories at once. Although fast, the search engine was restricted to simple searches and thus wasn't reliable. University of Washington student Eric Selberg released a more "updated" version called MetaCrawler. This search engine improved on SearchSavvy's accuracy by adding its own search syntax behind the scenes, and matching the syntax to that of the search engines it was probing. Metacrawler reduced the amount of search engines queried to 6, but although it produced more accurate results, it still wasn't considered as accurate as searching a query in an individual engine.[6]

On May 20, 1996, HotBot, then owned by Wired, was a search engine with search results coming from the Inktomi and Direct Hit databases. It was known for its fast results and as a search engine with the ability to search within search results. Upon being bought by Lycos in 1998, development for the search engine staggered and its market share fell drastically. After going through a few alterations, HotBot was redesigned into a simplified search interface, with its features being incorporated into Lycos' website redesign.[7]

A metasearch engine called Anvish was developed by Bo Shu and Subhash Kak in 1999; the search results were sorted using instantaneously trained neural networks.[8] This was later incorporated into another metasearch engine called Solosearch.[9]

In August 2000, India got its first meta search engine when HumHaiIndia.com was launched.[10] It was developed by the then 16 year old Sumeet Lamba.[11] The website was later rebranded as Tazaa.com.[12]

Ixquick is a search engine known for its privacy policy statement. Developed and launched in 1998 by David Bodnick, it is owned by Surfboard Holding BV. In June 2006, Ixquick began to delete private details of its users following the same process with Scroogle. Ixquick's privacy policy includes no recording of users' IP addresses, no identifying cookies, no collection of personal data, and no sharing of personal data with third parties.[13] It also uses a unique ranking system where a result is ranked by stars. The more stars in a result, the more search engines agreed on the result.

In April 2005, Dogpile, then owned and operated by InfoSpace, Inc., collaborated with researchers from the University of Pittsburgh and Pennsylvania State University to measure the overlap and ranking differences of leading Web search engines in order to gauge the benefits of using a metasearch engine to search the web. Results found that from 10,316 random user-defined queries from Google, Yahoo!, and Ask Jeeves, only 3.2% of first page search results were the same across those search engines for a given query. Another study later that year using 12,570 random user-defined queries from Google, Yahoo!, MSN Search, and Ask Jeeves found that only 1.1% of first page search results were the same across those search engines for a given query.[14]

Advantages edit

By sending multiple queries to several other search engines this extends the coverage data of the topic and allows more information to be found. They use the indexes built by other search engines, aggregating and often post-processing results in unique ways. A metasearch engine has an advantage over a single search engine because more results can be retrieved with the same amount of exertion.[2] It also reduces the work of users from having to individually type in searches from different engines to look for resources.[2]

Metasearching is also a useful approach if the purpose of the user's search is to get an overview of the topic or to get quick answers. Instead of having to go through multiple search engines like Yahoo! or Google and comparing results, metasearch engines are able to quickly compile and combine results. They can do it either by listing results from each engine queried with no additional post-processing (Dogpile) or by analyzing the results and ranking them by their own rules (IxQuick, Metacrawler, and Vivismo).

A metasearch engine can also hide the searcher's IP address from the search engines queried thus providing privacy to the search.

Disadvantages edit

Metasearch engines are not capable of parsing query forms or able to fully translate query syntax. The number of hyperlinks generated by metasearch engines are limited, and therefore do not provide the user with the complete results of a query.[15]

The majority of metasearch engines do not provide over ten linked files from a single search engine, and generally do not interact with larger search engines for results. Pay per click links are prioritised and are normally displayed first.[16]

Metasearching also gives the illusion that there is more coverage of the topic queried, particularly if the user is searching for popular or commonplace information. It's common to end with multiple identical results from the queried engines. It is also harder for users to search with advanced search syntax to be sent with the query, so results may not be as precise as when a user is using an advanced search interface at a specific engine. This results in many metasearch engines using simple searching.[17]

Operation edit

A metasearch engine accepts a single search request from the user. This search request is then passed on to another search engine's database. A metasearch engine does not create a database of web pages but generates a Federated database system of data integration from multiple sources.[18][19][20]

Since every search engine is unique and has different algorithms for generating ranked data, duplicates will therefore also be generated. To remove duplicates, a metasearch engine processes this data and applies its own algorithm. A revised list is produced as an output for the user.[citation needed] When a metasearch engine contacts other search engines, these search engines will respond in three ways:

  • They will both cooperate and provide complete access to the interface for the metasearch engine, including private access to the index database, and will inform the metasearch engine of any changes made upon the index database;
  • Search engines can behave in a non-cooperative manner whereby they will not deny or provide any access to interfaces;
  • The search engine can be completely hostile and refuse the metasearch engine total access to their database and in serious circumstances, by seeking legal methods.[21]

Architecture of ranking edit

Web pages that are highly ranked on many search engines are likely to be more relevant in providing useful information.[21] However, all search engines have different ranking scores for each website and most of the time these scores are not the same. This is because search engines prioritise different criteria and methods for scoring, hence a website might appear highly ranked on one search engine and lowly ranked on another. This is a problem because Metasearch engines rely heavily on the consistency of this data to generate reliable accounts.[21]

Fusion edit

 
Data Fusion Model

A metasearch engine uses the process of Fusion to filter data for more efficient results. The two main fusion methods used are: Collection Fusion and Data Fusion.

  • Collection Fusion: also known as distributed retrieval, deals specifically with search engines that index unrelated data. To determine how valuable these sources are, Collection Fusion looks at the content and then ranks the data on how likely it is to provide relevant information in relation to the query. From what is generated, Collection Fusion is able to pick out the best resources from the rank. These chosen resources are then merged into a list.[21]
  • Data Fusion: deals with information retrieved from search engines that indexes common data sets. The process is very similar. The initial rank scores of data are merged into a single list, after which the original ranks of each of these documents are analysed. Data with high scores indicate a high level of relevancy to a particular query and are therefore selected. To produce a list, the scores must be normalized using algorithms such as CombSum. This is because search engines adopt different policies of algorithms resulting in the score produced being incomparable.[22][23]

Spamdexing edit

Spamdexing is the deliberate manipulation of search engine indexes. It uses a number of methods to manipulate the relevance or prominence of resources indexed in a manner unaligned with the intention of the indexing system. Spamdexing can be very distressing for users and problematic for search engines because the return contents of searches have poor precision.[citation needed] This will eventually result in the search engine becoming unreliable and not dependable for the user. To tackle Spamdexing, search robot algorithms are made more complex and are changed almost every day to eliminate the problem.[24]

It is a major problem for metasearch engines because it tampers with the Web crawler's indexing criteria, which are heavily relied upon to format ranking lists. Spamdexing manipulates the natural ranking system of a search engine, and places websites higher on the ranking list than they would naturally be placed.[25] There are three primary methods used to achieve this:

Content spam edit

Content spam are the techniques that alter the logical view that a search engine has over the page's contents. Techniques include:

  • Keyword Stuffing - Calculated placements of keywords within a page to raise the keyword count, variety, and density of the page
  • Hidden/Invisible Text - Unrelated text disguised by making it the same color as the background, using a tiny font size, or hiding it within the HTML code
  • Meta-tag Stuffing - Repeating keywords in meta tags and/or using keywords unrelated to the site's content
  • Doorway Pages - Low quality webpages with little content, but relatable keywords or phrases
  • Scraper Sites - Programs that allow websites to copy content from other websites and create content for a website
  • Article Spinning - Rewriting existing articles as opposed to copying content from other sites
  • Machine Translation - Uses machine translation to rewrite content in several different languages, resulting in illegible text

Link spam edit

Link spam are links between pages present for reasons other than merit. Techniques include:

  • Link-building Software - Automating the search engine optimization (SEO) process
  • Link Farms - Pages that reference each other (also known as mutual admiration societies)
  • Hidden Links - Placing hyperlinks where visitors won't or can't see them
  • Sybil Attack - Forging of multiple identities for malicious intent
  • Spam Blogs - Blogs created solely for commercial promotion and the passage of link authority to target sites
  • Page Hijacking - Creating a copy of a popular website with similar content, but redirects web surfers to unrelated or even malicious websites
  • Buying Expired Domains - Buying expiring domains and replacing pages with links to unrelated websites
  • Cookie Stuffing - Placing an affiliate tracking cookie on a website visitor's computer without their knowledge
  • Forum Spam - Websites that can be edited by users to insert links to spam sites

Cloaking edit

This is an SEO technique in which different materials and information are sent to the web crawler and to the web browser.[26] It is commonly used as a spamdexing technique because it can trick search engines into either visiting a site that is substantially different from the search engine description or giving a certain site a higher ranking.

See also edit

References edit

  1. ^ Berger, Sandy (2005). "Sandy Berger's Great Age Guide to the Internet" (Document). Que Publishing.ISBN 0-7897-3442-7
  2. ^ a b c "Architecture of a Metasearch Engine that Supports User Information Needs". 1999.
  3. ^ Ride, Onion (2021). "How search Engine work". onionride.
  4. ^ Lawrence, Stephen R.; Lee Giles, C. (October 10, 1997). "Patent US6999959 - Meta search engine" – via Google Books.
  5. ^ Voorhees, Ellen M.; Gupta, Narendra; Johnson-Laird, Ben (April 2000). "The collection fusion problem".
  6. ^ . Archived from the original on 2020-01-30. Retrieved 2014-12-02.
  7. ^ "Search engine rankings on HotBot: a brief history of the HotBot search engine".
  8. ^ Shu, Bo; Kak, Subhash (1999). "A neural network based intelligent metasearch engine". Information Sciences. 120 (4): 1–11. CiteSeerX 10.1.1.84.6837. doi:10.1016/S0020-0255(99)00062-6.
  9. ^ Kak, Subhash (November 1999). "Better Web searches and prediction with instantaneously trained neural networks" (PDF). IEEE Intelligent Systems.
  10. ^ "New kid in town". India Today. Retrieved 2024-03-14.
  11. ^ "What is Metasearch Engine?". GeeksforGeeks. 2020-08-01. Retrieved 2024-03-14.
  12. ^ "www.metaseek.nl". www.metaseek.nl. Retrieved 2024-03-14.
  13. ^ "ABOUT US – Our history".
  14. ^ Spink, Amanda; Jansen, Bernard J.; Kathuria, Vinish; Koshman, Sherry (2006). "Overlap among major web search engines" (PDF). Emerald.
  15. ^ "Department of Informatics". University of Fribourg.
  16. ^ "Intelligence Exploitation of the Internet" (PDF). 2002.
  17. ^ HENNEGAR, ANNE (16 September 2009). "Metasearch Engines Expands your Horizon".
  18. ^ MENG, WEIYI (May 5, 2008). "Metasearch Engines" (PDF).
  19. ^ Selberg, Erik; Etzioni, Oren (1997). "The MetaCrawler architecture for resource aggregation on the Web". IEEE expert. pp. 11–14.
  20. ^ Manoj, M; Jacob, Elizabeth (July 2013). "Design and Development of a Programmable Meta Search Engine" (PDF). Foundation of Computer Science. pp. 6–11.
  21. ^ a b c d Manoj, M.; Jacob, Elizabeth (October 2008). "Information retrieval on Internet using meta-search engines: A review" (PDF). Council of Scientific and Industrial Research.
  22. ^ Wu, Shengli; Crestani, Fabio; Bi, Yaxin (2006). "Evaluating Score Normalization Methods in Data Fusion". Information Retrieval Technology. Lecture Notes in Computer Science. Vol. 4182. pp. 642–648. CiteSeerX 10.1.1.103.295. doi:10.1007/11880592_57. ISBN 978-3-540-45780-0.
  23. ^ Manmatha, R.; Sever, H. (2014). (PDF). Archived from the original (PDF) on 2019-09-30. Retrieved 2014-10-27.
  24. ^ Najork, Marc (2014). "Web Spam Detection". Microsoft.
  25. ^ Vandendriessche, Gerrit (February 2009). "A few legal comments on spamdexing".
  26. ^ Wang, Yi-Min; Ma, Ming; Niu, Yuan; Chen, Hao (May 8, 2007). "Connecting Web Spammers with Advertisers" (PDF).

metasearch, engine, metasearch, engine, search, aggregator, online, information, retrieval, tool, that, uses, data, search, engine, produce, results, take, input, from, user, immediately, query, search, engines, results, sufficient, data, gathered, ranked, pre. A metasearch engine or search aggregator is an online information retrieval tool that uses the data of a web search engine to produce its own results 1 2 Metasearch engines take input from a user and immediately query search engines 3 for results Sufficient data is gathered ranked and presented to the users Architecture of a metasearch engineProblems such as spamming reduces the accuracy and precision of results 4 The process of fusion aims to improve the engineering of a metasearch engine 5 Examples of metasearch engines include Skyscanner and Kayak com which aggregate search results of online travel agencies and provider websites and Searx a free and open source search engine which aggregates results from internet search engines Contents 1 History 2 Advantages 3 Disadvantages 4 Operation 4 1 Architecture of ranking 4 2 Fusion 5 Spamdexing 5 1 Content spam 5 2 Link spam 5 3 Cloaking 6 See also 7 ReferencesHistory editThe first person to incorporate the idea of meta searching was Daniel Dreilinger of Colorado State University He developed SearchSavvy which let users search up to 20 different search engines and directories at once Although fast the search engine was restricted to simple searches and thus wasn t reliable University of Washington student Eric Selberg released a more updated version called MetaCrawler This search engine improved on SearchSavvy s accuracy by adding its own search syntax behind the scenes and matching the syntax to that of the search engines it was probing Metacrawler reduced the amount of search engines queried to 6 but although it produced more accurate results it still wasn t considered as accurate as searching a query in an individual engine 6 On May 20 1996 HotBot then owned by Wired was a search engine with search results coming from the Inktomi and Direct Hit databases It was known for its fast results and as a search engine with the ability to search within search results Upon being bought by Lycos in 1998 development for the search engine staggered and its market share fell drastically After going through a few alterations HotBot was redesigned into a simplified search interface with its features being incorporated into Lycos website redesign 7 A metasearch engine called Anvish was developed by Bo Shu and Subhash Kak in 1999 the search results were sorted using instantaneously trained neural networks 8 This was later incorporated into another metasearch engine called Solosearch 9 In August 2000 India got its first meta search engine when HumHaiIndia com was launched 10 It was developed by the then 16 year old Sumeet Lamba 11 The website was later rebranded as Tazaa com 12 Ixquick is a search engine known for its privacy policy statement Developed and launched in 1998 by David Bodnick it is owned by Surfboard Holding BV In June 2006 Ixquick began to delete private details of its users following the same process with Scroogle Ixquick s privacy policy includes no recording of users IP addresses no identifying cookies no collection of personal data and no sharing of personal data with third parties 13 It also uses a unique ranking system where a result is ranked by stars The more stars in a result the more search engines agreed on the result In April 2005 Dogpile then owned and operated by InfoSpace Inc collaborated with researchers from the University of Pittsburgh and Pennsylvania State University to measure the overlap and ranking differences of leading Web search engines in order to gauge the benefits of using a metasearch engine to search the web Results found that from 10 316 random user defined queries from Google Yahoo and Ask Jeeves only 3 2 of first page search results were the same across those search engines for a given query Another study later that year using 12 570 random user defined queries from Google Yahoo MSN Search and Ask Jeeves found that only 1 1 of first page search results were the same across those search engines for a given query 14 Advantages editBy sending multiple queries to several other search engines this extends the coverage data of the topic and allows more information to be found They use the indexes built by other search engines aggregating and often post processing results in unique ways A metasearch engine has an advantage over a single search engine because more results can be retrieved with the same amount of exertion 2 It also reduces the work of users from having to individually type in searches from different engines to look for resources 2 Metasearching is also a useful approach if the purpose of the user s search is to get an overview of the topic or to get quick answers Instead of having to go through multiple search engines like Yahoo or Google and comparing results metasearch engines are able to quickly compile and combine results They can do it either by listing results from each engine queried with no additional post processing Dogpile or by analyzing the results and ranking them by their own rules IxQuick Metacrawler and Vivismo A metasearch engine can also hide the searcher s IP address from the search engines queried thus providing privacy to the search Disadvantages editMetasearch engines are not capable of parsing query forms or able to fully translate query syntax The number of hyperlinks generated by metasearch engines are limited and therefore do not provide the user with the complete results of a query 15 The majority of metasearch engines do not provide over ten linked files from a single search engine and generally do not interact with larger search engines for results Pay per click links are prioritised and are normally displayed first 16 Metasearching also gives the illusion that there is more coverage of the topic queried particularly if the user is searching for popular or commonplace information It s common to end with multiple identical results from the queried engines It is also harder for users to search with advanced search syntax to be sent with the query so results may not be as precise as when a user is using an advanced search interface at a specific engine This results in many metasearch engines using simple searching 17 Operation editA metasearch engine accepts a single search request from the user This search request is then passed on to another search engine s database A metasearch engine does not create a database of web pages but generates a Federated database system of data integration from multiple sources 18 19 20 Since every search engine is unique and has different algorithms for generating ranked data duplicates will therefore also be generated To remove duplicates a metasearch engine processes this data and applies its own algorithm A revised list is produced as an output for the user citation needed When a metasearch engine contacts other search engines these search engines will respond in three ways They will both cooperate and provide complete access to the interface for the metasearch engine including private access to the index database and will inform the metasearch engine of any changes made upon the index database Search engines can behave in a non cooperative manner whereby they will not deny or provide any access to interfaces The search engine can be completely hostile and refuse the metasearch engine total access to their database and in serious circumstances by seeking legal methods 21 Architecture of ranking edit Web pages that are highly ranked on many search engines are likely to be more relevant in providing useful information 21 However all search engines have different ranking scores for each website and most of the time these scores are not the same This is because search engines prioritise different criteria and methods for scoring hence a website might appear highly ranked on one search engine and lowly ranked on another This is a problem because Metasearch engines rely heavily on the consistency of this data to generate reliable accounts 21 Fusion edit nbsp Data Fusion ModelA metasearch engine uses the process of Fusion to filter data for more efficient results The two main fusion methods used are Collection Fusion and Data Fusion Collection Fusion also known as distributed retrieval deals specifically with search engines that index unrelated data To determine how valuable these sources are Collection Fusion looks at the content and then ranks the data on how likely it is to provide relevant information in relation to the query From what is generated Collection Fusion is able to pick out the best resources from the rank These chosen resources are then merged into a list 21 Data Fusion deals with information retrieved from search engines that indexes common data sets The process is very similar The initial rank scores of data are merged into a single list after which the original ranks of each of these documents are analysed Data with high scores indicate a high level of relevancy to a particular query and are therefore selected To produce a list the scores must be normalized using algorithms such as CombSum This is because search engines adopt different policies of algorithms resulting in the score produced being incomparable 22 23 Spamdexing editSpamdexing is the deliberate manipulation of search engine indexes It uses a number of methods to manipulate the relevance or prominence of resources indexed in a manner unaligned with the intention of the indexing system Spamdexing can be very distressing for users and problematic for search engines because the return contents of searches have poor precision citation needed This will eventually result in the search engine becoming unreliable and not dependable for the user To tackle Spamdexing search robot algorithms are made more complex and are changed almost every day to eliminate the problem 24 It is a major problem for metasearch engines because it tampers with the Web crawler s indexing criteria which are heavily relied upon to format ranking lists Spamdexing manipulates the natural ranking system of a search engine and places websites higher on the ranking list than they would naturally be placed 25 There are three primary methods used to achieve this Content spam edit Content spam are the techniques that alter the logical view that a search engine has over the page s contents Techniques include Keyword Stuffing Calculated placements of keywords within a page to raise the keyword count variety and density of the page Hidden Invisible Text Unrelated text disguised by making it the same color as the background using a tiny font size or hiding it within the HTML code Meta tag Stuffing Repeating keywords in meta tags and or using keywords unrelated to the site s content Doorway Pages Low quality webpages with little content but relatable keywords or phrases Scraper Sites Programs that allow websites to copy content from other websites and create content for a website Article Spinning Rewriting existing articles as opposed to copying content from other sites Machine Translation Uses machine translation to rewrite content in several different languages resulting in illegible textLink spam edit Link spam are links between pages present for reasons other than merit Techniques include Link building Software Automating the search engine optimization SEO process Link Farms Pages that reference each other also known as mutual admiration societies Hidden Links Placing hyperlinks where visitors won t or can t see them Sybil Attack Forging of multiple identities for malicious intent Spam Blogs Blogs created solely for commercial promotion and the passage of link authority to target sites Page Hijacking Creating a copy of a popular website with similar content but redirects web surfers to unrelated or even malicious websites Buying Expired Domains Buying expiring domains and replacing pages with links to unrelated websites Cookie Stuffing Placing an affiliate tracking cookie on a website visitor s computer without their knowledge Forum Spam Websites that can be edited by users to insert links to spam sitesCloaking edit This is an SEO technique in which different materials and information are sent to the web crawler and to the web browser 26 It is commonly used as a spamdexing technique because it can trick search engines into either visiting a site that is substantially different from the search engine description or giving a certain site a higher ranking See also editFederated search List of metasearch engines Metabrowsing Multisearch Search aggregator Search engine optimizationReferences edit Berger Sandy 2005 Sandy Berger s Great Age Guide to the Internet Document Que Publishing ISBN 0 7897 3442 7 a b c Architecture of a Metasearch Engine that Supports User Information Needs 1999 Ride Onion 2021 How search Engine work onionride Lawrence Stephen R Lee Giles C October 10 1997 Patent US6999959 Meta search engine via Google Books Voorhees Ellen M Gupta Narendra Johnson Laird Ben April 2000 The collection fusion problem The Meta search Search Engine History Archived from the original on 2020 01 30 Retrieved 2014 12 02 Search engine rankings on HotBot a brief history of the HotBot search engine Shu Bo Kak Subhash 1999 A neural network based intelligent metasearch engine Information Sciences 120 4 1 11 CiteSeerX 10 1 1 84 6837 doi 10 1016 S0020 0255 99 00062 6 Kak Subhash November 1999 Better Web searches and prediction with instantaneously trained neural networks PDF IEEE Intelligent Systems New kid in town India Today Retrieved 2024 03 14 What is Metasearch Engine GeeksforGeeks 2020 08 01 Retrieved 2024 03 14 www metaseek nl www metaseek nl Retrieved 2024 03 14 ABOUT US Our history Spink Amanda Jansen Bernard J Kathuria Vinish Koshman Sherry 2006 Overlap among major web search engines PDF Emerald Department of Informatics University of Fribourg Intelligence Exploitation of the Internet PDF 2002 HENNEGAR ANNE 16 September 2009 Metasearch Engines Expands your Horizon MENG WEIYI May 5 2008 Metasearch Engines PDF Selberg Erik Etzioni Oren 1997 The MetaCrawler architecture for resource aggregation on the Web IEEE expert pp 11 14 Manoj M Jacob Elizabeth July 2013 Design and Development of a Programmable Meta Search Engine PDF Foundation of Computer Science pp 6 11 a b c d Manoj M Jacob Elizabeth October 2008 Information retrieval on Internet using meta search engines A review PDF Council of Scientific and Industrial Research Wu Shengli Crestani Fabio Bi Yaxin 2006 Evaluating Score Normalization Methods in Data Fusion Information Retrieval Technology Lecture Notes in Computer Science Vol 4182 pp 642 648 CiteSeerX 10 1 1 103 295 doi 10 1007 11880592 57 ISBN 978 3 540 45780 0 Manmatha R Sever H 2014 A Formal Approach to Score Normalization for Meta search PDF Archived from the original PDF on 2019 09 30 Retrieved 2014 10 27 Najork Marc 2014 Web Spam Detection Microsoft Vandendriessche Gerrit February 2009 A few legal comments on spamdexing Wang Yi Min Ma Ming Niu Yuan Chen Hao May 8 2007 Connecting Web Spammers with Advertisers PDF Retrieved from https en wikipedia org w index php title Metasearch engine amp oldid 1217055787, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.