fbpx
Wikipedia

Boolean model of information retrieval

The (standard) Boolean model of information retrieval (BIR)[1] is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one.[2] The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms (a bag-of-words model). Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

Definitions edit

An index term is a word or expression, which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article. Let

 
be the set of all such index terms.

A document is any subset of  . Let

 
be the set of all documents.

A query is a Boolean expression   in normal form:

 
where   is true for   when  . (Equivalently,   could be expressed in disjunctive normal form.)

We seek to find the set of documents that satisfy  . This operation is called retrieval and consists of the following two steps:

1. For each   in  , find the set   of documents that satisfy  :
 
2. Then the set of documents that satisfy Q is given by:
 

Example edit

Let the set of original (real) documents be, for example

 

where

  = "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."

  = "Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility. If one had unlimited time and calculating power with which to make every decision, this procedure would be the best way to make any decision."

  = "Bayesian epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures. A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."

Let the set   of terms be:

 

Then, the set   of documents is as follows:

 

where

 

Let the query   be:

 
Then to retrieve the relevant documents:
  1. Firstly, the following sets   and   of documents   are obtained (retrieved):
     
  2. Finally, the following documents   are retrieved in response to  
     

This means that the original document   (corresponding to  ) is the answer to  .

Obviously, if there is more than one document with the same representation, every such document is retrieved. Such documents are indistinguishable in the BIR (in other words, equivalent).

Advantages edit

  • Clean formalism
  • Easy to implement
  • Intuitive concept
  • If the resulting document set is either too small or too big, it is directly clear which operators will produce respectively a bigger or smaller set.
  • It gives (expert) users a sense of control over the system. It is immediately clear why a document has been retrieved given a query.

Disadvantages edit

  • Exact matching may retrieve too few or too many documents
  • Hard to translate a query into a Boolean expression
  • All terms are equally weighted
  • More like data retrieval than information retrieval
  • Retrieval based on binary decision criteria with no notion of partial matching
  • No ranking of the documents is provided (absence of a grading scale)
  • Information need has to be translated into a Boolean expression, which most users find awkward
  • The Boolean queries formulated by the users are most often too simplistic
  • The model frequently returns either too few or too many documents in response to a user query

Data structures and algorithms edit

From a pure formal mathematical point of view, the BIR is straightforward. From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming, hash tables, inverted file structure, and so on.[3]

Hash sets edit

Another possibility is to use hash sets. Each document is represented by a hash table which contains every single term of that document. Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory. However, it will have a slowdown in performance because the operations are more complex than with bit vectors. On the worst-case performance can degrade from O(n) to O(n2). On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.

Signature file edit

Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature. The signature file contains one such superimposed code bitstring for every document in the collection. Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length. The query bitstring is tested against each signature.[4][5][6]

The signature file approached is used in BitFunnel.

Inverted file edit

An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term.[4][5]

References edit

  1. ^ Lancaster, F.W.; Fayen, E.G. (1973), Information Retrieval On-Line, Melville Publishing Co., Los Angeles, California
  2. ^ "Information Retrieval". MIT Press. Retrieved 2023-12-09.
  3. ^ Wartik, Steven (1992). "Boolean operations". . Prentice-Hall, Inc. ISBN 0-13-463837-9. Archived from the original on 2013-09-28.
  4. ^ a b Justin Zobel; Alistair Moffat; and Kotagiri Ramamohanarao. "Inverted Files Versus Signature Files for Text Indexing".
  5. ^ a b Bob Goodwin; et al. "BitFunnel: Revisiting Signatures for Search". 2017.
  6. ^ Richard Startin. "Bit-Sliced Signatures and Bloom Filters".
  • Lashkari, A.H.; Mahdavi, F.; Ghomi, V. (2009), "A Boolean Model in Information Retrieval for Search Engines", 2009 International Conference on Information Management and Engineering, pp. 385–389, doi:10.1109/ICIME.2009.101, ISBN 978-0-7695-3595-1, S2CID 18147603

boolean, model, information, retrieval, this, article, technical, most, readers, understand, please, help, improve, make, understandable, experts, without, removing, technical, details, june, 2018, learn, when, remove, this, template, message, standard, classi. This article may be too technical for most readers to understand Please help improve it to make it understandable to non experts without removing the technical details June 2018 Learn how and when to remove this template message The standard Boolean model of information retrieval BIR 1 is a classical information retrieval IR model and at the same time the first and most adopted one 2 The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user s query are conceived as sets of terms a bag of words model Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query Contents 1 Definitions 2 Example 3 Advantages 4 Disadvantages 5 Data structures and algorithms 5 1 Hash sets 5 2 Signature file 5 3 Inverted file 6 ReferencesDefinitions editAn index term is a word or expression which may be stemmed describing or characterizing a document such as a keyword given for a journal article LetT t1 t2 tm displaystyle T t 1 t 2 ldots t m nbsp be the set of all such index terms A document is any subset of T displaystyle T nbsp LetD D1 Dn displaystyle D D 1 ldots D n nbsp be the set of all documents A query is a Boolean expression Q textstyle Q nbsp in normal form Q W1 W2 Wi Wi 1 displaystyle Q W 1 lor W 2 lor cdots land cdots land W i lor W i 1 lor cdots nbsp where Wi textstyle W i nbsp is true for Dj displaystyle D j nbsp when ti Dj displaystyle t i in D j nbsp Equivalently Q textstyle Q nbsp could be expressed in disjunctive normal form We seek to find the set of documents that satisfy Q textstyle Q nbsp This operation is called retrieval and consists of the following two steps 1 For each Wj textstyle W j nbsp in Q textstyle Q nbsp find the set Sj textstyle S j nbsp of documents that satisfy Wj textstyle W j nbsp Sj Di Wj displaystyle S j D i mid W j nbsp 2 Then the set of documents that satisfy Q is given by S1 S2 Si Si 1 displaystyle S 1 cup S 2 cup cdots cap cdots cap S i cup S i 1 cup cdots nbsp Example editLet the set of original real documents be for example O O1 O2 O3 displaystyle O O 1 O 2 O 3 nbsp whereO1 textstyle O 1 nbsp Bayes principle The principle that in estimating a parameter one should initially assume that each possible value has equal probability a uniform prior distribution O2 textstyle O 2 nbsp Bayesian decision theory A mathematical theory of decision making which presumes utility and probability functions and according to which the act to be chosen is the Bayes act i e the one with highest subjective expected utility If one had unlimited time and calculating power with which to make every decision this procedure would be the best way to make any decision O3 textstyle O 3 nbsp Bayesian epistemology A philosophical theory which holds that the epistemic status of a proposition i e how well proven or well established it is is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures A Bayesian epistemologist would use probability to define and explore the relationship between concepts such as epistemic status support or explanatory power Let the set T textstyle T nbsp of terms be T t1 Bayes principle t2 probability t3 decision making t4 Bayesian epistemology displaystyle T t 1 text Bayes principle t 2 text probability t 3 text decision making t 4 text Bayesian epistemology nbsp Then the set D textstyle D nbsp of documents is as follows D D1 D2 D3 displaystyle D D 1 D 2 D 3 nbsp whereD1 probability Bayes principle D2 probability decision making D3 probability Bayesian epistemology displaystyle begin aligned D 1 amp text probability text Bayes principle D 2 amp text probability text decision making D 3 amp text probability text Bayesian epistemology end aligned nbsp Let the query Q textstyle Q nbsp be Q probability decision making displaystyle Q text probability land text decision making nbsp Then to retrieve the relevant documents Firstly the following sets S1 textstyle S 1 nbsp and S2 textstyle S 2 nbsp of documents Di textstyle D i nbsp are obtained retrieved S1 D1 D2 D3 S2 D2 displaystyle begin aligned S 1 amp D 1 D 2 D 3 S 2 amp D 2 end aligned nbsp Finally the following documents Di textstyle D i nbsp are retrieved in response to Q textstyle Q nbsp Q D1 D2 D3 D2 D2 displaystyle Q D 1 D 2 D 3 cap D 2 D 2 nbsp This means that the original document O2 textstyle O 2 nbsp corresponding to D2 textstyle D 2 nbsp is the answer to Q textstyle Q nbsp Obviously if there is more than one document with the same representation every such document is retrieved Such documents are indistinguishable in the BIR in other words equivalent Advantages editClean formalism Easy to implement Intuitive concept If the resulting document set is either too small or too big it is directly clear which operators will produce respectively a bigger or smaller set It gives expert users a sense of control over the system It is immediately clear why a document has been retrieved given a query Disadvantages editExact matching may retrieve too few or too many documents Hard to translate a query into a Boolean expression All terms are equally weighted More like data retrieval than information retrieval Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided absence of a grading scale Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic The model frequently returns either too few or too many documents in response to a user queryData structures and algorithms editFrom a pure formal mathematical point of view the BIR is straightforward From a practical point of view however several further problems should be solved that relate to algorithms and data structures such as for example the choice of terms manual or automatic selection or both stemming hash tables inverted file structure and so on 3 Hash sets edit Main article feature hashing Another possibility is to use hash sets Each document is represented by a hash table which contains every single term of that document Since hash table size increases and decreases in real time with the addition and removal of terms each document will occupy much less space in memory However it will have a slowdown in performance because the operations are more complex than with bit vectors On the worst case performance can degrade from O n to O n2 On the average case the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient Signature file edit Each document can be summarized by Bloom filter representing the set of words in that document stored in a fixed length bitstring called a signature The signature file contains one such superimposed code bitstring for every document in the collection Each query can also be summarized by a Bloom filter representing the set of words in the query stored in a bitstring of the same fixed length The query bitstring is tested against each signature 4 5 6 The signature file approached is used in BitFunnel Inverted file edit Main article inverted index An inverted index file contains two parts a vocabulary containing all the terms used in the collection and for each distinct term an inverted index that lists every document that mentions that term 4 5 References edit Lancaster F W Fayen E G 1973 Information Retrieval On Line Melville Publishing Co Los Angeles California Information Retrieval MIT Press Retrieved 2023 12 09 Wartik Steven 1992 Boolean operations Information Retrieval Data Structures amp Algorithms Prentice Hall Inc ISBN 0 13 463837 9 Archived from the original on 2013 09 28 a b Justin Zobel Alistair Moffat and Kotagiri Ramamohanarao Inverted Files Versus Signature Files for Text Indexing a b Bob Goodwin et al BitFunnel Revisiting Signatures for Search 2017 Richard Startin Bit Sliced Signatures and Bloom Filters Lashkari A H Mahdavi F Ghomi V 2009 A Boolean Model in Information Retrieval for Search Engines 2009 International Conference on Information Management and Engineering pp 385 389 doi 10 1109 ICIME 2009 101 ISBN 978 0 7695 3595 1 S2CID 18147603 Retrieved from https en wikipedia org w index php title Boolean model of information retrieval amp oldid 1189009264, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.