fbpx
Wikipedia

Object categorization from image search

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

Traditionally, classifiers are trained using sets of images that are labeled by hand. Collecting such a set of images is often a very time-consuming and laborious process. The use of Internet search engines to automate the process of acquiring large sets of labeled images has been described as a potential way of greatly facilitating computer vision research.[1]

Challenges edit

Unrelated images edit

One problem with using Internet image search results as a training set for a classifier is the high percentage of unrelated images within the results. It has been estimated that, when a search engine such as Google images is queried with the name of an object category (such as airplane), up to 85% of the returned images are unrelated to the category.[1]

Intra-class variability edit

Another challenge posed by using Internet image search results as training sets for classifiers is that there is a high amount of variability within object categories, when compared with categories found in hand-labeled datasets such as Caltech 101 and Pascal. Images of objects can vary widely in a number of important factors, such as scale, pose, lighting, number of objects, and amount of occlusion.

pLSA approach edit

In a 2005 paper by Fergus et al.,[1] pLSA (probabilistic latent semantic analysis) and extensions of this model were applied to the problem of object categorization from image search. pLSA was originally developed for document classification, but has since been applied to computer vision. It makes the assumption that images are documents that fit the bag of words model.

Model edit

Just as text documents are made up of words, each of which may be repeated within the document and across documents, images can be modeled as combinations of visual words. Just as the entire set of text words are defined by a dictionary, the entire set of visual words is defined in a codeword dictionary.

pLSA divides documents into topics as well. Just as knowing the topic(s) of an article allows you to make good guesses about the kinds of words that will appear in it, the distribution of words in an image is dependent on the underlying topics. The pLSA model tells us the probability of seeing each word   given the category   in terms of topics  :

 

An important assumption made in this model is that   and   are conditionally independent given  . Given a topic, the probability of a certain word appearing as part of that topic is independent of the rest of the image.[2]

Training this model involves finding   and   that maximizes the likelihood of the observed words in each document. To do this, the expectation maximization algorithm is used, with the following objective function:

 

Application edit

ABS-pLSA edit

Absolute position pLSA (ABS-pLSA) attaches location information to each visual word by localizing it to one of X 揵ins?in the image. Here,   represents which of the bins the visual word falls into. The new equation is:

 

  and   can be solved for in a manner similar to the original pLSA problem, using the EM algorithm

A problem with this model is that it is not translation or scale invariant. Since the positions of the visual words are absolute, changing the size of the object in the image or moving it would have a significant impact on the spatial distribution of the visual words into different bins.

TSI-pLSA edit

Translation and scale invariant pLSA (TSI-pLSA). This model extends pLSA by adding another latent variable, which describes the spatial location of the target object in an image. Now, the position   of a visual word is given relative to this object location, rather than as an absolute position in the image. The new equation is:

 

Again, the parameters   and   can be solved using the EM algorithm.   can be assumed to be a uniform distribution.

Implementation edit

Selecting words edit

Words in an image were selected using 4 different feature detectors:[1]

Using these 4 detectors, approximately 700 features were detected per image. These features were then encoded as Scale-invariant feature transform descriptors, and vector quantized to match one of 350 words contained in a codebook. The codebook was precomputed from features extracted from a large number of images spanning numerous object categories.

Possible object locations edit

One important question in the TSI-pLSA model is how to determine the values that the random variable   can take on. It is a 4-vector, whose components describe the object抯 centroid as well as x and y scales that define a bounding box around the object, so the space of possible values it can take on is enormous. To limit the number of possible object locations to a reasonable number, normal pLSA is first carried out on the set of images, and for each topic a Gaussian mixture model is fit over the visual words, weighted by  . Up to   Gaussians are tried (allowing for multiple instances of an object in a single image), where   is a constant.

Performance edit

The authors of the Fergus et al. paper compared performance of the three pLSA algorithms (pLSA, ABS-pLSA, and TSI-pLSA) on handpicked datasets and images returned from Google searches. Performance was measured as the error rate when classifying images in a test set as either containing the image or containing only background.

As expected, training directly on Google data gives higher error rates than training on prepared data.?[1] In about half of the object categories tested do ABS-pLSA and TSI-pLSA perform significantly better than regular pLSA, and in only 2 categories out of 7 does TSI-pLSA perform better than the other two models.

OPTIMOL edit

OPTIMOL (automatic Online Picture collection via Incremental MOdel Learning) approaches the problem of learning object categories from online image searches by addressing model learning and searching simultaneously. OPTIMOL is an iterative model that updates its model of the target object category while concurrently retrieving more relevant images.[3]

General framework edit

OPTIMOL was presented as a general iterative framework that is independent of the specific model used for category learning. The algorithm is as follows:

  • Download a large set of images from the Internet by searching for a keyword
  • Initialize the dataset with seed images
  • While more images needed in the dataset:
    • Learn the model with most recently added dataset images
    • Classify downloaded images using the updated model
    • Add accepted images to the dataset

Note that only the most recently added images are used in each round of learning. This allows the algorithm to run on an arbitrarily large number of input images.

Model edit

The two categories (target object and background) are modeled as Hierarchical Dirichlet processes (HDPs). As in the pLSA approach, it is assumed that the images can be described with the bag of words model. HDP models the distributions of an unspecified number of topics across images in a category, and across categories. The distribution of topics among images in a single category is modeled as a Dirichlet process (a type of non-parametric probability distribution). To allow the sharing of topics across classes, each of these Dirichlet processes is modeled as a sample from another 損arent?Dirichlet process. HDP was first described by Teh et al. in 2005.[4]

Implementation edit

Initialization edit

The dataset must be initialized, or seeded with an original batch of images which serve as good exemplars of the object category to be learned. These can be gathered automatically, using the first page or so of images returned by the search engine (which tend to be better than the subsequent images). Alternatively, the initial images can be gathered by hand.

Model learning edit

To learn the various parameters of the HDP in an incremental manner, Gibbs sampling is used over the latent variables. It is carried out after each new set of images is incorporated into the dataset. Gibbs sampling involves repeatedly sampling from a set of random variables in order to approximate their distributions. Sampling involves generating a value for the random variable in question, based on the state of the other random variables on which it is dependent. Given sufficient samples, a reasonable approximation of the value can be achieved.

Classification edit

At each iteration,   and   can be obtained from model learned after the previous round of Gibbs sampling, where   is a topic,   is a category, and   is a single visual word. The likelihood of an image being in a certain class, then, is:

 

This is computed for each new candidate image per iteration. The image is classified as belonging to the category with the highest likelihood.

Addition to the dataset and "cache set" edit

In order to qualify for incorporation into the dataset, however, an image must satisfy a stronger condition:

 

Where   and   are foreground (object) and background categories, respectively, and the ratio of constants describes the risk of accepting false positives and false negatives. They are adjusted automatically at every iteration, with the cost of a false positive set higher than that of a false negative. This ensures that a better dataset is collected.

Once an image is accepted by meeting the above criterion and incorporated into the dataset, however, it needs to meet another criterion before it is incorporated into the 揷ache set敆the set of images to be used for training. This set is intended to be a diverse subset of the set of accepted images. If the model were trained on all accepted images, it might become more and more highly specialized, only accepting images very similar to previous ones.

Performance edit

Performance of the OPTIMOL method is defined by three factors:

  • Ability to collect images: OPTIMOL, it is found, can automatically collect large numbers of good images from the web. The size of the OPTIMOL-retrieved image sets surpass that of large human-labeled image sets for the same categories, such as those found in Caltech 101.
  • Classification accuracy: Classification accuracy was compared to the accuracy displayed by the classifier yielded by the pLSA methods discussed earlier. It was discovered that OPTIMOL achieved slightly higher accuracy, obtaining 74.8% accuracy on 7 object categories, as compared to 72.0%.
  • Comparison with batch learning: An important question to address is whether OPTIMOL's incremental learning gives it an advantage over traditional batch learning methods, when everything else about the model is held constant. When the classifier learns incrementally, by selecting the next images based on what it learned from the previous ones, three important results are observed:
    • Incremental learning allows OPTIMOL to collect a better dataset
    • Incremental learning allows OPTIMOL to learn faster (by discarding irrelevant images)
    • Incremental learning does not negatively affect the ROC curve of the classifier; in fact, incremental learning yielded an improvement

Object categorization in content-based image retrieval edit

Typically, image searches only make use of text associated with images. The problem of content-based image retrieval is that of improving search results by taking into account visual information contained in the images themselves. Several CBIR methods make use of classifiers trained on image search results, to refine the search. In other words, object categorization from image search is one component of the system. OPTIMOL, for example, uses a classifier trained on images collected during previous iterations to select additional images for the returned dataset.

Examples of CBIR methods that model object categories from image search are:

  • Fergus et al., 2004 [5]
  • Berg and Forsyth, 2006 [6]
  • Yanai and Barnard, 2006 [7]

References edit

  1. ^ a b c d e Fergus, R.; Fei-Fei, L.; Perona, P.; Zisserman,A. (2005). "Learning Object Categories from Google抯 Image Search" (PDF). Proc. IEEE International Conference on Computer Vision.
  2. ^ Hofmann, Thomas (1999). (PDF). Uncertainty in Artificial Intelligence. Archived from the original (PDF) on 2007-07-10.
  3. ^ Li, Li-Jia; Wang, Gang; Fei-Fei, Li (2007). "OPTIMOL: automatic Online Picture collection via Incremental MOdel Learning" (PDF). Proc. IEEE Conference on Computer Vision and Pattern Recognition.
  4. ^ Teh, Yw; Jordan, MI; Beal, MJ; Blei,David (2006). "Hierarchical Dirichlet Processes" (PDF). Journal of the American Statistical Association. 101 (476): 1566. CiteSeerX 10.1.1.5.9094. doi:10.1198/016214506000000302. S2CID 7934949.
  5. ^ Fergus, R.; Perona,P.; Zisserman,A. (2004). "A visual category filter for Google images" (PDF). Proc. 8th European Conf. on Computer Vision.
  6. ^ Berg, T.; Forsyth, D. (2006). "Animals on the web". Proc. Computer Vision and Pattern Recognition. doi:10.1109/CVPR.2006.57.
  7. ^ Yanai, K; Barnard, K. (2005). "Probabilistic web image gathering". ACM SIGMM workshop on Multimedia information retrieval.

See also edit

object, categorization, from, image, search, this, article, needs, updated, please, help, update, this, article, reflect, recent, events, newly, available, information, september, 2019, computer, vision, problem, object, categorization, from, image, search, pr. This article needs to be updated Please help update this article to reflect recent events or newly available information September 2019 In computer vision the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects using only the images retrieved automatically with an Internet search engine Ideally automatic image collection would allow classifiers to be trained with nothing but the category names as input This problem is closely related to that of content based image retrieval CBIR where the goal is to return better image search results rather than training a classifier for image recognition Traditionally classifiers are trained using sets of images that are labeled by hand Collecting such a set of images is often a very time consuming and laborious process The use of Internet search engines to automate the process of acquiring large sets of labeled images has been described as a potential way of greatly facilitating computer vision research 1 Contents 1 Challenges 1 1 Unrelated images 1 2 Intra class variability 2 pLSA approach 2 1 Model 2 2 Application 2 2 1 ABS pLSA 2 2 2 TSI pLSA 2 3 Implementation 2 3 1 Selecting words 2 3 2 Possible object locations 2 4 Performance 3 OPTIMOL 3 1 General framework 3 2 Model 3 3 Implementation 3 3 1 Initialization 3 3 2 Model learning 3 3 3 Classification 3 3 4 Addition to the dataset and cache set 3 4 Performance 4 Object categorization in content based image retrieval 5 References 6 See alsoChallenges editUnrelated images edit One problem with using Internet image search results as a training set for a classifier is the high percentage of unrelated images within the results It has been estimated that when a search engine such as Google images is queried with the name of an object category such as airplane up to 85 of the returned images are unrelated to the category 1 Intra class variability edit Another challenge posed by using Internet image search results as training sets for classifiers is that there is a high amount of variability within object categories when compared with categories found in hand labeled datasets such as Caltech 101 and Pascal Images of objects can vary widely in a number of important factors such as scale pose lighting number of objects and amount of occlusion pLSA approach editIn a 2005 paper by Fergus et al 1 pLSA probabilistic latent semantic analysis and extensions of this model were applied to the problem of object categorization from image search pLSA was originally developed for document classification but has since been applied to computer vision It makes the assumption that images are documents that fit the bag of words model Model edit Just as text documents are made up of words each of which may be repeated within the document and across documents images can be modeled as combinations of visual words Just as the entire set of text words are defined by a dictionary the entire set of visual words is defined in a codeword dictionary pLSA divides documents into topics as well Just as knowing the topic s of an article allows you to make good guesses about the kinds of words that will appear in it the distribution of words in an image is dependent on the underlying topics The pLSA model tells us the probability of seeing each word w displaystyle w nbsp given the category d displaystyle displaystyle d nbsp in terms of topics z displaystyle displaystyle z nbsp P w d z 1 Z P w z P z d displaystyle displaystyle P w d sum z 1 Z P w z P z d nbsp An important assumption made in this model is that w displaystyle displaystyle w nbsp and d displaystyle displaystyle d nbsp are conditionally independent given z displaystyle displaystyle z nbsp Given a topic the probability of a certain word appearing as part of that topic is independent of the rest of the image 2 Training this model involves finding P w z displaystyle displaystyle P w z nbsp and P z d displaystyle displaystyle P z d nbsp that maximizes the likelihood of the observed words in each document To do this the expectation maximization algorithm is used with the following objective function L d 1 D w 1 W P w d n w d displaystyle displaystyle L prod d 1 D prod w 1 W P w d n w d nbsp Application edit ABS pLSA edit Absolute position pLSA ABS pLSA attaches location information to each visual word by localizing it to one of X 揵ins in the image Here x displaystyle displaystyle x nbsp represents which of the bins the visual word falls into The new equation is P w d z 1 Z P w x z P z d displaystyle displaystyle P w d sum z 1 Z P w x z P z d nbsp P w x z displaystyle displaystyle P w x z nbsp and P d displaystyle displaystyle P d nbsp can be solved for in a manner similar to the original pLSA problem using the EM algorithmA problem with this model is that it is not translation or scale invariant Since the positions of the visual words are absolute changing the size of the object in the image or moving it would have a significant impact on the spatial distribution of the visual words into different bins TSI pLSA edit Translation and scale invariant pLSA TSI pLSA This model extends pLSA by adding another latent variable which describes the spatial location of the target object in an image Now the position x displaystyle displaystyle x nbsp of a visual word is given relative to this object location rather than as an absolute position in the image The new equation is P w x d z 1 Z c 1 C P w x c z P c P z d displaystyle displaystyle P w x d sum z 1 Z sum c 1 C P w x c z P c P z d nbsp Again the parameters P w x c z displaystyle displaystyle P w x c z nbsp and P d displaystyle displaystyle P d nbsp can be solved using the EM algorithm P c displaystyle displaystyle P c nbsp can be assumed to be a uniform distribution Implementation edit Selecting words edit Words in an image were selected using 4 different feature detectors 1 Kadir Brady saliency detector Multi scale Harris detector Difference of Gaussians Edge based operator described in the studyUsing these 4 detectors approximately 700 features were detected per image These features were then encoded as Scale invariant feature transform descriptors and vector quantized to match one of 350 words contained in a codebook The codebook was precomputed from features extracted from a large number of images spanning numerous object categories Possible object locations edit One important question in the TSI pLSA model is how to determine the values that the random variable C displaystyle displaystyle C nbsp can take on It is a 4 vector whose components describe the object抯 centroid as well as x and y scales that define a bounding box around the object so the space of possible values it can take on is enormous To limit the number of possible object locations to a reasonable number normal pLSA is first carried out on the set of images and for each topic a Gaussian mixture model is fit over the visual words weighted by P w z displaystyle displaystyle P w z nbsp Up to K displaystyle displaystyle K nbsp Gaussians are tried allowing for multiple instances of an object in a single image where K displaystyle displaystyle K nbsp is a constant Performance edit The authors of the Fergus et al paper compared performance of the three pLSA algorithms pLSA ABS pLSA and TSI pLSA on handpicked datasets and images returned from Google searches Performance was measured as the error rate when classifying images in a test set as either containing the image or containing only background As expected training directly on Google data gives higher error rates than training on prepared data 1 In about half of the object categories tested do ABS pLSA and TSI pLSA perform significantly better than regular pLSA and in only 2 categories out of 7 does TSI pLSA perform better than the other two models OPTIMOL editOPTIMOL automatic Online Picture collection via Incremental MOdel Learning approaches the problem of learning object categories from online image searches by addressing model learning and searching simultaneously OPTIMOL is an iterative model that updates its model of the target object category while concurrently retrieving more relevant images 3 General framework edit OPTIMOL was presented as a general iterative framework that is independent of the specific model used for category learning The algorithm is as follows Download a large set of images from the Internet by searching for a keyword Initialize the dataset with seed images While more images needed in the dataset Learn the model with most recently added dataset images Classify downloaded images using the updated model Add accepted images to the datasetNote that only the most recently added images are used in each round of learning This allows the algorithm to run on an arbitrarily large number of input images Model edit The two categories target object and background are modeled as Hierarchical Dirichlet processes HDPs As in the pLSA approach it is assumed that the images can be described with the bag of words model HDP models the distributions of an unspecified number of topics across images in a category and across categories The distribution of topics among images in a single category is modeled as a Dirichlet process a type of non parametric probability distribution To allow the sharing of topics across classes each of these Dirichlet processes is modeled as a sample from another 損arent Dirichlet process HDP was first described by Teh et al in 2005 4 Implementation edit Initialization edit The dataset must be initialized or seeded with an original batch of images which serve as good exemplars of the object category to be learned These can be gathered automatically using the first page or so of images returned by the search engine which tend to be better than the subsequent images Alternatively the initial images can be gathered by hand Model learning edit To learn the various parameters of the HDP in an incremental manner Gibbs sampling is used over the latent variables It is carried out after each new set of images is incorporated into the dataset Gibbs sampling involves repeatedly sampling from a set of random variables in order to approximate their distributions Sampling involves generating a value for the random variable in question based on the state of the other random variables on which it is dependent Given sufficient samples a reasonable approximation of the value can be achieved Classification edit At each iteration P z c displaystyle displaystyle P z c nbsp and P x z c displaystyle displaystyle P x z c nbsp can be obtained from model learned after the previous round of Gibbs sampling where z displaystyle displaystyle z nbsp is a topic c displaystyle displaystyle c nbsp is a category and x displaystyle displaystyle x nbsp is a single visual word The likelihood of an image being in a certain class then is P I c i j P x i z j c P z j c displaystyle displaystyle P I c prod i sum j P x i z j c P z j c nbsp This is computed for each new candidate image per iteration The image is classified as belonging to the category with the highest likelihood Addition to the dataset and cache set edit In order to qualify for incorporation into the dataset however an image must satisfy a stronger condition P I c f P I c b gt l A c b l R c b l R c f l A c f P c b P c f displaystyle displaystyle frac P I c f P I c b gt frac lambda Ac b lambda Rc b lambda Rc f lambda Ac f frac P c b P c f nbsp Where c f displaystyle displaystyle c f nbsp and c b displaystyle displaystyle c b nbsp are foreground object and background categories respectively and the ratio of constants describes the risk of accepting false positives and false negatives They are adjusted automatically at every iteration with the cost of a false positive set higher than that of a false negative This ensures that a better dataset is collected Once an image is accepted by meeting the above criterion and incorporated into the dataset however it needs to meet another criterion before it is incorporated into the 揷ache set敆the set of images to be used for training This set is intended to be a diverse subset of the set of accepted images If the model were trained on all accepted images it might become more and more highly specialized only accepting images very similar to previous ones Performance edit Performance of the OPTIMOL method is defined by three factors Ability to collect images OPTIMOL it is found can automatically collect large numbers of good images from the web The size of the OPTIMOL retrieved image sets surpass that of large human labeled image sets for the same categories such as those found in Caltech 101 Classification accuracy Classification accuracy was compared to the accuracy displayed by the classifier yielded by the pLSA methods discussed earlier It was discovered that OPTIMOL achieved slightly higher accuracy obtaining 74 8 accuracy on 7 object categories as compared to 72 0 Comparison with batch learning An important question to address is whether OPTIMOL s incremental learning gives it an advantage over traditional batch learning methods when everything else about the model is held constant When the classifier learns incrementally by selecting the next images based on what it learned from the previous ones three important results are observed Incremental learning allows OPTIMOL to collect a better dataset Incremental learning allows OPTIMOL to learn faster by discarding irrelevant images Incremental learning does not negatively affect the ROC curve of the classifier in fact incremental learning yielded an improvementObject categorization in content based image retrieval editTypically image searches only make use of text associated with images The problem of content based image retrieval is that of improving search results by taking into account visual information contained in the images themselves Several CBIR methods make use of classifiers trained on image search results to refine the search In other words object categorization from image search is one component of the system OPTIMOL for example uses a classifier trained on images collected during previous iterations to select additional images for the returned dataset Examples of CBIR methods that model object categories from image search are Fergus et al 2004 5 Berg and Forsyth 2006 6 Yanai and Barnard 2006 7 References edit a b c d e Fergus R Fei Fei L Perona P Zisserman A 2005 Learning Object Categories from Google抯 Image Search PDF Proc IEEE International Conference on Computer Vision Hofmann Thomas 1999 Probabilistic Latent Semantic Analysis PDF Uncertainty in Artificial Intelligence Archived from the original PDF on 2007 07 10 Li Li Jia Wang Gang Fei Fei Li 2007 OPTIMOL automatic Online Picture collection via Incremental MOdel Learning PDF Proc IEEE Conference on Computer Vision and Pattern Recognition Teh Yw Jordan MI Beal MJ Blei David 2006 Hierarchical Dirichlet Processes PDF Journal of the American Statistical Association 101 476 1566 CiteSeerX 10 1 1 5 9094 doi 10 1198 016214506000000302 S2CID 7934949 Fergus R Perona P Zisserman A 2004 A visual category filter for Google images PDF Proc 8th European Conf on Computer Vision Berg T Forsyth D 2006 Animals on the web Proc Computer Vision and Pattern Recognition doi 10 1109 CVPR 2006 57 Yanai K Barnard K 2005 Probabilistic web image gathering ACM SIGMM workshop on Multimedia information retrieval See also editProbabilistic latent semantic analysis Latent Dirichlet allocation Machine learning Bag of words model Content based image retrieval Retrieved from https en wikipedia org w index php title Object categorization from image search amp oldid 1212828113, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.