How Images Maybe Chosen for Search Results
A few years ago, some former employees of Google (at least one who has since returned) started a search engine with the name Cuil, which was called a Google-killer when it first started. It became known for showing images with search results, and those images weren’t always well-chosen or accurate – See this blog post for an example of criticisms of images in search results from Cuil: What’s the deal with Cuil?.
Google has been showing images next to news results for years. How have they avoided making the kinds of mistakes that Cuil was making with their images? A Patent was granted to Google this week which discusses some of the things that they do to make the images that show up in their news results more accurate.
Towards the top of the description for the patent, they tell us about why they show images with news results, and what one of the challenges of doing so is:
In the case of news documents, users may find it beneficial to see an image in association with the news documents. Oftentimes, however, news documents include multiple images some of which may not be related to the topic of the news documents. This makes it difficult to automatically select appropriate images for the news documents.
They give us a summary of the approach they undertake to try to use images that are accurate and helpful in the new results they show:
According to one aspect consistent with the principles of the invention, a method includes identifying images associated with a document, filtering the images to create a set of candidate images, detecting captions associated with the candidate images, and selecting one of the candidate images to associate with the document based on the detected captions.
This newly granted patent is:
Image selection for news search
Inventors: Hong Zhou, Srdjan Mitrovic, Krishna Bharat, Michael Schmitt, and Michael Curtiss
Assignee: Google Inc.
US Patent 9,613,061
Granted: April 4, 2017
Filed: May 28, 2014
A system identifies a first document that includes several first images, identifies a second document that includes many second images, and forms a cluster based on a relationship between the first document and the second document. The system identifies a first caption associated with one of the first images, identifies a second caption associated with one of the second images, selects one of the first images of one of the second images as a representative image for the cluster based on the first caption or the second caption, and associates the representative image with the cluster.
News Crawling Unit
The patent tells us about the behavior of a “News Crawling Unit” which sounds a bit lie how we might envision a news-oriented Googlebot might behave if it focused primarily upon crawling news documents. It may go on focused crawls of the web that start with URLs that it may associate with news sources. It would capture images on those pages to include in news stories:
News crawling unit may also crawl the images based on their extracted addresses and store the images and other information relating to the images. For example, news crawling units may obtain temporal information and reference count information relating to the images. The temporal information may be useful for identifying “stock images” (i.e., images that are used in multiple news documents relating to the same topic). Stock images may qualify as good candidate images. The reference count information may be useful for identifying images that are linked by multiple news documents on the same host but not directly related to the topics of the news documents, such as images of columnists or news source related icons. Images with high reference counts may be determined to not make good candidate images.
So that gives us a start of an idea of how Google may choose the images we see that accompany news stories. The patent goes on to tell us how it may sort good candidate images from images that might not be good choices for showing in search results, including oddly shaped and formatted images or ones that are unrelated to the topic of the source news documents that they are near, such as images related to advertisements or columnists.
The patent also tells us that images below a certain dimension size or aspect ratio (making it possibly too tall or too narrow) may also be excluded as a candidate image (a candidate to show in news results.)
We are also told that an image that includes a link may be ruled out as a candidate because linked images are often advertisements.
Images that are hosted someplace other than where the news source is hosted may also be ruled out as candidate images because they might be advertisements unless they are from a content delivery network.
When images are crawled, information about the captions of images may be detected because those may be good descriptions of images, and tell whether the image may be related to the topic of the source news document.
When an image and text are captured together within HTML tags, such as within a table cell, that text may be associated with the image. Likewise, the Alt text could be associated with the image and used as the alt text for the image when used as a news result.
The patent tells us that some tests in alternative text for an image could be analyzed to see if it contains “poison” words, such as a word that might identify the name of the author of the image or words that are unrelated to the topic of the news document. If the alternative text does not contain poison words, it might then be used as the caption of the image.
If the image is in an HTML container such as a table cell with text, then that text might be used as a caption of the image (or text from a neighboring cell.)
If that text sharing an HTML container exceeds a certain threshold or is too bulky, it might not be considered as a caption because chances are it might be part of the news document.
The patent tells us that image scores for each of the candidate images might be created based upon certain factors, such as:
- image size
- distance to the title of the news document
- an overlap between the image caption and the news document centroid
We are also told that some other filters may be used to decide whether image from a news source should accompany that news story in news results. These could include:
- images that contain text
- images that look more like clip-art, as opposed to photographs
- images that are all the same color
- other criteria
Cluster Level Images
New topics are often broken down into clusters of documents about those topics.
The patent tells us that images may be associated within a cluster to topics, and the highest-ranked image within a topic cluster might be determined based upon the rank of the source news document within that cluster – the higher the news document is ranked within a cluster, “the more likely its image may be representative of the cluster.”
We are also told that the words in the caption to an image might be looked at, and the number of times words in the image caption appear in the body of documents in the cluster, the more likely it is that the image is related to the topic of the cluster.
I have seen some patents where one or more sentences near the end of the patent could have more meaning to them than may be expected. There is a sentence like that in this patent, where it tells us:
Further, while described in the context of news searches, systems and methods consistent with the principles of the invention may apply to non-news searches, such as product searches.
It sounds like it wouldn’t be a bad idea to think about how Google might use some of the methods described in the patent to apply associating images to Search Results other than just News Search. Sort of like Cuil was – but probably better than Cuil was.