What are Information Gain Scores?
A different way of ranking sources of information responsive to search requests is the topic of a patent application from Google. It is about information gain scores.
Information gain scores state how much more information one source may bring to a person who has seen other sources on the same topic. Pages with higher information gain scores may be ranked higher than pages with lower information gain scores.
A searcher may find documents about a topic, after searching for pages or links to pages responsive to a search query.
Also, a searcher may receive a page based on:
- Interests of the searcher
- Prior viewed pages of the searcher
- Other criteria that may identify and provide a page of interest.
Information from the pages may come through an automated assistant or results from a search engine.
Information from those pages may show to a searcher in response to a search query and may go to the searcher based on continued searching after the searcher has ended a search session.
In some cases, a subset of information from the document for presentation to the user. For example, when a searcher engages in a spoken human-to-computer dialog with an automated assistant software process.
Some search engines will provide a featured snippet showing summary information from one or more responsive and/or relevant documents, besides or instead of links to responsive and/or relevant documents, in response to a searchers’ search query.
But, Google has filed a patent to solve a problem that it has identified in this instance. They tell us that:
…when a set of documents is identified that share a topic, many of the documents may include similar information.
As an example, a searcher may submit a query about a computer problem and may be “provided with multiple documents that include a similar listing of solutions, remedial steps, resources, etc.”
While each of these documents is about the same topic and is relevant to “the request or interest of the user, the user may have less interest in viewing a second document after already viewing the same or similar information in a first document or set of documents.”
The patent tells us that this is a problem that needs solving. They tell us how they will do that in a patent application published in April:
Implementations described herein relate to determining an information gain score for one or more documents of potential interest to the user and presenting information from one or more of those documents that are selected based on their respective information gain scores.
An information gain score for a given document indicates “additional information included by a page beyond the information contained in other pages already presented to the user.”
Information from pages may go to a searcher in various ways, such as:
- Opening the entire document (e.g., in a web browser or another applicable software application)
- Audibly reading the entire content of the document to the user
- Extracting and audibly/visually presenting salient information extracted from the document to the user
How are Information Gain Scores Calculated?
Determining information gain scores for one or more pages by applying data indicative of the pages, such as:
- Their entire contents
- Salient extracted information
- A semantic representation (e.g., an embedding, a feature vector, a bag-of-words representation, a histogram generated from words/phrases in the document, etc.) across a machine learning model to generate an information gain score
An application of data indicative of one or more previously-presented pages, along with data indicative of one or more yet-to-be presented (or “new”) pages, as an input across a trained machine learning model to generate an output indicative of an information gain score of the one or more new pages.
Providing to the searcher based on information gain scores, information from one or more new pages to reflect the likely attained information gain for the searcher if the searcher were presented information from the selected pages.
How Does this Information Gain Score Process Work?
- Identifying the first set of pages displayed to the searcher.
- Identifying the pages of the first set that share a common topic priorly provided to the user
- The searcher may look for a topic, and one or more pages that are responsive to that query may be returned
- For each new page in the second set of pages, determining an information gain score, indicating whether that page includes information not contained in the pages of the first set of documents
- Based on the information gain scores, selecting one or more of the new documents to provide to the user, and/or ranking the new documents based on their respective information scores
- Ranking the new pages, and as the searcher views more pages, the second set of pages may be re-ranked based on new information gain scores
- While a machine learning approach may view the entire content of those pages, generating an alternative representation of the document, such as a semantic feature vector or embedding, a “bag-of-words” representation, etc., from each of the documents and applied as an input across the machine learning model
The patent states that “these search results may be ranked at least in part based on their respective information gain scores.”
This may mean that boosting some pages in rankings based on how much information they would add to a searcher and demoting them if they don’t add much information to a searcher.
This Information Gain Score patent application is at:
Contextual Estimation of Link Information Gain
Publication Number WO2020081082
Publication Date April 23, 2020
International Filing Date October 18, 2018
Applicants: GOOGLE LLC
Inventors: Victor Carbune, Pedro Gonnet Anders
Techniques are described herein for determining an information gain score for one or more documents of interest to the user and present information from the documents based on the information gain score. An information gain score for a given document indicates additional information that is included in the document beyond the information contained in documents that the user previously viewed. In some implementations, the information gain score may be determined for one or more documents by applying data from the documents across a machine learning model to generate an information gain score. Based on the information gain scores of a set of documents, the documents can be provided to the user to reflect the likely information gain that the user can attain if the user were to view the documents.