What are Information Gain Scores?
A different way of ranking sources of information responsive to search requests is the topic of a patent application from Google. It is about information gain scores.
Information gain scores indicate how much more information one source may bring to a person who has seen other sources on the same topic. Pages with higher information gain scores may be ranked higher than pages with lower information gain scores.
A searcher may be provided documents about a topic, after searching for pages or links to pages responsive to a search query.
Also, a searcher may be provided with a page based on:
- Interests of the searcher
- Previously viewed pages of the searcher
- Other criteria that may be utilized to identify and provide a page of interest.
Information from the pages may be provided through an automated assistant or results from a search engine.
Information from those pages may be provided to a searcher in response to a search query and/or maybe automatically served to the searcher based on continued searching after the searcher has ended a search session.
In some cases, a subset of information may be extracted from the document for presentation to the user. For example, when a searcher engages in a spoken human-to-computer dialog with an automated assistant software process.
Some search engines will provide a featured snippet showing summary information from one or more responsive and/or relevant documents, in addition to or instead of links to responsive and/or relevant documents, in response to a searchers’ search query.
But, Google has filed a patent to solve a problem that it has identified in this instance. They tell us that:
…when a set of documents is identified that share a topic, many of the documents may include similar information.
As an example, a searcher may submit a query about a computer problem and maybe “provided with multiple documents that include a similar listing of solutions, remedial steps, resources, etc.”
While each of these documents is about the same topic and is relevant to “the request or interest of the user, the user may have less interest in viewing a second document after already viewing the same or similar information in a first document or set of documents.”
The patent tells us that this is a problem that should be solved. They tell us how they will do that in a patent application published in April:
Implementations described herein relate to determining an information gain score for one or more documents of potential interest to the user and presenting information from one or more of those documents that are selected based on their respective information gain scores.
An information gain score for a given document indicates “additional information included by a page beyond the information contained in other pages already presented to the user.”
Information from pages may be presented to a searcher in various ways, such as:
- Opening the entire document (e.g., in a web browser or another applicable software application)
- Audibly reading the entire content of the document to the user
- Extracting and audibly/visually presenting salient information extracted from the document to the user
How are Information Gain Scores Calculated?
Information gain scores may be determined for one or more pages by applying data indicative of the pages, such as:
- Their entire contents
- Salient extracted information
- A semantic representation (e.g., an embedding, a feature vector, a bag-of-words representation, a histogram generated from words/phrases in the document, etc.) across a machine learning model to generate an information gain score
Data that is indicative of one or more previously-presented pages, along with data indicative of one or more yet-to-be presented (or “new”) pages, may be applied as an input across a trained machine learning model to generate an output indicative of an information gain score of the one or more new pages.
Based on information gain scores, information from one or more of the new pages may be provided to the searcher in a way reflecting the likely information gain that can be attained by the searcher if the searcher were to be presented information from the selected pages.
How Does this Information Gain Score Process Work?
- The first set of pages displayed to the searcher is identified.
- The pages of the first set share a common topic and can be identified based on being previously provided to the user
- The searcher may look for a topic and one or more pages that are responsive to that query may be returned
- For each new page in the second set of pages, an information gain score is determined indicating whether that page includes information not contained in the pages of the first set of documents
- Based on the information gain scores, one or more of the new documents may be selected to provide to the user, and/or the new documents may be ranked based on their respective information scores
- The new pages can be ranked and as the searcher views more pages, the second set of pages may be re-ranked based on new information gain scores
- While the entire Content of Those pages may be viewed by a machine learning approach, an alternative representation of the document, such as a semantic feature vector or embedding, a “bag-of-words” representation, etc., may be generated from each of the documents and applied as an input across the machine learning model
The patent clearly states that “these search results may be ranked at least in part based on their respective information gain scores.”
This may mean that some pages may be boosted in rankings based upon how much information they would add to a searcher, and maybe demoted if they don’t add much information to a searcher.
This Information Gain Score patent application can be found at:
Contextual Estimation of Link Information Gain
Publication Number WO2020081082
Publication Date April 23, 2020
International Filing Date October 18, 2018
Applicants: GOOGLE LLC
Inventors: Victor Carbune, Pedro Gonnet Anders
Techniques are described herein for determining an information gain score for one or more documents of interest to the user and present information from the documents based on the information gain score. An information gain score for a given document is indicative of additional information that is included in the document beyond the information contained in documents that were previously viewed by the user. In some implementations, the information gain score may be determined for one or more documents by applying data from the documents across a machine learning model to generate an information gain score. Based on the information gain scores of a set of documents, the documents can be provided to the user in a manner that reflects the likely information gain that can be attained by the user if the user were to view the documents.