Large Data Sets and Search Rankings
In accordance with one aspect consistent with the principles of the invention, a method for ranking documents is provided. The method may include creating a ranking model that predicts a likelihood that a document will be selected and training the ranking model using a data set that includes tens of millions of instances.
Back in 2011, I wrote about a patent that had been granted to Google in 2007, about building ranking models that use a very large amount of data, about queries, documents on the Web, and searchers. The post I wrote about that in was Google and Large Scale Data Models Like Panda, and the version of the patent I wrote about back then was Ranking documents based on large data sets.
That large data sets patent has been updated, through the use of a continuation patent, for the third time. The two earlier continuation patents weren’t granted, but this latest one has been, and it can be found at the link below this paragraph. The description appears to be the same as from the original version that was filed in 2003. The claims have been rewritten extensively, and are worth looking at, because the new ones capture how much effort has gone into this patent. The newest version of the large data sets patent can be found at:
Ranking documents based on large data sets
Inventors: Jeremy Bem, Georges R. Harik, Joshua L. Levenberg, Noam M. Shazeer and Simon Tong
Assignee: Google LLC
US Patent: 10,055,461
Granted: August 21, 2018
Filed: July 31, 2015
A system ranks documents based, at least in part, on a ranking model. The ranking model may be generated to predict the likelihood that a document will be selected. The system may receive a search query and identify documents relating to the search query. The system may then rank the documents based, at least in part, on the ranking model and form search results for the search query from the ranked documents.
Updated Claims in the large data sets patents
It is worth comparing the claims from the earliest version of this large data sets patent to the latest, to get a sense of how much it has changed. Reading through the post I made about the first version earlier can also be helpful to understand what it covers. I am including copies of the first claim from each here because they present quite a contrast in what the patents apply to.
In the original version of the patent, the first claim is much shorter and isn’t as detailed at all. It tells us about:
1. A computer implemented method, comprising: creating a ranking model that predicts a likelihood that a document will be selected by: storing information associated with a plurality of prior searches, determining a prior probability of selection based, at least in part, on the information associated with the prior searches, and generating the ranking model based, at least in part on the prior probability of selection; training the ranking model using a data set that includes approximately tens of millions of instances; identifying documents relating to a search query; scoring the documents based, at least in part, on the ranking model; forming search results for the search query from the scored documents; and outputting the search results.
Compare that claim to this one from the latest version of the patent, which is much more detailed:
What is claimed is:
1. A computer-implemented method comprising: receiving, by a distributed search system, a collection of training data comprising a plurality of training instances that each identify a respective first document selected by a particular user when the first document was identified in search results provided by the search system to the particular user in response to particular search query issued by the particular user; partitioning the collection of training data over a plurality of computing devices of the distributed search system; generating, by the distributed search system, a ranking model that produces a likelihood that a particular user will select a particular document when identified by one or more search results provided in response to a particular search query submitted by the particular user, including processing, by each computing device of the plurality of computing devices, training instances assigned to the computing device, including: selecting, by the computing device, a candidate condition, wherein the candidate condition specifies values for one or more user features, one or more query features, and one or more document features, sending, by the computing device, to each other computing device of the plurality of computing devices, a request to compute local statistics for the candidate condition, receiving, by the computing device from each other computing device of one or more other computing devices, respective computed statistics for the candidate condition computed by the other computing device using values of local training instances assigned to the other computing device, computing, by the computing device, a weight for the candidate condition according to the computed statistics received from the one or more other computing devices for the candidate condition; determining, by the computing device, that a new rule comprising the candidate condition and the computed weight should be added to the ranking model, and in response, adding the new rule to the ranking model and providing, by the computing device, to each other computing device of the plurality of computing devices, an indication that the new rule comprising the candidate condition and the computed weight should be added to the ranking model; receiving a search query submitted by a first user; obtaining a plurality of search results that satisfy the search query, wherein each search result identifies a respective document of a plurality of documents; determining one or more features of the first user and one or more features of the search query submitted by the first user; using the one or more features of the first user and the one or more features of the search query as input to the ranking model to compute, for each document identified by the search results, a respective likelihood that the first user will select the document when provided in response to the search query; and ranking the plurality of search results based on a respective computed likelihood for each document, the computed likelihood for each document being a likelihood that the first user will select the document when provided in response to the search query.
The claim tells us that the model that rankings are based upon involve features about users, about queries, and about documents ranked. These are just some of the features identified in the new claims:
- A language of the first user
- One or more previous queries issued by the first user
- A number of times the first user has accessed a particular document
- A language of the query
- One or more terms of the query
- One or more second documents that the particular user did not select
- Data representing a position of the selected first document in an order of the search results provided in response to the particular query
- A number of documents ranked above the selected first document in the search results provided to the particular user in response to the particular search query
- A location of the first user
There are some other claims in the newer version of the patent which have become much longer, and which makes them worth looking over and paying attention to.
The first version of the patent does tell us that it is paying attention to many different instances of data broken into triples about how a searcher and a query and a document differ. As I said in my first past about the original patent:
In the first Google patent, the model being built looked at a combination of data from users, the queries that they used, and the documents that they may or may not have selected. Each of these combinations is referred to as an “instance. An instance is a “triple” of data: (u, q, d), where u is user information, q is query data from the user, and d is document information relating to pages returned from the query data.
Take Aways about the Update of the Large Data Sets Patent
Google has recently had a large core update, as described in Google Confirms Broad Core Algorithm Update: The Facts & Advice. We know that Google has been updating their core search algorithms possibly 2 times a day, for a long period of time. We don’t know when the updates that are reflected in the new version of this large data sets patent may have been applied, but it’s possible that they may have been because as a continuation patent, it would ideally reflect changes to the process behind the patent, which could have been put in place on the algorithm over time. If Google was using this approach to rank pages, it is possible that it might be considered part of the core search algorithm. This patent considers a very large amount of data involving users queries and documents, to determine rankings.