Google Phrase-Based Indexing Updated

by Posted @ Jun 05 2018


Google has updated one of their most important patents today

What Phrase-Based Indexing Covers

When a page covers a topic such as “President of the United States”, chances are good that it might include meaningful phrases on that page that could be said to predict what the page is about, such as “White House” or “Rose Garden” or “Press Conference” or “Secretary of State.” If you see phrases like that on a page, they may be hints about the topic of that page, which is descriptive of how Google’s phrase-based indexing works. It is an approach that could be said to use semantic themes to show the meaning of pages. It does that by understanding and indexing the meaningful phrases that co-occur upon pages that rank highly for a term.

Matt Cutts published this video about 5 years ago, and he tells us in it that just because Google has a patent on something; that doesn’t mean they use it

But, When Google Proceeds to Update A Patent, They May Use It

I do look at a lot of patents from Google, and there are some arguments that can be made that tell us that they may be using a phrase-based indexing updated patent granted today.

1. There are over 20 related patents granted to Anna Patterson and assigned to Google about processes involving phrase-based indexing.

2. The patent seems to be an important one and one that I once called one of the 10 most important SEO patents of all time:

10 Most Important SEO Patents, Part 5 – Phrase Based Indexing

3. The process behind the patent first came out when it was filed at the patent office back in 2004, and they’ve been adding to the process with at least 20 patents that add on features, such as spam fighting and snippet generation, and tell us details about how it is likely implemented into Google’s index. I first wrote about that patent back in 2006, in the post, Move over PageRank: Google’s looking at phrases?

4. A continuation patent is a version of a patent where the description of the patent hasn’t been changed, but the claims in the patent have been updated, to reflect changes in the process that the patent is aimed at protecting. The date of the filing of the patent remains the date of the original filing, but the ability to exclude others from using the process behind the patent becomes based upon the new claims. The claims in the patent have changed significantly from 2004 to 2018. One significant reason to change those claims is to reflect the actual process in place (if the patent is actually being used) behind the patent.

It is worth comparing the first three claims from the original to the version of the patent that was granted today. Here are the first three claims from the original:

1. A method of selecting documents in a document collection in response to a query, the method comprising: receiving a query; identifying a plurality of phrases in the query, wherein at least one phrase is a multiple word phrase; identifying a phrase extension of at least one of the identified phrases; and selecting documents from the document collection containing at one phrase from a set including phrases in the query and the phrase extension.

2. The method of claim 1, wherein selecting documents comprises: combining a posting list of an identified phrase and a posting list of the phrase extension of the identified phrase to form a combined posting list, and selecting documents appearing in the combined posting list and in the posting lists of the other identified phrases.

3. A method of selecting documents in a document collection in response to a query, the method comprising: receiving a query; identifying an incomplete phrase in the query; replacing the incomplete phrase with a phrase extension, and selecting documents from the document collection containing the phrase extension.

What appears to be different from the older claims to the newer ones that follow, is that they provide more information on how phrase-based indexing may now rank pages.

What is claimed is:

1. A computer-implemented method comprising: obtaining, from a phrase-based index for an Internet search engine, a list of documents from a collection of documents available via the Internet that contain a first phrase, the first phrase being relevant to a query; for each document in the list: determining, using related phrase information stored in the index for each document in the list of documents, whether the document includes one or more related phrases of the first phrase, where each related phrase has an actual co-occurrence rate of the related phrase and the first phrase in the document collection that exceeds an expected co-occurrence rate of the related phrase and the first phrase in the document collection; ranking the documents in the list based on a quantity of related phrases determined for each document, so that documents with more related phrases are ranked higher than documents with fewer related phrases; and selecting at least some of the highest-ranked documents to include in a result to the query.

2. The method of claim 1, wherein determining whether the document includes one or more related phrases of the first phrase includes: accessing a posting list for the first phrase, the posting list including, for each document identified in the posting list, an indication of the quantity of related phrases present in the document.

3. The method of claim 1, wherein a document with a low frequency of query terms but a plurality of related phrases for the first phrase ranks higher than a document with a higher frequency of query terms but with no related phrases.

The phrase-based indexing updated patent can be found at:

Phrase-based searching in an information retrieval system
Inventors: Anna L. Patterson
Assignee: Google LLC
US Patent: 9,990,421
Granted: June 5, 2018
Filed: February 2, 2017


An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

I wrote a post around a year ago, focusing upon phrase-based indexing, Are You Using Google Phrase-Based Indexing?, which covers a patent that tells us how Google’s inverted index has been updated to include phrases – which would be a very expensive undertaking, but would make the method in this updated first patent on phrase-based indexing work much more effectively.

I also wrote one in 2016 called Thematic Modeling Using Related Words in Documents and Anchor Text. It tells us about how pages may be boosted in search results based upon the use of body hits (related phrases in the text of pages) and anchor hits (related phrases used as anchor text pointed from a page that related phrase have been generated for.) With rankings defined in more detail in this first phased-based indexing patent, we may see updates to other patents about phrase-based indexing as well.

Does phrase-based indexing look intriguing enough to you to test and research more?

  1. Jason BARNARD

    June 06th, 2018 at 4:08 am

    It DOES look interesting enough to test research more…
    Increasingly tricky to know quite how to test something so conceptual, though.
    Thanks Bill


    • Bill Slawski

      June 06th, 2018 at 7:17 pm

      Hi Jason,

      It is relatively easy to test, to find related phrases that co-occur on pages that rank highly for a term, and add those related phrases to the content of a page that you are trying to rank highly for that query term, and see if the addition of the related phrases makes a difference in how that page ranks.

  2. Edward Wilson

    June 06th, 2018 at 4:46 pm

    Bill, always useful stuff.

    Do you see latent semantic indexing and phrase-based indexing as essentially very similar? I am not clear on how they differ, other than by name.

    Thanks for any guidance here.


    • Bill Slawski

      June 06th, 2018 at 7:13 pm

      Phrase-Based indexing and latent Semantic Indexing have nothing to do with each other at all. Latent Semantic Indexing is a process that was developed by researchers at Bell Labs in 1990, which works to index small document collections, such as a set of books. The LSI patent that Bell researchers were awarded uses an example of 9 books in a document collection. It also tells us that if a document collection is indexed under LSI, and more content is added to that data corpus, indexing under LSI needs to happen all over again, which really isn’t useful for a large data collection that changes considerably as new pages are added, and removed regularly. Because LSI needs to be used for static data collections (it was developed for enterprise uses, and not the Web), it isn’t a good choice as an indexing method for the Web. It does not use co-occurrence the way that phrase-based indexing does, and has nothing in common with LSI. I wrote this post about how Google likely does not use LSI at all: Does Google Use Latent Semantic Indexing?

