Google Phrase-Based Indexing Updated

Published: June 05, 2018

Google Phrase-Based Indexing Updated featured cover image

Google has updated one of their most important patents today

What Phrase-Based Indexing Covers

When a page covers a topic such as “President of the United States”, chances are good that it might include meaningful phrases on that page that could be said to predict what the page is about, such as “White House” or “Rose Garden” or “Press Conference” or “Secretary of State.” If you see phrases like that on a page, they may be hints about the topic of that page, which is descriptive of how Google’s phrase-based indexing works. It is an approach that could be said to use semantic themes to show the meaning of pages. It does that by understanding and indexing the meaningful phrases that co-occur upon pages that rank highly for a term.

Related Content:

Just Because Google Has a Patent…Phrase-Based Indexing Updated

Matt Cutts published this video about 5 years ago, and he tells us in it that just because Google has a patent on something; that doesn’t mean they use it

But, When Google Proceeds to Update A Patent, They May Use It

I do look at a lot of patents from Google, and some arguments can be made that tell us that they may be using phrase-based indexing updated patent granted today.

1. There are over 20 related patents granted to Anna Patterson and assigned to Google about processes involving phrase-based indexing.

2. The patent seems to be an important one and one that I once called one of the 10 most important SEO patents of all time:

10 Most Important SEO Patents, Part 5 – Phrase-Based Indexing

3. The process behind the patent first came out when it was filed at the patent office back in 2004, and they’ve been adding to the process with at least 20 patents that add on features, such as spam fighting and snippet generation, and tell us details about how it is likely implemented into Google’s index. I first wrote about that patent back in 2006, in the post, Move over PageRank: Google’s looking at phrases?

4. A continuation patent is a version of a patent where the description of the patent hasn’t been changed, but the claims in the patent have been updated, to reflect changes in the process that the patent is aimed at protecting. The date of the filing of the patent remains the date of the original filing, but the ability to exclude others from using the process behind the patent becomes based upon the new claims. The claims in the patent have changed significantly from 2004 to 2018. One significant reason to change those claims is to reflect the actual process in place (if the patent is being used) behind the patent.

It is worth comparing the first three claims from the original to the version of the patent that was granted today. Here are the first three claims from the original:

1. A method of selecting documents in a document collection in response to a query, the method comprising: receiving a query; identifying a plurality of phrases in the query, wherein at least one phrase is a multiple word phrase; identifying a phrase extension of at least one of the identified phrases, and selecting documents from the document collection containing at one phrase from a set including phrases in the query and the phrase extension.

2. The method of claim 1, wherein selecting documents comprises: combining a posting list of an identified phrase and a posting list of the phrase extension of the identified phrase to form a combined posting list, and selecting documents appearing in the combined posting list and the posting lists of the other identified phrases.

3. A method of selecting documents in a document collection in response to a query, the method comprising: receiving a query; identifying an incomplete phrase in the query; replacing the incomplete phrase with a phrase extension, and selecting documents from the document collection containing the phrase extension.

What appears to be different from the older claims to the newer ones that follow, is that they provide more information on how phrase-based indexing may now rank pages.

What is claimed is:

1. A computer-implemented method comprising: obtaining, from a phrase-based index for an Internet search engine, a list of documents from a collection of documents available via the Internet that contain a first phrase, the first phrase being relevant to a query; for each document in the list: determining, using related phrase information stored in the index for each document in the list of documents, whether the document includes one or more related phrases of the first phrase, where each related phrase has an actual co-occurrence rate of the related phrase and the first phrase in the document collection that exceeds an expected co-occurrence rate of the related phrase and the first phrase in the document collection; ranking the documents in the list based on a quantity of related phrases determined for each document, so that documents with more related phrases are ranked higher than documents with fewer related phrases; and selecting at least some of the highest-ranked documents to include in a result to the query.

2. The method of claim 1, wherein determining whether the document includes one or more related phrases of the first phrase includes: accessing a posting list for the first phrase, the posting list including, for each document identified in the posting list, an indication of the number of related phrases present in the document.

3. The method of claim 1, wherein a document with a low frequency of query terms but a plurality of related phrases for the first phrase ranks higher than a document with a higher frequency of query terms but with no related phrases.

The phrase-based indexing updated patent can be found at:

Phrase-based searching in an information retrieval system
Inventors: Anna L. Patterson
Assignee: Google LLC
US Patent: 9,990,421
Granted: June 5, 2018
Filed: February 2, 2017

Abstract

An information retrieval system uses phrases to index, retrieve, organize, and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are then indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

I wrote a post around a year ago, focusing upon phrase-based indexing, Are You Using Google Phrase-Based Indexing?, which covers a patent that tells us how Google’s inverted index has been updated to include phrases – which would be a very expensive undertaking but would make the method in this updated first patent on phrase-based indexing work much more effectively.

I also wrote one in 2016 called Thematic Modeling Using Related Words in Documents and Anchor Text. It tells us about how pages may be boosted in search results based upon the use of body hits (related phrases in the text of pages) and anchor hits (related phrases used as anchor text pointed from a page that related phrases have been generated for.) With rankings defined in more detail in this first phase-based indexing patent, we may see updates to other patents about phrase-based indexing as well.

Does phrase-based indexing look intriguing enough for you to test and research more?

About Bill Slawski

With more than 26 years of SEO experience and a Juris Doctor Degree, Bill Slawski is the foremost expert on Google’s patents as related to SEO. Patent Exploration is one of the quickest and most detailed ways to find new information about SEO. Bill is the Editor of SEO by the Sea, a prominent search engine optimization blog, where he is the author of over 1,300 posts. Bill’s experience includes Fortune 500 brands and some of the largest websites in the world. Bill is a contributing author for Moz, Search Engine Land, and Search Engine Journal. In 2014-2021, he spoke at industry-leading international conferences about topics including search engine algorithms, universal and blended search, personalization in search, search and social, and duplicate content problems, structured data, and schema

MORE TO EXPLORE