Phrase-Based Indexing can help a page become more relevant for specific query terms with the presence of co-occurring phrases upon it that are related to those queries and anchor text pointed to that page using related phrases. Phrase-Based Indexing is something worked upon at Google since at least 2004.
When Anna Lynne Patterson wrote the paper Why Writing Your Own Search Engine is Hard, she had not long before created one of the largest search engines to be found on the Web, by the name of Recall, which indexed over 30 Billion pages at the Internet Archive. She ended up joining Google not long afterwards, and started filing patents there on phrase-based indexing. I’ve written about some of the patents she came out with:
02/10/2006 – Move over pagerank: Google’s looking at phrases?
05/19/2006 – Google Aiming at 100 Billion Pages?
12/29/2006 – Phrase Based Information Retrieval and Spam Detection
09/16/2008 – Google Phrase Based Indexing Patent Granted
03/15/2009 – What are the Top Phrases for Your Website?
04/07/2010 – Phrasification and Revisiting Google’s Phrase Based Indexing
12/19/2011 – 10 Most Important SEO Patents, Part 5 – Phrase Based Indexing
08/05/2016 – Thematic Modeling Using Related Words in Documents and Anchor Text
I know that is a lot to throw at you at the start of a blog post. If you want to find out more about this topic, you will come back to the list above and visit some of those earlier posts. I didn’t tell you that Anna Lynne Patterson had left Google at one point in time to start Cuil, a Google competitor which unfortunately failed, and was rehired by Google after Cuil closed down as a Vice President of Search at Google.
Today, Google was granted a continuation patent for a patent that was originally filed in 2007, which had me convinced when it first came out that Google had adopted phrase-based indexing. That is because the way a phrase-based indexing system is described as working seems to make a lot of sense to index something as large and complex as The World Wide Web.
The patent can be found at:
Index server architecture using tiered and sharded phrase posting lists
Inventors: Pei Cao, Nadav Eiron, Soham Mazumdar, Anna L. Patterson, Russell Power, and Yonatan Zunger
Assignee: Google Inc.
US Patent 9,652,483
Granted: May 16, 2017
Filed: November 23, 2015
An information retrieval system uses phrases to index, retrieve, organize, and describe documents. Phrases are extracted from the document collection. Documents are indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in a cluster of index servers. The phrase posting lists can be tiered into groups and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases and optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various index servers.
I had started reading patents to get an idea of how search engines worked, and this one describes inverted indexes and posting lists made up of individual terms and then posting lists made of up meaningful phrases. The last post in my list above on “Thematic Modeling” is about a phrase-based indexing patent filed in 2012, titled “Integrated external related phrase information into a phrase-based indexing information retrieval system,” which talks about how the presence of certain phrases on a page can be used to predict the appearance of another phrase. I highly recommend reading this latest patent and its description of how a phrase-based search engine works. There are some challenges in trying to set up a phrase-based index, as the patent tells us here:
The problem here is that conventional systems index documents are based on individual terms rather than concepts. Concepts are often expressed in phrases, such as “dark matter,” “President of the United States,” or idioms like “under the weather” or “dime a dozen.” At best, some prior systems will index documents concerning a predetermined and minimal set of `known` phrases, which a human operator typically selects. Indexing phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases, say three, four, or five or more. For example, assuming that any five words could constitute a phrase and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate. The further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than new individual words are invented. New phrases are always being generated from technology, arts, world events, and law. Other phrases will decline in usage over time.
Some existing information retrieval systems attempt to provide retrieval of concepts by using co-occurrence patterns of individual words. In these systems, a search on one word, such as “President,” will also retrieve documents that have other words that frequently appear with “President,” such as “White” and “House.” While this approach may produce search results having conceptually related documents at the level of individual words, it does not typically capture topical relationships between co-occurring phrases themselves.
The problem with needing so much storage to capture everything on a phrase-based indexing system still happens when you try to index individual terms:
Another problem with existing individual term-based indexing systems lies in the server computers’ arrangement to access the index. In a conventional indexing system for large-scale corpora like the Internet, the index comprises the posting lists for upwards of 200,000 unique terms. Each term posting list can have hundreds, thousands, and not infrequently, millions of documents. The index is typically divided amongst a large number of index servers, in which each index server will contain an index that includes all of the unique terms, and for each of these terms, some portion of the posting list. A typical indexing system like this may have upwards of 1,000 index servers in this arrangement.
When a given query with some number of terms is processed in such an indexing system, it becomes necessary to access all of the index servers for each query. Thus, even a simple single-word query requires each of the index servers (e.g., 1,000 servers) to determine whether it contains documents containing the word. Because all index servers must process the query, the slowest index server’s overall query processing time is limited.
Those are the problems that were perceived to exist when the idea of phrase-based indexing was developed. Yet, if those issues could be resolved, there are potential benefits to using phrase-based indexing. This patent explains how servers can be set up to index and search the web based upon phrases.
The benefits? Imagine a page about “Baseball Stadiums.” The chances are good that it would include phrases such as “pitcher’s mound,” “Concession Stands,” and “First Base.” These phrases could be identified as being relevant to that page about a Baseball Stadium. Those phrases often tend to co-occur on highly ranked pages about Baseball stadiums. The patent might score such phrases as described here:
In one aspect, an information retrieval system includes an indexing system and index server architecture based on phrases. Phrases are extracted from a document collection to identify real phrases as used in language by users instead of mere combinations of words. Generally, this is done by collecting a large body of word sequences that are candidates phrases based on the structural features in the documents. Each candidate phrase is given a document phrase score for each document in which it appears, in a manner that reflects its likelihood of being a real phrase based on its position within a document and the extent to which it occurs independently or jointly with other candidates phrases in the document. Also, each candidate phrase is processed to identify any sub phrases therein, which are similarly scored.
Each candidate phrase’s document phrase scores are then combined across the documents in which it appears to create a combined score. The document phrase scores and the combined score for a candidate phrase are evaluated to determine how strongly the document collection supports the usage of the candidate phrase in a real phrase. Generally, a candidate phrase is retained where it is strongly supported by at least one document; for example, the maximum of its document phrase scores exceeds a predetermined threshold. A candidate phrase is also retained where it is moderately supported, as indicated by having a combined phrase score above a second predetermined threshold. This shows that the candidate phrase has sufficient widespread use to be considered a real phrase. Finally, a candidate phrase is also retained where it is broadly supported, as indicated by the phrase receiving a minimum score from some number of documents. As an example, the system can include approximately 100,000 to 200,000 phrases, which will represent real phrases used in documents, rather than mere combinations of words.
Using a method like this to identify real phrases, how often they occur, and their locations in documents are complicated. This phrase-based indexing system covers a few different patents, including generating scores based on how often certain phrases appear in different documents and as anchor text pointing to documents and identifying spam using phrase-based indexing. There are a good number of related patents that have been developed since 2004. It’s complex enough that it doesn’t get talked about much. Yes, you can look for which phrases tend to appear on the top-ranking pages for specific terms frequently, and that will give you a hint as to what meaningful phrases you should ideally include on your page about those terms.
I wrote about Googles’s inverted index and mentioned this post because the patent I wrote about provides an inverted index of phrases on the Web, showing that Google is likely tracking those phrases.