Phrase-Based Indexing can help a page become more relevant for specific query terms with the presence of co-occurring phrases upon it that are related to those queries; and anchor text pointed to that page using related phrases. Phrase-Based Indexing is something worked upon at Google since at least 2004.
When Anna Lynne Patterson wrote the paper Why Writing Your Own Search Engine is Hard, she had not long before created one of the largest search engines to be found on the Web, by the name of Recall, which indexed over 30 Billion pages at the Internet Archive. She ended up joining Google not long afterwards, and started filing patents there on phrase-based indexing. I’ve written about some of the patents she came out with:
02/10/2006 – Move over pagerank: Google’s looking at phrases?
05/19/2006 – Google Aiming at 100 Billion Pages?
12/29/2006 – Phrase Based Information Retrieval and Spam Detection
09/16/2008 – Google Phrase Based Indexing Patent Granted
03/15/2009 – What are the Top Phrases for Your Website?
04/07/2010 – Phrasification and Revisiting Google’s Phrase Based Indexing
12/19/2011 – 10 Most Important SEO Patents, Part 5 – Phrase Based Indexing
08/05/2016 – Thematic Modeling Using Related Words in Documents and Anchor Text
I know that is a lot to throw at you at the start of a blog post. It’s my hope that if you want to find out more about this topic that you will come back to the list above and visit some of those earlier posts. I didn’t tell you that Anna Lynne Patterson had left Google at one point in time, to start Cuil, a Google competitor which unfortunately failed, and was rehired by Google after Cuil closed down as a Vice President of Search at Google.
Today, Google was granted a continuation patent for a patent that was originally filed in 2007, which had me convinced when it first came out that Google had adopted phrase-based indexing. That is because the way a phrase-based indexing system is described as working seems to make a lot of sense as a way to index something as large and complex as The World Wide Web.
The patent can be found at:
Index server architecture using tiered and sharded phrase posting lists
Inventors: Pei Cao, Nadav Eiron, Soham Mazumdar, Anna L. Patterson, Russell Power and Yonatan Zunger
Assignee: Google Inc.
US Patent 9,652,483
Granted: May 16, 2017
Filed: November 23, 2015
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in a cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions. Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
I had started reading patents to get an idea of how search engines worked, and this one, which describes inverted indexes and posting lists made up of individual terms, and then posting lists made of up meaningful phrases. The last post in my list above on “Thematic Modeling” is about a phrase-based indexing patent filed in 2012, titled “Integrated external related phrase information into a phrase-based indexing information retrieval system,” which talks about how the presence of certain phrases on a page can be used to predict the appearance of another phrase. I highly recommend reading this latest patent and its description of how a phrase-based search engine works. There are some challenges in trying to set up a phrase-based index, as the patent tells us here:
The problem here is that conventional systems index documents are based on individual terms, rather than on concepts. Concepts are often expressed in phrases, such as “dark matter,” “President of the United States,” or idioms like “under the weather” or “dime a dozen”. At best, some prior systems will index documents with respect to a predetermined and very limited set of `known` phrases, which are typically selected by a human operator. Indexing of phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of say three, four, or five or more words. For example, on the assumption that any five words could constitute a phrase, and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate. A further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than new individual words are invented. New phrases are always being generated, from sources such technology, arts, world events, and law. Other phrases will decline in usage over time.
Some existing information retrieval systems attempt to provide retrieval of concepts by using co-occurrence patterns of individual words. In these systems a search on one word, such as “President” will also retrieve documents that have other words that frequently appear with “President”, such as “White” and “House.” While this approach may produce search results having documents that are conceptually related at the level of individual words, it does not typically capture topical relationships that inhere between co-occurring phrases themselves.
The problem with needing so much storage to capture everything on a phrase-based indexing system still happens when you try to index individual terms:
Another problem with existing individual term based indexing systems lies in the arrangement of the server computers used to access the index. In a conventional indexing system for large scale corpora like the Internet, the index comprises the posting lists for upwards of 200,000 unique terms. Each term posting list can have hundreds, thousands, and not infrequently, millions of documents. The index is typically divided amongst a large number of index servers, in which each index server will contain an index that includes all of the unique terms, and for each of these terms, some portion of the posting list. A typical indexing system like this may have upwards of 1,000 index servers in this arrangement.
When a given query with some number of terms is processed then in such an indexing system, it becomes necessary to access all of the index servers for each query. Thus, even a simple single word query requires each of the index servers (e.g., 1,000 servers) to determine whether it contains documents containing the word. Because all of the index servers must process the query, the overall query processing time is limited by the slowest index server.
Those are the problems that were perceived to exist when the idea of phrase-based indexing was developed. Yet, if those issues could be resolved, there are potentially benefits to using phrase-based indexing. This patent explains how servers can be set up to index and search the web based upon phrases.
The benefits? Imagine a page about “Baseball Stadiums.” Chances are good that it would include phrases such as “pitcher’s mound,” “Concession Stands,” and “First Base”. These phrases could be identified as being relevant to that page about a Baseball Stadium by how often those phrases tend to co-occur on highly ranked pages about Baseball stadiums. The patent might score such phrases as described here:
In one aspect, an information retrieval system includes an indexing system and index server architecture based on phrases. Phrases are extracted from a document collection in a manner that identifies real phrases as used in language by users, as opposed to mere combinations of words. Generally, this is done by collecting a large body of word sequences that are candidates phrases based on the structural features in the documents. Each candidate phrase is given a document phrase score for each document in which it appears, in a manner that reflects its likelihood of being a real phrase based on its position within a document, and the extent to which it occurs independently or jointly with other candidate phrases in the document. In addition, each candidate phrase is processed so as to identify any subphrases therein, which are similarly scored.
Each candidate phrase’s individual document phrase scores are then combined across the documents in which it appears to create a combined score. The document phrase scores and the combined score for a candidate phrase are evaluated to determine how strongly the document collection supports the usage of the candidate phrase as a real phrase. Generally, a candidate phrase is retained where it is strongly supported by at least one document; for example, the maximum of its individual document phrase scores exceeds a predetermined threshold. A candidate phrase is also retained where it is moderately supported, as indicated by having a combined phrase score above a second predetermined threshold. This shows that the candidate phrase has a sufficient widespread use to be considered a real phrase. Finally, a candidate phrase is also retained where it is broadly supported, as indicated by the phrase receiving a minimum score from some number of documents. As an example, the system can include approximately 100,000 to 200,000 phrases, which will represent real phrases used in documents, rather than mere combinations of words.
Using a method like this to identify real phrases, and how often they occur, and their locations in documents is complicated. This phrase-based indexing system covers a few different patents, which include generating scores based upon how often certain phrases appear in different documents and as anchor text pointing to documents, as well as identifying spam using phrase-based indexing. There are a good number of related patents that have been developed since 2004. It’s complex enough that it doesn’t get talked about much. Yes, you can look for which phrases tend to frequently appear on the top ranking pages for specific terms, and that will give you a hint as to what meaningful phrases you should ideally include on your page about those terms.