Google search engine updates are often mysterious events, but sometimes I come across information in Google patents that provide some insights into how something works. One Google update that was aimed at making Google much faster than it had been was the Caffeine update, which was officially announced by Google in 2010 with this post: Our new search index: Caffeine.
If you make changes to content on a web page, how long does it take for those changes to make it to Google’s index? If you publish a new page or blog post, when does that become part of Google’s index as something that can become queried? It used to take some time before content added to the Web would become something searchable. Google would update their index, and a lot of data that had been added to the web would become queriable in a movement that was referred to by many people as a Google Dance. Then Google came out with an approach that meant changes took place to search results more quickly, and Google’s Former head of Web Spam, Matt Cutts, referred to it in this video where he described “Flux” that was happening in Google Search results:
Google has pushed out updates that were intended to speed up the indexing of content on the Web. One of those updates was referred to as the Big Daddy update. Another one that took place in 2009 was referred to as the Caffeine update. A slightly different look is available in this newspaper article: Google Caffeine: What it really is
Recently, I came across a patent that shows off how Google could make their search index much faster and decided to share it after having seen Google’s Caffeine update blamed for many changes to how content on the Web has been indexed over the years – sharing this patent might give people a little more understanding of how Google may be indexing pages on the Web. There were actually three related patents that were filed on the same day. They provide an interesting look at how Caffeine may operate. What they do is simply this:
The disclosed embodiments relate generally to data processing systems and methods, and in particular to a document repository that supports low latencies from when a document is updated to when the document is available to queries, and that requires little synchronization between query threads and repository update threads.
So, if you’ve wondered how long it takes from the point you publish something on the Web to the time that it is added to Google’s Index, it depends upon the synchronization described in those patents.
Google searches what is referred to as an inverted index, which contains all of the words in each document it indexes on the web, along with pointers for the locations of those words. The patent points out what it refers to as “obstacles” in providing fresh results. These include:
(1) the expense or overhead associated with rebuilding the document index each time the document repository is updated. For example, significant overhead is often associated with building small indexes from new and updated documents and periodically merging the small indexes with a main index, and furthermore such systems typically suffer long latencies between document updates and availability of those documents in the repository index.
(2) the difficulty of continuously processing queries against the document repository while updating the repository, without incurring large overhead. One aspect of this second obstacle is the need to synchronize both the threads that execute queries and the threads that update the document repository with key data structures in the data repository. The need to synchronize the query threads and repository update threads can present a significant obstacle to efficient operation of the document repository if document updates are performed frequently, which in turn is a barrier to maintaining freshness of the document repository.
To get to the patent and read the whole thing, here is a link to it:
Document treadmilling system and method for updating documents in a document repository and recovering storage space from invalidated documents
Inventors: Michael Burrows and Jeffrey A. Dean
Assignee: Google Inc.
US Patent 7,617,226
Granted: November 10, 2009
Filed: February 10, 2006
A tokenspace repository stores documents as a sequence of tokens. The tokenspace repository, as well as the inverted index for the tokenspace repository, uses a data structure that has a first end and a second end and allows for insertions at the second end and deletions from the front end. A document in the tokenspace repository is updated by inserting the updated version into the repository at the second end and invalidating the earlier version. Invalidated documents are not deleted immediately; they are identified in a garbage collection list for later garbage collection. The tokenspace repository is treadmilled to shift invalidated documents to the front end, at which point they may be deleted and their storage space recovered.
When I read through this patent, one of the words that caught my attention was “treadmilling”, which was used to describe how data was handled in Google’s index:
Because deletion can only be done to the data at the front end, periodically the data in the cells are “treadmilled.” That is, valid data at the front end are copied to the back end and the original valid data at the front end are deleted. As valid data from the front end are moved to the back end, data in the cells between the front end and the back end are logically shifted to the front end, where it may be deleted if needed. Thus, treadmilling aids in the recovery of memory space that is occupied by data (sometimes called stale data) that is no longer valid. Further information regarding treadmilling is described below, in relation to FIGS. 13-15.
The rest of the patent provides more details on how this indexing system works, and after reading it, I found myself wondering if it was talking about Caffeine and if Caffeine was still being used by Google. Over the past few days ago, Google spokesperson Gary Illyes has made a couple of cryptic Tweets that referred to Google’s indexer Caffeine in ways that seemed to indicate that it was still important and still being used by Google: