Investigating Google RankBrain and Query Term Substitutions

Posted @ Oct 26 2015 by

Introducing Google’s RankBrain

A Google patent granted in August looks at how google might replace terms within queries, and I have some examples of how it rewrites some queries below. It’s part of my investigation into Google’s newly announced RankBrain Deep Learning method.

The RankBrain method was described in the Bloomberg news article published today, Google Turning Its Lucrative Web Search Over to AI Machines. That article tells us the names of two of the five people who began exploring the use of Artificial Intelligence to rank web pages, and tells us that it’s a very popular approach at Google these days. The article tells us:

The roll out of RankBrain represents a yearlong effort by a team that started with about five Google engineers, including search specialist Yonghui Wu, and deep-learning expert Thomas Strohmann.

According to the article, RankBrain is supposedly the third most important signal in how Google ranks pages in search results at Google. Looking at patents that may involve people in this project rewriting queries seemed like a good starting point to me. I found one patent that had Thomas Strohmann, listed as one of the inventors, and it reminded me of the rewriting approach from Google’s Hummingbird update. The Bloomberg News article tells us:

RankBrain uses artificial intelligence to embed vast amounts of written language into mathematical entities — called vectors — that the computer can understand.

If RankBrain sees a word or phrase it isn’t familiar with, the machine can make a guess as to what words or phrases might have a similar meaning and filter the result accordingly, making it more effective at handling never-before-seen search queries.

A Google patent granted this August has an interesting process where it describes something like that in a little different way:

The process from the patent includes:

  • Collecting query term substitution data for one or more query terms that occur in a received query;
  • Collecting query term substitution data for one or more query terms that occur in subsequent queries that include the concept;
  • “Includes the concept” means it is adjacent to the one or more query terms in the subsequent queries;
  • Collecting query term substitution data for one or more query terms in a context of the concept;
  • Determining a substitution rule in a context of the concept based on the collected query term substitution data.
The Substitute Engine that helps rewrite queries

The Substitute Engine that helps rewrite queries

Advantages of the Process in the Patent

(1) To identify a context in a query, the search system traditionally can track only one or two words around a query term due to computation complexity. But, a concept can include more than two words and this approach enables more complex queries to be re-written.
(2) A substitute term rule in a specific context identified by a concept can be determined empirically from user interactions with search result data. By extending the formation of a context beyond two words, the search system can determine substitution rules directed to more specific contexts and potentially improve search results.

The patent is:

Using concepts as contexts for query term substitutions
US 9104750 B1
Application number: US 13/650,322
Publication date: Aug 11, 2015
Filing date: Oct 12, 2012
Inventors: Kedar Dhamdhere, Thomas Strohmann, P. Pandurang Nayak, Robert Spalek

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for collecting query term substitution data based on one or more identified concepts.

According to one implementation, a method includes receiving a query that includes at least three sequential query terms; determining that the sequential query terms represent a concept; and in response to determining that the sequential query terms represent a concept, collecting query term substitution data for one or more query terms that occur in queries that include the concept.

Similar Behavior in the Past

I’ve written three posts that seem to be related to this one in the past, and they are:

Substitute Terms in Query Rewriting

When someone performs a query at Google, the results that are returned can be unique, and include additional results within them than they may have had in the past. This type of revision can be done by adding to the original query additional terms that are substitute terms of terms that show up first in the original query.

If you want to learn this process, I do recommend that you go through the patent carefully, but I wanted to provide three examples from the patent that illustrate the process behind it. For each of these three, I have the original query, and then a rewritten one, and a passage from the patent that explains part of the transformation from the original query to the query with Substitute terms in them.

New York Times Puzzle Example

1 – Original query = “New York Times Puzzle”
2. Revised query = “Puzzle?Crossword (New York Times)”

nytimes-puzzle-crossword

During state (I), the substitution engine analyzes the aggregated query term substitution data, and determines whether one or more substitution rules may be generated from the analysis. For one example, the substitution engine may determine from the query term substitution data that the term “Crossword” is frequently a substitute term for the term “Puzzle” in the context of the concept “New York Times,” as indicated by a positive indication. In some implementations, the indication may be a quantitative score assigned to the query term substitution data in the query log, and the quantitative score can be analyzed by one or more criteria in the substitution engine’s evaluation of a potential substitute term. For another example, the substitution engine may determine from the query term substitution data that the term “Subscription” is not frequently a substitute term for the term “Puzzle” in the context of the concept “New York Times,” as indicated by a negative indication. Here, the substitution engine determines that the term “Crossword” is frequently a substitute term for the term “Puzzle” in the context of “New York Times”, and sends an indication to the collection of substitution rules to add the substitution rule “Puzzle?Crossword (New York Times :)” to the collection. For subsequent user queries that contain original query terms “New York Times Puzzles”, the substitution engine may then apply the substitution rule “Puzzle?Crossword (New York Times)” and communicate with the query reviser engine to include the substitute term “Crossword” in the revised query.

substitution-rules

New York Yankees Stadium Example

1. Original query = “New York Yankees Stadium”
2. Revised query = “Yankees?Baseball (“New York”: “Stadium”)”

The patent shows us some of the substitutions that the search engine might attempt to apply, and here’s one of the stages of that analysis:

Here, the substitution engine determines that the term “Baseball” is frequently a substitute term for the term “Yankees” in the context of combined concepts “New York” and “Stadium”, and sends an indication to the collection of substitution rules to add the substitution rule “Yankees?Baseball (“New York”: “Stadium”)” to the collection. For subsequent user queries that contain original query terms “New York Yankees Stadium”, the substitution engine may then apply the substitution rule “Yankees?Baseball (“New York”: “Stadium”)” and communicate with the query reviser engine to include the substitute term “Baseball” in the revised query.

Example rewritten query

‘New York Yankees Stadium’ Query

Social Security Tax Rate Example

1. Original query = “Social Security Tax Rate”
2. Revised query = “Rate?Calculation (“Social Security” “Tax”)”

During state (I), the substitution engine analyzes the aggregated query term substitution data, and determines whether one or more substitution rules may be generated from the analysis. For one example, the substitution engine may determine from the query term substitution data that the term “Calculation” is frequently a substitute term for the term “Rate” in the context of the concepts “Social Security” and “Tax” as indicated by a positive indication. In some implementations, the indication may be a quantitative score assigned to the query term substitution data in the query log, and the quantitative score can be analyzed by one or more criteria in the substitution engine’s evaluation of a potential substitute term. For another example, the substitution engine may determine from the query term substitution data that the term “Benefit” is not frequently a substitute term for the term “Rate” in the context of the concepts “Social Security” and “Tax,” as indicated by a negative indication. Here, the substitution engine determines that the term “Calculation” is frequently a substitute term for the term “Rate” in the context of combined concepts “Social Security” and “Tax”, and sends an indication to the collection of substitution rules to add the substitution rule “Rate?Calculation (“Social Security” “Tax”)” to the collection. For subsequent user queries that contain original query terms “Social Security Tax Rate”, the substitution engine may then apply the substitution rule “Rate?Calculation (“Social Security” “Tax”)” and communicate with the query reviser engine to include the substitute term “Calculation” in the revised query.

social-security-tax-rate

“Social Security Tax Rate” Query

Take-Aways

The substitute data may be taken from text seen in the Web, or in query log files. The examples point out:

“the term “Baseball” is frequently a substitute term for the term “Yankees” in the context of combined concepts “New York” and “Stadium”.

“substitution engine may determine from the query term substitution data that the term “Calculation” is frequently a substitute term for the term “Rate” in the context of the concepts “Social Security” and “Tax” as indicated by a positive indication.”

“the term “Crossword” is frequently a substitute term for the term “Puzzle” in the context of “New York Times”, and sends an indication to the collection of substitution rules to add the substitution rule “Puzzle?Crossword.”

Google is learning how words work as these examples show, and when words can be used for each other in certain contexts, which could potentially deliver better search results. According to the interview that the Bloomberg News reporters got their information from, “RankBrain impacts the 15% of Google’s queries a day that its systems have never seen before.”

36 Comments

  1. Olivier Duffez

    October 27th, 2015 at 4:06 am

    Thank you Bill! That’s always a pleasure to learn how these patents works and may be used in Google’s algorithms.

    “According to the interview that the Bloomberg News reporters got their information from, RankBrain impacts 15% of Google’s queries.”
    => that’s not the case, according to the author of this article. See his tweet https://twitter.com/mappingbabel/status/658780659889143812

    Reply

    • Bill Slawski

      October 27th, 2015 at 4:20 am

      Hi Olivier,

      Thank you for sharing your thoughts. The statement from the Bloomberg interview is “The system helps Mountain View, California-based Google deal with the 15 percent of queries a day it gets which its systems have never seen before, he said.” he said.” So the quote from the article is a little different than what he says in the tweet, where he says, “It’s invoked on a very large fraction of queries and is particularly good at dealing with the 15% per day that haven’t been seen.”

  2. Andreas

    October 27th, 2015 at 4:58 am

    Allways when I come here (mostly via Google Plus) I read and miss a thumb aup button!
    I click on tweet instead.

    Reply

    • Bill Slawski

      October 27th, 2015 at 9:33 am

      Thank you, Andreas.

      That’s much appreciated. 🙂

  3. Rick

    October 27th, 2015 at 8:59 am

    Any mad implications you see coming from this Bill?

    Google losing control of the search results? Can you spam related entities or substitute words?

    Will we see black hat synonym sites popping up?

    Reply

    • Bill Slawski

      October 27th, 2015 at 9:37 am

      Hi Rick,

      I’m not sure that you can spam related entities or substitute words – Good questions though. I sort of feel like the Web got a little brighter for searchers with the release of RankBrain. There’s some other amazing stuff out there waiting to be released from Google, too.

  4. Grant

    October 27th, 2015 at 9:10 am

    An Artificially Intelligent Hummingbird?

    Feels like the pipe into the Hummingbird update got smarter, allowing better understanding of meaning through concept / vector analysis and learning at significant scale.

    Always amazed at the resources Google has at their disposal… never surprised how well you explain how it all (potentially) works.

    Cheers

    Reply

    • Bill Slawski

      October 27th, 2015 at 9:40 am

      Hi Grant,

      Yes, my first instinct upon hearing the description for RankBrain was that it was potentially an alternative version of Hummingbird. When I came across this concepts as contexts patent, I was a lot more certain about it. Google’s Web just got filled with a lot more concepts. It should be fun.

    • Ammon Johns

      March 21st, 2016 at 12:51 pm

      I tend to think of this as a natural evolution or development arising from Hummingbird, in that I think Hummingbird showed where simply looking for semantics alone was the wrong approach. Hummingbird wasn’t designed to ‘guess’ or learn, so much, perhaps.

      The patents don’t seem to mention some of the methods of extrapolation I would expect for dealing with those ‘new queries’, such as looking specifically to twitter and social media for similar phrasing or terms.

      For instance, your example, Bill of “New York Times puzzle” might well suddenly surge if a news story had just broken about some extremely unusual and puzzling events in the business behind the publication. If RankBrain can’t do that then I would certainly expect that some other approach we haven’t heard named does so.

      Certainly, looking at other queries that include the terms used, but perhaps add additional words, concepts and contexts seems to be primary area here. That to me says that this is more news for being about the approach (machine learning) than necessarily the actual result.

    • Bill Slawski

      March 21st, 2016 at 1:34 pm

      Hi Ammon,

      Good to see you here, and interesting extrapolation regarding Rankbrain. I can think of a couple of places that Google might be mining to find “new queries”. One of those might be in the extensive query logs that Google collects daily that are made of of the query terms that people are using to search for new things. Another might be newer content on the Web that might gain in some popularity, and is identified as “fresh” based upon a proliferation of new links pointed to it. Those both seem to be good starting sources of input that Google could use to attempt to determine context.

    • Bill Slawski

      March 21st, 2016 at 4:24 pm

      Hi again, Ammon.

      Another place that Google may be looking at involves entities that are in the same categories in a knowledge graph. I wrote a post about that at:

      How Google Might Make Better Synonym Substitutions Using Knowledge Base Categories
      http://www.seobythesea.com/2015/12/how-google-might-make-better-synonym-substitutions-using-knowledge-base-categories/

      Using a knowledge graph in a manner like this takes advantage of the context of a query term being searched for, and understanding words that might be related to them.

  5. Pingback: Rankbrain: itt a Google, új mesterséges intelligencialapú algoritmusa! - ITE.hu

  6. harry

    October 27th, 2015 at 1:39 pm

    Hi Bill, great article and yes have been seeing this for a long time in google serps now.

    One question I wonder if you can help me or point me in the right direction or maybe someone else can help me with.

    Where can I find a good semantic keyword research tool?

    My problem is most crazy keyword tools out there if I put in the word yankees I would get the following results:

    yankee candle advent calendar
    yankee candle sale
    yankee direct
    yankee bet
    yankee stadium
    yankee doodle

    To me I dont want my keywords like that. I rather have results like this:

    Yankee
    Football
    Bronx
    Baseball

    I have a single word keyword tool, but I really want to find a keyword tool that gives me semantic results for 2 or more word keywords etc…

    I hope someone can point me in the right direction to find a good keyword research tool.

    Thanks for the article again.

    Reply

  7. Michael Martinez

    October 27th, 2015 at 3:34 pm

    Interesting find but I don’t think this is it. This might represent a second stage process.

    Reply

    • Bill Slawski

      October 27th, 2015 at 3:53 pm

      Hi Michael,

      I think there are a lot of different stages, and elements to how RankBrain works. The idea of re-writing queries, and learning appropriate substitution rules when certain contexts are present doesn’t sound like it’s secondary after what was described about the algorithm in the Bloomberg interview, but aspects like how learning about concepts and the contexts that certain words may be more likely to appear within is an essential part, and may be covered by other processes that aren’t covered in this particular patent.

  8. Pingback: SearchCap: Bing Halloween, Amazon Echo, AdWords Cross-Device & RankBrain

  9. Jacob Zucchi

    October 28th, 2015 at 9:53 am

    Hi Bill,
    great post!
    In your opinion, assuming that this patent is a way to interpret correctly RankBrain, what is the ‘rank’ part in RankBrain?
    My educated guess is that the AI create ranks for the ‘concept’, or for combinations of concepts, that are used as signals.

    Reply

    • Bill Slawski

      October 28th, 2015 at 10:17 am

      Hi Jacob,

      There are supposedly a lot of people at Google who have been working upon this approach. The patent I pointed to describes one way of re-writing queries; but it’s probably not the only innovative new method being used by Google. I thought it was worth looking at and sharing because it does focus upon an aspect of RankBrain that has been emphasized – an analysis of queries received, and then a re-writing of those. There are others at Google working upon methods like Deep Learning, and I agree with you that an AI approach is likely being used to identify rankings for concepts or combinations of them, that might be used as signals. Thanks.

  10. Jason

    October 28th, 2015 at 10:07 am

    I often wonder what I could achieve if I had more time to do it all.

    Now I know what Google would do if it had all the time it wanted, or, in Google’s case, the ability to speed up time through ever faster networks, databases and processors.

    Reply

    • Bill Slawski

      October 28th, 2015 at 10:21 am

      Hi Jason,

      I remember a few years ago going to a Search Engine Strategies conference session called “Meet the Crawlers” which featured search engineers as panelists, and during the Q&A session, someone asked, “How can you identify all the bad links you do, pointing to a site.” The first answer was the search engineer from Google, who said something like, “We have lots and lots of computers.”

  11. Jacob Zucchi

    October 28th, 2015 at 11:34 am

    I wonder if structured data can be used by RankBrain for the indexing (or, since is an AI, thinking?) of a ‘concept’.
    Google talked recently about structured data helping them at indexing better. It would make sense to use onthologies from schema to help a learning machine in creating vectors or concepts.

    Reply

    • Bill Slawski

      October 28th, 2015 at 12:09 pm

      Hi Jacob,

      That definitely sounds like an idea worth considering and investigating. If schema is created by subject matter experts who know a topic well, it would make sense to have it help your pages be better known for concepts it might help them with.

  12. Zach Doty

    October 28th, 2015 at 2:40 pm

    Hey Bill-

    Awesome write-up! Thanks for presenting this into an easy to understand format.

    Looks like search is more important than ever. The dynamic nature of RankBrain and real-time Penguin may make it easier to challenge entrenched competition…However, it’ll make it easier to be challenged as well. 🙂

    Cheers!

    Reply

    • Bill Slawski

      October 28th, 2015 at 2:46 pm

      Thank you, Zach.

      Search finds ways to change and stay challenging.

  13. Bryan Gray

    October 29th, 2015 at 10:26 pm

    This is good stuff Bill. Nice digging.

    I’ve read over most of what has come out over the past few days and while it is all still clear as mud to me I’ll throw my 2 cents in to hopefully see how it matches up with what you are finding.

    Taking a look at search from Google’s perspective and areas of opportunity with search quality I’d lean towards this AI being aimed, at least partially, at tightening up the QDD “we’re not really sure query” results.

    G has a lot of room for improvement in this area. Using AI on a meta and personalized level would make sense to provide a better result over time.

    I’m sure it’s much bigger than one particular type of query but I can’t help but think that minimizing QDD results plays a major role.

    Reply

    • Bill Slawski

      October 29th, 2015 at 11:49 pm

      Hi Bryan,

      Thank you. If we look at the QDD algorithm, it likely tries to identify the categories of the results it returns for a query, and then tries to make sure that those might be represented in the top results. So let’s say someone searches for [java], and could mean the drink, the Island, the software. Let’s say that 70% of the top 100 results are for the software, 20 % are for the Island, and 10% are for the drink. Google might show 7 results in first page that are about programming, 2 results about the Island, and 1 result about the drink, giving us diverse results. I wrote about a more complicated approach that Microsoft might use in http://www.seobythesea.com/2007/12/reranking-search-results-based-upon-personalization-and-diversification/ One of the challenges that Google would face with that is in categorizing results pages. I can see google knowing categories like that being useful to this processs, and something that seems related. I suspect that Google might have a greater understanding of information around query spaces under this RankBrain approach.

  14. Pingback: Der SEO-Blog-Wochenrückblick KW 44 | SEO-Trainee

  15. Pingback: Growth Hacks Top 10

  16. Pingback: RankBrain: Can You Beat Artificial Intelligence? | customerbloom

  17. Pingback: How Google Might Make Better Synonym Substitutions Using Knowledge Base Categories - SEO by the Sea

  18. Nikolay Stoyanov

    April 21st, 2016 at 8:07 am

    Great stuff as always Bill! It is a lot to take in 🙂 Do you think that RankBrain will have any impact on on-page optimization? Perhaps people will try to diversify they articles even more with introduction of new words and slang?

    Reply

    • Bill Slawski

      April 21st, 2016 at 10:46 am

      Hi Nikolay,

      I don’t think it would surprise me if anyone tried to add new words and more slang to their pages, but I can’t say that would be something that would help in the optimization of a page. RankBrain is supposed to be a query rewriting approach from Google; and it’s questionable whether trying to optimize a page in a manner like that would have any impact at all.

  19. Pingback: How Google Works: Paul Haahr at SMX - Builtvisible

  20. Srinivas

    July 10th, 2016 at 4:48 am

    The competition is gonna be still though after rankbrain. Google also considers LSI keywords when they query their database. How does rankbrain affects LSI?

    Reply

    • Bill Slawski

      July 10th, 2016 at 2:55 pm

      Hi Srinivas,

      I have seen someone suggesting that Google uses LSI keywords, but that person misuses and confuses that term to have it just stand for Synonyms. Latent Semantic Indexing was something invented by researchers at Microsoft for use with databases of approximately 10,000 documents that didn’t change much (both much smaller than the Web, and a lot less dynamic). That approach (Latent Semantic Indexing) would require that it be run again every time a new document was added to the body of documents, which would be impossible with the amount of churn of new and old documents on the Web. There was a company that was selling what they called “LSI Keywords”, and I’m not certain whether or not they are in business any more, nor am I convinced of the effectiveness of their product (because it just didn’t fit the Web.

      I have seen Google patents in the past that included what they referred to as PLSI or Probable Latent Semantic Indexing but haven’t seen much reference to that in the past few years.

      There are a number of Google patents that discuss how Google might use synonyms and substitute terms; and it’s possible that RankBrain has some aspects of how it works that may be similar to those. Unfortunately, Rankbrain has nothing to do with LSI Keywords though. 🙁

    • Bill Slawski

      July 10th, 2016 at 2:55 pm

      Hi Srinivas,

      I have seen someone suggesting that Google uses LSI keywords, but that person misuses and confuses that term to have it just stand for Synonyms. Latent Semantic Indexing was something invented by researchers at Microsoft for use with databases of approximately 10,000 documents that didn’t change much (both much smaller than the Web, and a lot less dynamic). That approach (Latent Semantic Indexing) would require that it be run again every time a new document was added to the body of documents, which would be impossible with the amount of churn of new and old documents on the Web. There was a company that was selling what they called “LSI Keywords”, and I’m not certain whether or not they are in business any more, nor am I convinced of the effectiveness of their product (because it just didn’t fit the Web.

      I have seen Google patents in the past that included what they referred to as PLSI or Probable Latent Semantic Indexing but haven’t seen much reference to that in the past few years.

      There are a number of Google patents that discuss how Google might use synonyms and substitute terms; and it’s possible that RankBrain has some aspects of how it works that may be similar to those. Unfortunately, Rankbrain has nothing to do with LSI Keywords though. 🙁

Leave a Comment