Did the Groundhog Update Just Take Place at Google?

Posted @ Feb 07 2017 by

Groundhog Update

A story that ran at Search Engine Land a few days ago informed us of a possible new Algorithm at Google: Unconfirmed Google algorithm update may be better at discounting links and spam. Before I read that post, I had just read a new Google patent, and the post reminded me of the patent. The patent was granted on January 31, 2017, and it is possible that what is described in the patent may be what people were experiencing in the update reported at Search Engine Land.

The algorithm behind the patent is based upon rankings that involve how many resources might link to a resource that may be ranked (Like Stanford’s PageRank Patent). Historically, at Google, a page that has a large number of resources that link to it may rank higher than other pages that have a smaller amount of resources that link to it. But what if Google decided to look closer at those resources and demote some of the ranking weight passed along by them? We have seen indications that Google may do something like that in the Reasonable Surfer Patent which had links passing along different amounts of PageRank. Another way to change how much PageRank might be passed along with a link might be based upon the amount of traffic a resource might receive from links, and the dwell times of traffic from those links, whether they might be short clicks, medium clicks, or long clicks.

This linking approach may also consider other aspects of links, such as the anchor text for a link pointing to a source resource, which it will consider as an n-gram and will assign a source score for that anchor text used to link to a page.

This was an interesting statement I ran across the first time I read through the newly granted patent:

Search result rankings can be adjusted based on a search query’s propensity to surface spam-related search results. The weighting of resource link counts in a ranking process can be reduced for search queries that have a high propensity for surfacing spam-related search results to reduce the skew on resource rankings caused by some resources having disproportionately large number of links compared to the number of selections of the links.

The patent tells us that it has a number of advantages in its use that can make it worth using, including the discounting of some links in rankings of pages being linked to.

Advantages of this Patented Process

1) Search results for resources can be more accurately ranked using data regarding links to the resources and selections of those links.
2) A seed score can be determined for a resource based on the number of links to the resource contained in other resources and a number of selections of those links.
3) Source resources that include links to resources that have a disproportionate number of links relative to the number of selections, as indicated by the seed scores for those resources, can be identified.
4) The links from these identified source resources can be discounted in a ranking process that ranks resources based on the number of links to the resource.
5) Resources for which data regarding links are unavailable or insufficient can be scored using data regarding resources that include a link to the resource.

The patent I am writing about can be found here, and is worth spending some time with:

Determining a quality measure for a resource
Inventors: Hyung-Jin Kim, Paul Haahr, Kien Ng, Chung Tin Kwok, Moustafa A. Hammad, and Sushrut Karanjkar
Assignee: Google
United States Patent: 9,558,233
Granted: January 31, 2017
Filed: December 31, 2012

Abstract:

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining a measure of quality for a resource. In one aspect, a method includes determining a seed score for each seed resource in a set. The seed score for a seed resource can be based on a number of resources that include a link to the seed resource and a number of selections of the links. A set of source resources is identified. A source score is determined for each source resource. The source score for a source resource is based on the seed score for each seed resource linked to by the source resource. Source-referenced resources are identified. A resource score is determined for each source-referenced resource. The resource score for a source-referenced resource can be based on the source score for each source resource that includes a link to the source-referenced resource.

Demotion based upon a high number of links that don’t produce much traffic

This was another passage from the patent that struck me because it pointed at potentially harmful results for links that didn’t match up to expectations that might be held for them:

A system can determine a measure of quality for a particular web resource based on the number of other resources that link to the particular web resource and the amount of traffic the resource receives. For example, a ranking process may rank a first web page that has a large number of other web pages that link to the first web page higher than a web page having a smaller number of linking web pages. However, some a resource may be linked to by a large number of other resources, while receiving little traffic from the links. For example, an entity may attempt to game the ranking process by including a link to the resource on another web page. This large number of links can skew the ranking of the resources. To prevent such skew, the system can evaluate the “mismatch” between the number of linking resources and the traffic generated to the resource from the linking resources. If a resource is linked to by a number of resources that is disproportionate with respect to the traffic received by use of those links, that resource may be demoted in the ranking process.

How might traffic be determined because of a link?

The evaluation of resources can be performed by a “pull-push” process. In an example pull-push process, a seed score is determined for each of a set of seed resources for which sufficient link and traffic data is available. The seed score for a particular seed resource is based on the number of source resources that link to the seed resource and the amount of traffic generated to the resource from the source resources. In some implementations, the seed score for a particular resource is the ratio between the number of selections of links to the particular resource and the number of source resources that link to the particular resource.

These seed scores are “pulled” up to the source resources and used to determine a source score for each source resource. In some implementations, the source score for a source resource is based on the seed score for each seed resource to which the source resource links. These source scores can be used to classify each source resource as being a “qualified source” or an “unqualified source.”

Links from sources that might be determined to be unqualified might then be discounted.

Spam-Related Resources

Some queries tend to produce more spam that others. The patent points at one group in particular:

For example, publishers of many video sharing web sites attempt to manipulate rankings by creating links to the sites, resulting in a disproportionately large number of links compared to the number of selections, while national news web sites typically do not attempt such manipulation.

For queries that tend to often produce higher amounts of spam, selection clicks may be given more value in this calculation of links compared to traffic sent by those links:

For queries that have a high propensity for surfacing spam-related web pages, the system can put a higher weight on selection counts for the search results and a lower weight on resource link counts for the search results when ranking the search results. Thus, the system can be said to “trust” the click counts more than the resource link counts for search queries that have a propensity for surfacing spam-related web pages.

The Selection Quality Score May Be Based Upon Dwell Time

Part of the process involved in calculating a quality score for resources involves determining a seed score for a seed resource. This can start with identifying a link resource count for the seed resource. That can be done by looking at the number of resources that include a link to the seed resource.

The next aspect of that involves identifying a selection count for the seed resource. This selection count for the seed resource may be based on a number of times the link(s) to the seed resource that are included in other resources have been selected.

A selection quality score is determined for at least a portion of the selections of the links to the seed resource. The selection quality score for a selection is a measure of quality for the selection and can be used to discount low quality selections when determining the seed score for the seed resource.

This brings back memories of the book by Steven Levy, called In the Plex, in which he stated that one metric that was often treated with a positive outlook by people at Google was one they referred to as “The Long Click.”

The patent tells us:

The selection quality score may be higher for a selection that results in a long dwell time (e.g., greater than a threshold time period) than the selection quality score for a selection that results in a short dwell time (e.g., less than a threshold time period). As automatically generated link selections are often of a short duration, considering the dwell time in determining the seed score can account for these false link selections.

The patent also tells us that some historic selection behavior might indicate that selections were made by real users rather than some automated process.

Resources with relatively low resource scores may be demoted in rankings, and resources with high resource scores may be boosted in rankings.

Take-Aways

The patent provides much more detail than I have in this post, and it is highly recommended reading. It is the first I can recall that has attempted to set up some kind of quality scores for links that point to pages on the web, and determine how much weight those should pass along. The reasonable surfer patent was different in that it determined how much weight a link might pass along based upon a probability that it was important based upon features involved in how (and where) it was presented on a page.

I mentioned on Twitter that I would be writing about the Search Engine Land post I mentioned at the start of this post, and that I had a guess as to what may have been implemented that would result in the Algorithmic change at Google that a number of people had noticed. I had a suggestion from Jonathan Hochman that I consider referring to it as the Groundhog Update, considering the timing, and that it seemed to take effect at the beginning of February. This patent was granted on the last day of January, and while it could have been implemented before then, it is possible that it also could have been put into place at the start of February.

Was what took place algorithmically at Google a weighting of linking resources based upon traffic associated with them, or whether or not they were associated with spammy results?

18 Comments

  1. Moty Malkov

    February 08th, 2017 at 2:54 am

    How does G evaluate traffic and behavior coming through links? Does this mean GA data is being used (hard to believe it is)?

    Reply

    • Bill Slawski

      February 08th, 2017 at 11:47 am

      Hi Moty,

      Google has published other patents where they talk about collecting traffic data and link data, and this patent refers to the possibility that sometimes traffic data is available:

      The evaluation of resources can be performed by a “pull-push” process. In an example pull-push process, a seed score is determined for each of a set of seed resources for which sufficient link and traffic data is available.

      Google does have access to data from Google Chrome, and it’s possible it might use that. The patent doesn’t really tell us though how that data is collected. But it isn’t the first patent from Google that covers traffic data and link data. I’ve written about a couple in the past:

      Google Patents Click-Through Feedback on Search Results to Improve Rankings
      How Google Might Index Link Behavior Information

      Those may contain some hints at how Google might measure traffic (the selections I refer to seem to be reflected in that first one.)

  2. Louis

    February 08th, 2017 at 4:12 am

    Great reading as always Bill!

    We have seen a few clients main keyword pages dip due to lack of link traffic and high bounce rate of the users from the referral sites.

    We need more data but it’s looking like a trend for sites that don’t have stronger user metrics and site authority signals.

    Reply

    • Bill Slawski

      February 08th, 2017 at 11:48 am

      Hi Louis,

      Thanks for sharing your experience, which seems to reflect what is in this patent from Google.

  3. Pingback: SearchCap: Bing Ads report, Search Console with Data Studio & Pinterest search bar

  4. Wolfgang Jagsch

    February 08th, 2017 at 6:15 pm

    Most traffic comes from Social Media or PR-related sources. So links must have a longtime value even if the are only clicked once a year. Imagine you a writing a post about semantic copywriting on a free blog and nobody is reading this minroity target group content. But one day a university professor is reading and click through to your homepage – stays there and writes you an email in which he tells you who great he found your work on semantic seo. That means really niche targeted user action and traffic.

    Reply

  5. Pingback: Onbevestigde Google Update: Penguin 4 tweak of PBN killer update? • SEOnieuws

  6. Henri

    February 09th, 2017 at 2:21 am

    Great information as always.

    How do you think Google is getting selection of links and dwell time information? Would it be enough if they used only the information they get from Google analytics and Chrome?

    You got a typo in your second to last paragraph, I believe you meant Groundhog, not Goundhog.

    Reply

  7. Bill Slawski

    February 09th, 2017 at 1:00 pm

    Hi Henri,

    Google likely has some kind of analytics running on their own results pages, and can record which pages were selected for which queries. They don’t need to look at GA or a searcher’s Chrome browser to tell what people may have chosen to visit after searching.

    Reply

    • Henri

      February 10th, 2017 at 9:16 pm

      I was actually thinking more about selection of links and dwell time considering links from other sites, not from Google. Since it’s their own website getting the information should be easy, but when it comes to links from other websites it’s a lot more trivial.

    • Bill Slawski

      February 10th, 2017 at 9:45 pm

      Hi Henri,

      It’s Google using that information to determine rankings of pages based upon the number of links to those pages, how much traffic those pages get, and how often those pages might be selected in Google’s
      search results.

  8. Pingback: ¿Es el tráfico web un factor SEO en 2017? Analizo la última patente de Google

  9. Pingback: Last Week In Digital Marketing News in 90 seconds

  10. Mikey Jones

    February 15th, 2017 at 12:14 am

    Yes, I think something had happened names does not matter one can call it Groundhog Update or anything.

    Reply

  11. Daniel Delos

    February 19th, 2017 at 9:43 am

    This seems like a logical direction for them to take, to start looking at traffic to linking pages and/or domains to determine the link quality. Because it seems like they’ve hit a brick wall in terms of determining the quality of content itself. So the way around this is to expand the number of supplementary signals they look at in order to indirectly infer the quality of the resource.

    Reply

    • Bill Slawski

      February 20th, 2017 at 3:30 pm

      Hi Daniel,

      I think that looking at traffic that comes through links is a great way to gauge the quality of the source of those links, and it seems to be a strong assumption that high quality sources of links will bring lots of traffic.

  12. Greg

    February 25th, 2017 at 8:40 pm

    The patent was filed back in 2012.

    Companies don’t wait for the patent to be approved before using it in practice (think of all the times you see ‘patent pending’.

    Goog didn’t wait 5 years and suddenly launch it on approval.

    This has been in the algo for years.

    Reply

    • Bill Slawski

      February 26th, 2017 at 2:14 pm

      Hi Greg

      Some companies do wait to implement something they’ve filed a patent for. How often to you see the words “patent pending” on Google Search results?

      We have no idea if Google has been using this, but with people actively using things like Private Blog Networks, it is possible that this hasn’t been in Google’s Algorithm for years.

Leave a Comment