A Poll I Ran on Twitter About Rankings of Content
I ran a poll on Twitter, asking if an original author posted a blog post to their own site, and then posted the same post to an online magazine, which version would end up ranking higher? I gave choices of the original author’s site, the online magazine, and an “It depends” results, asking for comments explaining those. I received some great comments covering a lot of ground.
The poll returned 556 votes as follows:
An author writes a post at his own blog, and publishes it at an online magazine. Which would Google rank more highly:
— Bill Slawski ⚓ (@bill_slawski) December 6, 2018
There were a lot of opinions about what might cause the author’s post or the online magazine’s post to rank higher., like this one from John Alexander:
Depends on which site has more/better content. As a reader I want to see that article, but also discover related content, so if author rarely posts or just posts brief, not-very-detailed content, I'd rather see magazine. Converse is true if author has lots of good content.
— John Alexander (@CallMeLouzander) December 7, 2018
A number of people suggested that cross domain canonicalization should ideally be used as well, like Jonah Stein:
In theory, Google should rank the first instance. Also, in theory, the author SHOULD use rel=canonical to point to the first instance. In practice, if the online magazine is on topic and has more authority, it will rank there unless links/social signals point to author's blog
— Jonah Stein (@Jonahstein) December 6, 2018
As an author who may control their own site, but not control such an online magazine site, it might be difficult to have the online magazine point a canonical link element to your site.
I did have Vikki Fraser provide me with an example of her article outranking an online magazine for very similar content:
Example using moi: pic.twitter.com/D8LOiuDhPc
— Vikki Fraser (@vikkiorlando) December 7, 2018
I was also asked by Cyrus Shepard about whether or not link inversion would apply:
— Cyrus (@CyrusShepard) December 6, 2018
My response was that as far as I know, Google doesn’t apply link inversion. I explored Link Inversion when I wrote about Google trying to identify the primary version of duplicate pages.
Some people, like Martin McGarry felt that it depended upon how topically relevant content may be to each of the places published:
Too many variables so I'll offer one variant example. You blog about a topic you do not operate in. But an industry magazine picks it up. On your blog it has little to no relevance, but on an industry magazine it could be considered authoratative content even if duplicate.
— Martin McGarry (@gamblinginseo) December 6, 2018
Or as Joshua Levenson noted it may rely on version got published first:
Depends what gets crawled first.
— joshua levenson (@josh_levenson) December 6, 2018
The answer from Peter McCarthy seems to match my own experience, and a recently published Google patent application, which I am including information about in this post:
I did this a while back and watched. Anecdotal, though. It changed with the magazine ranking for a while at first (its authority is high) but my original outranking the magazine over time. It also seems to depend on whether the query is more “the topic” vs. “me + the topic”.
— Peter McCarthy (@petermccarthy) December 6, 2018
Reranking Results for an Entity That is an Original Author
A patent application recently came out that tells us about a reranking method for search results when it involves results that have very similar or substantially the same content. It distinguishes those results by saying one of them is from an entity that is not known to produce original content (it either copies or redistributes original content authored by other entities. The other piece of content is associated with a second entity that is known to produce original content.
There is reference to a provisional patent in this patent application that has a very interesting title, and according to this patent has been incorporated into it in its entirety. That patent is U.S. Provisional Patent Application No. 61/648,562 filed on May 17, 2012, entitled “Systems and Methods for Determining a Likelihood that an Entity is an Author of Original Content” (This link is to the WIPO filing of the patent.)
Higher Rankings to an Original Author Regardless of Relevancy Scores
So, this original author patent application tells us that:
…Sometimes in is desirable to rank the search results that correspond to documents that are associated with entities that are authors of original content higher than search results corresponding to documents that associated with entities that are not authors of original content even though the documents associated with entities that are not authors of original content have higher relevancy scores.
The algorithm behind this original author patent involves:
- (i) submitted content, where the submitted content is identified as being published by an entity
- (ii) a link to location on a resource hosting the submitted content; evaluating whether the submitted content is represented in an index of known content to determine whether the submitted content is new relative to the known content
- in response to the evaluating, issuing a request to a search engine to crawl and index the submitted content hosted by the resource associated with the link when the submitted content is new relative to the known content, where the request to the search engine to crawl and index the submitted content hosted by the resource associated with the link is issued responsive to determining that the submitted content is deemed to not be represented in the index of known content
The patent tells us that the request to crawl the new content would go to a high priority crawler of the search engine, which is the first time I’ve heard one of those being mentioned in a Google patent.
The patent also refers to the use of “Shingles” being used to identify content that is similar or substantially the same. More about Shingles in this paper: Identifying and Filtering Near-Duplicate Documents
It appears that the original author patent will look at other content authored by the entity that may be associated with one of the versions of this content:
In some embodiments, prior to issuing the request to the search engine to crawl and index the submitted content hosted by the resource associated with the link, the method includes determining that the entity is an author of original content based on an evaluation of other submitted content identified as being published by the entity, where the other submitted content identified as being published by the entity is included in the known content; and the request to the search engine to crawl and index the submitted content hosted by the resource associated with the link is issued responsive to determining that the submitted content is deemed to not be represented in the index of known content and determining that the entity is an author of original content.
This patent application makes it sound like Google is keeping track of entities who are authors by collecting shingles of things that they have written.
The patent also describes a verification and registration process which an author could used to verify that he or she is the author of content, and to register as an author.
It also tells us that it may start timestamping content with an identifier for the entity associated with the content, including at least one author of content or one publisher of content, or at least one website.
The patent also tells us that it may determine whether the pieces of content are similar by determining author scores for each of the pieces of content. The author scores would also include a citation score for the entity involved which looks at a frequency at which content from that entity is cited.
How Reranking Based Upon an Entity being an Original Author Works
…The first search result being ranked higher than the second search result; determine that the first document and the second document satisfy a similarity criterion; determine that the second entity satisfies a predefined authorship differential with respect to the first entity; and responsive to determining that the second entity satisfies the predefined authorship differential with respect to the first entity, swap the second search result and the first search result in the ranked search results to produce re-ranked search results.
The patent application can be found at:
(US20180341656) Systems and Methods for Re-Ranking Ranked Search Results
Inventors: Chung Tin Kwok, Lei Zhong and Zhihuan Qiu
Publication Number: 20180341656
Publication Date: November, 29, 2018
Applicants: GOOGLE LLC
A system, computer-readable storage medium storing at least one program, and a computer-implemented method for re-ranking ranked search results is presented. Ranked search results satisfying a search query are obtained, where the ranked search results include a first search result corresponding to a first document associated with a first entity and a second search result corresponding to a second document associated with a second entity, and where the first search result is ranked higher than the second search result. The first document and the second document are determined to satisfy a similarity criterion. The second entity is determined to satisfy a predefined authorship differential with respect to the first entity. Responsive to determining that the second entity satisfies the predefined authorship differential with respect to the first entity, the second search result and the first search result in the ranked search results are swapped to produce re-ranked search results.
Some Conclusions about an Original Author
The patent provides many details that are worth spending time looking over if you want to know more. For instance, an author of content is spelled out in painstaking detail as follows:
For example, the respective entity may include an individual author or one of a plurality of co-authors for (or contributors to) content. In some embodiments, an entity is a business organization that produces original or partially original content. In some embodiments, an entity is a news organization. In some implementations, the entity includes at least one publisher of content. For example, the respective entity may be a publisher of books, a publisher of periodicals, a publisher of online content! and/or the like. In some implementations, the respective entity is the author of content on at least one website. For example, the respective entity may contribute original content to a blogging website, a website for a publisher (e.g., news, magazine, etc.) and/or the like. Note that such a website may include a subset of the content within a particular domain. For example, the website may include content in a particular domain (e.g., a top-level domain example.com). In another example, the website includes content in a sub-domain of the particular domain (e.g., a sub-domain biogs.example.com). In another example, the website includes content in a directory of the domain (e.g., www.example.com/johndoe/). In some embodiments, the website includes content in: a plurality of domains (e.g., a network of affiliated websites), a plurality of sub-domains of at least one domain, and/or a plurality of subdirectories of at least one domain. In some embodiments, the content authored by an entity is a blog post, a social network post, or a post in an on-line discussion thread. In some embodiments, the content authored by the entity is any content that has been posted to a location accessible on the Internet such that it is readily ascertainable that the entity posted the content.
The patent reminded me of the Google authorship program under Google+, where you linked to the place that you published as an author with a rel=”me” in your link to that site.
Creating an author score that includes a citation score identifying how often an author might be cited elsewhere on the Web is interesting. The idea of using citations as a way of scoring authors reminds me of the use of citations in PageRank as described in The PageRank Citation Ranking: Bringing Order to the Web. As the abstract from that paper tells us:
This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them.