Google Using Website Representation Vectors to Classify with Expertise and Authority

by Posted @ Feb 21 2020

Twitter

Added (2020-02-23) Questions and Answers: I have been asked a few questions about this patent and had some things about it pointed out to me, that I wanted to address.

1. The “Medic” update – that Barry Schwartz named in August of 2018 because it appeared to affect medical websites, also impacted other types of websites. This patent application, filed in August of 2018, notes that it covers a range of industries, including health and artificial intelligence sites as examples. It specifically uses an example about authors of health sites being doctors as experts, medical students as apprentices, and laypeople as nonexperts, which is why I asked a Go Fish Digital graphic designer for the masthead image for this post showing one of each of those. Different industries are covered by this patent and different levels of expertise. I purposefully chose to have an illustration that reflected the “medic” aspect of the process from the patent because I believed that it was an accurate reflection of what the patent covers

2. Quality Scores – The patent explains how it might further classify websites based upon whether they meet thresholds based upon quality scores. The patent does not define specifically what a “quality score” is, but Google has been granted several patents about quality scores for websites. A great page from Google about what a high-quality website includes is covered in a Google blog post from Amit Singhal: More guidance on building high-quality sites.

3. Rankings of Results – I have been asked how sites might be ranked under the process from this patent. Queries from specific knowledge domains (covering specific topics) might return results using sites that are classified as being from the same Knowledge domain. For instance, a medical query such as what are the symptoms of mononucleosis may be classified as being from a medical knowledge domain, and best answered by a site that has been classified as being from a medical knowledge domain. The patent also tells us that part of the purpose behind this patent is to limit possible results pages based on classifications involving industry and expertise, that meet sufficient quality thresholds. Those pages would then be ranked based upon relevance and authority scores:

0024] The search results are ranked based on scores related to the resources identified by the search results, such as information retrieval (“IR”) scores, and optionally a separate ranking of each resource relative to other resources (e.g., an authority score). The search results are ordered according to these scores and provided to the user device according to the order.

Classification of Websites

Google is telling us that they may use Website Representation Vectors to classify sites based upon features found on those sites.

This post is about a new Google patent application filed in August of 2018 and published at the World Intellectual Property Organization (WIPO) last week.

The patent application uses Neural Networks to understand patterns and features behind websites to classify those sites.

This website classification system refers to “a composite-representation, e.g., vector, for a website classification within a particular knowledge domain.”

Those knowledge domains can be topics such as health, finance, and others. Sites classified in specific knowledge domains can have an advantage in using that classification to return search results as they respond to receiving a search query.

Those website classifications can be more diverse than representing categories of websites within knowledge domains. The patent breaks the categories down much further:

For instance, the website classifications may include the first category of websites authored by experts in the knowledge domain, e.g., doctors, the second category of websites authored by apprentices in the knowledge domain, e.g., medical students, and a third category of websites authored by laypersons in the knowledge domain.

I am reminded of discussions in the SEO Industry about the Google Quality Raters Guidelines and references in it to E-A-T or Expertise, Authority, and Trustworthiness. The Guidelines point out Health sites with different levels of E-A-T, much like the classifications from this new Google patent application about Website representation vectors:

High E-A-T medical advice should be written or produced by people or organizations with appropriate medical
expertise or accreditation. High E-A-T medical advice or information should be written or produced in a
professional style and should be edited, reviewed, and updated regularly.

The Guidelines tell us that there are sites created by people with not as much expertise on topics:

It’s even possible to have everyday expertise in YMYL topics. For example, there are forums and support pages for people with specific diseases. Sharing personal experience is a form of everyday expertise. Consider this example.
Here, forum participants are telling how long their loved ones lived with liver cancer. This is an example of sharing
personal experiences (in which they are experts), not medical advice. Specific medical information and advice (rather
than descriptions of life experiences) should come from doctors or other health professionals.

The classifications include an expert level of sites in the health domain, an apprentice level of sites, and a layperson level of sites.

These classifications are based upon different levels of expertise, and this patent tells us that it is ranking pages based on authority too, but says nothing about trustworthiness, so it isn’t ranking sites completely based on E-A-T. This process captures two aspects of E-A-T, so it can fulfill part of the aim of the Quality Raters Guidelines by allowing human evaluators to have sites that rank well exhibiting high levels of authority and expertise.

Also, if this process limits the number of sites that Google has to return search results from based upon which knowledge domain they might be in, it does mean that Google is searching through fewer sites to return results than Google’s entire index of the web. Let’s look at the process behind this patent application in a little more depth.

It classifies many websites into particular knowledge domains and it tries to find different levels of sites within those particular knowledge domains:

  • Receiving representations of websites and quality scores that represent quality measures of sites relative to other sites
  • Classifying as first websites each sites having quality scores below a first threshold, at least one of the number of sites having a quality score below the first threshold
  • Classifying as second websites each of the sites having quality scores above a second threshold greater than the first threshold, at least one of the number of websites having a quality score greater than the first threshold
  • Generating a first composite-representation of the websites classified as the first websites
  • Generate a second composite-representation of the websites classified as the second websites
  • Receive a representation of another website
  • Determining a first measure of the difference between the first composite-representation and the representation
  • Determining the second measure of the difference between the second composite-representation and the representation
  • Based on the first measure of difference and the second measure of difference, classify the other website as one the first websites, the second websites, or as third websites that are not classified as either the first websites or second websites

The website classification process

Queries Request Responses from Particular Knowledge Domains

The patent application tells us that its process includes using terms from the query to understand that the query is requesting responsive data from a particular knowledge domain.

It may search for responses from that particular knowledge domain. The process involves:

  • Generating, from the authoritative data sources, preprocessed responses to future queries
  • Receiving, after generating the preprocessed responses, a query determined to or show the particular knowledge domain
  • In response, responding to the query with one of the preprocessed response

Advantages Of this Website Representation Vectors Approach

The Search System may select, search, or both, data for only websites with a particular classification, reducing computer resources necessary to find search results, e.g., by not selecting, searching, or both, any website irrespective of classification. This can:

  • Reduce the amount of storage needed to store data for potential search results, e.g., may need only data storage for websites with the particular classification
  • Reduce many websites analyzed by the search system, e.g., limiting a search to sites with the particular classification
  • Reduce network bandwidth used to provide search results to a requesting device
  • Address potential problems with earlier systems, such as higher use of bandwidth, memory, processor cycles, power, or a combination of two or more of these
  • Improve search results pages generated by a search system by including identification of only sites with a particular classification, e.g., a qualitative classification, in generated search results pages
  • Use characteristics learned from existing sites to classify previously unseen websites without requiring user input for the classification
  • Detect websites that are more likely responsive to queries for a knowledge domain, e.g., are more likely authoritative for the knowledge domain, by classifying previously unseen websites
  • Use a composite-representation based upon existing website classifications, which means the characteristics used by the classification are not limited by human discernible characteristics and can be any characteristic that can be learned by analysis of the website

Note that it is saying that it is helping to identify sites that are authoritative for different knowledge domains.

This Website Representation Vector Patent Application can be found at:

Website Representation Vector to Generate Search Results and Classify Website
Publication number: WO2020033805
Applicants: GOOGLE LLC
Inventors: Yevgen Tsykynovskyy
Publication Number WO/2020/033805
Filed: August 10, 2018
Publication Date February 13, 2020

Abstract:

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for using website representations to generate, store, or both, search results. One of the methods includes receiving data representing each website in the first plurality of websites associated with a first knowledge domain of a plurality of knowledge domains and having a first classification; receiving data representing each website in the second plurality of websites associated with the first knowledge domain and having a second classification; generating a first composite-representation of the first plurality of websites; generating a second composite-representation of the second plurality of websites; receiving a representation of a third website; determining a first difference measure between the first composite-representation and the representation; determining a second difference measure between the second composite-representation and the representation; and based on the first difference measure and the second difference measure, classifying the third website.

How Data From the Web Classification System May be Used

The search engine may use data from this website representation vectors classification system to return search results.

That classification system may use representations for each of many websites A-N and use the representations to determine a classification for each of the many websites A-N.

Choosing Features

The search engine decides to use classification for a search query to choose a category of websites with the same, or a similar, classification.

It may return search results from that category of sites.

The classifications of those sites are based upon the features that the sites contain.

How Website Classifications are Generated

This was the part of the patent’s description that I was most interested in finding.

It starts by telling us that this websites representation vectors classification system could use any appropriate method to generate classifications, which provides Google with a lot of flexibility

But then it goes into more detail by telling us that classification may be based upon content from websites to generate representations of those sites.

That content can include:

  • Text from the website
  • Images on the website
  • Other website content, e.g., links
  • Or a combination of two or more of these

The patent then provides details about how a neural Network gets involved:

The website classification system may use a mapping that maps the website content for the website A to a vector space that identifies a representation for website A.

For instance, the website classification system may use a neural network, that represents the mapping, to create a feature vector A that represents the website A using the content of the website A as input to the neural network.

Labels Used in Website Classification

Website Classification may be based upon using labels. The labels:

  • May be alphanumeric, numerical, or alphabetical characters, symbols, or a combination of two or more of these
  • Can indicate a type of entity that had the corresponding website published, such as a non-profit or a for-profit business
  • My indicate an industry described on the a site, such as about artificial intelligence or education
  • May indicate a type of person who authored a site, such as a doctor, a medical student, or a layperson
  • Could also be scores that represent a website classification

The Scores for classifications could be used:

  • To satisfy different thresholds to meet categories
  • May be specific for a particular knowledge domain
  • To classify a site to cover more than one knowledge domain
  • To select sites responsive to multiple queries for particular knowledge domains
  • With authoritativeness of the respective website to the particular knowledge domain
  • Or both

Input Data used to Classify sites may look at things such as:

  • A position of particular words concerning each other, e.g., that the word “artificial” is generally near or adjacent to the word “intelligence.”
  • Particular phrases included in the website
  • For each of the classifications A-B, a measure of difference, or a similarity measure, that represents a similarity between the respective classification and the other website
  • The classification A-B that is most similar
  • The classification A-B with the highest similarity measure, or with the shortest distance between the other feature vector and the respective average feature vector A-B, to name a few examples
  • A ratio between two similarity measures to select a classification for the other website

This website representations vectors patent provides several other ways that input data may be looked at during the classification process

Quality Scores indicating a classification of a site, may be measures of:

  • Authoritativeness
  • Responsiveness for a particular knowledge domain
  • Another property of the website
  • Or a combination of two or more of these

Takeaways from this Website Representation Vectors Classification Approach

  • Websites may be classified based upon the text, images, and links within them
  • Quality Scores of Classified Sites may indicate authoritativeness or how responsive a site may be for a particular knowledge domain, or both
  • Labels used to classify sites could include information about the entity behind a site, the industry described in the site, and the type of person who authored a site
  • A site might be classified to cover more than one knowledge domain
  • Last Updated February 23, 2020

    subscribe to our newsletter

    6 Comments

    1. Clint Butler

      February 25th, 2020 at 7:54 am

      Bill,

      I’m a little confused by this:

      “Classifying as first websites each sites having quality scores below a first threshold, at least one of the number of sites having a quality score below the first threshold”

      Why would they be looking for sites that don’t meet the first threshold to be declared top-level sites in the outline heirarchy of authority?

      Reply

      • Bill Slawski

        February 25th, 2020 at 12:07 pm

        Hi Clint,

        Good to see you. Thanks for asking about that. If was part of the patent that puzzled me, too. The patent doesn’t go into much detail about those quality scores, but it does provide a very quick definition of them as “a quality measure of the website relative to other websites.” I did a search on the patent for mentions of quality scores, and it is mentioned a few times. This was one of those:

        [0060] The website classification system classifies as first websites each of the plurality of websites having a quality score below a first threshold (204). For instance, each website in the plurality of websites may have a score. The score may indicate a classification of the website, such as an authoritativeness, a responsiveness for a particular knowledge domain, another property of the website, or a combination of two or more of these.

        So they are ranking sites within classifications based upon scores that can be based upon a few different aspects of a site.

        The language of the patent regarding those rankings and thresholds is a little confusing. I’m not sure if that is because this is a patent application, rather than a granted patent This section does seem to have an issue, involving calling the first set of sets below the first threshold”first sites” and the ones above a second threshold “second sites” and at least one site is above the first threshold:

        In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of, for each website of a plurality of websites determined to be in a particular knowledge domain, wherein the particular knowledge domain is one of a plurality of knowledge domains that are each different from the other knowledge domains: receiving representations of the website and a quality score representing a quality measure of the website relative to other websites; classifying as first websites each of the plurality of websites having a quality score below a first threshold, at least one of the plurality of websites having a quality score below the first threshold; classifying as second websites each of the plurality of websites having a quality score above a second threshold that is greater than the first threshold, at least one of the plurality of websites having a quality score greater than the first threshold;

        It does feel like that needs to be rewritten – the idea that sites are being ranked based upon if they meet thresholds based on quality scores does come across there, though.

    2. Pingback: Știri #26 - Lumea SEO PPC

    3. Pingback: Google sait qui a rédigé quels articles - Technologeek

    4. Pingback: Job Search Engine Using Occupation Vectors - SEO by the Sea ⚓

    5. Pingback: ¿Patente de EAT y el Medic Update? Vectores para clasificar sitios por temática y autoridad – My Blog

    Leave a Comment