Google Speaker Identification Vectors

Published: July 24, 2020

Google Speaker Identification Vectors featured cover image

What are Speaker Identification Vectors and What Are They Used for?

Google was granted a patent in February of this year involving speech recognition systems and how those could be used to determine or verify the identity of a speaker.

The patent tells us about previous attempts at speaker identification, which generally involved determining the likely identity based on speech samples from the speaker.

Related Content:

We are told that the more potential speaker identities a system had to select from, the more computation and time was required to identify the correct speaker from among all potential speaker identities.

Why is Google concerned about speaker identification?

Speaker identification is an area of speech processing that can help with:

Identification accuracy
Fast search in the database of speakers

This patent reminds me of a post I wrote about in March called Author Vectors: Google Knows Who Wrote Which Articles

In that patent, we were told how author vectors might be useful:

‘The author vector generated by the author vector system for a given author is a vector of numeric values that characterizes the author.

In particular, depending on the context of the use of the author vector, the author vector can characterize one or more of the communication style of the author, the author’s personality type, the author’s likelihood of selecting certain content items, and other characteristics of the author.”

So Google is interested in what authors are writing and also what speakers are saying.

I recently wrote about a patent application that classified websites based on what Google search engineers called Website Representation Vectors. I posted about that patent in Google Using Website Representation Vectors to Classify with Expertise and Authority.

This kind of identification and classification of authors, speakers, and websites and the use of features that help identify and classify those are part of a new trend from Google. It fits in with indexing real-world objects that were the aim behind Google’s knowledge graph launched in 2012. Google wants to index actual speakers and authors, and websites, treating each as an entity, understanding and indexing each of those based upon the features which make them unique.

Speaker Identification Vectors

This is what we are told about speaker vectors:

In one general aspect, a method includes:

Obtaining an utterance vector that is derived from an utterance
Determining hash values for the utterance vector according to multiple different hash functions
Determining a set of speaker vectors from a plurality of hash tables using the hash values, each speaker vector being derived from one or more utterances of a respective speaker
Comparing the speaker vectors in the set with the utterance vector
Selecting a speaker vector based on comparing the speaker vectors in the set with the utterance vectorv

Features involving Speaker Identification Vectors

Different versions of the process involving speaker identification may include many additional features.

For example, an utterance vector includes:

Obtaining an utterance i-vector for the utterance, the utterance i-vector comprising parameters determined using multivariate factor analysis of the utterance
Determining the set of speaker vectors from the plurality of hash tables using the hash values includes determining a set of speaker i-vectors from the plurality of hash tables, each speaker i-vector comprising parameters determined using multivariate factor analysis of one or more utterances of a respective speaker
Obtaining the utterance vector includes obtaining an utterance vector comprising parameters determined based on deep neural network activations that occur in response to information about the utterance being provided to the deep neural network
Determining the set of speaker vectors from the plurality of hash tables using the hash values includes determining a set of speaker vectors in which each speaker vector includes parameters determined based on deep neural network activations that occur in response to information about one or more utterances of a respective speaker being provided to the deep neural network

These features focus on the identification of individual speakers:

Accessing data indicating associations between the speaker vectors and respective speakers
Determining, based on the data indicating the associations between the speaker vectors and the respective speakers, a speaker identity corresponding to the selected speaker vector
Outputting data indicating the speaker identity

The process behind the patent can include identifying one or more media items that include utterances from a speaker that correspond to the selected speaker vector; and then output data indicating the identified one or more media items.

This method may include:

Determining that the selected speaker vector corresponds to a particular user
Based at least in part on the determining that the selected speaker vector corresponds to a particular user identity, authenticating the particular user

Speaker Identification Vectors in More Detail

Each speaker vector corresponds to a different speaker.

The process includes providing data that indicates that the speaker corresponding to the selected speaker vector is the speaker of the utterance.

This process can include obtaining multiple speaker vectors that each indicate characteristics of the speech of a respective speaker; and, for each particular speaker vector of the multiple speaker vectors:

Determining hash values for the particular speaker vector according to each of the multiple different hash functions
Inserting the particular speaker vector into each of the plurality of hash tables based on the hash values

Collecting Speaker Vector Information

The collection of unique information about a speaker’s characteristics is done from videos and multiple videos.

When obtaining multiple speaker vectors, each of those will indicate unique characteristics of the speech of a respective speaker. Doing that can include:

Accessing a set of multiple video resources
Generating a speaker vector for each of the multiple video resources

In another general aspect, a method includes: obtaining an utterance i-vector for an utterance; determining hash values for the utterance i-vector according to multiple different hash functions; determining a set of speaker i-vectors from a plurality of hash tables using the hash values; comparing the speaker i-vectors in the set with the utterance i-vector, and selecting a speaker i-vector based on comparing the speaker i-vectors in the set with the utterance i-vector.

This Speaker Identification Vectors patent can be found at:

Speaker identification
Inventors: Matthew Sharifi, Ignacio Lopez Moreno, and Ludwig Schmidt
Assignee: Google LLC
US Patent: 10,565,996
Granted: February 18, 2020
Filed: June 1, 2016

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker identification. In some implementations, data identifying a media item, including the speech of a speaker, is received. Based on the received data, one or more other media items that include the speaker’s speech are identified. One or more search results are generated that each reference a respective media item of one or more other media items that include the speaker’s speech. One or more search results are provided for display.

Speaker Identification Vectors Take Aways

The patent provides many more details about the importance of speaker identification, and the process behind it. It starts by telling us that some applications of speaker identification include:

Authentication in security-critical systems
Personalized speech recognition
Searching for speakers in large corpora

I have summarized the summary from the patent, but there is a lot more detail behind this patent that is worth digging into. For example, speech recognition can be used in Web searches and be important to web security.

If you want to learn more, read through the patent is recommended. However, there is also a white paper on this topic from the inventors who published this patent which is worth spending time with, Large-Scale Speaker Identification (pdf).

I have had discussions with several SEOs recently about the quality of transcripts at places such as Youtube. The feedback I have been receiving is that transcripts have been improving significantly in quality. I don’t know if that would be because of this Speaker Identification Vectors process, but it could be related.

Earlier this year, I wrote the post Quote Searching Updated at Google to Focus on Videos which was about Google updating a patent to provide information about quotes by analyzing text from videos instead of relying upon finding information about those quotes in knowledge bases such as Wikipedia. If Google has spent time getting better at analyzing audio in videos (also referred to in this patent), that could explain why transcripts at places such as YouTube have improved in quality.

About Bill Slawski

With more than 26 years of SEO experience and a Juris Doctor Degree, Bill Slawski is the foremost expert on Google’s patents as related to SEO. Patent Exploration is one of the quickest and most detailed ways to find new information about SEO. Bill is the Editor of SEO by the Sea, a prominent search engine optimization blog, where he is the author of over 1,300 posts. Bill’s experience includes Fortune 500 brands and some of the largest websites in the world. Bill is a contributing author for Moz, Search Engine Land, and Search Engine Journal. In 2014-2021, he spoke at industry-leading international conferences about topics including search engine algorithms, universal and blended search, personalization in search, search and social, and duplicate content problems, structured data, and schema

MORE TO EXPLORE