What are Speaker Identification Vectors and What Are They Used for?
Google was granted a patent in February of this year involving speech recognition systems, and how those could be used to determine or verify the identity of a speaker.
The patent tells us about previous attempts at speaker identification, which generally involved determining the likely identity of a speaker based on speech samples from the speaker.
We are told that previously the more potential speaker identities a system had to select from, the more computation and time was required to identify the correct speaker from among all potential speaker identities.
Why is Google concerned about speaker identification?
Speaker identification is an area of speech processing that can help with:
- Identification accuracy
- Fast search in the database of speakers
This patent reminds me of a post I wrote about in March called Author Vectors: Google Knows Who Wrote Which Articles
In that patent, we were told how author vectors may be useful:
‘The author vector generated by the author vector system for a given author is a vector of numeric values that characterizes the author.
In particular, depending on the context of the use of the author vector, the author vector can characterize one or more of the communication style of the author, the author’s personality type, the author’s likelihood of selecting certain content items, and other characteristics of the author.”
So Google is interested in what authors are writing, and also what speakers are saying.
I also recently wrote about a patent application that classified websites based on what Google search engineers were calling Website Representation Vectors. I posted about that patent in Google Using Website Representation Vectors to Classify with Expertise and Authority.
This kind of identification and classification of authors, speakers, and websites, and the use of features that help to identify and classify those is part of a new trend from Google. It fits in with indexing real-world objects that were the aim behind Google’s knowledge graph launched in 2012. Google wants to index actual speakers and authors and websites treating each as an entity, understanding, and indexing each of those based upon the features which make them unique.
Speaker Identification Vectors
This is what we are told about speaker vectors:
In one general aspect, a method includes:
- Obtaining an utterance vector that is derived from an utterance
- Determining hash values for the utterance vector according to multiple different hash functions
- Determining a set of speaker vectors from a plurality of hash tables using the hash values, each speaker vector being derived from one or more utterances of a respective speaker
- Comparing the speaker vectors in the set with the utterance vector
- Selecting a speaker vector based on comparing the speaker vectors in the set with the utterance vectorv
Features involving Speaker Identification Vectors
Different versions of the process involving speaker identification may include many additional features.
For example, an utterance vector includes:
- Obtaining an utterance i-vector for the utterance, the utterance i-vector comprising parameters determined using multivariate factor analysis of the utterance
- Determining the set of speaker vectors from the plurality of hash tables using the hash values includes determining a set of speaker i-vectors from the plurality of hash tables, each speaker i-vector comprising parameters determined using multivariate factor analysis of one or more utterances of a respective speaker
- Obtaining the utterance vector includes obtaining an utterance vector comprising parameters determined based on deep neural network activations that occur in response to information about the utterance being provided to the deep neural network
- Determining the set of speaker vectors from the plurality of hash tables using the hash values includes determining a set of speaker vectors in which each speaker vector includes parameters determined based on deep neural network activations that occur in response to information about one or more utterances of a respective speaker being provided to the deep neural network
These features focus on the identification of individual speakers:
- Accessing data indicating associations between the speaker vectors and respective speakers
- Determining, based on the data indicating the associations between the speaker vectors and the respective speakers, a speaker identity corresponding to the selected speaker vector
- Outputting data indicating the speaker identity
The process behind the patent can include identifying one or more media items that include utterances from a speaker that correspond to the selected speaker vector; and then output data indicating the identified one or more media items.
This method may include:
- Determining that the selected speaker vector corresponds to a particular user
- Based at least in part on the determining that the selected speaker vector corresponds to a particular user identity, authenticating the particular user
Speaker Identification Vectors in More Detail
Each speaker vectors corresponds to a different speaker.
The process includes providing data that indicates that the speaker corresponding to the selected speaker vector is the speaker of the utterance.
This process can include obtaining multiple speaker vectors that each indicate characteristics of the speech of a respective speaker; and, for each particular speaker vector of the multiple speaker vectors:
- Determining hash values for the particular speaker vector according to each of the multiple different hash functions
- Inserting the particular speaker vector into each of the plurality of hash tables based on the hash values
Collecting Speaker Vector Information
The collection of unique information about a speaker’s characteristics is done from videos and is done from multiple videos.
When obtaining multiple speaker vectors, each of those will indicate unique characteristics of the speech of a respective speaker. Doing that can include:
- Accessing a set of multiple video resources
- Generating a speaker vector for each of the multiple video resources
In another general aspect, a method includes: obtaining an utterance i-vector for an utterance; determining hash values for the utterance i-vector according to multiple different hash functions; determining a set of speaker i-vectors from a plurality of hash tables using the hash values; comparing the speaker i-vectors in the set with the utterance i-vector, and selecting a speaker i-vector based on comparing the speaker i-vectors in the set with the utterance i-vector.
This Speaker Identification Vectors patent can be found at:
Inventors: Matthew Sharifi, Ignacio Lopez Moreno, and Ludwig Schmidt
Assignee: Google LLC
US Patent: 10,565,996
Granted: February 18, 2020
Filed: June 1, 2016
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker identification. In some implementations, data identifying a media item including the speech of a speaker is received. Based on the received data, one or more other media items that include the speech of the speaker are identified. One or more search results are generated that each reference a respective media item of the one or more other media items that include the speech of the speaker. The one or more search results are provided for display.
Speaker Identification Vectors Take Aways
The patent provides many more details about the importance of speaker identification, and the process behind it. It starts by telling us that some applications of speaker identification include:
- Authentication in security-critical systems
- Personalized speech recognition
- Searching for speakers in large corpora
I have summarized the summary from the patent, but there is a lot more detail behind this patent that is worth digging into. Speech recognition can be used in Web search, but can also be important to web security.
If you want to learn more, reading through the patent is recommended. There is also a white paper on this topic from the inventors who published this patent which is worth spending time with, Large-Scale Speaker Identification (pdf).
I have had discussions with several SEOs recently about the quality of transcripts at places such as Youtube, and the feedback I have been receiving is that transcripts have been improving significantly in quality. I don’t know if that would be because of efforts such as this Speaker Identification Vectors process, but it could be related.
Earlier this year, I wrote the post Quote Searching Updated at Google to Focus on Videos which was about Google updating a patent to provide information about quotes by analyzing text from videos instead of relying upon finding information about those quotes in knowledge bases such as Wikipedia. If Google has spent time getting better at analyzing audio in videos (also referred to in this patent), that could explain why transcripts at places such as YouTube have improved in quality.