Audio Query Disambiguation at Google

by Posted @ Sep 16 2021

Twitter

audio query disambiguation

The next time I have questions about songs, I might play the Ramones or the Mekons and ask Google those questions on my phone. I wonder if it would make a difference. It seems it might.

Patenting Audio Query Disambiguation

An Internet search engines provide information about Internet-accessible resources, such as web pages, documents, and images, in response to a searcher’s query.

A searcher can submit an audio query using a device, such as a mobile telephone, using a microphone.

Sometimes searchers submit an audio query to internet search engines that is ambiguous in that they relate to more than one concept and/or entity.

This audio query disambiguation is a problem that needs addressing, which this patent aims to help with.

How Audio Query Disambiguation May Work

One innovative aspect of this audio query disambiguation patent that can include the actions of:

  • Receiving a query with background audio
  • Identifying concepts related to the background audio
  • Generating a set of related terms related to the identified concepts
  • Responding with search results based on the search query
  • Related to the identified concepts
  • Displaying the search results

The audio query disambiguation patent can get configured to perform the actions of the audio query disambiguation methods. Computers can perform particular operations or actions by having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.

Optional features involving Audio Query Disambiguation, could include:

  • Showing the query to a search engine
  • Returning the scored results from the search engine
  • Altering a score for a search result containing at least one of the related terms
  • Deciding on an amount to alter the score by using a number of training queries
  • Providing the search results that satisfy a threshold score

Receiving search results based on the search query and at least on at least one of the terms related to the identified concepts may include providing the query to a search engine together with the related terms and receiving results from the search engine.

Identifying Concepts Related to Background Audio

Identifying concepts related to the background audio may include recognizing at least a part of the background audio by matching it to an acoustic fingerprint and identifying concepts related to the background audio, including concepts associated with the acoustic fingerprint.

Also, generating a set of terms related to the background audio may include terms based on querying a conceptual expansion database based on the concepts related to the background audio.

Technical Advantages of This Audio Query Disambiguation Approach

This patent can use background audio to clarify an ambiguous search query based on background audio. Such implementations provide more accurate search results, thus achieving a technical advantage.

This audio query disambiguation patent is at:

Background audio identification for query disambiguation
Inventors: Jason Sanders, John J. Lee, and Gabriel Taubman
Assignee: GOOGLE LLC
US Patent: 11,023,520
Granted: June 1, 2021
Filed: January 10, 2019

Abstract

Implementations relate to techniques for providing context-dependent search results. The techniques can include receiving a query and background audio.

The techniques can also include identifying the background audio, establishing concepts related to the background audio, and obtaining terms related to the concepts related to the background audio.

The techniques can also include obtaining search results based on the query and at least one of the terms.

The techniques can also include providing the search results.

Searchers gain information about internet-accessible resources by submitting a query to a search engine using a client device, a mobile telephone equipped with a microphone.

An Example of Audio Query Disambiguation Using Background Audio

Sometimes the user’s query is ambiguous. For example, the user can submit the query “scorpions.”

The search engine cannot determine whether the user wants information on scorpion arthropods or the rock band Scorpions without more information.

However, if the search engine determined that the searcher submitted the query while listening to the Scorpions’ song “Rock You Like A Hurricane,” it could provide search results relevant to the rock band instead of those relevant to the arthropods.

Some implementations perform a search based on the searcher’s query and terms related to background audio around a time of the search query’s input. In this manner, implementations respond to search queries by taking into account the context of background audio. The context of background audio thus allows ambiguous queries to get matched to more relevant results.

The audible query disambiguation patent will show example implementations.

A schematic diagram of an example implementation. The searcher provides voice input of a search query to a computer. The searcher can provide search queries using a keyboard or other input device instead of voice input. In addition, a background audio source, for example, a radio or television, provides background audio, which the computer detects. Computer conveys search query and background audio to search system through a communications channel.

Both Search Query and Background Audio May Go To The Search Engine

This is how audible query disambiguation may take place. The search system receives both search query and background audio through communications channels from the computer. If necessary, the search system uses speech recognition to convert search queries to computer-readable forms. The search system utilizes techniques disclosed herein to identify background audio and retrieve terms related to the background audio. For example, if the background audio is a popular song, the search system can retrieve a set of terms that include other songs by the same artist, the name of the album on which the song appears, the names of the performers, etc.

The search system then executes a search query using at least one term related to the background audio. For example, the search system can supplement search query with at least one term related to background audio and submit the supplemented search query to a search engine for processing. As another example, the search system can execute the search using augmented match scores for search results that include terms related to background audio.

Irrespective of the particular use of the terms related to background audio, the search system obtains search results, which the search system conveys through communications channel back to a computer for display or audio presentation to a searcher.

Audio Background Music May Permit Much Improved Searching

A schematic diagram of a search system is included in an example implementation. The search system includes various components. These components and their interaction allow improved searching that uses information provided from the background audio to permit such improved searching.

The operation of the search system involves the receipt of a search query and background audio data from a computer. Search query includes parameters for a search that specify the desired results. However, these parameters are ambiguous, especially if the search query includes keywords that have multiple meanings.

Background audio data may include audio such as music or the soundtrack to a video program that, when identified, is associated with certain concepts that can help narrow the scope of a search query by reducing ambiguity or otherwise providing information that helps improve the quality of the results obtained by searching with the search query.

The search query may optionally get processed by a voice recognition module, in the case in which it is an audio query, and it needs to get converted into a textual query, such as a natural language question or a set of keywords. However, if a search query is received as text, to begin with, such as from a keyboard, then a voice recognition module may is not necessary for that implementation. The voice recognition module can receive audio speech input and convert such input to computer-readable data, for example, ASCII or Unicode text, using conventional techniques.

The Audio Query Disambiguation Program Can Look at Background Audio Data

However, background audio data requires processing to lead to results that help improve results for a search query. A background audio recognizer processes background audio data. A Background audio recognizer analyzes background audio data and determines that background audio data includes audio corresponding to a known audio segment.

While one example of how audio is a known segment of audio is if the audio includes audio from an existing media entity, such as an audio component of a television or movie, or a piece of music, it will get recognized that implementations may generally use any identifications from analyzing the background audio.

For example, simple examples of identified background audio such as dialogue from an episode of “The Simpsons” is playing in the background, or the song “Penny Lane” is playing in the background.

However, other implementations might take advantage of other identifications, such as recognizing participants’ voices in a background conversation or recognizing noises made by a certain type of animal.

Private Background Audio Information

Background audio sources may produce background audio that searchers may want to keep private or otherwise prefer not to have recorded and/or analyzed.

For example, the background audio may include a private conversation or other background audio that the searcher does not want to capture. Even background audio that may seem innocuous, such as a song playing in the background, may divulge information about the searcher that the searcher would prefer not to have made available to a third party.

Affirmative Consent To The Use of Background Audio Information

Because of the need to ensure that the searcher is comfortable with having the background audio processed in case the background audio includes content that the searcher does not wish to have recorded and analyzed, implementations should provide the searcher with a chance to affirmatively consent to the receipt of background audio before receiving or analyzing audio that is received from background audio source.

Therefore, the searcher may need to indicate that they are willing to allow the implementations to capture background audio before recording background audio. For example, the computer may prompt the searcher with a dialog box or other graphical searcher interface element to alert the searcher with a message that makes the searcher aware that the computer is about to monitor background audio.

For example, the message might state, “Please authorize the use of background audio. Please note that information about background audio may get shared with third parties.” Thus, to ensure that background audio is gathered exclusively from consenting searchers, implementations should notify searchers that gathering background audio is about to begin.

Searchers should know that accumulated background audio information may get shared to draw conclusions based on the background audio. Only after the searcher is alerted to these issues and has affirmatively agreed that they are comfortable recording the background audio will background audio get gathered from a background audio source.

Furthermore, certain implementations may prompt the searcher again to ensure that the searcher is comfortable with recording background audio if the system has remained idle for a period of time, as the idle time may indicate that a new session has begun and prompting again will help ensure that searcher is aware of privacy issues related to gathering background audio and is comfortable having background audio recorded.

Where the systems here collect personal information about searchers or may use personal information, the searchers are provided with an opportunity to control whether programs or features collect personal information. This is information about a searcher’s social network, social actions or activities, profession, a searcher’s preferences, or a searcher’s current location), or whether and how to receive content from the more relevant content server.

In addition, certain data can get anonymized in ways before it is stored or used so that personally identifiable information is removed. For example, a searcher’s identity may get anonymized so that no personally identifiable information can get determined for the searcher, or a searcher’s geographic location can get generalized where location information is obtained (such as to a city, ZIP code, or state level) so that a particular location of a searcher cannot get determined. Thus, the searcher may control how information is collected about them and used by a content server.

Identifying An Audio Sample Uing Conventional Techniques

The background audio recognizer is capable of identifying an audio sample using conventional techniques. For example, background audio recognizer accepts as input data reflecting a query audio sample, uses such information to match the query audio sample to a known audio sample, and outputs an identification of the known audio sample. Background audio recognizer thus includes or is coupled to a database storing data reflecting many audio samples, e.g., songs, television program audio, etc.

Example data reflecting an audio sample can include a spectrogram of the sample or derivations of a spectrogram of the sample, e.g., a hash of part of the spectrogram. The spectrogram can include or get represented by, for example, identified peaks, e.g., local maxima, in a frequency domain.

A background audio recognizer can recognize background audio data is to use an acoustic fingerprint database. Acoustic fingerprint database may communicate with background audio recognizer to process background audio data, produce fingerprints of background audio data that represent features of background audio data, and match those fingerprints to other fingerprints in the acoustic fingerprint database.

For example, a background audio recognizer may receive background audio data and code fingerprints based on background audio data. By using those fingerprints as a query into the acoustic fingerprint database, the background audio recognizer can conclude, such as that an audio snippet of the Eagles’ “Hotel California” is playing in the background.

After the background audio recognizer recognizes background audio data, it produces recognized background audio. The next stage performed by the search system is that recognized background audio must get processed through a conceptual expander.

The Role of the Conceptual Expander in Audio Query Disambiguation

The role of the conceptual expander is to take recognized background audio and use the identification information for recognized background audio to produce terms that can influence the query to improve the results. The conceptual expander can return, in response to an identification of an audio sample, terms related to such sample. Thus, the mapping engine can get coupled to a relational database and map an identification o audio samples to terms related to the audio sample in the database.

The conceptual expander may use a variety of approaches to produce terms. For example, identifying recognized background audio is used as a query to search a conceptual expander database. A conceptual expander database is an information repository that can map recognized background audio to related terms. A conceptual expander database may accomplish this task in several ways, depending on its contents.

For example, the conceptual expander database may include various documents, such as articles related to various topics. These documents may get mined for keywords related to concepts based on the background audio. For example, suppose that the search query is “Data Android” and background audio data includes an audio snippet of the theme song from “Star Trek: The Next Generation.”

Background audio recognizers can identify recognized background audio and use this information with a conceptual expander. For example, the conceptual expander database might include an article about “Star Trek: The Next Generation” that could get mined to produce terms indicative of “Data Android” in that context. For example, terms such as “Brent Spiner” or “Lieutenant Commander” rather than other senses of the terms “Data” and “Android.”

An Information Repository That Can Serve As The Conceptual Expander’s Role

One example of an information repository that can serve as the conceptual expander’s role is an interconnected network of concepts, for example, a comprehensive collection of real-world entities, such as people, places, things, and concepts, along with the relationships factual attributes that describe them. Examples of such networks include the Google Knowledge Graph or Wikipedia. These networks describe entities that are related to literals in specific ways. As discussed above, recognized background audio may include information about terms related to background audio.

If conceptual expander uses such a network of concepts, it becomes possible to use the terms to identify entities and related literals that can get considered in query disambiguation. For example, suppose that the recognized background audio is a clip from the “Men in Black” theme song, sung by Will Smith. The network of concepts may serve in the role of conceptual expander based on this information. Recognized background audio leads the network of concepts to suggest certain entities as being relevant, based on recognized background audio. For example, “Will Smith” and “Men in Black” are derived from recognized background audio.

Based on these entities, the network of concepts can then provide literals related to these entities, defined by a schema. For example, the network of concepts can provide the date “Sep. 25, 1968” as having the “date of birth” relationship to “Will Smith,” or “Tommy Lee Jones” as having a “lead actor” relationship to “Men in Black.” Because the network of concepts is a repository of entities associated with related literals, the network is well-suited, to begin with, entities derived from recognized background audio and suggest related literals as terms that expand the concepts and improve query performance.

Once conceptual expander produces terms, search query, as recognized by voice recognition module, if necessary, can get combined to produce modified search query, and then get results using a search engine. This can get done either by concatenating terms to search query introduce modified search query or by filtering the results of a search on search query as recognized by the voice recognition module. Continuing the previously discussed example, a modified search query might take “Data Android” and add some terms to search “Lieutenant Commander Data Android.”

This Scoring Engine May contain a Search Engine Too

Additionally, the search engine can filter results from searching on the modified search queries. For example, before the search engine returns results to a computer, it may filter using terms. For example, only search results that include “Brent Spiner” might get provided as results. Search engines may include scoring engines in one implementation, and scoring engines may score results based on the relationship between results and terms.

Thus, the search system includes a search engine. The search engine can include a web-based search engine, a proprietary search engine, a document lookup engine, or a different response engine. The search engine may include an indexing engine, which indexes a large part of the internet. The search engine may further include scoring engines. When a search engine receives a search query, it matches it to the index to retrieve search scoring engine attributes a score to each such search result, and search engine ranks, e.g., orders—the search results which are based on the scores. The search engine is thus capable of returning search results in response to a search query.

The search system conveys the search query to the search engine. If the searcher supplied the search query in computer-readable form, the search system conveys the query directly to the search engine. If the searcher successfully supplied the search query to the computer, the search system conveys the search query to the voice recognition module to obtain the search query in computer-readable form. It may then provide the computer-readable search query to the search engine.

The search engine processes the search taking into account at least taking related to the background audio. There are several ways that the system can accomplish this.

The search engine supplements the search query with the term, or terms, related to the audio. The search engine obtains search results based on this supplemented search query.

This search engine matches the terms of the search query terms. It will consider both the search query terms and the terms related to the background audio for scoring purposes using the scoring engine. The terms related to the background audio are not used for matching purposes.

The search engine processes the query while considering the terms related to the background audio by augmenting scores attributed to search results that match the search query containing the term or terms related to the background audio. The scoring engine may add an amount to the scores of search results that also contain the term, or terms, related to the background audio.

Regardless of the particular technique the search engine uses to obtain search results. The search system sends search results back to the computer. It will display search results to the searcher, and outputs an audio rendering of the speaker’s search results.

A schematic diagram of a computer is included in an example implementation. Pictures illustrate various hardware and other resources that are used in implementations. This is through a mobile telephone, a personal digital assistant, a laptop computer, a desktop computer, or another computer or hardware resource. The computer is communicatively coupled to the search system through a communications channel by way of the interface.

The interface includes components of a computer that allows the computer to interact with other entities such as the search system. A communications channel is used in any combination, such as a cellular channel, the internet, network, or wired or wireless data connection.

Thus, a computer user provides a query to the computer using, for example, a microphone or input device. The computer also receives, through a microphone, any background audio.

A Flowchart of Query Disambiguation

A searcher sends a query to the search engine using a microphone. Or, The searcher can supply the query using an input device.

The computer obtains background audio using, for example, a microphone. The computer can gather such background audio while entering or submitting a search query, whether the searcher enters the query as voice input or in computer-readable form. The computer gathers background audio in a time interval after the searcher has submitted the search query. The computer detects the background audio immediately after the searcher requests that the search query get executed. Such as by activating an ENTER key or otherwise providing input that query entry is complete.

The time interval is any period of time from 0.1 seconds to 10 seconds. The computer determines that the searcher has finished entering the search query as a voice input by detecting a drop below a threshold volume level. The computer gathers background audio both before and after the searcher submits the search query.

Background audio recognizer of voice recognition system identifies concepts associated with the background audio to produce recognized background audio. One way this may occur is that a background audio recognizer may search an acoustic fingerprint database using background audio data to identify the nature of the background audio and the corresponding related concepts.

The conceptual expander obtains terms related to the recognized background audio. This may occur because recognized background audio may search a conceptual expander database that provides terms associated with the concepts from recognized background audio.

If the background audio is a song, such related terms can include the:

  • Song title
  • Song lyrics
  • Performing artist
  • Composer
  • Album
  • Titles for the other songs on the same album
  • Any other related information

If the background audio is a television program or movie audio segment, such related terms can include the actors, the producers, the title of the program or movie, the network, and any parts from a transcript of the program or movie.

However, these are only example terms, and the conceptual expander database may include other terms suggested based on the identity of recognized background audio.

Search engine obtains search results based on the search query and at least one related term. The number of related terms that search engine takes into account can vary. For example, the number of terms may vary from 1 to 15, but there is no absolute limit on the number of related terms. The search engine executes the search multiple times, each time with an additional related term. It may retain only those search results whose score attributed by the scoring engine exceeds a threshold. There are several ways in which search engines can obtain search results based on the query.

That Search engine may supplement the search query with at least one term related to the background audio and conventionally executes the supplemented query to obtain search results. The search engine provides an ordered set of search results.

The Search Engine Matches The Search Query To Search Results With An Indexing Engine

The query may be matched by the search engine to search results using an indexing engine, requiring all search query terms presented in the search result. The search engine allows at least one term related to the background audio in the matching search results but does not require it. All terms, including those in the search query and those related to the background audio, count toward the score attributed to the matching results by the scoring engine.

The search query is processed by the search engine and at least one term related to the background audio requires the terms from the search query to become present in the search results. It likely will use at least one term related to the background audio as an optional search term or terms. The search engine then orders the results according to score. The search engine thus obtains search results for the search query while taking into account the background audio.

This search engine augments scores attributed to results matching the search query containing at least one term related to the background audio. In implementations that use this technique for obtaining search results for the search query while taking into account the background audio, the search engine matches the search query to search results using an indexing engine. The scoring engine attributes a score to each such matched search result and then adds an amount to the scores of search results that also contain at least one term related to the background audio.

After adjusting each score as appropriate, the search engine ranks the results using the score. The search engine obtains searching the results for the search query while considering how the scoring engine adjusts scores for search results that include the term, or terms, related to the background audio. The amount of the adjustment can get set by fiat or by a learned approach. Several learning techniques are used.

The increased scores can be learned by obtaining a large set of ambiguous queries, such as 100-100,000, of ambiguous queries. Those would get followed by related unambiguous queries.

For each query pair, the audio query disambiguation technique calculates the difference between the score of the highest-ranked search result for the unambiguous query and the score attributed to the same search result for the ambiguous query. The audio query disambiguation technique then takes the average of these differences as the amount by which to increase search result scores.

Another audio query disambiguation technique for learning the amount to add utilizes a large set, e.g., 100-100,000, of ambiguous/unambiguous query pairs together with background audio for each pair. This audio query disambiguation technique determines the amount of score increase, denoted X, that maximizes the proportion of the large set of query pairs for which the following two conditions are satisfied:

  • (1) the highest-ranked ambiguous query search result does not appear in the top N unambiguous query search results, where N is fixed at any number from 1 to 25
  • (2) adding X to the score of each of the ambiguous query search results that contain terms related to the background audio causes the highest-ranked ambiguous query search result to change as a result of such addition(s). The technique selects the value of X that maximizes the proportion of the query pairs for which conditions (1) and (2) are satisfied. To select such an X, the technique can use, for example, an exhaustive search, a gradient descent, or another methodology

Reliance on learning to set or adjust the score increase amount. The learning process for audio query disambiguation can become ongoing, for example, as the system receives search queries from searchers.

The search system sends results to the computer using a communications channel. The computer receives the search results and presents them to the searcher using, for example, display and speaker.

Systems capable of performing the disclosed audio query disambiguation can take many different forms. Further, the functionality of one part of the system can substitute for another part of the system. Each hardware component can include processors coupled to random access memory operating under control of, or in conjunction with, an operating system.

The search system can include network interfaces to connect with clients through a network. Such interfaces can include servers. Further, each hardware component can include persistent storage, such as a hard drive or drive array, storing program instructions to perform the techniques disclosed herein. That is, such program instructions can serve to perform techniques as disclosed. Other configurations of the search system, computer, associated network connections, and other hardware, software, and service resources are possible.

subscribe to our newsletter

Leave a Comment