Textual Image Queries and Classification of Images at Google

by Posted @ Aug 14 2020

Twitter

Google may categorize images using object recognition to determine what to show of an image or may analyze text that appears in an image to create categories. Those categories can determine what an image may rank for in image search results.

(Make sure you check out the image optimization categorization takeaways at the end of this post.)

Textual Image Queries Based on Classification of Images into Categories

I recently wrote about How Google Might Rank Image Search Results. That post was about Google using a machine learning approach looking at features from images, image search queries, and landing pages for images.

It’s not unusual to see related patents from Google being granted around the same time.

It’s not surprising to see another recent patent from Google focusing on image search, and how Google may classify images and use classifications to generate image search results.

Google does offer a reverse image search where you can search for images using photographs and other images. This new patent is about returning image search results for textual image queries. The patent tells us how Google processes those textual image queries.

This image query processing system classifies images and use uses those image categories to decide which images to return that are responsive to a textual image query.

I performed a reverse image search for the Torpedo Factory Art Center, and Google generated a textual search for “torpedo factory art center” with my reverse image search.

textual images queries example

This image query processing system may access a database of images stored on a device. Each image may be associated with a category from many different image categories.

image query processing system

When the image query processing system receives a textual image query from a searcher, it may decide on one or more categories from many different image categories that are likely to include images that are responsive to the textual image query.

The textual image queries processing system may then analyze images in the categories to choose images responsive to the textual image query and display the selected images in search results.

Here’s an overview of part of this category process:

  • Determine an image category for images responsive to textual image queries
  • An output type that identifies a type of requested content
  • Select, using data that associates many images with a corresponding category a subset of the images that each belong to the image category
  • Each of those images from the number of images belonging to the categories
  • Analyze, using the textual query, data for the images in the subset of the images to determine images responsive to the textual query
  • Determine a response to the textual image query using the images responsive to the textual query
  • Provide, using the output type, the response to the textual query for presentation

image categories

The categories may include an image category for images responsive to the textual query.

Responding to the textual query with images responsive to the textual query may include selecting, for each image about the textual query, a part of the image that depicts data responsive to the textual query.

Responding to the textual query may include generating instructions for the result presentation of a user interface that emphasizes, for each image responsive to the textual query, the part of the image that depicts the data responsive to the textual query

Providing, using the output type, the response to the textual query for presentation may include generating instructions for an audible presentation of the data responsive to the textual query

Providing the instructions to a speaker to cause the speaker to provide the audible presentation of the data responsive to the textual query.

Using Key Phrases From the Textual Image Queries

The method may involve using the textual query, and one or more key phrases for the textual query. It may use the textual query to analyze data for the images to determine images responsive to the textual query, and data for the images in the subset of the images to determine images responsive to the textual query.

Selecting the part of the image that depicts data responsive to the textual query may include determining a bounding box for the image that surrounds the data responsive to the textual query, and selecting the part of the image defined by the bounding box.

Cropping Content From an Image that is Not Responsive to a Textual Image Query

Selecting the part of the image that depicts data responsive to the textual query may include cropping, for at least one of the images responsive to the textual query, the image to remove content that is not responsive to the textual query.

Cropping the image to remove content that is not responsive to the textual query may include cropping the image so that the data responsive to the textual query includes a fixed size or a percent of the cropped image.

The method may include determining the percent of the cropped image using context depicted in the image.

Determining the percent of the cropped image using context depicted in the image may include determining the percent of the cropped image using at least one of the data responsive to the query depicted in the image, text depicted in the image, or a boundary of an object depicted in the image.

Determining Responses to Textual Image Queries Based on the Number of Images Responsive to Queries

Generating instructions for the presentation of an image may include determining an output format using a quantity of the images responsive to the textual query or the output type or both, and generating the instructions for the presentation of the user interface using the output format.

Determining the output format may include determining that a single image from an image database depicts data responsive to the textual query; and in response to determining that a single image from the image database depicts data responsive to the textual query, selecting an output format that depicts, in the user interface, only data from the image.

Determining the output format may include determining that many images from the plurality of images depict data responsive to the textual query; and in response to determining that many images from the plurality of images depict data responsive to the textual query, selecting a summary output format that depicts, in the user interface:

  • a) a summary of the data responsive to the textual query from the many images and
  • b) data from each of the many images

Generating the instructions for the presentation of the user interface using the output format may include generating the instructions for the presentation of the user interface that includes

  • a) the summary of the data responsive to the textual query and
  • b) the data from each of the many images

The summary output format may include the summary above the data for each of the many images. The summary may include a list of the data responsive to the textual query from the many images.

The user interface may include navigation control enabling a user to scroll through the presentation of the data from each of the many images.

Analyzing Image Data Using Object Recognition to determine an Image Category for an Image

The method may include for each of two or more images in the plurality of images:

  • Analyzing image data for an image using object recognition to determine an initial image category for the image from categories
  • Determining whether the initial image category is in a particular group of image categories; for at least one image from the two or more images for which the initial image category is in the particular group of image categories
  • Determining to use the initial image category as the image category for the image; for at least one image from the two or more images for which the initial image category is not included in the particular group of image categories
  • Analyzing the image data for the image using text recognition to determine a second image category for the image from the two or more categories
  • Determining the image category for the image using the initial image category and the second image category
  • Storing, for the images, data in a database that associates the image with the image category for the image
  • The method may include for each of the two or more images:

    Receiving the image data before the image data is stored in an image database
    Storing the image data in the image database.
    Analyzing the image data may be responsive to receiving the image data.
    Providing instructions to cause the display to present the user interface and at least one of the images responsive to the textual query may include providing the instructions to a display to cause the display to present an answer to the textual query in the user interface.

    Generating Image Categories for Images Responsive to a Textual Query

    Determining using a textual query, a category for images responsive to the textual query, and an output type identifying a type of requested content.

    Selecting, using data that associates a plurality of images with a corresponding category, a subset of the images that each belong to the image category. Each of those images in the number of images belonging to one of the two or more categories.

    Analyzing, using the textual query, data for the images in the subset of the images to determine images responsive to the textual query.

    Selecting, for each image responsive to the textual query using the output type, a part of the image that depicts data responsive to the textual query.

    Generating instructions for an audible presentation of the data responsive to the textual query.

    Providing instructions to a speaker to cause the speaker to provide the audible presentation of the data responsive to the textual query.

    Generating the instructions may include generating, for at least one of the images responsive to the textual query, instructions for an audible presentation of the data that are responsive to the textual query, and that state a location of the part of the image that depicts the data responsive to the query.

    Providing instructions to the speaker to cause the speaker to provide, for the at least one of the images responsive to the textual query, the audible presentation of the data responsive to the textual query and that state a location of the part of the image that depicts the data responsive to the query.

    The method may include generating instructions for the presentation of a user interface that emphasizes, for each image responsive to the textual query, the part of the image that depicts the data responsive to the textual query; and providing the instructions to a display to cause the display to present the user interface and at least one of the images responsive to the textual query.

    Advantages From Following the Process Described in the Textual Image Queries Patent

    The patent does point out what the inventors perceive as advantages in following what they describe in the patent. These are those that they tell us:

    1. Using image categories when searching images may reduce the amount of image data analyzed for data responsive to a search query.
    2. Presenting part of an image (e.g., rather than the entire image) responding to a query can reduce network bandwidth, and content processed
    3. Using an image query processing system to classify an image with either an object recognition process, a text recognition process, or both. Using a single process, when accuracy for the single process satisfies a threshold accuracy, may reduce processing resources used (reduce computation time.) Using both processes can improve classification analysis (e.g., when a classification accuracy based on using a single process does not meet a threshold accuracy.) Presentation of image content with text-based results may improve the accuracy of the system (e.g., enabling separate validation of responses.)
    4. Using a query processing system with both an image category selector and a key phrase device may enable the query processing system to determine responsive data that other systems would not be able to determine.
    5. Using a system that classifies an image with an object recognition process, a text recognition process, or both, may classify an image once and does not need to classify an image each time the system receives a textual query for image data.
    6. Using a system with many classifiers (e.g., an object recognition classifier and a text recognition classifier) only needs to update classification analysis for images classified by the particular classifier, and not all images, when the particular classifier is updated. For instance, a system that classifies first images with an object classifier and second images with a text classifier would only need to determine classifications for the second images classified with the text classifier, and not the first images, when the text classifier is updated.

    The Textual Image Queries Patent is at:

    Image analysis for results of textual image queries
    Inventors: Gokhan H. Bakir, Marcin Bortnik, Malte Nuh, Kavin Karthik Ilangovan
    Assignee: Google LLC
    US Patent: 10,740,400
    Granted: August 11, 2020
    Filed: August 28, 2018

    Abstract:

    Methods, systems, and apparatus, including computer programs encoded on computer storage media, for analyzing images for generating query responses. One of the methods includes determining, using a textual query, an image category for images responsive to the textual query, and an output type that identifies a type of requested content; selecting, using data that associates a plurality of images with a corresponding category, a subset of the images that each belong to the image category, each image in the plurality of images belonging to one of the two or more categories; analyzing, using the textual query, data for the images in the subset of the images to determine images responsive to the textual query; determining a response to the textual query using the images responsive to the textual query; and providing, using the output type, the response to the textual query for presentation.

    Using Different Classifiers to Create Categories for Images

    An image query processing system associates images with at least one category to determine image search results responsive to a query.

    The image query processing system may classify each image of many images based on content depicted in the images.

    Some image categories may include:

    • Landscape
    • Cityscape
    • Buildings
    • Monuments
    • Outer space
    • Receipts
    • Menu
    • Ticket
    • Presentation
    • Document

    This textual image queries processing system could use more than one classifier to classify images into image categories.

    • Object Classification – An object classification system may analyze an image based on objects in the image to decide on an image category.
    • Text classification – a text classification system analyzes an image based on words depicted in an image.
    • A Combined Classification – The image query processing system may use output from the object classification system, the text classification system, or both to determine a final image category for an image.

    Object Classification of Images

    The object classification system may analyze images using any appropriate process.

    For instance, object classification may use machine learning to detect objects depicted in images and a likely image category for an image based on detected objects within the image.

    The patent provides some examples. When the object classification system decides:

    • An image with office buildings, the object classification system may assign the category of “cityscape” to the image.
    • An image with text, the object classification system may assign a category of “unknown” or “further processing required” or “text” as the image category for the image.

    Text Classification of Images

    The text classification system may use a potential type of image category determined by the object classification system.

    When the image query processing system receives data from the object classification system that an image likely shows text, the image query processing system may cause the text classification system to analyze the image to determine an image category for the image based on the depicted text.

    The text classification system may determine some of the words depicted in the image and use the words to determine the image category.

    In some examples, the text classification system may determine an image category based on a layout of the text depicted in an image.

    The text classification system can use optical character recognition to identify text depicted in an image.

    The optical character recognition can use any appropriate process to detect image depicted in an image.

    It can:

    • De-skew the image
    • Convert the image to grayscale
    • Perform character isolation to detect characters depicted in the image.

    Optical character recognition can use the detected characters to determine words, phrases, sentences, or a combination of these, depicted in the image.

    The text classification system can use the determined words to select an image category for the image. Examples:

    Receipt – For instance, when the text classification system determines that an image includes many line items, each with a corresponding cost, and a total cost toward the bottom of the image, the text classification system may assign an image category of “receipt” to the image.

    Presentation – When the text classification system determines that a landscape-oriented image or many associated images include page numbers and a presenter’s name, the text classification system may assign “presentation” to the image or images as the image category.

    The image query processing system can associate the image with the determined image category using data in a database, (e.g., a category database.)

    The category database may store images, data that associates images with respective image categories, or both.

    For instance, when the image query processing system receives an image, (e.g., from a camera included in the device, the image query processing system can determine an image category for the image,) (e.g., using the object classification system, the text classification system, or both.)

    The image query processing system can then store data in the category database that associates the image with the determined image category.

    The image query processing system may store the image in the category database.

    When the image query processing system receives a query for data from some of the many images, the image query processing system may analyze the query to determine an image category likely to have data responsive to the query, an output type requested by the query, one or more keywords included in the query, or a combination of two or more of these.

    For example, at time T.sub.0, the image query processing system may receive a query of “what restaurant did I go to last Wednesday” from a user.

    The query may be any appropriate type of textual query.

    The query may be a spoke query (e.g., that is converted into text.)

    The query may be typed, (e.g., using touch input, a keyboard, or both.)

    Other examples of queries can include “show the business card from bob,” “how many pictures did I take when camping last week?”, or “what was the name of the person I interviewed last Friday?” (e.g., when the image query processing system includes an image of the person, their resume, or both.)

    The image query processing system can provide the query to a query processing system, included in the image query processing system, that analyzes the query to determine image results responsive to the query.

    The query processing system includes an image category selector that determines one or more image categories for the received query.

    For instance, the image category selector may determine image categories of “receipts,” “cityscapes,” “buildings,” “menu,” or a combination of two or more of these, for the query “what restaurant did I go to last Wednesday?”

    The textual image queries processing system can use the determined image category to select a subset of the images to search for data responsive to the query.

    An output type selector, included in the query processing system, can use the query to determine an output type for the query.

    Some example output types include an image, an annotated image, a total cost, a textual summary, or a combination of two or more of these.

    An annotated image output type can state that the image query processing system should output only a part of an image in response to a query rather than the entire image.

    A total cost output type can state that the image query processing system should output a sum of many different cost values, such as the costs for eating at many different restaurants during a week.

    The many different cost values can be depicted in a single image, (e.g., a receipt that includes the cost for vegetables purchased this past weekend can be depicted in many images,) (e.g., five receipts for the cost of eating at five restaurants throughout the week, or both.)

    A textual summary output type can state that the image query processing system should generate a summary of content depicted in many different images, such as a list of the names of restaurants at which a person ate during a week.

    The query processing system may use the determined output type to select a subset of the images to search for data, to select other processing parameters for determining responsive data or both.

    Some examples of processing parameters include a part of an image in which responsive data is likely to be located:

    • A bottom part of a receipt
    • Types of processing to perform on an image (e.g., how to crop an image for presentation.).

    Based on the example query above, the query processing system may determine that the output type should be a picture of the restaurant and select cityscapes as the image category.

    When the query processing system determines a total cost output type, the query processing system may select receipts as the image category.

    The query processing system may use the determined output type to determine a location within an image to search for responsive data.

    For instance, when the output type is total cost, the query processing system may determine to search a bottom portion of an image for the total cost of dinner at a restaurant.

    When the output type is a picture of a restaurant, the query processing system may determine to search an upper third of an image for the name of the restaurant, (e.g., while the lower portion of the image would depict the storefront for the restaurant.)

    In some examples, the query processing system may use the determined output type to constrain an area within which to search for responsive data.

    The query processing system may detect objects depicted in an image and search within the boundaries of one or more of those detected objects to detect responsive data within those boundaries.

    The responsive data may be text, depicted objects, or other appropriate data.

    The query processing system may use the detected text to determine data responsive to the query.

    The responsive data may include some of the detected text, one of the detected objects, or both.

    For example, with the query “what restaurant did I go to last Wednesday?”, the query processing system may detect a sign depicted within an image, determine the text presented on the sign, (e.g., the restaurant name, and use the determined text to determine a response to the query.)

    The query processing system may determine the text that was previously recognized in the image and search a part of the image for the determined text. When a part of an image was not previously processed to determine the text depicted in the part, the query processing system may determine the text that was recognized in the image after receipt of the query.

    For instance, the query processing system may detect objects depicted in an image, and select one or more of the depicted objects.

    The query processing system may analyze the content for the depicted objects, (e.g., using an optical character recognition process, to determine text included in the depicted objects.)

    Textual Image Queries, Key Phrases and Query Processing

    A key phrase device, included in the query processing system, may determine one or more key phrases for a query.

    The query processing system can use the determined key phrases, (e.g., keywords, to select images responsive to the query,) (e.g., in addition to using one or both of the determined image categories or the output type.)

    For example, when the query is “what restaurant did I go to last Wednesday?”, the query processing system may select “Wednesday” and “restaurant” as key phrases.

    The query processing system can select a subset of images from the most recent Wednesday that depicts cityscapes, which may include a storefront for the restaurant, that depict receipts for the most recent Wednesday, or both.

    The query processing system can analyze the images in the selected subset using the key phrases.

    For instance, when the subset includes many images of a cityscape, the query processing system may use the keyword “restaurant” to determine the images of restaurants.

    When the subset includes many receipt images, the query processing system may use the keyword “restaurant” to determine which receipts from last Wednesday were for a restaurant rather than another purchase, (e.g., coffee or a notebook.)

    The query processing system may use one or more keywords to determine whether an image is more likely responsive than another image.

    The query processing system may use a keyword to determine one or more text types for the keyword.

    The query processing system can analyze an image to determine whether the image includes text that corresponds to the one or more determined text types.

    For instance, the query processing system can use the keyword “restaurant” to determine text types of:

    • Restaurant phone number
    • Restaurant name
    • Restaurant menu types (e.g., breakfast, lunch, dinner)
    • Hours of operation
    • A combination of these

    The textual image queries processing system can analyze an image to determine whether the image includes data for some of the text types, such as a receipt that includes a restaurant phone number, restaurant name, and hours of operation.

    The query processing system can select an image based on the text types for which the image depicts content.

    For example, the query processing system may determine a subset of images that includes three images: a first image that includes:

    • The name of a restaurant, (e.g., a picture of the front of the restaurant)
    • A menu for the restaurant
    • A receipt from the restaurant with the restaurant name, hours of operation, and phone number

    The query processing system can use the number of text types for which the image depicts data to select one of the images from the subset. For instance, the query processing system may select an image with the most, the fewest, or the average number of text types.

    Image Optimization Categorization Takeaways

    Google will generate textual image labels when you search for images, and those labels can include image categories. The categories can determine which queries an image might rank for in image search.

    Classification of those categories can be done a number of ways, based on things such as:

    Optical Character Recognition, to read the text in signs, and on documents, and text added to images. Google can read store signs, billboards, menus. It is worth experimenting with how much Google may understand. We used to caution people not to have Google rely upon address information in images of text to index that address information. I would still suggest making sure that address information be shown on a webpage for a business in text, but Google may now be capable (and willing to) read business signs.

    Object Recognition, to understand what objects are contained in an image. If there is more than one object, Google could recognize more than one of the objects. When Google has a photo of a bear in a river eating a fish, it can recognize the bear, the river, and the fish – I wrote about this in a post describing how Google may annotate images based on the objects in those.

    Keyphrase-based Content in Images. If you are optimizing a page for a harbor, you will want to include images that show off things that appear near or at harbors, such as boats, buoys, seagulls, and more. These are things that people may also search for that are related

subscribe to our newsletter

Leave a Comment