How Google Processes Queries: Entity Resolution Resources

Posted @ Jul 18 2017 by

New Castle Ale

The Web is filled with entities – information about people, places, and things. A search engine may collect knowledge about connections between entities. In the presentation, How Google Works, Google’s Paul Haahr told us that Google will try to identify entities that appear in queries. His presentation involves more than just crawling the Web and locating the presence of links on pages, and is recommended watching.

A patent that was granted to Google on July 4th focuses upon the presence of entities in queries and understanding those. It focuses upon something called entity resolution or what an entity in a query might stand for. When I looked at the patent, I was impressed by the number of references that the patent applicants filed along with the patent, and I wanted to read those. I thought they were worth sharing with others as well. Not to prove a particular point or to take a particular stance or opinion, but to give anyone willing to take the time in reading the material to take a look at the latest papers and research involving entity resolution. I have read a few of these, and will be reading more. Some of these papers are co-authored by researchers at Google. If you find anything that surprises you, please share that in the comments. Going through an understanding of entities in queries makes a lot of sense, as that process can put those queries in context. Having a sense of how Google might process a query may give some ideas that goe beyond information retrieval scores and PageRank scores for pages. How is Google adjusting for context, for the presense of entities in a query?

For example, Newcastle may refer to Newcastle upon Tyne, UK, to the football (soccer) club Newcastle United, or to the beverage Newcastle Brown Ale. Context may assist in disambiguating the referring text. For example, if the referring text includes the context of “John plays for Newcastle,” the mention is most likely the football club, while “John was born in Newcastle” most likely refers to the location, etc.

We know that Google was attempting to better understand context in developing keywords, as I wrote about in Google Patents Context Vectors to Improve Search. A patent from Google about understanding the context of Entities better can add meaning to pages, and what a search engine knows about them. The focus of this new patent is on building models that can help with entity resolution:

Models predict the probability of some event given observations. Machine learning algorithms can be used to train the parameters of the model. For example, the model may store a set of features and a support score for each of a plurality of different entities. The support score represents a probability score the model has learned, a probability that the feature occurs given the entity. Models used in entity resolution have relied on three components: a mention model, a context model, and a coherency model. The mention model represents the prior belief that a particular phrase refers to a particular entity in the data graph. The context model infers the most likely entity for a mention given the textual context of the mention. In a context model, each feature can represent a phrase that is part of the context for the entity mention. For example, the phrase “president” may have a support score (or a probability score) for the entities of “Barack Obama,” “Bill Clinton,” “Nicolas Sarkozy,” and many others. Similarly, the phrase “plays for” may have a support score for various bands, teams, etc. The context discussed above may be represented by a set of features, or phrases, co-occurring with (e.g., occurring around) the referring text, or entity mention. The coherency model attempts to force all the referring expressions in a document to resolve to entities that are related to each other in the data graph. But a coherency model introduces dependencies between the resolutions of all the mentions in a document and requires that the relevant entity relationships in the data graph be available at inference time, increasing inference and model access costs.

The patent is:

Additive context model for entity resolution
Inventors: Amarnag Subramanya, Michael Ringgaard, and Fernando Carlos das Neves Pereira
Assignee: Google
US Patent: 9,697,475
Granted: July 4, 2017
Filed: December 23, 2013

Abstract

Systems and methods are disclosed for using an additive context model for entity disambiguation. An example method may include receiving a span of text from a document and a phrase vector for the span. The phrase vector may have a quantity of features and represent a context for the span. The method also includes determining a quantity of candidate entities from a knowledge base that have been referred to by the span. For each of the quantity of candidate entities, the method may include determining a support score for the candidate entity for each feature in the phrase vector, combining the support scores additively, and computing a probability that the span resolves to the candidate entity given the context. The method may also include resolving the span to a candidate entity with a highest probability.

Applicant References

When I saw all of the papers referred to in this patent, I wanted to read them all, and share links to them These are papers selected by leaders in the search industry, and having links to them provides a way to dig into some of the latest research on entity resolution. I will be going through these in the weeks to come. I look at it as an opportunity to learn from some of the best sources available. If anything stands out about any of these papers, I would like to hear your thoughts about them.

Chu, et al, “Map-Reduce for Machine Learning on Multicore“, In NIPS, 2006, pp. 281-288. cited by applicant .

Friedman, et al, “Additive Logistic Regression: A Statistical View of Boosting”, Special Invited Paper, The Annals of Statistics, vol. 28, No. 2, 2000, pp. 337-407. cited by applicant.

Ambiverse: AIDA: Accurate Online Disambiguation of Named Entities in Text and Tables“, Max Planck Institut Informatik, available online at http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/r- esearch/yago-naga/aida/, 2013, 4 pages. cited by applicant.

Baluja et al., “Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph“, International Conference on World Wide Web (WWW 2008), Apr. 21-25, 2008, 10 pages. cited by applicant.

Bollacker et al., “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge“, Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 9-12, 2008, pp. 1247-1249. cited by applicant.

Bunescu et al., “Using Encyclopedic Knowledge for Named Entity Disambiguation“, Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2006, pp. 9-16. cited by applicant.

Cucerzan, Silviu, “Large-Scale Named Entity Disambiguation Based on Wikipedia Data“, Proceedings of Joint Conference on Empirical Methods in ze et al.,

Entity Disambiguation for Knowledge Base Population“, Proceedings of the 23rd International Conference on Computational Linguistics, Aug. 2010, pp. 277-285. cited by applicant.

Duchi et al., “Efficient Online and Batch Learning Using Forward Backward Splitting“, Journal of Machine Learning Research, vol. 10, 2009, pp. 2899-2934. cited by applicant.

Ferragina et al., “TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities)“, Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Oct. 26-30, 2010, pp. 1625-1628. cited by applicant.

Finin et al., “Using Wikitology for Cross-Document Entity Coreference Resolution“, Association for the Advancement of Artificial Intelligence, 2009, pp. 29-35. cited by applicant.

Finkel et al., “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling“, Proceedings of the 43rd Annual Meeting of the ACL, Jun. 2005, 363-370. cited by applicant.

Gabrilovich et al., “Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization“, Journal of Machine Learning Research, vol. 8, 2007, pp. 2297-2345. cited by applicant.

Hachey et al., “Evaluating Entity Linking with Wikipedia“, Artificial Intelligence, vol. 194, 2013, pp. 130-150. cited by applicant.

Haghighi et al., “Simple Coreference Resolution with Rich Syntactic and Semantic Features“, Proceedings of Conference on Empirical Methods in Natural Language Processing, Aug. 6-7, 2009, pp. 1152-1161. cited by applicant.

Han et al., “A Generative Entity-Mention Model for Linking Entities with Knowledge Base“, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies–vol. 1, Jun. 19-24, 2011, pp. 945-954. cited by applicant.

Han et al., “An Entity-Topic Model for Entity Linking“, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jul. 12-14, 2012, pp. 105-115. cited by applicant.

Han et al., “Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge“, Proceedings of the 18th ACM Conference on Information and Knowledge Management, Nov. 2-6, 2009, pp. 215-224. cited by applicant.

Hoffart et al., “Robust Disambiguation of Named Entities in Text“, Proceedings of Conference on Empirical Methods in Natural Language Processing, Jul. 27-31, 2011, pp. 782-792. cited by applicant.

Kulkarni et al., “Collective Annotation of Wikipedia Entities in Web Text“, Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining, Jun. 28-Jul. 1, 2009, pp. 457-466. cited by applicant.

Kwiatkowski et al., “Lexical Generalization in CCG Grammar Induction for Semantic Parsing“, Proceedings of Conference on Empirical Methods in Natural Language Processing, Jul. 27-31, 2011, pp. 1512-1523. cited by applicant.

Lin et al., “Entity Linking at Web Scale“, Proc. of the Joint Workshop on Automatic Knowledge Base Construction & Web-scale Knowledge Extraction, Jun. 7-8, 2012, pp. 84-88. cited by applicant.

Mayfield et al., “Cross-Document Coreference Resolution: A Key Technology for Learning by Reading“, Spring Symposium on Learning by Reading and Learning to Read, Mar. 2009, 6 pages. cited by applicant.

Mihalcea et al., “Wikify! Linking Documents to Encyclopedic Knowledge“, Proceedings of the 16th ACM Conference on Information and Knowledge Management, Nov. 6-8, 2007, pp. 233-241. cited by applicant.

Milne et al., “Learning to Link with Wikipedia“, Proceedings of the 17th ACM Conference on Information and Knowledge Management, Oct. 26-30, 2008, pp. 509-518. cited by applicant.

Nigam et al., “Text Classification from Labeled and Unlabeled Documents using EM“, Machine Learning, vol. 39, 2000, pp. 103-134. cited by applicant.

Orr et al., “Learning from Big Data: 40 Million Entities in Context“, available online , Mar. 8, 2013, 6 pages. cited by applicant.

Ratinov et al., “Local and Global Algorithms for Disambiguation to Wikipedia“, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Jun. 19-24, 2011, pp. 1375-1384. cited by applicant.

Sil et al., “Linking Named Entities to Any Database“, Proceedings of Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jul. 12-14, 2012, pp. 116-127. cited by applicant.

Subramanya et al., “Semi-Supervised Learning with Measure Propagation“, Journal of Machine Learning Research, vol. 12, 2011, pp. 3311-3370. cited by applicant.

Talukdar et al., “Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition“, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Jul. 11-16, 2010, pp. 1473-1481. cited by applicant.

Talukdar et al., “New Regularized Algorithms for Transductive Learning“, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, 2009, pp. 442-457. cited by applicant.

Talukdar et al., “Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks“, Proceedings of Conference on Empirical Methods in Natural Language Processing, Oct. 2008, pp. 582-590. cited by applicant.

The patent does describe a process for disambiguating entities; but it seemed to me that being able to go through the resources in the patent was really valuable, and that it was worth focusing upon that aspect of the patent. I will be going through them. This may seem like an academic exercise, but entity resolution is now part of how Google handles queries and worth knowing something about. When Google sees “New Castle” in a query, it should know if the ale or team or location is being referred to. How would you show that off to a search engine?

1 Comment

  1. Jack Wieczorek

    July 26th, 2017 at 9:01 am

    Thanks Bill. Another little step to understand how Google works. I’m looking forward to seeing you summarize the reference articles.

    Reply

Leave a Comment