Vocabulary Mismatch- Weakness of Modern Search Engines
As modern Web search engines become more and more successful, everyday knowledge and most online resources are literally at our finger tips. It seems that search engines are successfully replacing the human librarians, and it seems that search is a solved problem.
This talk shows that search ranking, retrieval modeling in particular, is far from being solved. Popular retrieval models such as Okapi BM25 or Statistical Language Models are based on simple string matching between query and document terms, and do not effectively model the vocabulary mismatch between query terms and relevant results.
We formally define term mismatch in search, and show that contrary to common perception, term mismatch happens to most query terms and affects most search tasks. Properly modeling and solving term mismatch can lead to a large potential gain at the scale of 50%-300%. We demonstrate several initial success in addressing term mismatch using novel mismatch prediction features and methods and theoretically motivated retrieval techniques. Promising problems for future research are also outlined.
As a search engine user, after the talk, you should be able to recognize the term mismatch problem when it happens to you in search, (typically disguised as the emphasis problem). You should be able to solve the problem in principled ways, and become a more effective searcher.
Le is a Software Engineer at Google, working on search quality. He is the owner of www.wikiquery.org. He holds a PhD from School of Computer Science, Carnegie Mellon University, and ME and BE degrees in Computer Science from Tsinghua University. (The content of the talk is solely based on work done during Le's PhD.) For more details visit http://www.cs.cmu.edu/~lezhao