The Web has brought together a wide variety of digital information into
publicly accessible media. However, because of the sheer quantity and varying
quality of available information, the user often feels overwhelmed and
disoriented during his pursuit of information.
My research focuses on the monitoring, collection, sharing, mining and
searching of information in order to help the users identify and extract
the desired information quickly and effectively through simple and intuitive methods.
Here is a list of some of the projects that I have worked on:
Searching the Social Web: In this project we study how Social Networks can improve the daily life of users while
searching and accessing information on the Web. We organize our research around the
following thrusts. a) SN Data Management: How can we efficiently manage the constantly
evolving social Web data? b) Social Searches: How can we enable users to save and share
their Web searches with other users? c) Socially-Aware Results: How should we take into
account a user's Social Network during Web searches? d) Social Trust: How can we tell
which user is trustworthy and knowledgable on a given topic?
WISE 2012 |
WebDB 2011 |
Technical Report 2012
Hidden-Web Crawling: Search engines employ automated programs called
crawlers to download pages from the Web. Typical crawlers today follow links
from one Web page to another and download every page in their path. However, an
ever-increasing amount of information on the Web is accessible only through
search interfaces; such information is called the Hidden or Deep Web. For
example, in PubMed (www.pubmed.org) users can access pages of high-quality
papers on medical research only after issuing a set of keywords. Since there
are no static links to the Hidden-Web pages, current search engines cannot
index them, thus depriving users from accessing potentially valuable
information. In my research, I studied how we can build an effective Hidden-Web
crawler that can autonomously discover and download pages from the Hidden Web.
JCDL 2005 |
extended version |
Indexing Optimizations: Search engines typically create and maintain
large-scale indexes that are used to answer thousands of user queries per
second. Given the vast amount of information available on the Web, such indexes
can easily grow very large and become very costly to operate. In my research, I
proposed and evaluated algorithms for reducing the size of an index, without
sacrificing the quality of results that we return to the users.
SIGIR 2007 |
Web Spam Detection: One of the most important goals of Web search
engines is to return highly relevant results to the users. However, given the
potential monetary value of the traffic that search engines direct to Web sites,
some Web site operators craft spam Web pages that are useless to human users
and exist for the sole purpose of fooling the search engine rankings into
returning such pages, in the hope of attracting traffic. At Microsoft Research,
we studied the characteristics of Web spam and proposed fast and highly
accurate algorithms for removing spam from the search engine results.
WWW 2006 |
Data Synchronization: Information on the Web is constantly updated.
Therefore, once the search engine’s crawler has downloaded pages and stored
them locally, the crawler has to refresh the pages periodically. In my research,
I performed large-scale experimental studies on several million pages
collected weekly from the Web over a period of one year. We induced models that
capture the evolution of Web sites and Web-accessible textual databases. The
models were then used to predict when we should refresh the Web pages.
Additionally, since the enormous size of the Web limits most crawlers to
downloading only a subset of the entire Web, I studied sampling-based
algorithms for determining which subset of the Web the crawler should focus on.
VLDB 2002 ,
WWW 2004 ,
ICDE 2005 ,
The Infocious Web Search Engine: As part of my research, I worked on the
implementation of a full-fledged commercial Web search engine called Infocious, which blends my research in
crawling, data synchronization and indexing along with a variety of natural
language processing (NLP) techniques in order to improve the quality of results
presented to the users. The search engine
performes highly efficient crawling and indexing of Web data, operates in a
distributed fashion over a cluster of commodity machines,
provides failover capabilities that guarantee 24/7 availability of the service,
and gracefully scales to the size of the Web, currently indexing more than 2
WWW 2005 |
Automatic Web Directory Construction: Web Directories provide an
alternative (to search engines) way of locating relevant information on the
Web. Typically, Web Directories rely on humans putting in significant time and
effort into finding important pages on the Web and categorizing them in the
Directory. I studied ways for automating the creation of a Web
Directory by assigning every page from a given collection of
pages to a given subject hierarchy. Our method is based on the identification
of important sequences of terms within Web pages (called lexical chains),
which are then used to assign the pages to categories in the hierarchy.
Releasing Search Queries and Clicks Privately: Researchers working within the realm of a real Web search engine have the privilege of accessing real query logs containing the queries and clicks that the users performed over large periods of time. Although such data is of tremendous value for research in the WWW and Social Networks areas, they are kept strictly confidential from Search Engine companies due to privacy concerns. At Microsoft Research, we worked on algorithms that would allow us to release queries and clicks from a real query log to third parties (users or researchers) while rigorously preverving user privacy.
[ WWW 2009 ]