About

From its beginnings as a means to share scientific data, the Web is evolving to capture parts of daily life for millions of users. A large amount of user-generated content on a number of topics, people and their connections to other people and content are placed online daily. Estimates put the amount of user-generated content at a rate of at least 10 GB per day. Web search technology has traditionally assisted users in locating useful information online. Typically, a user would have to visit one of their favorite search engines, type a few keywords and explore the returned results. However, within this newly created ecosystem of user-generated content there are two important new trends that call for a paradigm shift on how information is searched and accessed on the Web.

Proliferation of Social Networks: The first trend is the growing use and availability of social media data. Social media data is often structured and contain information such as reviews, rankings, tags, trust scores, etc. which is linked together into some form of social network, based on friendship-links and interactions among users. There are currently dozens of social networks serving a multitude of purposes: finding friends (e.g. facebook, myspace), sharing media (e.g. youtube, flickr), business networking (e.g. linkedin, xing), etc. The collective knowledge contained in social networks and online communities can be extremely helpful to users searching for information online. For example, a user interested in finding restaurants in Brussels may ask her friends in her social network who have already been to the city for good suggestions. Although this is analogous to the way people ask for suggestions from their friends in real life, there is currently no easy way for users to tap into the wisdom present in their online social networks and communities.

Searching Socially: A second and more recent trend is that users are increasingly using the Web to accomplish complex information gathering, planning and coordination tasks that are beyond simple queries, and that involve other users. Examples of such tasks are the purchase of a car, planning of a vacation, or carrying out a home improvement project. Throughout the process of accomplishing such tasks, users search for information in a "social manner" i.e. they coordinate, collaborate, and consult with other users from their broader social network. One user survey found that more than 50% of users have cooperated with others to search the Web, particularly for tasks such as travel planning, shopping, literature search, etc. Although there is a growing need, the current search paradigm does not allow for collaboration and sharing of results among users when they look for information online.

These trends indicate that information search on the Web is transforming from a lonely activity to a social one, where users are actively involving other users to improve the quality of their findings.

In this project we study how Social Networks can improve the daily life of users while searching and accessing information on the Web. We organize our research around the following thrusts. a) Social Network Data Management: How can we efficiently manage the constantly evolving social Web data? b) Social Searches: How can we enable users to save and share their Web searches with other users? c) Socially-Aware Results: How should we take into account a user's Social Network during Web searches? d) Social Trust: How can we tell which user is trustworthy and knowledgable on a given topic?

Acknowledgements

The SocWeb research project is carried out at the Department of Informatics and Telecommunications of the University of Athens and is funded by the European Union through the Marie Curie International Reintegration grant PIRG06-GA-2009-256603.


People

People participating in the research project:

Publications

  • Focused Crawling for the Hidden Web
    Panagiotis Liakos, Alexandros Ntoulas, Alexandros Labrinidis, Alex Delis
    World Wide Web Journal (WWWJ), vol 18, no 3, 2015.

    A constantly growing amount of high-quality information resides in databases and is guarded behind forms that users fill out and submit. The Hidden Web comprises all these information sources that conventional web crawlers are incapable of discovering. In order to excavate and make available meaningful data from the Hidden Web, previous work has focused on developing query generation techniques that aim at downloading all the content of a given Hidden Web site with the minimum cost. However, there are circum- stances where only a specific part of such a site might be of interest. For example, a politics portal should not have to waste bandwidth or processing power to retrieve sports articles just because they are residing in databases also containing documents relevant to politics. In cases like this one, we need to make the best use of our resources in downloading only the portion of the Hidden Web site that we are interested in. We investigate how we can build a focused Hidden Web crawler that can autonomously extract topic-specific pages from the Hidden Web by searching only the subset that is related to the corresponding area. In this regard, we present an approach that progresses iteratively and analyzes the returned results in order to extract terms that capture the essence of the topic we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from four popular sites. Our approach is able to download most of the content in search in all cases, using a significantly smaller number of queries compared to existing approaches.

  • Effective Detection of Overlapping Communities
    Panagiotis Liakos, Alex Delis, Alexandros Ntoulas
    (under review)

    Real-life systems entailing interacting objects have been routinely modeled as graphs or networks. Revealing the community structure of such systems is crucial in helping us better understand their complex nature. Networks and the relationships they portray are exploited by proposed community detection techniques that seek to facilitate the discovery of separate, overlapping, nested or fully hierarchical communities. Nevertheless, our perception of what a community is in a network of interacting objects, has evolved over the years. In this respect, the decomposition of networks into possibly overlapping organizational groups and our enhanced understanding of their intricate interactions remain open challenges. We address these issues through an agglomerative approach that groups pairs of links and provides a richer hierarchical structure than previous efforts. We attain this objective by exploiting the dispersion of established relationships among objects in the network. Our algorithm measures the similarity of such links as well as the extent of their participation in multiple contexts, to determine the order in which pairs of links should be grouped. Moreover, our technique termed Dispersion-aware Link Communities or DLC can handle both unweighted and weighted networks. Our experimental results with a popular network strongly demon- strate that our approach overcomes issues earlier techniques stumbled upon. Furthermore, we investigate the performance of our algorithm against ground-truth communities for a wide range of networks and show that DLC outperforms state-of-the- art methods.

  • Using temporal IDF for efficient novelty detection in text streams
    Margarita Karkali, Francois Rousseau, Alexandros Ntoulas, Michalis Vazirgiannis
    Journal of Artificial Intelligence Research (under review).

    Novelty detection in text streams is a challenging task that emerges in quite a few different scenarios, ranging from email thread filtering to RSS news feed recommendation on a smartphone. An efficient novelty detection algorithm can save the user a great deal of time and resources when browsing through relevant yet usually previously-seen content. Most of the recent research on detection of novel documents in text streams has been building upon either geometric distances or distributional similarities, with the former typically performing better but being much slower due to the need of comparing an incoming document with all the previously-seen ones. In this paper, we propose a new approach to novelty detection in text streams. We describe a resource-aware mechanism that is able to handle massive text streams such as the ones present today thanks to the burst of social media and the emergence of the Web as the main source of information. We capitalize on the historical Inverse Document Frequency (IDF) that was known for capturing well term specificity and we show that it can be used successfully at the document level as a measure of document novelty. This enables us to avoid similarity comparisons with previous documents in the text stream, thus scaling better and leading to faster execution times. Moreover, as the collection of documents evolves over time, we use a temporal variant of IDF not only to maintain an efficient representation of what has already been seen but also to decay the document frequencies as the time goes by. We evaluate the performance of the proposed approach on a real-world news articles dataset created for this task. The results show that the proposed method outperforms all of the baselines while managing to operate efficiently in terms of time complexity and memory usage, which are of great importance in a mobile setting scenario.

  • SocWeb: Efficient Monitoring of Social Network Activities
    F. Psallidas, A. Ntoulas, A. Delis
    In Proceedings of the International Conference on Web Information System Engineering, 2013, Nanjing, China.

    We demonstrate that contemporary OSNs feature similar, if not identical, baseline structures. To this end, we propose an extensible model termed SocWeb that articulates the essential structural elements of OSNs in wide use today. We introduce a flexible API that enables applications to effectively communicate with designated OSN providers and discuss key design choices for our distributed crawler. Our approach helps attain diverse qualitative and quantitative performance criteria including freshness of facts, scalability, quality of fetched data and robustness. We report on a cross-social media analysis compiled using our extensible SocWeb-based crawler in the presence of Facebook and Youtube.

  • Efficient Online Novelty Detection in News Streams
    M. Karkali, F. Rousseau, A. Ntoulas, M. Vazirgiannis
    In Proceedings of the International Conference on Web Information System Engineering, 2013, Nanjing, China.

    In this paper, we propose a new novelty detection algorithm based on the Inverse Document Frequency (IDF) scoring function. Computing novelty based on IDF enables us to avoid similarity comparisons with previous documents in the text stream, thus leading to faster execution times. At the same time, our proposed approach outperforms several commonly used baselines when applied on a real-world news articles dataset.

  • Topic Sensitive Hidden-Web Crawling
    P. Liakos, A. Ntoulas
    In Proceedings of the International Conference on Web Information System Engineering, 2012, Paphos, Cyprus.
    (18% accepted)

    In this paper, we study how we can build a topically-focused HiddenWeb crawler that can autonomously extract topic-specific pages fromthe HiddenWeb by searching only the subset that is related to the corresponding category. To this end, we present query generation techniques that take into account the topic that we are interested in. We propose a number of different crawling policies and we experimentally evaluate them with data from two popular sites.

  • Rank-aware crawling of Hidden-Web sites
    G. Valkanas, A. Ntoulas, G. Gunopulos
    In Proceedings of the SIGMOD 2011 International Workshop on the Web and Databases, Athens, Greece.

    In this paper we present algorithms for crawling a Hidden Web site by taking the ranking of the results into account. Since we do not know all potential queries that may be directed to the Web site in advance, we study how to approximate the site's ranking function so that we can compute the top results based on the data collected so far. We provide a framework for performing ranking-aware Hidden Web crawling and we show experimental results on a real Web site demonstrating the perfor- mance of our methods.

  • SocWeb: Search Within Your Social Networks (Technical Report)
    F. Psallidas, A. Ntoulas, A. Delis
    Unversity of Athens Technical Report, October, 2012.

    We present the overall architecture of the SocWeb system. We introduce the core concepts of the SocWeb project, the SocWeb Model which is a generic way to describe arbitrary social network graphs, ,the SWODL (abbreviated for SocWeb Object Definition Language) which is a language that we introduce to describe object types of social network graphs and the SocWeb Generic API which makes use of the SocWeb Model and SWODL templates to provide a very simple interface to retrieve information from these social networks. We present the architecture of each component of SocWeb. First, the Application Level that provides a simple but rich front-end to the user and triggers specific requests to be handled by the backend. Second, the Distributed Crawler level, which is cosnidered the most impartant part of SocWeb, and is able to retrieve information. Third, the Storage System, which takes the information extracted from the Crawler, apply several checks and finally store it in a distributed manner. Fourth, the Indexing level which reads the output of either the Crawler directly or from the Storage System and creates the final indices that allow efficently query performing by the users.

Downloads

As part of the SocWeb project we are efficiently crawling data from a set of social networks such as Facebook and Twitter. We currently provide a sample of our dataset for other researchers to use with additional datasets coming up.

  • Complete information of 20,000 user posts from Facebook in XML [.gz]
  • Complete information of 20,000 user post comments from Facebook in XML [.gz]
  • Complete information of 20,000 user post likes from Facebook in XML [.gz]

Contact

If you have questions, comments, or ideas for collaborations, please contact Alexandros Ntoulas or Alex Delis.

Demo

As part of our our project we have developed a fully functional prototype showcasing our work. The prototype allows users to log in with their facebook accounts and search the web by issuing queries. The users also have the ability of saving their work and continuing it at a later time. Each search result within the application is ordered according to relevance and which other users have searched or saved it. The users also have the ability of sharing their overall work with other users. We hope that our application can serve as the basis for identifying new needs for social web search in the future and is open to other developers and researchers to use and improve upon it.

Screenshots:




Figure 1. List of search projects of a user.



Figure 2. Functionality of persistent search queries.


You can access the current version of the demo here.