Tarantula Crawler Hidden Page!

Tarantula Project
Tarantula is a spider, which is exactly what the Tarantual Project is about. Tarantula is the basic crawler of the KDDLab, developped by George Valkanas, meant to crawl several types of sites and harvest web resources in general. Tarantula comes in various flavors, for the different types of sites that it crawls.

Tarantula is written purely in Java and currently uses Lucene as the indexing platform. Lucene is mostly a matter of taste, since Tarantula can be used in conjunction with a database (PostgreSQL has already been tested). The crawler is task oriented and resembles a workflow execution engine. It is multi-threaded -- well, how could it not? a spider has 8 legs! -- and multiple instances of its components can be replicated for increased performance. Precisely because it resembles a workflow execution engine, the Tarantula backbone is also used for experimentation with the datasets that we have downloaded (using the same software), to playback streams of data, etc.

Tarantula takes commands from the Tarantula Hive (or else, the Tarantula HQ), a command line interface that controls its performance, e.g. number of simultaneous crawlers, simultaneous indexers, crawler politeness, etc. Note that the project is still under heavy development but is functional in many ways.

Tarantula flavors

  1. TaranTwitter: Tarantula crawler for Twitter (up and running)
  2. TarantWEB: Tarantula crawler for general web resources (up and running)
  3. TaranTube: Tarantula crawler for YouTube (up and running)
  4. TarantIckr: Tarantula crawler for Flickr (under development)
  5. TarantuLinked: Tarantula crawler for LinkedIn (under development)

Several utilities have also been created, to be used alongside the backbone of Tarantula. Some of them are:

  1. Tarantula: The backbone of the workflow mechanism of Tarantula. (up and running)
  2. TarantUtils: General utility classes, e.g., system information, information conversions, etc. (up and running)
  3. TarantuLogger: General purpose logging mechanism (up and running)
  4. TarantDB: Component to interact with the indexer / database (up and running)
  5. TaranText: Text manipulation, splitting, cleaning, etc. (up and running)
  6. TarantuLocation: Manipulation of geographical information, e.g., geocoding, GPS signals, etc. (up and running)

Papers written using Tarantula

  • TaranTwitter: Live Web Events, Events UI
  • TarantWEB: Commodity Location
  • TaranTube: Ranked Crawling of HW