WebRace

Multithreaded High Performance User-Driven Crawler

 
 
           
 

Overview

Publications

Presentations


WebRaceV4
 

Overview

WebRACE is a prototype HTTP Retrieval, Annotation and Caching Engine developed in Java. It is the WWW Agent-Proxy of eRACE.

WebRACE: an agent-proxy that collects, processes and caches content from information sources on the WWW, accessible through HTTP/1.0 and HTTP/1.1

WebRACE retrieves from the Web documents according to XML-encoded user profiles that determine the urgency and relevance of collected information. The system subsequently caches and processes retrieved documents. Processing is guided by pre-defined user queries and consists of keyword-searches, title-extraction, summarizing, classification based on relevance with respect to user-queries, estimation of priority, urgency, etc.

| eRace Components |

  • Mini-Crawler
  • Annotation Engine
  • Object Cache

WebRACE Receives crawling instructions from the eRACE Request Scheduler.

| WebRACE Components |

  • URLQueue URLFetcher
  • Extractor & Normalizer
  • Object Cache


Publications

  • M. Dikaiakos, D. Zeinalipour-Yazti, "A Distributed Middleware Infrastructure for Personalized Services." Computer Communications, September 2004, Vol 27/15, pp. 1464-1480, Elsevier (available online through Elsevier's portal).
  • M. Dikaiakos, "Intermediary Infrastructures for the World-Wide Web." Computer Networks,  Volume 45, Issue 4, June 2004, pp. 421-447, Elsevier (available online through Elsevier's portal).
  • D. Zeinalipour-Yazti, M. Dikaiakos, "Design and Implementation of a Distributed Crawler and Filtering Processor." In Proceedings of the Fifth International Workshop on Next Generation Information Technologies and Systems (NGITS'2002), A. Halevy, A. Gal (Eds.), Lecture Notes in Computer Science series, vol. 2382, pages 58-74, Springer, June 2002 (available through the Digital Library of Springer in pdf)
  • M. Dikaiakos, "Intermediaries for the World-Wide Web: Overview and Classification." In Proceedings of the 7th IEEE International Symposium on Computers and Communications, pages 231-238, Taormina, Italy, June 2002 (available in pdf)
  • M. Dikaiakos, D. Zeinalipour-Yazti, "WebRACE: A Distributed WWW Retrieval, Annotation, and Caching Engine." In PADDA01: International Workshop on Performance-oriented Application Development for Distributed Architectures , Munich, Germany, April 2001 (available in pdf)


Presentations

  • "Scheduling Policies for Distributed Crawling", by Eleni Tsiakouri, M.Sc. Thesis Presentation, University of Cyprus, June 2004 (available in pdf)