Clickstream Based Parallel Focused Web Crawler
Keywords:
Crawler· Data acquisition· Indexing· Parallel processing· Search engines· Web miningAbstract
Due to the increasing size of the Web each day, there is always an ever-increasing need for Web crawlers
and searching algorithms that provide efficient scalable crawling and searching mechanisms for indexing the Web.
Due to the voluminous size of the Web, single process crawlers are no longer enough and linkbased metrics, no longer
produce a properly ordered answered set. Traditional linkbased metrics in a parallel crawling environment come with
a lot of communication overhead, which reduces the speed of the crawler and consumes unnecessary bandwidth. The
aim of this paper is to propose an architecture for parallel crawler that relies on content-based metrics for
determining page importance. The page importance is calculated using different weights assigned to each metric
collected using the clickstream based on their relevance to the context. The experiment was done on a blog keeping in
mind, the politeness factor in the crawling process to ensure that the process did not affect the normal functioning of
the website.