Clickstream Based Parallel Focused Web Crawler

Naresh Kumar; Sarneet Singh Chitkara; Prashant Solanki

Authors

Naresh Kumar Department of Computer Science & Engineering, Maharaja Surajmal Institute of Technology, New Delhi, India
Sarneet Singh Chitkara Department of Computer Science & Engineering, Maharaja Surajmal Institute of Technology, New Delhi, India
Prashant Solanki Department of Computer Science & Engineering, Maharaja Surajmal Institute of Technology, New Delhi, India

Keywords:

Crawler· Data acquisition· Indexing· Parallel processing· Search engines· Web mining

Abstract

Due to the increasing size of the Web each day, there is always an ever-increasing need for Web crawlers
and searching algorithms that provide efficient scalable crawling and searching mechanisms for indexing the Web.
Due to the voluminous size of the Web, single process crawlers are no longer enough and linkbased metrics, no longer
produce a properly ordered answered set. Traditional linkbased metrics in a parallel crawling environment come with
a lot of communication overhead, which reduces the speed of the crawler and consumes unnecessary bandwidth. The
aim of this paper is to propose an architecture for parallel crawler that relies on content-based metrics for
determining page importance. The page importance is calculated using different weights assigned to each metric
collected using the clickstream based on their relevance to the context. The experiment was done on a blog keeping in
mind, the politeness factor in the crawling process to ensure that the process did not affect the normal functioning of
the website.

Clickstream Based Parallel Focused Web Crawler

Authors

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Download

Menu items

Google

open access

Information