Abstract

A web crawler could either be a standalone program or a distributed system that downloads webpages from the URLs stored in it’s queue before processing them. Web crawlers play a vital role in making high-performance search engines. They are expected to be able to parse pages written in as many languages as possible, filter spam, evade crawler traps and prepare the data for the next round of processing. This research introduces important terminology and background related to web crawlers and distributed systems. In addition, we implement in Python and compare the performance of three methods for web crawling, namely breadth-first search, Google’s page rank algorithm, and historical page rank. Our experiments show that page rank algorithm is the fastest at processing webpages.

Advisor

Guarnera, Drew

Second Advisor

Sofia Visa

Department

Computer Science

Recommended Citation

Dohri, Salim, "Comparing Three Approaches To Webcrawling" (2022). Senior Independent Study Theses. Paper 9638.
https://openworks.wooster.edu/independentstudy/9638

Publication Date

2022

Degree Granted

Bachelor of Arts

Document Type

Senior Independent Study Thesis

Download

COinS

Open Works

Senior Independent Study Theses

Comparing Three Approaches To Webcrawling

Abstract

Advisor

Second Advisor

Department

Recommended Citation

Publication Date

Degree Granted

Document Type

Search

Browse

Author Corner

Open Works

Senior Independent Study Theses

Comparing Three Approaches To Webcrawling

Authors

Abstract

Advisor

Second Advisor

Department

Recommended Citation

Publication Date

Degree Granted

Document Type

Share

Search

Browse

Author Corner