Abstract

A web crawler could either be a standalone program or a distributed system that downloads webpages from the URLs stored in it’s queue before processing them. Web crawlers play a vital role in making high-performance search engines. They are expected to be able to parse pages written in as many languages as possible, filter spam, evade crawler traps and prepare the data for the next round of processing. This research introduces important terminology and background related to web crawlers and distributed systems. In addition, we implement in Python and compare the performance of three methods for web crawling, namely breadth-first search, Google’s page rank algorithm, and historical page rank. Our experiments show that page rank algorithm is the fastest at processing webpages.

Advisor

Guarnera, Drew

Second Advisor

Sofia Visa

Department

Computer Science

Publication Date

2022

Degree Granted

Bachelor of Arts

Document Type

Senior Independent Study Thesis

Share

COinS
 

© Copyright 2022 Salim Dohri