Abstract
A web crawler could either be a standalone program or a distributed system that downloads webpages from the URLs stored in it’s queue before processing them. Web crawlers play a vital role in making high-performance search engines. They are expected to be able to parse pages written in as many languages as possible, filter spam, evade crawler traps and prepare the data for the next round of processing. This research introduces important terminology and background related to web crawlers and distributed systems. In addition, we implement in Python and compare the performance of three methods for web crawling, namely breadth-first search, Google’s page rank algorithm, and historical page rank. Our experiments show that page rank algorithm is the fastest at processing webpages.
Advisor
Guarnera, Drew
Second Advisor
Sofia Visa
Department
Computer Science
Recommended Citation
Dohri, Salim, "Comparing Three Approaches To Webcrawling" (2022). Senior Independent Study Theses. Paper 9638.
https://openworks.wooster.edu/independentstudy/9638
Publication Date
2022
Degree Granted
Bachelor of Arts
Document Type
Senior Independent Study Thesis
© Copyright 2022 Salim Dohri