Amazon Interview Question
SDE-3sCountry: United States
Functionality of Web Crawler
1. List of website to be crawled.
2. All the pages crawled should be stored.
3. Defined frequency for different type of web sites - New websites should be crawled frequenty
4. Consider robot.txt to determine what should not be crawled
5. Understand if there is any change in the page, if so recrawl.
6. Parse and persist.
Need a Queue for BST kind of experience.
Datastructure
1. Set : Key is hash of URL, value is parsed content
2. Zset: Key as hash of URL and timestamp
Queue - FIFO. Will check if content is available in Set, if no then it will store in the Set along with Zset.
Technique
- Bloom filter for determining if the page is not present in the storage. This is OOB in Redis.
- For page modification, rely on modification time, MD5 etc. this can be persisted as a separate set.
A Hash in redi
Given a seed URL, the crawler needed to auto-discover the value of the missing fields for a particular record. So if a web page didn't contain the information that I was looking for, the crawler needed to follow outbound links, until the information was found.
- justhelping September 11, 2019It needed to be some kind of crawler-scraper hybrid, because it had to simultaneously follow outbound links and extract specific information from web pages.
The whole thing needed to be distributed, because there were potentially hundreds of millions of URLs to visit.
The scraped data needed to be stored somewhere, most likely in a database.
The crawler needed to work 24/7, so running it on my laptop wasn't an option.
I didn't want it to cost too much in cloud hosting1.
It needed to be coded in Python, my language of choice.