Design distributed crawling sy

Amazon Interview Question for SDE-3s

0

of 0 votes

2
Answers
Design distributed crawling system which would be feeded a source url.
- neer.1304 August 09, 2019 in United States | Report Duplicate | Flag | PURGE
Amazon SDE-3 Distributed Computing

Email me when people comment.

An error occurred in subscribing you.

Country: United States

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

Given a seed URL, the crawler needed to auto-discover the value of the missing fields for a particular record. So if a web page didn't contain the information that I was looking for, the crawler needed to follow outbound links, until the information was found.
It needed to be some kind of crawler-scraper hybrid, because it had to simultaneously follow outbound links and extract specific information from web pages.
The whole thing needed to be distributed, because there were potentially hundreds of millions of URLs to visit.
The scraped data needed to be stored somewhere, most likely in a database.
The crawler needed to work 24/7, so running it on my laptop wasn't an option.
I didn't want it to cost too much in cloud hosting1.
It needed to be coded in Python, my language of choice.

- justhelping September 11, 2019 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Functionality of Web Crawler
1. List of website to be crawled.
2. All the pages crawled should be stored.
3. Defined frequency for different type of web sites - New websites should be crawled frequenty
4. Consider robot.txt to determine what should not be crawled
5. Understand if there is any change in the page, if so recrawl.
6. Parse and persist.

Need a Queue for BST kind of experience.
Datastructure
1. Set : Key is hash of URL, value is parsed content
2. Zset: Key as hash of URL and timestamp

Queue - FIFO. Will check if content is available in Set, if no then it will store in the Set along with Zset.

Technique
- Bloom filter for determining if the page is not present in the storage. This is OOB in Redis.
- For page modification, rely on modification time, MD5 etc. this can be persisted as a separate set.

A Hash in redi

- anshuman101 February 12, 2020 | Flag Reply

CareerCup

Amazon Interview Question for SDE-3s

Books

Videos

Resume Review

Mock Interviews