Facebook Interview Question
Software Engineer / DevelopersCountry: United States
Interview Type: In-Person
Two phase processing.
1st phase : election of a server which gathers IP addresses and # of hash buckets(which is dependent on size of each hash bucket and storage amount of each server) per each server. The elected server sends this information to every server. Or you can manually discover information about servers and distribute it to all servers. This server will take the IP address which should be exposed to clients.
2nd phase:
- The elected server hashes the top url into IP_hash_bucket_number using consistent hashing and send the retrieval request to the mapped server
- Each bot waits for retrieval request
- if a request is received and (the url is not fetched yet or the url is updated), then retrieve the url and parse the returned page and extract linked urls(in this phase, maybe external link can be excluded)
- for each url in urls in the page
- hash it into IP_hash_bucket_number using consistent hashing and send the retrieval request to the mapped server
The elected server sends heartbeat messages to keep track of active servers. If timeout occurs, the elected server starts 1st phase.
If each server does not receive a heartbeat message within timeout, it voluntarily starts 1st phase.
Any client can query for a url to the elected server and the elected server hashes the url and send redirect message to the client and client query the url again to the redirected server.
The question states a bot, so it should not a vanilla spider.
- Kumar March 03, 20141. We should be familiar with the target site to be downloaded. That would help in classifying/distributing the work amongst the bots.
2. There'd be a control bot in the network which distributes the payload to other bots. Any bot can become a control bot incase the current actor goes down.
3. The webpages downloaded with remain on the client. Only metadata on what is downloaded will be sent to the control bot. The subsequent querying can use this to route the request to the correct bot location.
4. The network activity should be covert. So worker bots will operate in small groups to avoid network detection. Also the payload should be to picked from different sections of the website to make it look more normal.