Design question: Say you have

Facebook Interview Question for Software Engineer / Developers

1

of 1 vote

4
Answers
Design question: Say you have hacked in to a network and can deploy your bot thousands of machines, how would you design your bot so that all the machines work together to download a website, say wikipedia. There should be load balancing and a page should be queryable given its URL.
- Interviewee February 27, 2014 in United States | Report Duplicate | Flag | PURGE
Facebook Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

The question states a bot, so it should not a vanilla spider.

1. We should be familiar with the target site to be downloaded. That would help in classifying/distributing the work amongst the bots.
2. There'd be a control bot in the network which distributes the payload to other bots. Any bot can become a control bot incase the current actor goes down.
3. The webpages downloaded with remain on the client. Only metadata on what is downloaded will be sent to the control bot. The subsequent querying can use this to route the request to the correct bot location.
4. The network activity should be covert. So worker bots will operate in small groups to avoid network detection. Also the payload should be to picked from different sections of the website to make it look more normal.

- Kumar March 03, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

Two phase processing.
1st phase : election of a server which gathers IP addresses and # of hash buckets(which is dependent on size of each hash bucket and storage amount of each server) per each server. The elected server sends this information to every server. Or you can manually discover information about servers and distribute it to all servers. This server will take the IP address which should be exposed to clients.
2nd phase:
- The elected server hashes the top url into IP_hash_bucket_number using consistent hashing and send the retrieval request to the mapped server
- Each bot waits for retrieval request
- if a request is received and (the url is not fetched yet or the url is updated), then retrieve the url and parse the returned page and extract linked urls(in this phase, maybe external link can be excluded)
- for each url in urls in the page
- hash it into IP_hash_bucket_number using consistent hashing and send the retrieval request to the mapped server

The elected server sends heartbeat messages to keep track of active servers. If timeout occurs, the elected server starts 1st phase.

If each server does not receive a heartbeat message within timeout, it voluntarily starts 1st phase.

Any client can query for a url to the elected server and the elected server hashes the url and send redirect message to the client and client query the url again to the redirected server.

- anonymous November 05, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

CareerCup

Facebook Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews