Facebook Interview Question for Software Engineer / Developers

Country: United States
Interview Type: In-Person

Comment hidden because of low score. Click to expand.
of 0 vote

I would chunk each page in blocks of text (I can leverage the DOM of the page... the way you optimize this step deeply affect the final result). Then I would apply an hash function on each chunk and represent the page A as the set of that hash numbers A = [h_0, ... h_n].

Now, I can build a inverted index of the hashes so that h_i => {A,B} meaning that the h_i is present both in A and in B. At this point whenever I render the page A I will generate a link to the page B associated to the chunk of text having h_i as hash number.

I will keep updated the index by keep crowling the Web and updating internal representation of the pages and the index like a SE does.

- Claudio October 05, 2012 | Flag Reply
Comment hidden because of low score. Click to expand.
of 0 vote

What constitutes a quote? - A fragment of text in a page constitutes a quote. For now we will only consider a text dump rather than gathering any semantic meaning to it.

How do you find quotes?
If more than one page has common occurrences of fragments of text, then it constitutes a quote. As for which page is going to be quoted can be determined by some page rank or the timestamp of the page that was created.

How do you make it scale to the web?
We crawl the web across to gather documents. These documents would then have to be compared amongst each other. The comparison can be limited to similar domains of the pages.

How do you handle updates?

The way the information would be laid out is, typically there would/could be a many(children) to one (parent) relationship of pages. Lets call this a group.

When a new page comes in, there are three possibilities:
1. It is a page which will be added to an already existing grouping.
2. It is a page for which a grouping doesn't exist yet. And this will be the new parent referring to itself.
3. It is a page that will replace the parent of a group.

Even within a domain there would be many such groupings. To assign some priority, we can have the number of children quoting the parent treated as a rank. So the node with the highest rank gets checked against first.

If there is no match against any of the highest rank nodes, then that means the children will not match as well. So we create a new parent with rank 1 and the page points to itself.

#1It matches one of the high rank document and gets added to the other nodes.

Along side checking against the highest rank i.e. checkIfQuoteFrom(parent,new_page) , we should also do a checkIfQuoteFrom(new_page,parent). If it returns true for the second call, then either a) the pages are the same. b) new_page is the parent of the current parent ( determined by timestamp amongst other metrics).

If it is a) then we can discard it. If it is b) then we replace the parent with new_page and have the parent join the other set of nodes.

How would you arrange the servers?
There would be a set of servers that would be responsible for crawling the web.
Another set of servers who would be doing a comparison of pages to get the quotes.
As soon as the the crawlers would return back with the set of results, it would be written to some cache and not the database. After massaging the data received is when they will be stored in the db with their state ( discussed above).

Even to fetch the results, we would cache the high frequency requests to ensure minimum db reads.

What data structures would you use?

A decreasing order of nodes ( based on rank ) for the pages to scan. Each of these nodes would be referred to by children. An object representing a domain pointing to these nodes.

A tree like ds that represents the reverse domain and each node there in would store the location to access the aforementioned list.

- capt_caveman March 17, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
of 0 vote

What constitutes a quote?
A fragment text longer than some threshold, say 20 words

How do you find quotes?
Cluster all the web pages. For a web page, compare with all other pages in the cluster. O(mn).
Store quotes in db. Key is url, value is a list of position and url where it is reproduced.

How do you make it scale to the web?
There are 1t web pages over the internet, store all of them in clusters.

How do you handle updates?
Periodically crawl all the web pages and update the quotes. Given a url, check if the quoted positions are updated. If it is updated, then recompute the quotes.

How would you arrange the servers?
Two database, one to store all the clustered web pages, the other stores the quotes.

- ly October 08, 2016 | Flag Reply

Add a Comment

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.


is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More


CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More