Design a web crawler.

Microsoft Interview Question

0

of 0 votes

5
Answers
Design a web crawler.
- Jack Sparrow August 09, 2006 | Report Duplicate | Flag | PURGE
Microsoft Application / UI Design

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

A simple web crawler is pretty easy to implement. In Java, I know that there are a few libraries that would help you parse HTML pages.
Given an URL, get all the the URLs that are in this page. Then it becomes a Breadth First Search or Depth First Search traversals. Whatever you choose. DFS might consume too much memory in this case.

- vodangkhoa February 27, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think intention of this question is to think about the complex cases. How will you handle infinite loop, how will you handle url shortener etc.

- sam January 10, 2016 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

There are atleast 3 components that are required.
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker

THe first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)

HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc

URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)

The web crawler should start with a set of URLs.

- mp March 07, 2007 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Think about maintaining a data structure for mainting links (fan-in and fan-out) counts. One might also be interested in displaying a graph(forest) of node(URL) and edges(fan-in and fan-out).

CareerCup

Microsoft Interview Question

Books

Videos

Resume Review

Mock Interviews