Microsoft Interview Question
There are atleast 3 components that are required.
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker
THe first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)
HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc
URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)
The web crawler should start with a set of URLs.
A simple web crawler is pretty easy to implement. In Java, I know that there are a few libraries that would help you parse HTML pages.
- vodangkhoa February 27, 2007Given an URL, get all the the URLs that are in this page. Then it becomes a Breadth First Search or Depth First Search traversals. Whatever you choose. DFS might consume too much memory in this case.