| Design a web crawler. | |||||||
|
30 Day Risk-Free Guarantee:
100% money back if you're unsatisfied. Book (308 Pages):
![]() Video (One Hour):
![]() Resume Review (24 - 48hr)
All Products / Services
|
|||||||
use lucene and much and get done with it
There are atleast 3 components that are required.
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker
THe first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)
HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc
URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)
The web crawler should start with a set of URLs.
Think about maintaining a data structure for mainting links (fan-in and fan-out) counts. One might also be interested in displaying a graph(forest) of node(URL) and edges(fan-in and fan-out).
Hey dude . . That info was useful
A simple web crawler is pretty easy to implement. In Java, I know that there are a few libraries that would help you parse HTML pages.
Given an URL, get all the the URLs that are in this page. Then it becomes a Breadth First Search or Depth First Search traversals. Whatever you choose. DFS might consume too much memory in this case.