30 Day Risk-Free Guarantee:
100% money back if you're unsatisfied.
Book (308 Pages):
  • 150 Programming Interview Questions and Solutions
  • Five Proven Approaches to Solving Tough Algorithm Questions
  • Ten Mistakes Candidates Make -- And How to Avoid Them
  • Steps to Prepare for Behavioral and Technical Questions
  • Interview War Stories: A View from the Interviewer's Side
  • Book Preview and More Info

Video (One Hour):
  • Watch CareerCup's founder perform a totally unscripted technical interview of a candidate.
  • Overview of what interviewers look for in software engineering jobs.
  • Technical questions with white-boarding coding where the candidate does well - and struggles.
  • Interviewer reviews with what went well and poorly.
  • Video Preview and More Info

Resume Review (24 - 48hr)
  • Get your resume reviewed by a trained engineering interviewer. More Info
All Products / Services


Express Prep Package (Book)
$29.99 for e-book | $32.45 for paperback
Buy E-Book Buy on Amazon


Standard Prep Package (E-Book & Video)
Emailed Instantly | $39.99
Buy Standard

Elite Prep Package (E-Book, Video & Resume)
Emailed Instantly | $99.99
Buy Elite
 



Design a web crawler.

5


vodangkhoa on February 27, 2007 |Edit | Edit

A simple web crawler is pretty easy to implement. In Java, I know that there are a few libraries that would help you parse HTML pages.
Given an URL, get all the the URLs that are in this page. Then it becomes a Breadth First Search or Depth First Search traversals. Whatever you choose. DFS might consume too much memory in this case.

vipul on March 03, 2007 |Edit | Edit

use lucene and much and get done with it

mp on March 07, 2007 |Edit | Edit

There are atleast 3 components that are required.
1. HTTP Request/Getting page.
2. HTML Parser
3. URL Tracker

THe first component is to request a given URL and either download it to the machine or just keep it in memory. (Downloading will need design to store the web page for easy retreival)

HTML Parser - Removes the html tags and retains text of interest (I needed only part of the page based on some pattern) and URL s in the current page. A more generic webcrawler will have to save different components like image/sound etc

URL Tracker - URL tracker makes sure that no URL is visited twice within a set time frame( A simple mechanism is a hash table with a user-defined comparator function, some urls may still point to the exact same page eg www.abc.com and www.abc.com/index.htm)

The web crawler should start with a set of URLs.

peace on November 04, 2009 |Edit | Edit

Think about maintaining a data structure for mainting links (fan-in and fan-out) counts. One might also be interested in displaying a graph(forest) of node(URL) and edges(fan-in and fan-out).

Sriram on October 18, 2009 |Edit | Edit

Hey dude . . That info was useful

Add a Text Comment | Add Runnable Code
Name:
Comment:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.








Amazon (1033)Bloomberg LP (403)Qualcomm (117)Adobe (88)Goldman Sachs (78)NetApp (49)IBM (43)Morgan Stanley (33)CapitalIQ (30)Sophos (25)Achieve Internet (23)Electronic Arts (19)Motorola (18)Research In Motion (17)Flipkart (16)
Microsoft (867)Google (141)NVIDIA (98)Yahoo (82)Epic Systems (69)Expedia (47)VMWare Inc (43)Apple (32)Cisco Systems (28)Facebook (23)Infosys (22)Agilent Technologies (19)Sage Software (17)Deshaw Inc (16)FlexTrade (15)
More Companies »
Software Engineer / Dev... (1062)Financial Software Deve... (170)Testing / Quality Assur... (56)Analyst (35)Virus Researcher (25)Field Sales (15)Developer Program Engin... (9)Front-end Software Engi... (6)MyJoB (5)area sales manager (4)Assistant (3)Cabin crew (2)Accountant (1)personnel (1)Intern (1)
Software Engineer in Te... (288)Program Manager (65)Development Support Eng... (47)INTERN(MSIDC) (28)Web Developer (18)System Administrator (10)Consultant (10)Production Engineer (5)Associate Technology L2 (5)AcquireKnowledge (4)Product Security Engine... (3)Solutions Architect (2)Gamer (1)mts (1)Fresh graduate interview (0)
More Job Titles »
Algorithm (1073)Terminology & Trivia (294)C (166)Object Oriented Design (159)Java (121)Testing (114)Arrays (101)Operating System (78)Database (70)Linked List (62)String Manipulation (56)Networking / Web / Inte... (44)Threads (36)Linux Kernel (33)PHP (22)
Coding (511)C++ (204)Behavioral (159)Data Structures (155)Experience (116)Brain Teasers (111)Computer Architecture &... (79)General Questions and C... (73)Trees and Graphs (69)Math & Computation (57)Application / UI Design (45)Ideas (38)System Design (35)Puzzles (30)Bit Manipulation (20)
More Topics »
CareerCup Official Interview Book Daily Questions Requests for Help Mock Interviews Video for Cracking the Coding Interview Job Placement Service CareerCup Blog
My Profile Edit Profile & Email Settings Sign Out