Design a web crawler to dump a

Amazon Interview Question for Software Engineer Interns

0

of 0 votes

4
Answers
Design a web crawler to dump all the pages of a given website (URL) onto disk. So basically it saves pages which is related to the website (for instance dump all pages of aws.amazon.com) and do not crawl the links outside the website

I coded it in python and then they asked what is the internal structure of dict in python and why or why not it is fast
- vik September 06, 2013 in United States | Report Duplicate | Flag | PURGE
Amazon Software Engineer Intern Coding Data Structures Python

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 2 vote

Hi, I'm not sure if this is Exactly the answer to the problem you posted, but this is would be my approach to the problem.
Any inputs or suggestions are most welcome!

import lxml
from lxml import etree
import urllib2
class webCrawler(object):
	self.urlStack = set() #Set to avoid duplicates in the list
	def __init__(self,baseURL):
		"""The init function(can do database connection setup here)"""
		self.urlStack.add(baseURL)

	def checkURL(self,url):
		"""Check if the url already exists in the database and return True or False"""
	def db_call(self,url,html):
		"""Insert into the db/disk the URL and the html content of the URL"""
	
	def crawl(self,url):
		try:
			html = urllib2.urlopen(url).read()
		except:
			print "Unable to retrieve HTML"
			if len(self.urlStack)>0:
				self.crawl(self.urlStack.pop())
		href_xpath = "//a/@href"
		self.db_call(url,html)
		tree = lxml.etree.HTML(html)
		hrefs = tree.xpath(href_xpath)
		for each in hrefs:
			if each.find("amazon.com")<=0 or (self.checkURL(url)):
				hrefs.remove(each)
		for each in hrefs:
			self.urlStack.add(each)
		if len(self.urlStack)>0:
			self.crawl(self.urlStack.pop())

- Ankit September 11, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

I haven't tested the your code but logic looks correct, there is one glitch that might cause problems in the interview.

The problem is with using a parser and its time complexity. You dont need to parse the whole html file when all you need are the attributes begining with href. A simple regex should suffice for it.

- Rage October 26, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

One more thing to consider: the existing URLs set assumes pages never change. In reality pages may come with cache control metadata (in the form of response headers) that tell you how soon to expire your cached version, whether you should try to invalidate your cache on each access, etc.

- estebistec June 04, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

This is mine. Has depth limit so that it won't crawl all of Amazon's pages.

from collections import deque
import urllib2
import re

class crawler: 
    def __init__(self):
        self.visited = set()

    def crawl(self, url, max_depth):
        queue = deque()
        queue.append((url, 0))

        while queue:
            url, depth = queue.popleft()
            if depth < max_depth:
                try:
                    source = urllib2.urlopen(url).read()
                except:
                    continue
                print url
                href_pos = [m.end() for m in re.finditer(r'href=', source)]
                for pos in href_pos:
                    quote = source[pos]
                    end_pos = source.find(quote, pos + 1)
                    href = source[pos + 1 : end_pos]
                    if href not in self.visited and href.find('amazon.com') >= 0:
                        self.visited.add(href)
                        queue.append((href, depth + 1))
        
        return self.visited

- Bogdan September 30, 2013 | Flag Reply

CareerCup

Amazon Interview Question for Software Engineer Interns

Books

Videos

Resume Review

Mock Interviews