Amazon Interview Question
Software Engineer InternsCountry: United States
Interview Type: In-Person
I haven't tested the your code but logic looks correct, there is one glitch that might cause problems in the interview.
The problem is with using a parser and its time complexity. You dont need to parse the whole html file when all you need are the attributes begining with href. A simple regex should suffice for it.
One more thing to consider: the existing URLs set assumes pages never change. In reality pages may come with cache control metadata (in the form of response headers) that tell you how soon to expire your cached version, whether you should try to invalidate your cache on each access, etc.
This is mine. Has depth limit so that it won't crawl all of Amazon's pages.
from collections import deque
import urllib2
import re
class crawler:
def __init__(self):
self.visited = set()
def crawl(self, url, max_depth):
queue = deque()
queue.append((url, 0))
while queue:
url, depth = queue.popleft()
if depth < max_depth:
try:
source = urllib2.urlopen(url).read()
except:
continue
print url
href_pos = [m.end() for m in re.finditer(r'href=', source)]
for pos in href_pos:
quote = source[pos]
end_pos = source.find(quote, pos + 1)
href = source[pos + 1 : end_pos]
if href not in self.visited and href.find('amazon.com') >= 0:
self.visited.add(href)
queue.append((href, depth + 1))
return self.visited
Hi, I'm not sure if this is Exactly the answer to the problem you posted, but this is would be my approach to the problem.
Any inputs or suggestions are most welcome!
- Ankit September 11, 2013