Amazon Interview Question for Software Engineer / Developers


Country: United States
Interview Type: In-Person




Comment hidden because of low score. Click to expand.
2
of 2 vote

I think we can use a simple linked list to traverse the URLs appending it in end whenever we find new nodes. There is no requirement for order of parsing the URLs, so a graph search may be unnecessary.

As for bottlenecks,

1) Method used for parsing :
Using simple regex to figure out <a> tags in the page may be more efficient than a external library for parsing HTML.

2) Using multi-threading: given the large number of URLs, this may be efficient. If we use n threads, for example, the time taken is (T/n) . [where T is the time taken if we use a single thread.]

3) Outsourcing and Distributed Computing: If Bandwidth/processing power of our server is low, we may want to outsource the URLs to a remote machine with a superior resources.

4) Language Used for implementation : Because of difference in goals of design, different languages have different execution times for different tasks. A more "network-oriented" language may be more suitable.

- coolraj334 December 03, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

Consider the web as a graph of related URLs, all you have to do is traverse this graph
its can be done using simple BFS, DFS
for parsing the web pages it can be done using any parsers but most of them used FSM to generate all the different components of the web page.

- mBk October 28, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 votes

One last thing is generate key words and store them in some data structure..

you can use different datastructures based on your requirement.
you can use tries, hash table or any balanced tree..

- mBk October 28, 2014 | Flag
Comment hidden because of low score. Click to expand.
0
of 0 vote

The webCrawler system needs to be distributed to support the work load. There should be a Link Discovering component that would feed urls to a queue in the network (ie. amazon SQS). There should be multiple machines with multiple worker threads to download and parse. One of the challenges is to minimize multiple worker threads writing the same portion of the index at the same time (read/write block is expensive). Another challenge is to manage and optimize an enormous index


The challenge is to manage the en

- Anonymous November 08, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

MapReduce ?

sudocode :

INPUT : url \t contents
OUTPUT : url \t n
step.mp :

map (line:String)
	
	url = regex =~ line.split("\t")[0]
	return url \t 1 \n
	
reduce (vector:tuple(String, Integer)) 

	count = 0
	url = vector[0].1
	
	foreach tuple in vector 
		count++
		
	return url \t count \n
	

INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :

map (line:String) 
	contents = URL.GET(line.split(\t)[0])
	return line \t contents
	
reduce (vector:tuple(String, String))
	return line \t contents
	

main :

outputCrawl = initialFile

0.upto(n) 
	outputStep = step(outputCrawl)
	crawl(outputStep)

- Anonymous December 11, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

MapReduce ?

sudocode :

INPUT : url \t contents
OUTPUT : url \t n
step.mp :

map (line:String)
	
	url = regex =~ line.split("\t")[1]
	return url \t 1 \n
	
reduce (vector:tuple(String, Integer)) 

	count = 0
	url = vector[0].0
	
	foreach tuple in vector 
		count++
		
	return url \t count \n
	

INPUT : url \t n
OUTPUT : url \t contents
crawl.mp :

map (line:String) 
	contents = URL.GET(line.split(\t)[0])
	return line \t contents
	
reduce (vector:tuple(String, String))
	return line \t contents
	

main :

outputCrawl = initialFile

0.upto(n) 
	outputStep = step(outputCrawl)
	crawl(outputStep)

- Anonymous December 11, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Seems that they just want to crawl the internet to know the URLs.

I don't buy the idea of using a Graph to connect the URLs as it is not part of the requirement but I do think it is usefully to do if you want to keep the relationships between URLs

I do not see the point of doing deduplication on memory by threads that query as this can be done separate thread as connection speeds is more limiting than CPU speed and having a non-blocking thread querying is more useful time than waiting for deduplication.

The best way is to use some data store to be somewhat of a Queue every time you find a new URL hash or index all of the URLs and check for duplicates and stop if it is a duplicate as some other thread or machine or itself has already crawl them.


Here are the bottlenecks and solutions for them:
Memory: Distribute in different machines

CPU: Distribute in different machines

Disk: Distribute the Queues or Databases partitioning the data using a evenly distributed Hash Algorithm


Connection of Speed Crawler: Pre-Compute as much as possible before doing the query like pre-read the queue till memory is full and pre-create required querie objects.

Connection Speed of Sites: Use a Thread Pool to handle slow connections to not block if a connection is slower.

- Nelson Perez December 17, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Here is my design:
1, "Download worker". input: URL, output: URL & documentation
2, "URL extract worker". Input: documentation, output: URLs
3, "URL feed". it accept URLs and then prepare to dispatch URLs to works. you can choose message queue or any technology.
4, "URL judgement". "URL extract worker" parse the document and output URLs. URL judgement component will lookup DB/memcache/... and decide if we need add all or parts fo URLs to "URL feed".
5, URL key and value storage. the value can includes URL expiration time, URL downloaded document location, size, ... we can use memcache, DB...
6, background thread. it can check the URL key and value storage periodically. if a URL expired, we send the URL to "URL feeder" to update the documents and URLs

In this design, I don't use DFS or complex algo, but it's easy to support millions of URLs. all communication will be asynchronous messages.

- licanhua January 29, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

public IParser
	// downloads the url and parser it
	IEnumerable<string> Parse(url)
	
IRepository
	void SaveToDB(url)
	
IWebCrawler
	void PerformAction()
	
WebCrawler : IWebCrawler
	IParser	m_parser
	IRepository m_repository
	HashTable<string> m_visited
	Queue<string> m_urls;
	
	public WebCrawler(IParser, IRepository, HashTable<string> visited){}
	
	public void PerformAction(IEnumerable urls)
	{
		// place the urls into the Queue

		// call to PerformActionImpl
	}
	
	private void PerformActionImpl()
	{
		// go over the queue
		
		// if the url already in the m_visited then continue to the following one
		
		// if not, parse the url, and place the results into the queue
		
		// if there also be a requirement to do something with the result (like save it), then use the repository
	}

	
WebCrawlerManager
	
	public int m_threadAmount;
	public HashTable<string> m_visitedUrl;
	public IEnumerable<string> m_urls;
	
	public WebCrawlerManager(IEnumerable<string> urls)
	
	Run()
	{
		// created threads by the amount of threads that we got into the constructor
		
		// generate m_threadAmount IEnumerable<string> urls
		
		// run every WebCrawler with it's own IEnumerable<string>
	}

- Feldman.Max March 17, 2015 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

for(int i = 0 ; i  <10 ; i++

- Anonymous December 10, 2014 | Flag Reply
Comment hidden because of low score. Click to expand.
1
of 1 vote

LOL..

- Anonymous December 10, 2014 | Flag


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More