You have a file with 100 billi

Amazon Interview Question for SDE-2s

1

of 1 vote

33
Answers
You have a file with 100 billion URLS, find first unique URL.
- aks January 19, 2015 in India | Report Duplicate | Flag | PURGE
Amazon SDE-2

Email me when people comment.

An error occurred in subscribing you.

Country: India

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 6 vote

Since problem specifies large dataset, it must be divided into chunks that fit in memory, and then each chunk is processed, and the results are combined to get the final answer.

Specifically, in this case, first we divide the file into smaller files using a uniform hashing function that produces N/M chunks, each of size M approximately (M=size of main memory). This is done by reading each URL from the input file, hashing it to find which chunk file it should be written into, and then appending it to that file together with its original line number.

Then we read eah file into memory and build a hash-table in memory for the URLs, that is used to count how many occurances for each URL, and store the line number of the URL. After the hash table is built, we scan it to find the entry with the lowest line number from the entries which have a count of 1. This is the first unique URL which is in this chunk file.

We repeat the step above for all chunk file, and after processing each file compare the line number of the URL we find with the URL found so far to have the minimum line number.

After processing all the chunk files, we will have the URL which is first unique URL in all the input.

- gen-y-s January 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 3 votes

Hi,

This seems to be the complete answer, in my opinion.
By using hashing to split the original input into chunks, it is guaranteed that all the duplicates fit into the same chunk, thus any URL that is unique within its chunk is also globally unique.

- popescu.af January 21, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Great solution. Bucketing seems to be the best strategy.

- Victor January 22, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Forgot to mention this solution is O(N) time complexity, which is optimal in this case.

- gen-y-s January 23, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

with your solution, you can figure out the unique URLs. But how to find out the "first" unique URL.

- Anonymous January 24, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

As mentioned in the solution above, to find the first unique URL in each chunk file, you build a hashtable which stores the occurance count and the first line in which each URL occurs.

So when reading a URL from the chunk file, together with its original line number, you update the hashtable as follows:
If the URL is not in the hashtable, you insert it, with a count of 1 and the line number read from the chunk file. If the URL was in the hashtable, you just increase the count.
After finishing reading the chunk file, and building the hashtable, we scan the hashtable to find the URL with a count of 1 and the minimum line number.

- gen-y-s January 24, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

If you are using java, LinkedHashMap is good to use here. while iterating over map, first node where you find 1 is the first unique line number.

- kn January 25, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

if too much records with the same hash key, you'll got problem. you may not have enough memory to hold all hashtable

- Anonymous January 25, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Hi..
Suppose out hashMap size is exceeding the memory then how i can load the rest chunk in the memory...?

- SUMIT January 25, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Assuming each URL has 20 characters, the file would have 800 GB.

Divide it in chunks of 8 GB (taking measures to not cut an URL halfway through) and create one hash table for each on a first run. These hash tables need to be stored to files, naturally.

On the second run, for each URL, look for collisions in each of the 100 hash tables. The first URL with no collisions is the first unique.

- Victor January 19, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Other than if multithreading occurs, what benefit does this have over just having 1 large hashtable for the whole thing?

- Skor January 19, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Well! collisions in hashtable can even occurs if two url(s) are different.?
Think of size of 8GB string hashtable?

- Bharat Kumar Arya January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The problem is that to "look for collisions in each of the 100 hash tables" you need to either: use a distributed system and communicate between them or load/dump chunks to hard drive. In either case the cost of communication or disk r/w would be too great thus unimplementable.

- gameboy1024 January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Skorpius: I divided it into multiple hash tables because one table wouldn't fit into memory. Technically they might be the same thing, since you can read only the desired chunk of the big table file.

Bharat: check if strings whose hashes collided are indeed equal or not, as every hash table in history.

gameboy1024: I agree that loading chunks is demanding, but I don't see any alternative. You still need to read the original file one time in the first place.

- Victor January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Instead of using a hash-table, we can use a compact data-structre for filtering like bloom-filter. Bloom-Filter is very space efficient and can be stored completely in-memory for this kind of data set. Also you can perform membership operation in constant time.
It can also be parallelized for heavy work load and we don't have to incur un-necessary disk IO's or inter-node communication for distributed systems.

- Anonymous January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

You can use distributed hashing to tackle the problem. All the hashes will be stored and maintained in different hash tables with consistent hashing on multiple machines(in memory) IO cost will be associated to locate and load the hash value but it should still be better then storing the hash in the file system. Memcached works on the same principle. So you have a unique hashkey generated for each unique url. When you do a cache_get($url) if its =1 then you have a unique url.

However for this method to work you have to iterate through all the urls twice.

- aniruddhkiller January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

How do you guarantee that the hash tables are in order of the input URLs? I mean, if you hash a string, the index can be anywhere in the hash table. So, how do you guarantee that the first url hash that has no duplicates is the first URL in the input file? What am I missing? Please clarify.

- smallchallenges January 21, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I don't see the point in hashing which is unreliable inherently in case of keys which are not of fixed size such as URLs.

The requirement is to find the very first non-repeating URL and stop any further processing.

Do you see the need of mapping each URL in the data set and updating each URL's count ?

- aye.bettikk January 23, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The key sentence is: "On the second run, for each URL, look for collisions in each of the 100 hash table".

This means that for each URL you have to read thru the whole 100 billlion URLs (summarized in 100 hashtables) to see if the same URL is repeated later..... So this solution is O(N^2) , and since N=100 billion (10^11), N^2=10 Sextillion (10^22) !!!!! That is completely unacceptable.....

- gen-y-s January 26, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Could you elaborate on what you mean by "unique"?
For example, I can look at the first URL in the list and see that there are no duplicates thus far and it will be considered unique...

- Skor January 19, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I do agree. I was also thinking in the same way. I can read first one and can say it is the uniq .

- yuvaraj February 01, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Thinking about all the books I buy on Amazon, I knew that each book has a unique ISBN number in the URL. Likewise, each non-book product as a unique product code in the URL.The format is something like title/uniqueid/.... So, I would ask if I can use this info. If I could, then I would extract the unique id and create a set of (unique id, file offset, num entries). When the entire url file is processed, I would get the earliest file offset with a num_entries = 1.

Edit:
Thinking a bit more about the set, I would think we need two sets: one indexed by unique id and storing num_entries and the other indexed by file offset and storing unique id. If a btree were used to implement the set, then walking file offets and checking for unique ids with num_entries == 1 would give the earliest unique url.

- smallchallenges January 20, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Hello,

You can use a hash table (unordered_map) for unique URLs found so far and another hash table (unordered_set) for those which were repeated.

The table for unique URLs would translate from URL value to its index in the file.

Then you would need to do a run on the whole data set to fill the two tables and another run on the unique URL table to extract the entry that has the smallest index.

Space complexity would be O(N), each entry having (AVG URL length + sizeof(int)) bytes.
Time complexity would be 2 * O(N).

Distributed approach: by using a distributed storage, you don't have to hold the whole file into memory, nor the entire hash tables.

Of course, the IO / Network overhead would slow down the process considerably.

- popescu.af January 21, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

why not use hadoop?

- enigma January 22, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

mapreduce will be better.

Iteration 1 to get uniq URL:
maper: URL -> offset
reducer: URL -> offset when only with 1 record

Iteration 2 to sort uniq URL:
maper: offset -> URL
reduce: offset -> URL

get the head line of the output

- canhua January 25, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

if the array grows too big, dump it to a file.

<?php
function findFirstUnique($urls, $pos = 0) {
  if (count($urls) == 1) {
    return key($urls);
  }
  $groups = [];
  foreach ($urls as $i => $url) {
    if (!isset($url[$pos])) {
      $key = 'empty';
    } else {
      $key = $url[$pos];
    }
    if (!isset($groups[$key])) {
      $groups[$key] = [];
    }
    $groups[$key][$i] = $url;
  }
  $min = -1;
  if (isset($groups['empty'])) {
    if (count($groups['empty']) == 1) {
      $min = key($groups['empty']);
    } else {
      unset($groups['empty']);
    }
  }
  foreach ($groups as $key => $group) {
    $index = findFirstUnique($group, $pos + 1);
    if ($index < $min or $min < 0) {
      $min = $index;
    }
  }
  return $min;
}

- paulz January 26, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Why wouldn't you do this in one pass? If you're reading the file from top to bottom (even if it was line by line) and you are "bucketing" your URLs in a Map, before inserting you will check if it already exists. If a collision is detecteed (

myMap.contains(newURL)

) will return true, and thus you have found the first unique URL. No need to go through all the file and then all the resulting hash maps.

- Anonymous January 26, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Storing all 100 billion URLs in RAM (if alll are unique) would require around 2TBs RAM if I assume that an average URL is 20 chars long (which is a very rough underestimation). Even if you only store a clever hash value of the urls (which needs to be at least a 64bit value (8 bytes)) we're talking about hundereds of GBs of ram.

- Anonymous February 11, 2015 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Find the byte representation of each URL, then perform an AND operation on each subsequent URL. First occurrence of a FALSE means the second URL in the comparison (assuming we compare prev AND next URLs) is the Unique URL we are looking for.

We can partition the file into an external memory/storage for retrieval since this is most likely a question that isn't meant to have the entire file stored on local memory.

This can be done in O(n) time.

Assuming all urls are identical except for one.

- brianfreeze9 January 24, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

-2

of 2 vote

From the Limited Information thats given and given the fact that you are only looking for the first URL that is unique the best way I can think of is to first Sort the file.

Once you have a sorted file you just scan the file linearly to find the first one that has no duplicate next to it on the list.

before:
google.com
apple.com
careercup.com
bloomberg.com
google.com
apple.com
bloomberg.com

after:
apple.com
apple.com
bloomberg.com
bloomberg.com
careercup.com
google.com
google.com

You can easily pick out the unique one.

The initial cost of sorting is a major overheard for a file that size. For which you can use some of the techniques talked about in the comment above. Merge sort or Quick sort should be fine.
Note*
Radix sort or Bucket sort can save a ton of computation if you can convert the URL to IP address.

- roshenw January 19, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think sorting the file will disturb the order and the first unique url found won't be the same as the first unique one in the actual file.

- techdebugger.zg January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

He didn't say the entries are top-level domains only.

- gameboy1024 January 20, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The "best way you can think of" is to sort a super huge file ???
Except for the fact that it's extremely expensive operation, this will help you find the first unique URL alphabetically, not the first unique URL by position in the list.
So it's a horribly wrong solution.

- a-non January 28, 2015 | Flag

CareerCup

Amazon Interview Question for SDE-2s

Books

Videos

Resume Review

Mock Interviews