Amazon Interview Question
SDE-2sCountry: India
Hi,
This seems to be the complete answer, in my opinion.
By using hashing to split the original input into chunks, it is guaranteed that all the duplicates fit into the same chunk, thus any URL that is unique within its chunk is also globally unique.
Forgot to mention this solution is O(N) time complexity, which is optimal in this case.
with your solution, you can figure out the unique URLs. But how to find out the "first" unique URL.
As mentioned in the solution above, to find the first unique URL in each chunk file, you build a hashtable which stores the occurance count and the first line in which each URL occurs.
So when reading a URL from the chunk file, together with its original line number, you update the hashtable as follows:
If the URL is not in the hashtable, you insert it, with a count of 1 and the line number read from the chunk file. If the URL was in the hashtable, you just increase the count.
After finishing reading the chunk file, and building the hashtable, we scan the hashtable to find the URL with a count of 1 and the minimum line number.
If you are using java, LinkedHashMap is good to use here. while iterating over map, first node where you find 1 is the first unique line number.
if too much records with the same hash key, you'll got problem. you may not have enough memory to hold all hashtable
Assuming each URL has 20 characters, the file would have 800 GB.
Divide it in chunks of 8 GB (taking measures to not cut an URL halfway through) and create one hash table for each on a first run. These hash tables need to be stored to files, naturally.
On the second run, for each URL, look for collisions in each of the 100 hash tables. The first URL with no collisions is the first unique.
Other than if multithreading occurs, what benefit does this have over just having 1 large hashtable for the whole thing?
Well! collisions in hashtable can even occurs if two url(s) are different.?
Think of size of 8GB string hashtable?
The problem is that to "look for collisions in each of the 100 hash tables" you need to either: use a distributed system and communicate between them or load/dump chunks to hard drive. In either case the cost of communication or disk r/w would be too great thus unimplementable.
Skorpius: I divided it into multiple hash tables because one table wouldn't fit into memory. Technically they might be the same thing, since you can read only the desired chunk of the big table file.
Bharat: check if strings whose hashes collided are indeed equal or not, as every hash table in history.
gameboy1024: I agree that loading chunks is demanding, but I don't see any alternative. You still need to read the original file one time in the first place.
Instead of using a hash-table, we can use a compact data-structre for filtering like bloom-filter. Bloom-Filter is very space efficient and can be stored completely in-memory for this kind of data set. Also you can perform membership operation in constant time.
It can also be parallelized for heavy work load and we don't have to incur un-necessary disk IO's or inter-node communication for distributed systems.
You can use distributed hashing to tackle the problem. All the hashes will be stored and maintained in different hash tables with consistent hashing on multiple machines(in memory) IO cost will be associated to locate and load the hash value but it should still be better then storing the hash in the file system. Memcached works on the same principle. So you have a unique hashkey generated for each unique url. When you do a cache_get($url) if its =1 then you have a unique url.
However for this method to work you have to iterate through all the urls twice.
How do you guarantee that the hash tables are in order of the input URLs? I mean, if you hash a string, the index can be anywhere in the hash table. So, how do you guarantee that the first url hash that has no duplicates is the first URL in the input file? What am I missing? Please clarify.
I don't see the point in hashing which is unreliable inherently in case of keys which are not of fixed size such as URLs.
The requirement is to find the very first non-repeating URL and stop any further processing.
Do you see the need of mapping each URL in the data set and updating each URL's count ?
The key sentence is: "On the second run, for each URL, look for collisions in each of the 100 hash table".
This means that for each URL you have to read thru the whole 100 billlion URLs (summarized in 100 hashtables) to see if the same URL is repeated later..... So this solution is O(N^2) , and since N=100 billion (10^11), N^2=10 Sextillion (10^22) !!!!! That is completely unacceptable.....
Could you elaborate on what you mean by "unique"?
For example, I can look at the first URL in the list and see that there are no duplicates thus far and it will be considered unique...
Thinking about all the books I buy on Amazon, I knew that each book has a unique ISBN number in the URL. Likewise, each non-book product as a unique product code in the URL.The format is something like title/uniqueid/.... So, I would ask if I can use this info. If I could, then I would extract the unique id and create a set of (unique id, file offset, num entries). When the entire url file is processed, I would get the earliest file offset with a num_entries = 1.
Edit:
Thinking a bit more about the set, I would think we need two sets: one indexed by unique id and storing num_entries and the other indexed by file offset and storing unique id. If a btree were used to implement the set, then walking file offets and checking for unique ids with num_entries == 1 would give the earliest unique url.
Hello,
You can use a hash table (unordered_map) for unique URLs found so far and another hash table (unordered_set) for those which were repeated.
The table for unique URLs would translate from URL value to its index in the file.
Then you would need to do a run on the whole data set to fill the two tables and another run on the unique URL table to extract the entry that has the smallest index.
Space complexity would be O(N), each entry having (AVG URL length + sizeof(int)) bytes.
Time complexity would be 2 * O(N).
Distributed approach: by using a distributed storage, you don't have to hold the whole file into memory, nor the entire hash tables.
Of course, the IO / Network overhead would slow down the process considerably.
if the array grows too big, dump it to a file.
<?php
function findFirstUnique($urls, $pos = 0) {
if (count($urls) == 1) {
return key($urls);
}
$groups = [];
foreach ($urls as $i => $url) {
if (!isset($url[$pos])) {
$key = 'empty';
} else {
$key = $url[$pos];
}
if (!isset($groups[$key])) {
$groups[$key] = [];
}
$groups[$key][$i] = $url;
}
$min = -1;
if (isset($groups['empty'])) {
if (count($groups['empty']) == 1) {
$min = key($groups['empty']);
} else {
unset($groups['empty']);
}
}
foreach ($groups as $key => $group) {
$index = findFirstUnique($group, $pos + 1);
if ($index < $min or $min < 0) {
$min = $index;
}
}
return $min;
}
Why wouldn't you do this in one pass? If you're reading the file from top to bottom (even if it was line by line) and you are "bucketing" your URLs in a Map, before inserting you will check if it already exists. If a collision is detecteed (
myMap.contains(newURL)
) will return true, and thus you have found the first unique URL. No need to go through all the file and then all the resulting hash maps.
Storing all 100 billion URLs in RAM (if alll are unique) would require around 2TBs RAM if I assume that an average URL is 20 chars long (which is a very rough underestimation). Even if you only store a clever hash value of the urls (which needs to be at least a 64bit value (8 bytes)) we're talking about hundereds of GBs of ram.
Find the byte representation of each URL, then perform an AND operation on each subsequent URL. First occurrence of a FALSE means the second URL in the comparison (assuming we compare prev AND next URLs) is the Unique URL we are looking for.
We can partition the file into an external memory/storage for retrieval since this is most likely a question that isn't meant to have the entire file stored on local memory.
This can be done in O(n) time.
Assuming all urls are identical except for one.
From the Limited Information thats given and given the fact that you are only looking for the first URL that is unique the best way I can think of is to first Sort the file.
Once you have a sorted file you just scan the file linearly to find the first one that has no duplicate next to it on the list.
before:
google.com
apple.com
careercup.com
bloomberg.com
google.com
apple.com
bloomberg.com
after:
apple.com
apple.com
bloomberg.com
bloomberg.com
careercup.com
google.com
google.com
You can easily pick out the unique one.
The initial cost of sorting is a major overheard for a file that size. For which you can use some of the techniques talked about in the comment above. Merge sort or Quick sort should be fine.
Note*
Radix sort or Bucket sort can save a ton of computation if you can convert the URL to IP address.
I think sorting the file will disturb the order and the first unique url found won't be the same as the first unique one in the actual file.
Since problem specifies large dataset, it must be divided into chunks that fit in memory, and then each chunk is processed, and the results are combined to get the final answer.
- gen-y-s January 21, 2015Specifically, in this case, first we divide the file into smaller files using a uniform hashing function that produces N/M chunks, each of size M approximately (M=size of main memory). This is done by reading each URL from the input file, hashing it to find which chunk file it should be written into, and then appending it to that file together with its original line number.
Then we read eah file into memory and build a hash-table in memory for the URLs, that is used to count how many occurances for each URL, and store the line number of the URL. After the hash table is built, we scan it to find the entry with the lowest line number from the entries which have a count of 1. This is the first unique URL which is in this chunk file.
We repeat the step above for all chunk file, and after processing each file compare the line number of the URL we find with the URL found so far to have the minimum line number.
After processing all the chunk files, we will have the URL which is first unique URL in all the input.