Goldman Sachs Interview Question
Applications DevelopersCountry: India
Good answer. Just to add some more details to show stronger mastery of the subject you should include a discussion about pathological data sets. This is the idea that a universe of key values is size X (every possible entry) is being hashed down to our HashTable size N which is substantially smaller than universe X. If our hashing function regularly encounters values in X that are divisible by N, then an abnormal amount of collisions can occur and degrade the O(1) performance in our HashTable to O(n). Therefore, when considering hashing functions you should absolutely do 2 steps:
1. Consider the type of input you expect to receive and ensure that it doesn't result in abnormal clustering/collisions when applied to your Hashing function.
2. Monitor the performance of the HashTable during its initial implementation to ensure that it is meeting the anticipated benchmarks for collisions, load, and rehashing needs.
They are asking about resolving collisions (when two keys map to same slot in underlying table) by chaining.
By chaining is where you just put all the stuff that maps to that slot AT that slot in some way.
There are many ways to do this:
1) Use a list at every slot and put all the keys that map to that slot at the list
a) The list can have underlying it a linked list OR
b) A vector / ArrayList (basically an array)
2) Any other non dynamic set data structure (like BST)
So they are comparing two types of chaining by lists:
Let's do the analysis:
----------------------------
What are the operations needed on the list?
1) Insert a new object that mapped to that slot
2) Delete a new object that mapped to that slot
That's it.
For ArrayLists/Vectors, insert is O(n), delete is O(1).
For LinkedLists, insert is O(1), delete is O(n)
So, in most situations, we are doing more inserts than deletes (because we build up a hash table at first then after a point the inserts and deletes even out).
So when you look at the O( ) of the two key operations, since you are probably going to do M inserts and K deletes, where M>K, for linked list the total runtime of doing all M,K operations will be:
O(M + n*K)
for vectors/ArrayLists it will be:
O(M*n + K)
So linked list is better in terms of overall big-O since usually M < K.
[But in practice, experiments show that if the data held in each node is not enormous, vectors/ArrayLists destroy LinkedList even if M<K by a good amount. This is because of the expense of hopping around via pointers in LinkedLists (on deletes), vs. cheap shifting of array elements in vectors/ArrayLists... but the interview didn't know that, they just wanted big-O proof above]
Ooops, it should say:
For ArrayLists/Vectors, insert is O(1), delete is O(n).
For LinkedLists, insert is O(1), delete is O(n)
And in fact the interviewer is wrong, ArrayLists/Vectors are better always.
Because even in Big-O they are equal. Experiments have shown that using arrays for these things KILLS using linked lists.
I think the interviewer was looking for this answer:
ArrayLists have AMORTIZED time for insert: O(1), delete O(n) because occasionally you have to double the array or cut the array in half.
Whereas Linked list always have insert O(1) vs. delete O(n).
BUT this does not matter. Experiments show that vectors are wayyyy fastor (unless the data in each node is big). In this case, each node holds 2 things, a pointer to object and a key. So ArrayLists will smoke Linked Lists in performance.
But historically people assumed LinkedLists were better because they thought resizing the array underlying an ArrayList was a big deal.
Ooops, it should say:
For ArrayLists/Vectors, insert is O(1), delete is O(n).
For LinkedLists, insert is O(1), delete is O(n)
And in fact the interviewer is wrong, ArrayLists/Vectors are better always.
Because even in Big-O they are equal. Experiments have shown that using arrays for these things KILLS using linked lists.
I think the interviewer was looking for this answer:
ArrayLists have AMORTIZED time for insert: O(1), delete O(n) because occasionally you have to double the array or cut the array in half.
Whereas Linked list always have insert O(1) vs. delete O(n).
BUT this does not matter. Experiments show that vectors are wayyyy fastor (unless the data in each node is big). In this case, each node holds 2 things, a pointer to object and a key. So ArrayLists will smoke Linked Lists in performance.
But historically people assumed LinkedLists were better because they thought resizing the array underlying an ArrayList was a big deal.
In the expectation, each list will have only O(1) entries. If a list only has O(1) entries, both insert and delete are O(1) for both arrays and linked lists.
I don't think anyone ever assumed resizing the underlying array was a big deal in terms of the efficiency of the resize operation itself. Keep in mind that even searching a bucket (the most common operation) is proportional to the number of elements in the bucket, so if touching every element in the bucket were a problem, hash tables wouldn't really work well.
Eugene,
Any List ADT anchored at each table bucket would work.
Historically people avoided List's based on arrays (i.e., ListArrays , Vectors) because of misconceptions about resizing, or reaching a max size, or expense of shifting elements.
Recent papers have highlighted this, and proven that Lists ADTs built on top of arrays are much faster in practice (unless the size of stored node is large).
So in general, we really should be using Vectors/ListArrays for HashTables, because nothing much is really stored in each element of the list (a key, a pointer, maybe two pointers, maybe a "count").
Hash (functions) is a transform from a "Key"-space to the subset of natural numbers {0, 1, ..., N - 1} where "N" is the size of the underlying array. Hashtables deploy hashing to provide O(1) insertion and look-ups in the data structure.
- Ehsan September 18, 2013Assume a set of keys "K" and an underlying array "A" with size "N". Name the table "T". Then in calling "T[k]", the program does:
1- Uses the hash function on key F(k)--->n.
2- Looks in "A[n]" and returns the corresponding value with key "k".
If evaluating "F(k)" is O(1), then you have got yourself a DS with O(1) look-up/insertion.
Problem is that things are not quite easy. For any hash function you use, there are many streams of input data which cripple the performance (E.g., assume your key is an unsigned int and the hash function is simply F(k) = k mod N. For k = 1, N + 1, .... you end up at the second location of array.)
To resolve the collisions in hash function you could do the following two:
1- Implement another DS at each array location, e.g., Linked Lists.
2- When the array is almost full (> 50%), double up it size.
For both scenarios there is O(N) time complexity, but I prefer the latter since it is a one time thing and maybe less dependent on how well you have tuned your hashing function to the type of input.
As for the second question why ArrayList is not used, I am not certain but I am not sure even if LinkedLists are common anymore. The first idea is that LL are easy to implement and lower cost and you don't need dedicated memory, e.g., you allocate on-the-go. However, for an AL, there is a fixed-length underlying array which adds to the cost, e.g., your hash table would need exactly "NM" memory where "M" is the size of AL. This is a waste for some bins with lower number of keys.
Also, read the following post on why LL are not common in .NET (Stackoverflow - could not put the link)
"what-data-structure-is-used-to-implement-the-arraylist"