Explain Hash and how hashing w

Goldman Sachs Interview Question for Applications Developers

0

of 0 votes

11
Answers
Explain Hash and how hashing works?
Another question was on why HashMaps stores same hashcode entry in map as linkedlist, why not it is storing as ArrayList?
- java.interviews.questions September 18, 2013 in India | Report Duplicate | Flag | PURGE
Goldman Sachs Applications Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: India

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 3 vote

Hash (functions) is a transform from a "Key"-space to the subset of natural numbers {0, 1, ..., N - 1} where "N" is the size of the underlying array. Hashtables deploy hashing to provide O(1) insertion and look-ups in the data structure.

Assume a set of keys "K" and an underlying array "A" with size "N". Name the table "T". Then in calling "T[k]", the program does:
1- Uses the hash function on key F(k)--->n.
2- Looks in "A[n]" and returns the corresponding value with key "k".
If evaluating "F(k)" is O(1), then you have got yourself a DS with O(1) look-up/insertion.

Problem is that things are not quite easy. For any hash function you use, there are many streams of input data which cripple the performance (E.g., assume your key is an unsigned int and the hash function is simply F(k) = k mod N. For k = 1, N + 1, .... you end up at the second location of array.)

To resolve the collisions in hash function you could do the following two:
1- Implement another DS at each array location, e.g., Linked Lists.
2- When the array is almost full (> 50%), double up it size.

For both scenarios there is O(N) time complexity, but I prefer the latter since it is a one time thing and maybe less dependent on how well you have tuned your hashing function to the type of input.

As for the second question why ArrayList is not used, I am not certain but I am not sure even if LinkedLists are common anymore. The first idea is that LL are easy to implement and lower cost and you don't need dedicated memory, e.g., you allocate on-the-go. However, for an AL, there is a fixed-length underlying array which adds to the cost, e.g., your hash table would need exactly "NM" memory where "M" is the size of AL. This is a waste for some bins with lower number of keys.

Also, read the following post on why LL are not common in .NET (Stackoverflow - could not put the link)

"what-data-structure-is-used-to-implement-the-arraylist"

- Ehsan September 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Thanks @Ehsan for details.

- java.interviews.questions September 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Good answer. Just to add some more details to show stronger mastery of the subject you should include a discussion about pathological data sets. This is the idea that a universe of key values is size X (every possible entry) is being hashed down to our HashTable size N which is substantially smaller than universe X. If our hashing function regularly encounters values in X that are divisible by N, then an abnormal amount of collisions can occur and degrade the O(1) performance in our HashTable to O(n). Therefore, when considering hashing functions you should absolutely do 2 steps:

1. Consider the type of input you expect to receive and ensure that it doesn't result in abnormal clustering/collisions when applied to your Hashing function.
2. Monitor the performance of the HashTable during its initial implementation to ensure that it is meeting the anticipated benchmarks for collisions, load, and rehashing needs.

- masterjaso September 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

They are asking about resolving collisions (when two keys map to same slot in underlying table) by chaining.

By chaining is where you just put all the stuff that maps to that slot AT that slot in some way.

There are many ways to do this:
1) Use a list at every slot and put all the keys that map to that slot at the list
a) The list can have underlying it a linked list OR
b) A vector / ArrayList (basically an array)
2) Any other non dynamic set data structure (like BST)

So they are comparing two types of chaining by lists:
Let's do the analysis:
----------------------------
What are the operations needed on the list?
1) Insert a new object that mapped to that slot
2) Delete a new object that mapped to that slot
That's it.

For ArrayLists/Vectors, insert is O(n), delete is O(1).
For LinkedLists, insert is O(1), delete is O(n)

So, in most situations, we are doing more inserts than deletes (because we build up a hash table at first then after a point the inserts and deletes even out).

So when you look at the O( ) of the two key operations, since you are probably going to do M inserts and K deletes, where M>K, for linked list the total runtime of doing all M,K operations will be:
O(M + n*K)
for vectors/ArrayLists it will be:
O(M*n + K)

So linked list is better in terms of overall big-O since usually M < K.

[But in practice, experiments show that if the data held in each node is not enormous, vectors/ArrayLists destroy LinkedList even if M<K by a good amount. This is because of the expense of hopping around via pointers in LinkedLists (on deletes), vs. cheap shifting of array elements in vectors/ArrayLists... but the interview didn't know that, they just wanted big-O proof above]

- bigphatkdawg September 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Ooops, it should say:

For ArrayLists/Vectors, insert is O(1), delete is O(n).
For LinkedLists, insert is O(1), delete is O(n)

And in fact the interviewer is wrong, ArrayLists/Vectors are better always.

Because even in Big-O they are equal. Experiments have shown that using arrays for these things KILLS using linked lists.

I think the interviewer was looking for this answer:
ArrayLists have AMORTIZED time for insert: O(1), delete O(n) because occasionally you have to double the array or cut the array in half.
Whereas Linked list always have insert O(1) vs. delete O(n).

BUT this does not matter. Experiments show that vectors are wayyyy fastor (unless the data in each node is big). In this case, each node holds 2 things, a pointer to object and a key. So ArrayLists will smoke Linked Lists in performance.

But historically people assumed LinkedLists were better because they thought resizing the array underlying an ArrayList was a big deal.

- bigphatkdawg September 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

As i think hash it is attempt to map objects from some range and which has many parametres, maybe even infinite to finite range of values.
HashMap uses linkedlist to store enries for the sake of fast insertion to tail.

- gstepanov@griddynamics.com September 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- bigphatkdawg September 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

In the expectation, each list will have only O(1) entries. If a list only has O(1) entries, both insert and delete are O(1) for both arrays and linked lists.

I don't think anyone ever assumed resizing the underlying array was a big deal in terms of the efficiency of the resize operation itself. Keep in mind that even searching a bucket (the most common operation) is proportional to the number of elements in the bucket, so if touching every element in the bucket were a problem, hash tables wouldn't really work well.

- eugene.yarovoi September 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Eugene,
Any List ADT anchored at each table bucket would work.
Historically people avoided List's based on arrays (i.e., ListArrays , Vectors) because of misconceptions about resizing, or reaching a max size, or expense of shifting elements.

Recent papers have highlighted this, and proven that Lists ADTs built on top of arrays are much faster in practice (unless the size of stored node is large).

So in general, we really should be using Vectors/ListArrays for HashTables, because nothing much is really stored in each element of the list (a key, a pointer, maybe two pointers, maybe a "count").

- bigphatkdawg September 20, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Here is one such analysis:
bulldozer00 dot com slash 2012 slash 02 slash 09 slash vectors-and-lists

What he calls Lists really mean Linked Lists.

- bigphatkdawg September 20, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

LL have insert/delete O(1) and search O(n).
AL have insert/delete O(1)/O(n) and search O(n).

It's quicker to insert/delete with a LL than an AL. Both are equally slow on search.

Hence, LL is preferable.

- Gravometer October 02, 2013 | Flag Reply

CareerCup

Goldman Sachs Interview Question for Applications Developers

Books

Videos

Resume Review

Mock Interviews