## Amazon Interview Question

Research Scientists**Country:**United States

**Interview Type:**Phone Interview

I missed to add. The reviewer further mentioned that I can not sort and that I can use any amount of time for creating the data structure.

Since, I am not allowed to 'sort', I assumed that I can not use BST as I will have to compare numbers while populating the tree. But I didn't clarify it with him; I should have in hindsight.

please edit your post then, adding that information. but that "sort" constraint is weird

Are you sure this BST approach always work?

how you are constructing your BST to be sure search time is O(logn) (balanced BST).

I suppose you need to sort the given set and then build the BST to get O(logn) search time.

If we are ignoring this fact, then yes BST approach will give the correct and optimized solution.

Its a very interesting question I suppose. Now I think the answer is very straight forward unless I am totally mistaken. My code follows:

```
void printNearest()
{
float modulo = -1; //Init to -1 for starter
float array[MAX_SIZE]; // MAX_SIZE = total number of floats
float x; // Given number
loadFloats(array); // Load all the floating point numbers
List<float> closestNums = new ArrayList<>(); // Java 7 style
for(int i=0; i<array.length; i++)
{
float num = array[i];
float moduloDiff = (num-x)<=0 ? -(num-x) : (num-x);
if(modulo == -1)
{
modulo = moduloDiff;
closestNums.add(num);
}
else if(modulo > moduloDiff)
{
modulo = moduloDiff;
closestNums.clear();
closestNums.add(num);
}
else if(modulo == moduloDiff)
{
closestNums.add(num);
}
else
{
continue; //Ignore the number, it is not near
}
}
printAllNums(closetNums); //Print all the numbers
}
```

Furthermore since java flots & decimals are not very accurate I would adopt BigDecimal if that becomes an issue.

All comments are very welcome.

Like you said, that's the very straightforward way to solve the question, answering each query of a point X in O(N) time.

This kind of question ("goal is to efficiently find the number that is closest to 'x' from the fixed set", "what data structure will you actually use for storing the fixed set of floating point numbers to achieve this?") implies that we need to do better than the trivial approach.

Of course this is brute force search, and it has the worst case complexity of O(n) i.e. linear in the total floating point numbers in the set. In contrast BST would be O(log n) and the trie approach would be O(m) where m is the length of the 'longest float in n'.

I think trade off is the key point here. Why do we want a memory intensive data structure approach if the given float number set is small enough/not too big? This would more than suffice. However if you are talking about scaling then I challenge all the BST & other k-dimensional tree would choke the memory axis eventhough it may 'theoretically' fare better than this brute force approach on time axis...

I think the key point is clarification about what could be considered as 'sort', in the context of the problem.

IMO, using a BST is not sorting per se. It's more like structuring the data for searching optimization.

@Itanium, Well I really can't argue with him in the interview as to, what the ideal solution *should* be. My job in the interview is to try and come up with some solution that also satisfies the constraints. And I believe he was definitely looking for an efficient solution that would have smaller time complexity.

Not accusing you but you never mentioned time only efficiency in your question. If you just say efficiency it refers to product of Space & Time efficiencies.

i.e. Efficiency = SxT

And therefore we look for a tradeoff according to the given situation.

@Itanium: Does it? I have never heard anyone define "efficiency" as space complexity * time complexity.

@Eugene.Yaraovi

Why do you think space is so subordinate to time? Had space not been a serious performance metric, sorting algorithms like external sort that operate on huge files would not have been useful at all.

To clarify I meant to say Efficiency is proportional to S x T & not exactly equal..... I admit that was a mistake.... Its not really a textbook definition.

@Itanium: No part of my comment says or implies that space is subordinate to time. I was responding to this part of your post:

"If you just say efficiency it refers to product of Space & Time efficiencies.

i.e. Efficiency = SxT"

You're presenting it as if it were a common definition. I was saying that, as far as I know, this definition is completely made up, and it doesn't exist as a recognized concept. Nor is defining it as "proportional to" space * time a recognized definition.

@successboy123, Good point. However, all I have to do to avoid this issue is to maintain multiple search paths until each one of them can be *safely* deleted.

Thus, in the present case, I'll include paths "1-4-." and "1-5-." and when I reach the next set of nodes I can delete the path "1-4-.-0" and only retain "1-5-.0". And if there was as 13.99, I would also retain "1-3-." until it reaches the next digit, for the given query 14.99. Isn't this correct?

why to use any data structure, a normal loop can do ,here is a simple example ,

tell me if i'm wrong...

```
#include<stdio.h>
#include<math.h>
void main(){
float a[5] = {123.4, 123.46 , 127.65 ,123.45, 120.34 };
int i = 0;
int j = 0;
float num = 123.00;
float temp = fabs(num-a[0]);
for(i = 0; i< 5 ;i++){
if(fabs(num-a[i]) <= temp){
j = i;
temp = fabs(num-a[i]) ;
}
}
printf("%f\n",a[j]);
}
```

Why would you want to sort / BST? That takes O(n ln n). A simple linear scan is easy and takes O(n).

Got to be honest, don't really see how this is an interesting question!

Just a thought. Why not use hash map? Key as floating numbers and value as "difference between floating number and x. find the key with minimum diff .

The idea is to have a fixed set of N points and then have multiple queries with different values of X. So, that won't help

I think we can use a hash function for this. The important point is that it is a floating point. So, first write the floating points a exponent functions (e.g; 1003 = 1.003 * 10^{3}). Then uses nested hashmaps (e.g hashmap(integer*exponent, <hashmap(first_decimal, <haspmap(second_decimal,...)>)>). We can do so until we reach a predetermined depth, and just use brute force on the rest.

Complexity for searching O(M+L); where 'M' is the depth and 'L' is the length of the list of number at the end to search from.

There are many possible options - en.wikipedia.org/wiki/Nearest_neighbor_search

I think we should go for a Space Partitioning method.

I would pick a Kd-Tree (where K is the number of dimensions of the points). This would allow us to answer each query of a point X in O(log N) time, where N is the number of points of the fixed set.

Thanks for the answers. Even the k-d tree or for that matter any space partitioning method requires comparison to identify their order as to which goes to the left and which one to the right. Correct my if I am wrong.

My own answer during the interview was bit unconventional (I am not sure if the interviewer was convinced though).

I opted for a trie data structure for storing the set of floating point numbers, where each node would either store a digit or a decimal point. Thus the height of the trie would be length of the longest floating point number. The traversal at each node would involve finding the node that matches the corresponding digit in the number being tested or finding nodes that are at same distance. I will have to maintain multiple paths during the search until I can safely eliminate all but the closest path.

The problem in this is that in order to get the individual digits, I need to convert the number to string and back to number. While this is not an issue for populating the trie as I don't have to worry about the efficiency here, it will be an issue during retrieval.

Any thoughts about this approach?

Of course, BST or k-d tree would have been straight forward answers if 'sorting' constraint was not there.

I'm not sure that the trie approach ends up being better than a linear scan through the N points. The trie stores the most significant digits first and we don't know the order of magnitude of the points represented in the children of a node. We can have numbers with 3, 4, 10 digits, so we end up traversing all points (we have little potential to cut the search).

The sorting constraint is a bit weird to me. I would like to clarify it a bit better with the interviewer.

If we completely ignore sorting, maybe we should go for a sort of Locality Sensitive Hashing. The goal of the data structure would be to group the points in buckets of points close to each other.

At this point, I would also ask the interviewer if we need 100% correctness or if an approximate answer is ok.

Let's say we group the points in O(sqrt(n)) buckets. To answer a query, we could pick a random sample of each bucket and find the bucket closest to point X. The 100% correctness part is relevant here. Since we group closest points together, we expect that the selected bucket contains the closest point (although it can be wrong a few times).

Then, we could brute force the elements of this bucket to find the closest point to X, yielding O(sqrt(n)) per query.

We can try to have O(log(n)) buckets but the sorting constraint makes it hard to have a good performance checking the n/log(n) elements of a bucket.

Still, i think that grouping the points into proper balanced buckets can be quite hard. In an interview I would go for a straightforward approach. First by going through each point and selecting the ~sqrt(n) available points closest to it. Or trying to binary search the maximum distance between points in a bucket and check how many buckets a given threshold yields.

What do you think?

I don't know if the Trie data Structure would work. For example lets assume that the Trie contains "14.01" and "15.01" and we are searching for "14.99". In the Trie we would proceed as follows "1" -> "4" -> "." Then the only subtree is "01" and Hence the result we get is "14.01" while in reality "14.99" is closer to "15.01".

Reproducing what I said in other post below:

I think trade off is the key point here. Why do we want a memory intensive data structure approach if the given float number set is small enough/not too big? This would more than suffice. However if you are talking about scaling then I challenge all the BST & other k-dimensional tree would choke the memory axis eventhough it may 'theoretically' fare better than this brute force approach on time axis...

Itanium, you're missing the point of the question. We have a fixed set of points, but we want to do several queries with different points X. The trivial approach takes O(Q*N), but we would like it to be O(Q*log(N)) or O(Q*sqrt(N)) at least.

Now i also see that i misread the question. I thought we were talking about 2d or 3d points, not just an array of floats. I still don't see a better approach avoiding sorting.

@Miguel,

I see what you are saying, but don't you think heavy data structures would also degrade performance during high frequency invocations?

While the constant factor might be big for certain data structures, it's still quite worth it due to the O(n) vs O(log(n)) time.

In this case, the data structure does not need to change between queries, I don't see any reason for the performance to degrade.

Using a Binary Search tree can solve this problem. While traversing the BST we can continue to track the "Minumum difference" found till the current Node and return the "Global minumum" found at the end.

Time O(lg n) and constant space.

- successboy123 September 25, 2013