eugene.yarovoi
BAN USERO (log(n!)) = O(n log n)
 eugene.yarovoi February 12, 2013@S.Abakumoff: This is actually an optimal solution for a square 2d array, as far as I know. It is O(n) for an n*n matrix. An n*n matrix has n^2 elements, so naive search would be O(n^2). A binary search solution is possible, but it probably wouldn't be as good as you think it would be. You can only get O(nlogn) with typical binary search here, or perhaps O(n) with some clever variant.
 eugene.yarovoi February 12, 2013No, it's really not that simple. A couple things:
1. There's no need for there to be k different buckets. Suppose the array is all 1s and k=5 and NUM = 5. Then any arrangement satisfies the conditions, even though there's only one bucket.
2. The sum of (a1 % NUM) + (a2 % NUM) +... + (ak % NUM) needs not add up to NUM. It can also add up to a multiple of NUM.
3. If you were to have k buckets, it could not be that the number of elements in each bucket is the same, since N is not necessarily divisible by k.
I recommend reading the discussion at the top of the page. It might shed some light as to why simple roundrobinlike approaches to this problem fail.
@Anonymous: yes, that's what the problem sounded like to me.
"Random(n) is given as a tool you can use to generate a
single random number between 1n".
I assumed that Random(n) is O(1) in saying that the overall complexity of the solution discussed here is O(n log n).
A binary search tree is a type of binary tree. Asking to compare the two is like asking to compare dogs to animals.
 eugene.yarovoi February 10, 2013An AVL tree is a type of binary search tree, which is a type of binary tree. I don't know what it means to compare an AVL tree to a binary tree because an AVL tree is a type of binary tree.
 eugene.yarovoi February 10, 2013Why couldn't you do it with BFS?
 eugene.yarovoi February 10, 2013Most of what you said isn't right.
You wouldn't want to use a binary tree for a folder structure, because a folder can have many subfolders, not just 2.
An AVL tree's leaf nodes differ in height by at most one.
RB trees have nothing to do with security. They are often used instead of AVL trees because their insertion and deletion times have a lower constant factor.
@stefanos: yes, I think truncating division is the most plausible interpretation of / in the context of this question.
 eugene.yarovoi February 09, 2013@vik: True. Both approaches are different ways of doing the same thing. I wasn't concerned with optimizing the algorithm as much as possible, but only with giving a reasonable O(n) approach.
 eugene.yarovoi February 08, 2013@HG: Collections.unmodifiableMap(...) in Java uses this approach to immutability, actually.
 eugene.yarovoi February 06, 2013@HG: that works too, though I tend not to prefer that kind of approach, largely because illegal uses cannot be detected until runtime (the compiler can't catch them). The technique is very useful if you must be able to accept an ImmutableDate anywhere that you can accept a Date, however.
 eugene.yarovoi February 06, 20131) Count the nodes in the list. Then travel half of them to get to the middle of the list
2) Starting with the middle of the list, use the standard algorithm for reversing a singlylinked list in place until you reach the end of the list.
3) Get a pointer to the node that used to be last node before the reversal (and now is the middle node), call it endPointer. Also get a pointer to the start node of the list, call it startPointer.
4) Take turns advancing the two pointers, printing elements as you go. Like, print startPointer; startPointer = startPointer.next; print endPointer; endPointer = endPointer.next. Do this until the endPointer reaches the null pointer (the end of the list after reversal).
5) Restore the original value of endPointer from step 3, and reverse the linked list from that point until the end (null terminator). This restores the list to its original state.
Of course you have to be careful about how you treat lists of even size vs. lists of odd size and watch out for offbyone bugs, but these are the basic steps.
@Sidhu: Not quite. Remember, you said, "if it doesn't then print it else try again." The "else try again" part is quite critical. You must consider how many times you'll need to try again for each element. Every element you print may be the result of several attempts. The analysis above shows that the average number of attempts per element printed in your scheme is O(log n), for a total O(nlogn) hash lookups, and O(nlogn) total expected complexity.
@fly123: please elaborate
You can definitely factor a number up to 10 billion into prime factors very quickly. Even using a relatively naive method (standard loop to try all factors up to sqrt(n) <= 10^5), you should be able to get all the prime factors in a few milliseconds or so.
 eugene.yarovoi February 02, 2013If you think the problem is asking you to find integers x and y in the array such that x + y = T, you could achieve O(n^2) worst case and, look, O(1) best case by just the most naive method: try summing every pair of numbers, return true as soon as you see a pair summing to T. Why use randomization at all?
 eugene.yarovoi February 02, 2013@bill: I think the solution is for integers in the range [1, N]. You'd use one extra variable (O(1) space) to sit in place of arr[N] if you wanted to support integers in the range [0, N]. Would have some special cases, but you could still use this same idea.
 eugene.yarovoi February 02, 2013I don't really like the label "best one". For example, the first time someone asked me this problem, I came up with a solution that was nothing like any of the solutions here so far. Mine also ran in O(n). Was mine any better or worse? Well, the queue solution is very elegant in my opinion; my solution looks more like a hack. My solution (at least to me) is far more natural though, in that I think someone who has never seen this problem solved before would be more likely to come up with it than with the queue solution, but maybe that's just me. I don't know about the constant factor. I'd have to look at it carefully. The point is that even if two solutions are both O(n) there may be tradeoffs between code complexity, complexity of the idea, elegance of the idea, constant factor for the time, locality of memory accesses (will you get lots of cache hits, etc.), and space needed. For instance, my solution was able to be adjusted to use only O(sqrt(windowSize)) space at the expense of doubling the constant factor (still O(n) though, so there are methods that use less space and are still O(n)).
That's generally why people think it's worth discussing different solutions to the same problem.
There's a well known dynamic programming solution too, when the weights are constrained to relatively small numbers. I think that usually the dynamic programming approach is what interviewers look for in problems like this one.
 eugene.yarovoi February 01, 2013@Bill: why is it out of bounds?
 eugene.yarovoi February 01, 2013You can fix this just by staying on the same spot after you swap and a[i] still doesn't equal i. In that case, you can do it in one process:
5, 6, 10, 8, 9, 5, 7, 6, 5, 10
Pass 1  9, 6, 10, 8, 5, 5, 7, 6, 5, 10
Pass 2  5, 6, 10, 8, 5, 5, 7, 6, 9, 10
Pass 3  1, 6, 10, 8, 5, 5, 7, 6, 9, 10
Pass 4  1, 5, 10, 8, 5, 6, 7, 6, 9, 10
Pass 5  1, 1, 10, 8, 5, 6, 7, 6, 9, 10
Pass 6  1, 1, 1, 8, 5, 6, 7, 6, 9, 10
Pass 7  1, 1, 1, 6, 5, 6, 7, 8, 9, 10
Pass 8  1, 1, 1, 1, 5, 6, 7, 8, 9, 10
the rest are the same.
Since each pass we eliminate an out of place element, advance the current index, or both, we know that it's at most 2n steps, each of which is constant time.
Your solution with this change is actually a slightly different form of the topvoted solution. It's essentially the same thing, give or take optimizations.
It's interesting how the most subtle differences can change the complexity.
I understand if you can't find a proof or counterexample. All I'm saying is that it's important to mention conjectures on which the result depends so that people can proactively validate them, as opposed to catching them hidden within a proof.
The conjecture turns out to be false. The O(log n) upper bound is tight. Consider the following family of sequences:
S_0 = (1)
S_k = (2^k, 2^k  1, ..., 2^(k1) + 1) + Reverse(S_(k1))
where + is sequence concatenation.
So the first sequences are
S_0 = (1)
S_1 = (2) + (1) = (2, 1)
S_2 = (4, 3) + S_1 = (4, 3) + Reverse(2, 1) = (4, 3, 1, 2)
S_3 = (8, 7, 6, 5) + Reverse (4, 3, 1, 2) = (8, 7, 6, 5, 2, 1, 3, 4)
Observe that each S_k is chosen so that applying one pass of your algorithm mirrors it about the center. (2^k, 2^k  1, ..., 2^(k1) + 1) + Reverse(S_(k1)) becomes S_(k1) + (2^(k1) + 1, ... , 2^k  1, 2^k). So the second half has all the elements in their right place, and the first half becomes S_(k1).
S_1 takes 1 pass to sort. S_k takes 1 + passes for S_(k1). By induction, it takes k passes to sort S_k. We can also see that S_k has twice the length of S_(k1). S_1 has length 2^1, so S_k has length 2^k. The number of passes needed can thus be as high as logarithmic in the number of elements. Thus, if you make all the passes you need to ensure correctness, the algorithm will be O(n log n) time overall.
I think I'll be able to find one. But why didn't you mention this issue in your analysis, especially if you thought about it and couldn't prove it?
 eugene.yarovoi January 31, 2013+1. By the way, what's with the function name?
@zyfo2: there is no hash map.
@genys: That's very true, but there are only log(k) elements moved during each heap deletion. If you use a HashMap, you can get expected O(log k) per heap operation overall.
If you didn't implement the heap as an array but as a tree with dynamically allocated nodes, you could not have to update pointers in the map. You'd just have to switch around child and parent pointers in the heap itself, achieving deterministically O(log k) time heap operations.
genys is correct. Searching an element in a binary heap (which is required to remove it) provably cannot be faster than O(k) in the worst case where k is the size of the heap, unless the binary heap is augmented with some other data structure. Deleting an element with a specified value takes O(k) in the worst case. The O(log k) removal time is for deleting an element at a known position once you've found it. For example, the min can be found right away in a minheap, so it can be deleted in O(log k).
That said, you can use a map to associate values with positions in the heap. You can then locate values in the heap in log(k) or better, and cause their removal in log(k). This becomes the O(n * log(k)) solution you were hoping for, but it's now more complicated than the O(n) solutions to this problem.
If there are many squares of the same size, there may be distances that have a lot of pairs of points associated with them. Let's say we have 1M squares of side 4. Then there are at least 4M entries under the key of 4. Now you have to match them up and associate the right ones together to form squares. You haven't said anything about how you will do this efficiently. It's quite possible that you had an algorithm in mind for that, but without describing it, this solution seems incomplete.
 eugene.yarovoi January 30, 2013@Rage: you can't set the return type of a method to be final.
public final Date foo() { ... } does something very different than what you seem to think. It does not return an immutable object; rather, it seals the method from extension by subclasses. The "final" here is not part of the return type (the return type is still Date). I think that maybe you're equating "final" with "const" in C++, and they are not equivalent.
Your initial answer is not correct because you class is entirely mutable: I can change its state by calling yourObject.getDate().set(myNewTime). Simply creating a defensive copy on construction does not make your object immutable, but only makes there be no reference to the internals of the class on object instantiation. If you later leak references to the internals of your class (like your getDate() method does), the class is still mutable. What you need is a defensive copy both on construction and on returning internals. Copyonconstruction ensures that there are no external references to the internals of the class at the time the object is initialized. Copyonget ensures that no such references come to exist at a later time.
Assuming Date has a copy constructor, the full solution would be as follows:
public class ImmutableDate
{
//the final here isn't strictly necessary, because this field isn't
//externally visible anyway
private final Date date;
public ImmutableDate(Date date)
{
this.date = new Date(date);
}
public Date getDate()
{
return new Date(this.date);
}
}

eugene.yarovoi
January 29, 2013 This isn't a nearestneghbors problem. You're asked to store the exact data in the distance matrix.
 eugene.yarovoi January 19, 2013@Bertranreddy: there needs to be far more discussion of what the requirements are of this "like" feature before it makes sense to talk about implementation details. Do we have to add stories to users' pages about things they liked? Or are the likes more like just a counter? Is it important to be able to see who liked a particular thing?
If, after examining the requirements carefully, we determine that an array or linked list would be a reasonable inmemory data structure, we still need to talk about how we're going to persist this data (we want to actually save this data to disk, not just store it in RAM), and then maybe also replicate it throughout some sort of distributed system.
The query does what I indicated in my answer above. I think you're just getting the logic a bit backwards.
You could try running the query to confirm. I've already done that.
5 highest books *with ties*.
 eugene.yarovoi January 15, 2013It's not quite the 5 most expensive books  it's the 5 most expensive books with ties included. See my answer.
 eugene.yarovoi January 15, 2013@Son: Given the lengths of AB and BC in a triangle, can you find the length of AC? The answer is no. It might still be possible to use a knowledge of certain distances to compute certain other distances, but it's not as simple as you suggest.
@Vincent: It's constant time, assuming O(1) arithmetic operations. There's nothing in there such that the running time depends on N.
"The problem with the matrix is that it occupies N*N rooms of the memory, where only a small subset of nodepairs (let's say m) have distance information."
I don't understand what you mean by this sentence. The problem statement suggests a distance is defined between every pair of points, not some small subset.
If you know the distance AC and you also know AB, how will you find BC? You cannot infer the length of the third side of a triangle from the length of the two other sides.
 eugene.yarovoi January 15, 2013This gets the top 5 most expensive books, except the top 5 designation here will admit ties. In other words, if the books are:
A $31.10
B $30.00
C $22.12
D $22.12
E $19.99
F $19.99
G $17.16
This query will select all books AF because E and F are tied for 5th place. So 6 results will be returned. If, however, we now insert ('H', 22.12), the query will now return A, B, C, D, and H, since E and F will both be pushed to 6th place now.
I would note that this query is likely to be an inefficient way to achieve the result. There are much better ways to do this. In MS SQL Server for example, there's readymade syntax to do this:
SELECT TOP 5 WITH TIES title
FROM Books
ORDER BY Price DESC
If you didn't have access to that, you a good way to do it would be to select the 5th highest price, and then query for all titles whose price is >= that price:
SELECT title
FROM Books
WHERE
Price >=
(SELECT TOP 1 Price FROM (SELECT TOP 5 Price FROM Books ORDER BY Price DESC) ORDER BY Price ASC)
As compared to the approach in this question statement, the advantage here is that the subqueries are independent of any perrow information, which will lead the SQL engine to evaluate the subquery once before beginning the main query, as opposed to evaluating it for each row.
That's because x and y get promoted to int when doing the addition. The same idea would not save integers from overflowing.
 eugene.yarovoi January 13, 2013Agreed with Cerberuz's response to the first question: just interact with the interviewer and explain your thought process as you try to find the solution. Whatever you do, don't panic.
In response to your second question, I'd say that the probability of being asked to code is usually inversely proportional to the difficulty of the problem. If the problem is complex enough that just developing the algorithm takes some time, it's less likely that you'll be asked to code. When coding, a lot of the time it's fine to assume you already have certain standard components (standard libraries, where many common data structures may be found). At times, the interviewer may ask you to implement something yourself instead of using a library  just keep interacting with the interviewer to find out what they want, and try to give them what they're looking for.
Remember to always be honest about what you know and what you don't know. Your interviewer may not be Socrates, but he/she will be considering whether you admit not knowing things that you don't know. After all, an admission of ignorance opens the way to learning, and is thus extremely important to the kind of lifelong learning needed for a software engineer.
It's fine to have certain black boxes in your understanding of some concepts, so long as you have no delusions about which topics you know and which you don't know. Candidates use black boxes in their reasoning all the time; for example, most people don't remember the exact algorithm for balancing BSTs, and yet when constructing algorithms involving BSTs, they assume their trees will be balanced, and they analyze the complexity of their solution accordingly.
As long as you're aware of what components you need to achieve certain results (for example, you must know that to achieve guaranteed O(logN) inserts you need a balanced tree and not just any tree), a lot of the time it's OK to use a black box component. Of course, the more you know, the better. It's great if you can code a suffix tree from scratch. It's much more important, however, to be able to turn an implementation of a data structure into a solution to an actual problem. After all, libraries for most data structures can be found online, but you'll have to write your own business logic.
Yes, I'm aware it's just a definition.
All I'm arguing is that defining C(0) = 1 can be a part of a reasonable method for solving this problem. In fact, making this definition and further defining C(anything negative) = 0 is a solution that has fewer base cases and that also, in some sense, explains the origin of the base cases in skeptical_scientist's initial answer.
There's a reason why nowhere in my reply do I claim that I have given a rigorous mathematical proof demonstrating C(0) = 1. I suppose whether the expression C(0) has any meaning that is mandated by this problem statement is a somewhat philosophical question.
Why is the argument for C(0) = 1 bogus? There is exactly one way to take 0 items. I would argue that the real base cases should be C(0) = 1, C(all negative numbers) = 0. The logic here is that if you're left with no items, there's only one thing to do  nothing  and if you've used more items than you have, then you're in an invalid state and there are 0 ways to proceed from there.
Before you decide that this is unreasonable, consider how you'd even solve this problem if the problem wasn't asking you to choose items in groups of 1, 2, or 3, but in groups of 1, 2, ..., or 100. How would you get all the base cases? C(1) would be 1 because if you take 1 you're left in a valid state (0 items left); take any more and you're in an invalid state. Similarly C(2) would be 2 because taking 1 or 2 leaves you in a valid state. C(3) would be 4 because you'd consider C(0) + C(1) + C(2)  the three possible numbers of remaining items that leave you in a valid state. The C(0) term corresponds to the possibility of taking all 3 items at once. C(4) would then be 8 since it would be C(0) + C(1) + ... + C(3). Again, the C(0) term corresponds to the possibility of taking all 4 items and so should be set to 1 for the equation to make sense.
You get into infinite recursion here if oldColor ==
newColor. I would add a quick explicit check for that condition.
Are you sure this question is correctly and fully stated? Please revise.
 eugene.yarovoi January 02, 2013So I have some doubts about this solution: how exactly will we discover minimum subsegments that pass through the center? It seems we're assuming that any minimal subsegment that passes through the middle of the array will have an equal number of elements on both sides, and hence we will be able to discover it efficiently. There's no basis for such an assumption, if in fact you make it. I might just be misunderstanding you though.
 eugene.yarovoi December 30, 2012That's just not valid. Consider {4, 1} and {2, 3}. The difference would be {2, 2}, the sum of which is 0. Clearly {4, 1} are not the same elements as {2, 3}.
 eugene.yarovoi December 30, 2012It's the same logic as given in my answer here: www .careercup. com/question?id=12217041
 eugene.yarovoi December 30, 2012Just a couple ideas off the top of my head:
Compression: compression often improves space usage. It may improve or worsen performance. It can worsen performance because the data has to be decompressed and compressed on the fly, which uses CPU time; it may improve performance because it can cause data to be stored more compactly, causing fewer disk sectors to be read or written to fulfill a given query. Compression is most likely to produce gains in an application that is heavily I/Obound.
Partitioning: sometimes, partitioning a table based on an appropriate key can decrease lock contention and therefore improve performance. For example, suppose you have a table that contains a date column. If you write new dates and delete old dates, partitioning by date may allow the inserts and deletes to happen on different partitions of the table (which translates to separate data structures internally), which means those two operations may be able to proceed in parallel without lock contention. Note that partitioning can also worsen query performance, if a query that previously was able to search an index efficiently now has to search each partition of an index.
I must have posted this comment in the wrong place.
 eugene.yarovoi December 29, 2012
RepGayle L McDowell, CEO at CareerCup
Gayle Laakmann McDowell is the founder / CEO of CareerCup, which provides programming interview prep for candidates interviewing with Microsoft, Google ...
RepLoler, Employee at Google
Rep
Open Chat in New Window
Yes, this approach has a recurrence formula of T(n) = 3 T(n/2) + C where C is a constant. The complexity will therefore be O(n^log_2(3)), which is about O(n^1.58). This is better than the naive solution, and worse than the O(n) answer.
 eugene.yarovoi February 12, 2013