sujoyg
BAN USERBrilliant.
Here is the explanation:
num(n) returns the "index" of n in the series. For example, 1 => 1, 2 => 2, 3 => 3, 4 => 4, 5 => 5, 6 => 6, 8 => 7, 9 => 8, 10 => 9, 12 => 10 and so on.
Now, to find the "nth" number in the series, we find the smallest r whose index is >= to n. The number we want is clearly going to be <= r. It is also going to be >= n, since the nth number in the series is always >= n.
Once r is found, the algorithm performs a binary search of all numbers between n and r to find the number whose index is equal to n.
We assume that the each log file corresponds to a specific date and log files arrive in increasing order of dates. For example, log file for 16th April 2016 follows the one for 15th April 2016. In each log file, there are zero or more entries for a product.
A. In order to keep track of the entries seen for the last 30 days for each product, one of a min-heap, a BST, or a doubly-linked list ordered by date should be maintained for each product. Each node corresponds to a specific date and number of entries for the product seen on that date. If it is a BST, each node should also store the total number of entries in the subtree rooted at that node.
B. There should be a cache, that maintains the total count of all entries for each product in the 30 day period. This will not be necessary if a BST was used in A, since the root of the BST will store the total count anyway.
C. When a new log file comes in, a node with the date and count of entries should be added to data structure in A for each product that appears in the log file. The data structure in B should also be updated by adding this count to the entry for each product.
At the same time, for each product, nodes in A corresponding to dates are earlier than 30 days, should be removed. B should also be updated by subtracting the total count of removed nodes for each product. In the case of BST, the entire subtree rooted at the node for the earlier date can be removed and the total count at the root be updated by subtracting the value in that subtree.
Since the window is 30 days in length, all these operations are constant time O(n=30) or O(log(n=30)), irrespective of the data structure used.
D. To keep track of the top 10 products in the window, a min-heap of size 10 should be maintained, where each node corresponds to a unique product and whose value is the total number of entries for that product in the 30 day window. The heap is organized by this value. This heap should be updated every time the data structure in A (and optionally B) is updated.
Also, for each node that appears in D, there should be a reference to it from the corresponding entry in B (or null, if a product in B is not a top-10 product and hence does not appear in D), so that the node can in D can be efficiently located, removed and reinserted with updated value if necessary, for each product whose count is updated.
E. Also, a product may not appear in some log files within a 30 day period at all. If there are a large number of products, it will be inefficient to scan the data structure in A for each product to determine whether nodes exist corresponding to earlier dates.
This can be mitigated by using a cache where the key is the date and the entry is the list of data structures in B for products that appear in logs files for that date. This way, for example, when the log file for 1st May comes along, the data structures B for products that appeared on 1st April can be efficiently retrieved and updated.
This way, the min-heap in D will always contain the top 10 products that appeared in log files within the last 30 days.
Use a bit trie.
- sujoyg June 07, 2016