JeffD
BAN USERC++, Linux guy
- 0of 0 votes
AnswersGiven 2 equal-length arrays of integers, find pairs, one from each array, that sum to 0.
- JeffD
-- note that one wrinkle of this problem over the more usual form, which is to do this in a single array, is that you can't use the indexes / iterators crossing each other to know to stop, rather their /values/ have to cross (if you're doing it right, at or near 0).| Report Duplicate | Flag | PURGE
PayPal Software Engineer / Developer Algorithm - 0of 0 votes
AnswersWhat's an efficient way to process a large file, with lines of varying length?
- JeffD
-- I said, break it up and process the pieces in parallel, using fseek to divide up the file, and scan backward and forward for the line terminators to decide which chunk a line belongs to. I think the answer he wanted though was to memory map it.| Report Duplicate | Flag | PURGE
PayPal Software Engineer / Developer Operating System
Hmm; if there are any duplicates you can remove them to your two halves, without thinking any further about them, and solve the reduced problem. The remaining numbers are all unique. Not sure how to use that though. (No this is junk, eg 3,1,5,3 would be 3,3 and 1,5.)
It's like the 0/1 knapsack problem, you have to fill your knapsack with weights (= values) of sum / 2. Except, that the number of items you can take is constrained, too; hum...
In the unlikely event that the two halves are to be contiguous, you could double the array length, take partial sums from the beginning, wrapping around the original array while continuing straight ahead in the doubled, partial-sum one. Then, walk the doubled array taking the difference between the elements sizeof(original array) = N apart. When you hit a difference of sum / 2, back-translate to indexes of the original array. The doubled array is just to avoid fiddling with % N. And yeah, this interpretation is too easy.
- JeffD March 29, 2014In regular Dijkstra you choose the next edge as the one that's nearest to the source, and not already part of the tree. Well, instead of keeping track of used nodes by the whole tree, keep them by individual path (when you branch in 2nd direction from a node, you'll have to make a new path). When you check for where to extend your edge to, check only that it doesn't touch your current own path (grrr - there may be more than one way to get to that node, thus more than one path. No, we got here the shortest way possible, no?). Anyway, first you remove the source from all paths, so that it's ok to add it. When you add a node, see if it's the source, and if so see whether your path includes all nodes. If so, this should be a shortest path. If not, abandon this path. Not 100% sure this works though. And, it's not really the Dijkstra algorithm anymore, it's just similar - is this really the question?
- JeffD March 26, 2014To cope with the 'billions' part you could make a Levenshtein Automaton of the pattern string, with as wide of edit distance as you're willing, convert that to an NFA, run the possible matches in parallel (not OS parallel), add logic (somehow) to keep count of the longest. It's possible that the branches in the NFA would make it exponential, but I'm sure that with good coding you could make it max out at something less (there should be only so many states). On its best day, this would be O(m = length of input string). Given that we're starting with O(n x m) we've got room to play with. The NFA 'threads' would keep track of their history - the longest you'd replay, with the right characters, '-' and so on.
- JeffD March 03, 2014Well, but isn't that still O(N^2)? It may be N/2 ^ 2 if you divide into grids of 2x2, but that's still O(N^2).
I'm thinking about running sums (here, products), somehow.
The query rectangle has to completely enclose any pre-computed GCD squares for it to use it - all I can think of is to make some constant number of levels, top-down rather than bottom-up, as bottom up makes it dependent on N - just pick a number - and make a quadtree of pre-computed GCD squares down to that level. Not the answer, I'm sure.
Nah, top-down doesn't make any difference, any regular grid of cells will be N^2 one way or another.
It's known short, ie small, right? I'd stick the /count of/ each integer into an array (or c++ vector, deque, whatever you want), with a global min and max for the numbers you've seen, and a total count (not sum) of the numbers seen. 'Min' acts as an offset telling you what to add to incoming numbers to make the index where their count is stored (eg, if the first number is '5' and you store it at index 0, the offset would be -5, ie -min). If the number is already present, bump the count (of eg the 5s you've seen). If the number is higher than your max seen, set the new max, set the array elements between your old max and your new one to 0, and bump the count at the new max. If you have a new min, well, painfully shift the array elements so as to accommodate the new min (maybe with some buffer space eh?), and set min and max accordingly. Now, this 'painful shift' is O(n), and you'll do it, what, 'size-of-known-short-range' / 2 times on average (right?) so that's actually constant. Everything else in maintaining this array is O(1). When you want to sample the median, walk from min to max, summing the array entry counts as you go, until the bin where you hit total count / 2. This number of this bin (taking your min offset into account) is the median.
- JeffD February 18, 2014You know, I wonder whether GZ_Penny's questions really came from Google. They're certainly not his first-hand experience - there look to be ~50 questions and he doesn't understand them himself - and they're easier than, or seem to be easier than, the other Google questions I see. He might have just found a bunch of questions and dumped them all at once under 'Google'.
- JeffD February 18, 2014In English (for the lazy), take the piece(s) of the query that are before, between, and after the wildcards, ie just the pieces that are not the wildcard, and strstr them in order against the body string. If / when you find a match for a piece, continue strstr'ing the next piece from the end of that match. This is O(N).
- JeffD February 17, 2014@robert - no, I'm assuming n = 4. 4 * (4+1) / 2 == 10; the sum of the array is 14, 14 - 10 = 4 - the extra number.
But that's too easy a question, maybe they meant eg: 1, 2, 3, 3, 5 or 1, 2, 3, 5, 5, ie that there are n numbers, one of them is a dup of another. I don't think sums help much with this form. I can't think of how to solve this off the top of my head.
Find what the sum 1 .. n should sum to with the formula n * (n + 1) / 2 (see 'summation' in wikipedia). Now sum the actual array. If I've read the problem right and there's a single, extra number, the actual sum minus the theoretical sum should give you the extra number.
- JeffD May 07, 2013You have to do your egrep <pattern> <filename> | sort | uniq
egrep sees only one line at a time, it cannot know whether it's unique among other lines. 'sort' will put duplicate next to each other (as a side effect of sorting), and 'uniq' expects that, and efficiently outputs only one of a series of adjacent duplicates.
An idea - quantize / discretize the intervals as coarsely as practical, and have an array of quanta covering the range of all the intervals. For each person's interval, fill in all the covered quanta. If the quantum is already filled, they've collided with someone. You have to do this twice, because the first person eg won't know whether he's collided with anyone otherwise. It's all still O(n). If you don't need infinite, split-second precision.
- JeffD January 31, 2012I was thinking 'it's POD; nothing', but, with prompting I guessed the 4 - default ctor, copy ctor, operator=, and dtor.
One twist - later I didn't think it'd even compile, after all being a 'class', the default access is private, and the default ctor should be inaccessible to the public right? Well it turns out that the default ctor is an exception to that, even if there're data and members in the class. The rule does come into effect if you write any of your own ctors (default and otherwise). One more piece of C++ trivia.
Your program may work (haven't tried it) but, we're asked to draw a DFA, and, you have separate sets of states for As and Bs. In a DFA, all information must be combined into a single state; ie you'd have to have nodes for oddA_oddB, oddA_evenB, evenA_oddB, and evenA_evenB (plus start of course, evenA_oddB is an accepting state).
- JeffD January 28, 2012> as far as c Or c++. I don't think it really matters
No, C++ is faster, because C's 'qsort' function takes void* ptrs to what it's sorting, and those are opaque to the compiler. C++'s std::sort on the other hand, is templated by the type, and exposing that type information lets the compiler speed the code up.
I can only think, if the numbers are of bounded range (and limits known beforehand!) keep an array, one for each number. Keep a count in each position of the number of times that that number has been seen. Keep a variable 'median index' which is one of the indexes, and 'how far up' it is, of an imaginary pile of the value at that index. Eg if the median is at 5, and 5 (or the number at 5, if our range is offset) has been seen 6 times, our 'how far up' might be 4 (we're trying to model, as though our input was spread out. '4' means, the median is 4/6 of the way through the pile of sightings of '5'). When a number comes in we increment or decrement 'how far up' by 0.5, and if we have to move off that number either higher or lower then we do, that's the new median. This should be O(1), only it requires storage for the full range, of every /different/ number we expect to see. (I suppose it could be a sparse, linked list, too; for a number not previously seen, you'd have to do a binary search as to where to splice it in, which is O(lg number-of-different-numbers); it's independent of N itself at least. )
- JeffD January 23, 2012Have an array of 127 (it's only ASCII) /structs/, where the struct contains: index of first sighting, and count (or a tri-state of: 'haven't seen', 'have seen once', 'have seen 2+ times').
Walk along the sentence, setting the array elt at the position in ASCII of the char (eg 'A' will be pos 65). If we haven't seen the char before, mark index of first sighting, and change the state / bump the count. If we have seen it before, just bump count or change its state to '2+'. At the end, scan the whole array for the lowest 'first sighting' that's been seen only once. (Is O(n) with initialization + scan costing 127 (ie O(1)) each.
Yes; walk through the array once with a double loop, over row, then col, always adding matrix[row][col] to row_sum[row] and to col_sum[col]. If row == col add the matrix value to the ul -> lr diagonal sum; if row + col == n - 1 add the value to the ll -> ur diagonal sum. Then check whether all the sums are the same. (My 'n' is a dimension of the array, not the total.)
- JeffD January 22, 2012Having RAM swap to and from disk is still O(N). It'd be N / some constant related to memory size, times a big constant for swap time, but still, O(N). I vote for some sort of hash table that can hold all the encountered ints; at each, keep a flag of whether it's been seen in A or B or both, and swap to/from disk as necessary. Use a memory-mapped file for ease-of-use and speed.
Of course, the intended answer is probably something tricky with XOR or summing and subtraction :)
I must say that this is an out-of-date question. Nowadays compilers and linkers will inline and not inline whenever they feel like it, even the same function! Herb Sutter's 'Exceptional C++ Style', page 190 has a question: "3) .. What kinds of functions are guaranteed never to be inlined?" And his answer is 'none'. And this book is copyright 2005. (Sutter is both head (I think) of MS's compiler project, and of the C++ standard committee.) Now, whether the /compiler/ (and not the linker) will inline stuff, I can only think that it'd have to be when a function is 'extern' to a compilation unit. But, this is uninteresting, since the linker can still do it. So the only answer worth knowing is 'no'. Unless you happen to know your compiler's limits and quirks, but then that's not a generic C++ question.
- JeffD January 13, 2012As fentoyal says, put them into a hash table. Add to each actual bucket entry an iterator into a doubly-linked list of those items that have been seen once. When you walk down the original list of urls, if you don't find it in the hash table, add it, and add its entry (hash table iterator, or raw url, either way) to the back of the linked list. If you do find the entry in the hash table, look at the iterator stored there. If it's null, you've seen it 2+ times already, do nothing. If the iterator there is non-null, use it to pull the item out of the linked list (O(1) time), and then set the iterator in the hash table to null.
When you've gone through the list of urls, pick the front-most (which is the first added, that was not subsequently removed) url from the linked list.
I think this fails on: 2 1 1 1 1 2.
- JeffD August 28, 2015you want to group the ones with their neighbor, /as evenly as possible/: (2+1+1) * (1+1+2) = 16;
(2 + 4) * 2 = 12
2 1 1 1 1 1 1 1 1 2
4 * 2 * 2 * 4 = 64
8 * 8 = 64.
2 * 2 * 2 * 2 * 2 * 2 = 64. Hum; so far this seems to hold up.
2 1 1 1 2 = 12
3 4 = 12 -- k, really seems to hold up.
edit: also, '(' and ')' are free - they don't take up a space between numbers. They couldn't - what would eg 1(2) mean?