ftfish
BAN USERA lower bound of Omega(nlogn) can be proved by reducing the "element distinctness problem" to this one:
1. Build ar_low for ar and reversed(ar) so that we can calculate the number of elements smaller than or equal to ar[i] in the whole array for every i.
2. Compare the sum of all such numbers to binomial(n, 2)=n*(n-1)/2. If both are different, (that is, if the former is larger,) there must be equal elements in the array.
E.g.:
ar=[5,2,3]
So the number of elements in the whole array that are smaller than or equal to each element is [2,0,1], the sum is 3, which is equal to C(3,2)=3. So there's no equal element.
On the other hand, for ar=[1,5,1], the numbers are [1,2,1], summing to 4 > 3. So there is equal element.
1. sort the three arrays.
2. enumerate all the 6 ordering of a,b,c
WLOG assume that a<=b<=c.
note that |a-b|+|b-c|+|a-c| = (b-a) + (c-b) + (c-a) = 2(c-a), which has nothing to do with the value in the middle (but there must be one between a and c).
because the ordering is fixed, we know that a<=b<=c.
3. scan the first array, keeping 3 pointers pointing to current a,b and c where b is the minimum number in array 2 which is >=a and c is the minimum in array 3 which is >=b.
note that when pointer a moves to right, the pointers to b and c cannot move left. so every time we move pointer a, we: 1) move b to right if necessary; and then 2) move c to right if necessary. more than one steps may be taken, but the total steps taken can't exceed n (the length of the arrays).
the overall algorithm takes O(n) time and O(1) extra space.
It's not clear whether the matrix is entirely sorted (i.e. numbers in every row are larger than the ones in the former) or just every row itself is sorted.
For the former O(log(m*n)) = O(logm+logn) is obviously sufficient.
For the latter O(MlgN) is the most obvious result.
@Frodo Baggins
It's just a basic math formula. But of course, if even you don't understand the idea clearly, it may be confusing.
Your second part is not perfectly true. I assume you know that the Miller–Rabin primality test, which is commonly used, is a probabilistic algorithm. There're counterexamples, but you shouldn't be able to find them directly/easily. And with careful choice of parameters, the probability that your computer hardware is done is larger than the probability that the algorithm gives wrong answer.
But OK, if you prefer suffix tree/array which generally take hundreds lines of code, it's your choice and google may help.
@first Anonymous
Your algorithm is essentially the same as MaYanK's which is also wrong. You two haven't take k into consideration.
Your last sentence "This can easily be customized for any k." is not true.
@second Anonymous
how would you do "calculate their sum for each path."? If you just trivially try all paths, then it's an exponent-time algorithm. If you use DP without considering k AT EACH STAGE, then it's just the same as the first Anonymous which is wrong.
It's good that you two think of graphs. But your graph is essentially the same as DP which uses the topological ordering of the numbers.
the lengths of the suffixes are 1,2,...,n, so the sum is 1+2+..+n=(n+1)*n/2. Then your "average" = O(n).
Using any nlogn-comparision sorting algorithm yields an O(nlogn * n) time complexity, because each comparision (of two suffixes) takes O(n) time.
Note the commented part of my code is just a binary search + trivial bruteforce, which is easy to code.
For the ones who really want to learn, and the ones who want to see the code but expect that I can't write it.
btw, there seems to be something wrong with the commenting system of this site. And my indentations are easily messed up even with the { { { brackets.
//============================================================================
// Name : LongestRepeatedSubstring.cpp
// Author : ftfish
// Version : 1.0
// Description :
// finds the longest substring of a given string which appears
// more than once.
//============================================================================
#include <cstdio>
#include <ctime>
#include <cstring>
#include <cstdlib>
const int maxl = 100000, maxc = 2;
const int maxh = 200000, basen = 3, base[basen] = { 113, 149, 173 };
struct node {
unsigned key[basen];
node *next;
};
void gen_rand_str(char *s, int len) {
srand(time(0));
for (int i = 0; i < len; ++i)
s[i] = 'a' + rand() % maxc;
s[len] = 0;
}
void insert(unsigned v[], node *h[], node *p) {
memcpy(p->key, v, sizeof(p->key));
p->next = h[v[0] % maxh];
h[v[0] % maxh] = p;
}
bool find(unsigned v[], node *h[]) {
for (node *p = h[v[0] % maxh]; p; p = p->next)
if (memcmp(p->key, v, sizeof(p->key)) == 0)
return 1;
return 0;
}
bool valid(char *s, int n, int len, int &pos) {
static node *h[maxh], pool[maxl];
memset(h, 0, sizeof(h));
int pp = 0;
unsigned v[basen], pow[basen];
for (int j = 0; j < basen; ++j)
pow[j] = 1, v[j] = 0;
for (int i = 0; i < len; ++i)
for (int j = 0; j < basen; ++j)
v[j] = v[j] * base[j] + s[i], pow[j] *= base[j];
insert(v, h, pool + (pp++));
for (int i = len; i < n; ++i) {
for (int j = 0; j < basen; ++j)
v[j] = v[j] * base[j] - pow[j] * s[i - len] + s[i];
if (find(v, h)) {
pos = i - len + 1;
return 1;
}
insert(v, h, pool + (pp++));
}
return 0;
}
//
//bool valid2(char *s, int n, int len, int &pos) {
// for (int i = 0; i < n - len; ++i)
// for (int j = i + 1; j < n - len; ++j)
// if (memcmp(s + i, s + j, len) == 0) {
// pos = i;
// return 1;
// }
// return 0;
//}
int main() {
int n = maxl;
static char s[maxl + 1];
gen_rand_str(s, n);
// puts(s);
int start_t = clock();
int lr = 1, rr = n - 1, mid, tpos, anslen = 0, anspos;
while (lr <= rr) {
mid = (lr + rr) >> 1;
if (valid(s, n, mid, tpos))
anslen = mid, anspos = tpos, lr = mid + 1;
else
rr = mid - 1;
}
int end_t = clock();
if (!anslen)
puts("No repeated substring !");
else {
printf("maxlen = %d\n", anslen);
for (int i = 0; i < anslen; ++i)
putchar(s[anspos + i]);
putchar('\n');
}
printf("Time used: %.2lf\n", double(end_t - start_t) / CLOCKS_PER_SEC);
//
// lr = 1, rr = n - 1, anslen = 0;
// while (lr <= rr) {
// mid = (lr + rr) >> 1;
// if (valid2(s, n, mid, tpos))
// anslen = mid, anspos = tpos, lr = mid + 1;
// else
// rr = mid - 1;
// }
// printf("maxlen by bruteforce: %d\n", anslen);
return 0;
}
@Anonymous
I can't exactly understand your first part. But all we have to do is to keep a hash table in which all the hash values of all the substrings of length k are stored and clear that hash table before every phase.
You don't have to store the substrings themselves because it's too costly. Just use multiple "bases" in polynomial hash so that we're confident enough to assert two substrings are the same if they give the same hash value for all the bases. generally 2 or 3 bases are enough.
I still can't get what you exactly want to say.
If you just think I can't implement my algorithm efficiently, that's alright, I won't waste my time trying to prove you're wrong. But I assure you that I can implement most (if not all) of the algorithms I gave on this site efficient enough. But there's always someone better than I, it's true.
If you just want to see the code, just express that clearly so that some one may get you and may have time pasting it here.
If you have a better algorithm, or a normal algorithm with EXTREMELY fast implementation, which you consider as "ultimate", I'm glad to see your post and thanks in advance.
BTW, replying as Anonymous makes me (and many others I think) confused.
We don't know exactly what the length is, but we do know it is between 0 and len(s).
The key insights which allow us to do binary search are:
1. if there's a substring with length k which appears more than once, then there's also a substring with length k-1 which also appears more than once;
2. if there's NO substring with length k which appears more than once, then there's also NO substring with length k+1 which also appears more than once;
for a given length of answer k, just insert all the hash values of all the substrings with length k into a hash table(totally O(len(s)) using polynomial hash) and check if there're two substrings with the same hash value.
so the time complexity is O(len*log(len))
This is the easiest way I know to solve this problem.
Binary search for the length of the answer, then check validity with polynomial hash (so that hash[s[i..j]] can be easily calculated from hash[s[i-1..j-1]], which is also called Robin Karp hash).
O(n*logn)
Better (but more complicated) solutions exist. Try googling suffix tree/array. O(n) can be achieved.
Another way to see this is to reduce the sorting problem to this one.
- ftfish October 14, 2013Note that the number of elements <= ar[i] can be used as the "rank" of ar[i]. We can therefore sort the whole array according to the rank in O(n) time with counting sort, since ranks are integers in range [0..n-1].