Given a file with a set of spa

Google Interview Question for Software Engineer / Developers

-7

of 9 votes

22
Answers
Given a file with a set of space separated numbers in a file write a program to remove duplicate rows. Two rows are duplicate if they contain the same numbers regardless of the order in which they occur. Constant time algorithm expected. LogN time is not good enough.

Given file:
1 2 3 4 5
3 6 7 8 9
2 4 7
1 5 3 2 4

Answer expected:
1 2 3 4 5
3 6 7 8 9
2 4 7
- hsantosh71 June 15, 2013 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 7 vote

What do you mean by constant time ?
Just to read file with N lines and with K elements in each line you have to do O(N*K) operations. Looks like you didn't understand the problem correctly dude.
Big minus for this.

- gevorgk July 15, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 1 vote

I was also thinking that sum of squares. Sum results high probability of different same length set but squares results less probability ;) But there is is a chance of getting same result of same different set. Now we can do a trick here (cross check).. NB: As it's said that complexity less than log(N), I think sorting should not be used which requires typical log(N) complexity we can create a DS for it {{{ // Represents each line class FileLine { // supposed to be unique for same length set long code = <sum of squares of all numbers>; Set<Integer> = ... <Set of numbers> // set size int length = .. } // as system reads the line of file, this set gets populated with new entries // hashcode based on assigning code (sum of squares) and length Set<FileLine> soFarReadLines = .... }} 1. Read line by line and populate soFarReadLines set 2. When next line is read then calculate the code of line using sum of squares and then look into the set and if match found (code and same length) then we need to do cross check whether it's same set or not (match element one by one). If no match found then it's unique line. If match found but different set then also it's unique line. So this is the trick to make sure whether same code results same length different set. 3. If duplicate line then no need to do entry into soFarReadLines set I think without sorting each line, it's a better aproach - Debasis Jana June 17, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

// Represents each line
class FileLine {
	// supposed to be unique for same length set
	long code = <sum of squares of all numbers>;

	Set<Integer> = ... <Set of numbers>
	// set size int length = .. 

}

// as system reads the line of file, this set gets populated with new entries // hashcode based on assigning code (sum of squares) and length

Set<FileLine> soFarReadLines = ....

1. Read line by line and populate soFarReadLines set
2. When next line is read then calculate the code of line using sum of squares and then look into the set and if match found (code and same length) then we need to do cross check whether it's same set or not (match element one by one). If no match found then it's unique line. If match found but different set then also it's unique line. So this is the trick to make sure whether same code results same length different set.
3. If duplicate line then no need to do entry into soFarReadLines set I think without sorting each line, it's a better aproach

- Debasis Jana June 17, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

The problem isn't specific to what O(lg n) is referring to though everyone assumes it means not sorting each line to create a partition key. Your solution seems pretty good though as each line is processed in one pass and hash collisions will be unlikely.

- Jose Cuervo July 10, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Did anybody think about XORing as alternative and length?

I already have a counterexample..
row 1 : 1 3 3 5 same length and same xor as row 2.
row 2 : 1 4 4 5

- Dinesh June 18, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Are there any assumptions on the range of the numbers (for example - all are from [0,255] ?)
If so we can store a frequency array (which is const size) for every row as it's hashcode.

2. It is impossible to read the whole file in less then O(N).

- BorisC June 19, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

i don't think it is possible. to solve this question, we need to go compare every 2 lines. there are (n C 2) lines to compare. Assume that we can check if 2 lines are identical in constant time and n = total number of lines. The complexity of this question O(n^2).

- Anonymous July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

First, create a large table of prime numbers {P1, P2, P3, ... Pn} (where n is larger than the largest number you will encounter)

Second, when you encounter a line with the numbers {a b c d}, multiply together the numbers Pa, Pb, Pc and Pd and store in a hashtable.

When you encounter another line with the numbers {b d c a} and multiply then primes together, you will find that the entry already exists in hashtable, in constant time

- Fluffy July 14, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

@Fluffy: what is the maths behind this?

- aka July 14, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@aka just that a number formed by a multiplication of primes cannot be composed by any other set of primes. So if you multiply together a bunch of prime numbers, it does not matter in which order you multiplied them, you always get the same number. Since the numbers are prime, you cannot get the numbers in any other way, through any other set of primes...

- Fluffy July 14, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Can we make each line as String by removing the space. Use HashSet<String> to save whether that line has been or not. The trick is how to construct the Key.
Take the idea of Anagrams.
{c, b, a, d} => tempStr: cbad==>sort tempStr: abcd
{d, a, b, c} => tempStr: dabc ==>sort tempStr: abcd.

- Anonymous August 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Maintain a hashmap, where key is the sorted line.
e.g if a line is "3 2 6 1 4", it will be treated as "1 2 3 4 6" in hashtable.

Now, for each line:
1. sort it
2. lookup in hashmap
3. if it is not present, then enter it in hashmap and output it
4. if it is present, then don't do anything and continue to next line.

- mytestaccount2 December 22, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Python Code:
Tried using the same logic with sum as well as storing everything as a sorted array so it becomes a nxm array.
f is the file object in the code. I need to pass the path to the file as f.

def printunique_sum(f):
    temp=[]
    with open(f,'r') as fp:
        for lines in fp:
            x=sum([ int(x) for x in lines.replace('\n','').split() ])
            if x not in temp:
                print lines,
                temp.append(x)


def printunique_sort(f):
    temp=[]
    with open(f,'r') as fp:
        for lines in fp:
            x=sorted([ int(x) for x in lines.replace('\n','').split() ])
            if x not in temp:
                print lines,
                temp.append(x)

- Murali December 17, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Two lines are equal if there sum and product of the numbers match.
Start scanning each line of the file, for every line store the sum and product. Print the line, whenever a new line with already seen sum and product occurs move to the next line.

- Subhajit June 15, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Isn't the complexity here O(n), n being the number of lines in the file? But, the OP says the solution should be better than O(log(n))? Although I doubt how we can do less than O(log(n)).

- Anonymous June 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

Can you know the data beforehand without even parsing it?
If not then it is not possible to give the solution in o(logn)..

- aka June 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

multiplication can result in oveflow

- anonymous June 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

You should also compare the length of the lists.

In addition, this algorithm would not work especially if the elements can be 0 or negative. Example:

[1 0 1]
[2 0 0]
[-1 0 -1]

- Tuxdude June 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

It is saying should be constant time, doubt we can ever make it in constant time.

- elber.kam June 19, 2013 | Flag

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Good one. Will storing sum and sum of squares also work? Just wondering..

- Murali Mohan June 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I guess that will also work.
I am not able to get any example where it will not work.
Only thing is that may be squaring and adding will be a costly operation than taking the product.

- Subhajit June 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 votes

L1 = [0,1,1,1,1,1,2,4]
L2 = [0,0,0,2,2,2,2,3]

Sum(L1) = Sum(L2) = 11
SoS(L1) = SoS(L2) = 25

L1 != L2

- 01zhou July 12, 2013 | Flag

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews