how to read a big data file to

Ebay Interview Question for Software Engineer / Developers

0

of 0 votes

10
Answers
how to read a big data file to get the top K values?
- HadoopUser December 29, 2013 in India | Report Duplicate | Flag | PURGE
Ebay Software Engineer / Developer

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: Phone Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

Use max heap. n*log(k) complexity

- Anonymous December 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

it should depends on data information given.
Heap is a good solution if no other information provides.

- hahaha December 29, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Min-Heap and heap size is K.

The time complexity Θ(n*log(K)). The space complexity is K .

- xiaofan December 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Right. Min-heap for max/top k, as you will be pushing out/comparing with the minimum each time.

Θ for runtime, but not space usage? Why?

- Anonymous December 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

space usage is K.The heap need a Array which length is K to initialize.

- xiaofan December 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

What about space used for indexing into array? Other bookkeeping overhead? What if individual data size is 100 bytes?

Silliness. Just use Θ for space too and avoid looking silly.

- Anonymous December 30, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

When I read "big data file" I understand that it doesn't all fit into memory at once, so heaps won't work.
I think to use QuickSelect - as soon as element is compared with pivot it is recorded into separate file (there are 2 files for right and left partition). Eventually files will be small enough to fit into memory, but regardless of that it allows to find K-th largest element in linear time.
Then go through initial array again, picking elements larger than K, which is linear again.
So time is O(n) and space is K (because you can delete the files you don't need anymore)

- Alexey December 30, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

We can use the following logic:

1. Split the file into N different smaller files. Assign each file to a separate process - so, N processes handling N smaller files
2. In each task, sort file assigned to itself (using a heap) and record the largest number. Finally, write that largest value to a separate file
3. Finally, read the N files and pick the largest K values from them

Very easy to implement this using hadoop map/reduce framework because input split and assigning those to tasks etc are automatically done by the framework.

- Ajith January 03, 2014 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Then what about N<K, how could you find the non-root (K-N) nodes?

- Anonymous March 12, 2014 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Since it is big data, I/O will take most part of the time consumption so it is meaningless to talk about computational O()

- Anonymous March 12, 2014 | Flag Reply

CareerCup

Ebay Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews