If there is a million data in

Yahoo Interview Question for Interns

0

of 0 votes

15
Answers
If there is a million data in file(assume integers).
The memory is enough to hold all the data.
After loading all the data into data structure , we need to insert 500 new integers after the 10000th element.
What Data structure to use and how to use?
- rahul.jain215 March 15, 2013 | Report Duplicate | Flag | PURGE
Yahoo Intern Algorithm

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 5 vote

We can use array for storage + some data structure to manage the storage. We could load the first million data to an array "int A[1000000]". Since we need to insert some data in the middle, we wouldnt want to resize this big array and shift the elements. We can just create the new array of B[500] to store the new 500 members that should be inserted after A[9999]. Then there should be a managing data structure to say that the array is contiguous from 0-9999, then read from B[0-499] and then continue reading from A[10000-999999].

Below is the data structure

class bigIntArray {
  public:
    bool loadData(string filename) {/*load data from file; first data load*/};
    int getAt(long index) {/*look at data_list to identify the memory location for index */}; // or operator []
    bool loadAt(long index, string filename) {/* Load new data at index*/};
  private:
    std::list<pair<int *, long>> data_list;  // Chain of arrays, dynamically created
}

/*
1) In loadData(filename), will malloc int array of 1 million elements (assume starting address as A); load the data from file. Add to data_list the starting pointer and size as 1 million. data_list will only have "pair<&A[0], 1000000>"
2) in loadAt(10000, newfile), will malloc int array of size 500 (no of elements in newfile, assume starting address as B) and split data_list to have B in between A. Now data_list will be "pair<&A[0], 10000>", "pair<&B[0], 500>" and "pair<&A[10000], 990000>".

So, we dont split the original array physically, but logically we do.
*/

- jtr.hkcr March 15, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 8 votes

The simplest way is to use LinkedList for million integers and another create new linked list for 500 integer that we want to store in middle. Now it is easy to add newly created list in between of original Linkedlist.

- Geet March 15, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

This is the correct approach. You want to maintain a list of offsets. See "Python Solution with Paged Lists" for a full implementation of this concept. @Geet, If you're doing a large number of inserts, then you might not want a linked list to store the offset. Better to have an array that allows you to do a binary insert to find the pages.

If you're doing this in C, you need to be careful about memory management. You'll want your list of pages to also indicate whether pages were created via malloc() or carved out of a previously malloc'ed page. If you throw deletes into the equation, then you might want to consider a data structure that uses a fairly small page size. Having more pages will slow down random access, but that can be mitigated with a binary search for offsets, and your insertions/deletions will be faster.

- showell30@yahoo.com March 16, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

We create a skip list, we store data page wise in linked list, and above we create a tree kind of structure,(google skip list for more info), here the insertion time is O(log(#page)) per page.
Insertion procedure :

Here each page is half empty, 
               Now if we get a page which is more then half empty then we just insert it
               and if the newly inserted page have less then half data then we merge it with another near page. 
               or if both nearer page don't have space to store this new page, then we take items from any one of nearer page and make both of then to have more then half data.

- sonesh March 15, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 vote

I'm assuming the requirements break down like this:

1) We're dealing with big data, not huge data, so millions of values, not billions.
2) We obviously need bulk inserts to be fast.
3) I'm gonna assume a read-mostly context, so we want random access to be at least log(N) fast, so a simple linked list won't cut it.
4) I'm gonna assume inserts are always bulk, so we can amortize the cost or live with the slowness for small inserts.

Given these constraints, I'd create a 2-level tree of integers, with roughly 1000 pages of 1000 integers each. The outer list (i.e. top of the tree) would have the offsets for the inner lists. For random access, you'd binary search the outer list to find your page table and then do an O(1) lookup in the page to get your value. This is essentially how BigTable works, except BigTable has a few more levels.

To do a bulk insert, you find the page where the insertion point belongs, split that page, then simply create a new page for your new data, and then you only need to update the outer table to accomodate the two new pages and recalculate offsets, which has a worst case of about 2000 operations.

If you're doing lots of bulk inserts over time, then you'll eventually want some compaction algorithm to consolidate pages. Also, after you split pages to make room for your new insertion, then if the first part of the split page is pretty small, you should consider simply appending your values to it, rather than creating a new page. Data structure that use a paging scheme usually have some flexibility on page size, so that you try to be a certain size, but you can go up to double that size to be lazy about re-organizing your data structure.

- showell30@yahoo.com March 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

@showell30@yahoo.com: would you mind giving an example how the offset is created with real numbers?It would be nice if you can take 5 numbers and create offset and do insertion and removal.

- aka April 12, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

define reserved_size SOME_NUMBER;
struct bigIntArray {
    size_t  used_size;
    bigIntArray*  pNext;
    int data[reserved_size];
}

- Anonymous March 15, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

A linked list will solve the purpose.

- alex March 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Python Solution with Paged Lists:

def test():
    arr = BigArray()
    chunk_size = 10000
    for i in range(100):
        new_elems = range(i*chunk_size, (i+1)*chunk_size)
        arr.insert_elems(i*chunk_size, new_elems)
    assert 43999 == arr[43999]
    assert 500000 == arr[500000]
    arr.insert_elems(900000, range(3000))
    assert 0 == arr[900000]
    assert 1001 == arr[901001]
    assert 900001 == arr[903001]
    arr.insert_elems(903001, [1,2,3])
    assert 1 == arr[903001]
    assert 2 == arr[903002]
    assert 3 == arr[903003]
    assert 900001 == arr[903004]

class BigArray:
    def __init__(self, page_size = 1000):
        # Keep a list of pages, with an empty page
        # at the end for convenience.
        self.page_offsets = [0]
        self.pages = [[]]
        self.page_size = 1000

    def debug(self):
        for i in range(len(self.pages)):
            print i, self.page_offsets[i], len(self.pages[i])

    def __getitem__(self, i):
        if i > self.page_offsets[-1]:
            raise IndexError, "n too large"
        page = self._get_page(i)
        page_i = i - self.page_offsets[page]
        return self.pages[page][page_i]
        
    def insert_elems(self, n, lst):
        # insert elements from lst into BigArray
        # starting at index n
        if n > self.page_offsets[-1]:
            raise IndexError, "n too large"

        if n == self.page_offsets[-1]:
            return self.append_elem(lst)

        # Find the page where insertion starts.
        page = self._get_page(n)
        cnt = len(lst)

        # Split the page after the insertion point.  This
        # might create a small page, but we don't worry
        # about it for simplicity sake.  We can refine
        # this later by trying to balance our new elements
        # better.
        self._split_page(page, n)
        self._update_offsets_for_insertion(page, cnt)

        low = 0
        # See how much room we have in current page
        # to avoid needlessly creating new pages. We
        # may luck out with a partial page from a previous
        # insert.
        room = self.page_size - len(self.pages[page])
        if room > 0:
            # We might have extra room
            if room > cnt:
                room = cnt
            # Extend our current page
            self.pages[page].extend(lst[0:room])
            low = room

        # Now create new pages as needed.
        if low >= cnt: return
        new_pages = []
        new_offsets = []
        while low < cnt:
            page_size = min([cnt - low, self.page_size])
            new_pages.append(lst[low:low+page_size])
            new_offsets.append(n+low)
            low += page_size
        self.pages[page+1:page+1] = new_pages
        self.page_offsets[page+1:page+1] = new_offsets

    def append_elem(self, lst):
        # TODO: For now, we always create new pages, but
        # we could try to look at the last nonempty
        # page and see if there's room.
        cnt = len(lst)
        n = self.page_offsets[-1]
        # remove empty page temporarily
        self.page_offsets.pop()
        self.pages.pop()
        low = 0
        while low < cnt:
            page_size = min(self.page_size, cnt - low)
            self.page_offsets.append(n + low)
            self.pages.append(lst[low:low+page_size])
            low += page_size
        # put back empty page
        self.page_offsets.append(n+cnt)
        self.pages.append([])

    def _update_offsets_for_insertion(self, page, cnt):
        # update offsets of all the pages to the
        # right of the insertion point (if any)
        for i in range(page+1, len(self.pages)):
            self.page_offsets[i] += cnt
    
    def _get_page(self, n):
        # This should be a binary search, but
        # doing it as a linear search for clarity.
        for page in range(len(self.page_offsets) - 1):
            if n < self.page_offsets[page+1]:
                return page
        # the last page is unbounded, and it's usually
        # an empty page
        return len(self.page_offsets) - 1 
        
    def _split_page(self, page, n):
        if n >= self.page_offsets[page+1]:
            return
        self.page_offsets.insert(page+1, n)
        offset = n - self.page_offsets[page]
        self.pages.insert(page+1, self.pages[page][offset:])
        self.pages[page] = self.pages[page][:offset]

test()

- showell30@yahoo.com March 16, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Create a linked list with each node having storage of 500 integers.

- Sangeet Kumar April 22, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

You may use a LinkedList.

For example, you have two linked lists - A and B.
Split data into lists : read first 10000 elements to A while remaining numbers insert into list B.

Then you can easily insert numbers after 10000th with constant time.

After reading stuff you can merge A and B.

Complexity O(N) (every insertion costs O(1) )

- ZuZu May 10, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Simple, use a link list we just need to change 2 pointers(assuming 500 new elements already exist as a link list)

- Ghochu August 04, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

-1

of 1 vote

Why not BST? To find position it will take O(n) and to insert 'n' elements after that will take O(nlogn) so total it makes O(nlogn).

- prateek March 26, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

@algos, This is a good start, but if you're not told the insertion point in advance, then you'll need to do the bookkeeping on where the insertion happened. Also, while not explicitly stated in the problem, you can probably expect multiple inserts, so you're gonna want a data structure that scales beyond one insert.

- showell30@yahoo.com March 16, 2013 | Flag

CareerCup

Yahoo Interview Question for Interns

Books

Videos

Resume Review

Mock Interviews