Yahoo Interview Question
InternsThe simplest way is to use LinkedList for million integers and another create new linked list for 500 integer that we want to store in middle. Now it is easy to add newly created list in between of original Linkedlist.
This is the correct approach. You want to maintain a list of offsets. See "Python Solution with Paged Lists" for a full implementation of this concept. @Geet, If you're doing a large number of inserts, then you might not want a linked list to store the offset. Better to have an array that allows you to do a binary insert to find the pages.
If you're doing this in C, you need to be careful about memory management. You'll want your list of pages to also indicate whether pages were created via malloc() or carved out of a previously malloc'ed page. If you throw deletes into the equation, then you might want to consider a data structure that uses a fairly small page size. Having more pages will slow down random access, but that can be mitigated with a binary search for offsets, and your insertions/deletions will be faster.
We create a skip list, we store data page wise in linked list, and above we create a tree kind of structure,(google skip list for more info), here the insertion time is O(log(#page)) per page.
Insertion procedure :
Here each page is half empty,
Now if we get a page which is more then half empty then we just insert it
and if the newly inserted page have less then half data then we merge it with another near page.
or if both nearer page don't have space to store this new page, then we take items from any one of nearer page and make both of then to have more then half data.
I'm assuming the requirements break down like this:
1) We're dealing with big data, not huge data, so millions of values, not billions.
2) We obviously need bulk inserts to be fast.
3) I'm gonna assume a read-mostly context, so we want random access to be at least log(N) fast, so a simple linked list won't cut it.
4) I'm gonna assume inserts are always bulk, so we can amortize the cost or live with the slowness for small inserts.
Given these constraints, I'd create a 2-level tree of integers, with roughly 1000 pages of 1000 integers each. The outer list (i.e. top of the tree) would have the offsets for the inner lists. For random access, you'd binary search the outer list to find your page table and then do an O(1) lookup in the page to get your value. This is essentially how BigTable works, except BigTable has a few more levels.
To do a bulk insert, you find the page where the insertion point belongs, split that page, then simply create a new page for your new data, and then you only need to update the outer table to accomodate the two new pages and recalculate offsets, which has a worst case of about 2000 operations.
If you're doing lots of bulk inserts over time, then you'll eventually want some compaction algorithm to consolidate pages. Also, after you split pages to make room for your new insertion, then if the first part of the split page is pretty small, you should consider simply appending your values to it, rather than creating a new page. Data structure that use a paging scheme usually have some flexibility on page size, so that you try to be a certain size, but you can go up to double that size to be lazy about re-organizing your data structure.
Python Solution with Paged Lists:
def test():
arr = BigArray()
chunk_size = 10000
for i in range(100):
new_elems = range(i*chunk_size, (i+1)*chunk_size)
arr.insert_elems(i*chunk_size, new_elems)
assert 43999 == arr[43999]
assert 500000 == arr[500000]
arr.insert_elems(900000, range(3000))
assert 0 == arr[900000]
assert 1001 == arr[901001]
assert 900001 == arr[903001]
arr.insert_elems(903001, [1,2,3])
assert 1 == arr[903001]
assert 2 == arr[903002]
assert 3 == arr[903003]
assert 900001 == arr[903004]
class BigArray:
def __init__(self, page_size = 1000):
# Keep a list of pages, with an empty page
# at the end for convenience.
self.page_offsets = [0]
self.pages = [[]]
self.page_size = 1000
def debug(self):
for i in range(len(self.pages)):
print i, self.page_offsets[i], len(self.pages[i])
def __getitem__(self, i):
if i > self.page_offsets[-1]:
raise IndexError, "n too large"
page = self._get_page(i)
page_i = i - self.page_offsets[page]
return self.pages[page][page_i]
def insert_elems(self, n, lst):
# insert elements from lst into BigArray
# starting at index n
if n > self.page_offsets[-1]:
raise IndexError, "n too large"
if n == self.page_offsets[-1]:
return self.append_elem(lst)
# Find the page where insertion starts.
page = self._get_page(n)
cnt = len(lst)
# Split the page after the insertion point. This
# might create a small page, but we don't worry
# about it for simplicity sake. We can refine
# this later by trying to balance our new elements
# better.
self._split_page(page, n)
self._update_offsets_for_insertion(page, cnt)
low = 0
# See how much room we have in current page
# to avoid needlessly creating new pages. We
# may luck out with a partial page from a previous
# insert.
room = self.page_size - len(self.pages[page])
if room > 0:
# We might have extra room
if room > cnt:
room = cnt
# Extend our current page
self.pages[page].extend(lst[0:room])
low = room
# Now create new pages as needed.
if low >= cnt: return
new_pages = []
new_offsets = []
while low < cnt:
page_size = min([cnt - low, self.page_size])
new_pages.append(lst[low:low+page_size])
new_offsets.append(n+low)
low += page_size
self.pages[page+1:page+1] = new_pages
self.page_offsets[page+1:page+1] = new_offsets
def append_elem(self, lst):
# TODO: For now, we always create new pages, but
# we could try to look at the last nonempty
# page and see if there's room.
cnt = len(lst)
n = self.page_offsets[-1]
# remove empty page temporarily
self.page_offsets.pop()
self.pages.pop()
low = 0
while low < cnt:
page_size = min(self.page_size, cnt - low)
self.page_offsets.append(n + low)
self.pages.append(lst[low:low+page_size])
low += page_size
# put back empty page
self.page_offsets.append(n+cnt)
self.pages.append([])
def _update_offsets_for_insertion(self, page, cnt):
# update offsets of all the pages to the
# right of the insertion point (if any)
for i in range(page+1, len(self.pages)):
self.page_offsets[i] += cnt
def _get_page(self, n):
# This should be a binary search, but
# doing it as a linear search for clarity.
for page in range(len(self.page_offsets) - 1):
if n < self.page_offsets[page+1]:
return page
# the last page is unbounded, and it's usually
# an empty page
return len(self.page_offsets) - 1
def _split_page(self, page, n):
if n >= self.page_offsets[page+1]:
return
self.page_offsets.insert(page+1, n)
offset = n - self.page_offsets[page]
self.pages.insert(page+1, self.pages[page][offset:])
self.pages[page] = self.pages[page][:offset]
test()
You may use a LinkedList.
For example, you have two linked lists - A and B.
Split data into lists : read first 10000 elements to A while remaining numbers insert into list B.
Then you can easily insert numbers after 10000th with constant time.
After reading stuff you can merge A and B.
Complexity O(N) (every insertion costs O(1) )
@algos, This is a good start, but if you're not told the insertion point in advance, then you'll need to do the bookkeeping on where the insertion happened. Also, while not explicitly stated in the problem, you can probably expect multiple inserts, so you're gonna want a data structure that scales beyond one insert.
We can use array for storage + some data structure to manage the storage. We could load the first million data to an array "int A[1000000]". Since we need to insert some data in the middle, we wouldnt want to resize this big array and shift the elements. We can just create the new array of B[500] to store the new 500 members that should be inserted after A[9999]. Then there should be a managing data structure to say that the array is contiguous from 0-9999, then read from B[0-499] and then continue reading from A[10000-999999].
Below is the data structure
- jtr.hkcr March 15, 2013