langeolli
BAN USERAnon is correct. The second pass will not help. Take the following situation:
3 servers:
SERVER A:
URL #1 - #10 : 2 counts
URL #0 : 1 count
SERVER B:
URL #11-#20: 2 counts
URL #0: 1 count
SERVER C:
URL #21-#30: 2 counts
URL #0: 1 count
now the first pass will find URL #1-#30 but not URL #0, this will be completely missed. Yet it is the winning URL, with 3 counts.
Hence the whole strategy is flawed.
Serialize node recursively as
NODE LEFT RIGHT
use ^ if node doesn't have a right child
A binary tree
'
10
5 18
1 8 13 20
7 14
would become the string '10 5 ^ 8 7 ^ ^ 18 13 14 ^ 20'
deserialize:
1. read value by value
2. if value is sentinal
no more children
3. smaller than current node's key it is left-child
otherwise right-child
class NodeDS:
def __init__( self, key ):
self.key = key
self.left=None
self.right=None
def serialize_tree( node ):
if not node:
return ''
s = '%d '%node.key
s += serialize_tree( node.left )
if node.right is None:
s += '^ '
else:
s += serialize_tree( node.right )
return s
class TagStream:
def __init__( self,input_string ):
self.tags=input_string.split()
self.index = 0
def next( self ):
tag = self.tags[self.index]
self.index += 1
return tag
def _deserialize_tree( stream, last_key = None ):
node = NodeDS(last_key)
while True:
next_key = stream.next()
if next_key == '^': break
next_node = _deserialize_tree( stream, int(next_key) )
if int(next_key)<last_key:
node.left = next_node
else:
node.right = next_node
break
return node
def deserialize_tree( input ):
stream = TagStream( input )
first_key = stream.next()
return _deserialize_tree( stream, int(first_key) )
gray code is given by XOR the current bit (position i) with previous bit of the binary code (position i+1). To reverse the conversion one has to make sure to compute the binary bit at position i+1 first, to know the filter for position i.
Algorithm is here given with array of digits [1,0,1,1,0] for clarity. It can easily be translated into operations on bits of an unsigned number as shown below the ==== line.
def binary2gray( code, invert_bit = 0 ):
if len(code)==0:
return []
bit = code[0]
return [ bit ^ invert_bit ] + binary2gray( code[1:], bit )
def gray2binary( code, invert_bit = 0 ):
if len(code)==0:
return []
bit = code[0]
return [ bit ^ invert_bit ] + gray2binary( code[1:], bit ^ invert_bit )
==============================================
assuming 32-bit numbers
def xbinary2gray( x, pos = 31, invert_bit = 0 ):
if pos < 0:
return x
mask = 1<<pos
newx = (mask & ( x ^ invert_bit )) | (~mask & x)
return xbinary2gray( newx, pos - 1, (mask&x) >> 1 )
def xgray2binary( x, pos = 31, invert_bit = 0 ):
if pos < 0:
return x
mask = 1<<pos
newx = (mask & ( x ^ invert_bit )) | (~mask & x)
return xgray2binary( newx, pos - 1, (mask&newx) >> 1 )
def rotate_matrix( input ):
n = len( input )
for r in range( 0, n/2 ):
for o in range( r, n-r-1 ):
temp = input[ r ][ o ]
input[ r ][ o ]=input[ n-o-1 ][ r ]
input[ n-o-1 ][ r ]=input[ n-r-1 ][ n-o-1 ]
input[ n-r-1 ][ n-o-1 ]=input[ o ][ n-r-1 ]
input[ o ][ n-r-1 ]=temp
return input
Sort the integers into an array using mod R, where R is the known short range. While sorting into the buckets record the position of the minimum element. This element is then the offset and the median is in bucket (R/2+offset) mod R.
If duplicate numbers are allowed need to take care of counting how many, and you have to actually walk along you're R buckets to find the median.
Complexity of this should be O(N) if N, number of integers, and O( N + R ) if duplicates are allowed.
We partition the work by using a hash on the url. The hash gives an address to an aggregator who is responsible counting a subset (as defined by the hash) of urls. Then we crawl through the aggregators and fill the top-ten list. As each url is mapped to exactly one aggregator the crawler only has to maintain a map with ten entries.
- langeolli March 07, 2014Step 1) send table with aggregator addresses: say we use 10k aggregators
Step 2) on individual machine count all urls
Step 3) on individual machine iterate through counted urls, find hash-key, modulo 10k
gives you the address of responsible aggregator -> send to aggregator the count of said url.
Step 4) aggregators accumulate the counts
Step 5) initialize empty top-ten list, iterate through 10k aggregator machines
and maintain the top-ten list