ly
BAN USERWhat constitutes a quote?
A fragment text longer than some threshold, say 20 words
How do you find quotes?
Cluster all the web pages. For a web page, compare with all other pages in the cluster. O(mn).
Store quotes in db. Key is url, value is a list of position and url where it is reproduced.
How do you make it scale to the web?
There are 1t web pages over the internet, store all of them in clusters.
How do you handle updates?
Periodically crawl all the web pages and update the quotes. Given a url, check if the quoted positions are updated. If it is updated, then recompute the quotes.
How would you arrange the servers?
Two database, one to store all the clustered web pages, the other stores the quotes.
Scenario: user posts status, user receives news feed from friends
Service: news feed, friendship
When user posts status, write to db.
When user requests news feed, get the top 100 status from each friends, merge and return top 100
Storage:
status nosql row key user id, column key time stamp, value status
friend nosql row key user id, column key tie stamp, value friend id
disk = 1b dau * 0.5 post/day * 100byte = 50G/day
Scale
read qps = 1b * 5 read per day /86400 = 100k
peak = 200k, growth = 400k
write qps = 1b * 0.5/86400 = 10k
peak = 20k, growth = 40k
Turn on caching since qps is high.
memory = 1b*100*100byte = 10T.
So we need at least 100 100GB ram database server
Solution 1 O(n)
10 trillion 32 bit integers
Each machine creates an integer array of size 2^16 to count the frequency of each 2^16 interval. The array only takes 2^16*4 = 256 k memory. Total size is 2.5G.
One master machine creates a 64 bit integer array of size 2^16 to merge the counts.
Master knows the ith number in the jth bucket is the median.
In the second pass, each machine count the frequency of each number in the jth bucket and pass to master.
Master figures out the ith number is the median. If 2.5G data transfer between servers is the bottle neck, then we can do a three level partition, 2^11, 2^11, and 2^10.
Solution 2 map reduce
The merge step of can also be done in parallel which hints for map reduce.
Use map reduce to count the frequencies of each number.
Output contains a max of 4B integers, frequency pairs sorted by integer.
We can scan the file and find the median.
If scanning the output file becomes the bottle neck, we can do a multi-level map reduce similar to solution 1.
message service->channel service<->push service<->client
friendship service
channel service stores if a user is online in memory.
When a user visits facebook homepage, message services returns a push server address to user and tells channel services and the user is subscribed for push. User then connects to the push server by socket.
When a user closes facebook, socket is disconnected. Push service tells channel service to unsubscribe the user.
When a comment is posted, message service send the comment to channel service. Channel service gets the thread followers and forward the comments only to those online through push service.
Push servers are sharded by userid with consistent hash
search qps = 1b dau * 2 search per day/86400s = 20k/s
- ly October 09, 2016peak is 40k/s
One counter server can keep the sum in memory.
All web servers informs the counter when a search request is received.
A second counter does the same thing to prevent single point failure.
Slave reports to master if it hits 1b.