Amazon Interview Question
SDE-2sCountry: India
To scale it, we can deploy servers which serve request regionally or according to domain suffix, ex for .in or .com, the requests coming for .in and .com will be served from different servers.
For region wise servers these will be served from region wise cache, each cache holds trending news of particular region for each topic.
user will subscribe number of topics, when first user gets logged in for his/her subscribed topics trending news will be fetched from region wise cache for each topic and will be served.
Now we need to update region wise cache frequently. A cron job will keep on running in background which will fetch news from news feeds , aggregate news from different feeds and update region wise cache.
By breaking system region wise , it is easy to scale since for not all regions load will distributed and each region wise cluster can be scaled easily .
with some assumptions, like pull model:
class newsaggregator
{
User user;
public:
bool login (string userId, string password);
//This will hold timers and update result after timer expiry. will fetch from server based on user credentials
void aggregateResult ();
void addSource (string sourcename);
void addCategory (string category, int freq = 5 min);
};
class User
{
list<Source> sources;
public:
bool validateUser (string userId, string password);
void addSource(string sourcename);
void addCategory (string category, int freq);
};
class Source
{
list<string> sources;
double frequency; // 1 min
map<string, list<string> > news;
public:
void daemon (); // after timer expiry, query each source and update news
list<string> retrieveNews (string source, string category); // Interface for newsaggregator
addNewSource (string source); // if source new, then add
};
Good design. Few doubts - 1) Why do we need user class. We can have add\remove source method in newsaggregator class. 2) For scalling, how do you handle multiple machines or servers which will aggregate the news and merge them.
it was a very naive or initial thoughts which can be more specific with discussion in interviews. For your answers:
1) Just like in google news, we have authentication. User preferences are saved for future interaction. Just like in google news, whenever we login it aggregates news according to our preferences. In newsaggregator class, we have provided the interface only for external users. This interface will then call user class for saving preferences and call Source class to add new source if it doesn't exist already. As Source class will also update its database (or you should call more technically, Source class will learn from its users) with new inputs which it has not added initially. So newAggregator is kind of single instance per user.
2) For scaling to multiple machines, we can use Distributed hash table (consistent hashing). So whenever there is request to fetch from source S1, then corresponding hashed server will be requested for source S1.
@Saurabh: Any ideas on how to scale this? Say, the number of sources have increased manifold.
Also, in general case, how will a news aggregator work on a push mechanism ever? Is it like each of the news source will "push" their states/news/information to a preconfigured server/location?
i think it should very much depend on the server itself whether it support to push its info to other who has registered itself.
in case of push model, there must be some kind of event module running on newAggregator which will receive notification from server and update its database.
How can we get information if we are not allowed to pull or push (page crawl is kind of pull) ??
- SK January 26, 2014I can think of only this: catching broadcast information should be aggregated ..
Further lights?