Amazon Interview Question
Software Engineer / DevelopersCountry: India
Interview Type: In-Person
Interviewer wanted to have job runner(s) separate from the worker which would be responsible of running the job.
So Worker is responsible for getting the jobs from the client and returning the result.
Only question is how a job runner would return the result upon completion, will it save the worker host name, if that worker is not available what would happen ?
@CoolGuy: I don't get the difference of Job runner and worker in your latest post. How ever, let's assume the job runner is a job scheduler and distributes the work to workers. So it acts like a Loadbalancer. A typical thing you can do to return results is IP rewriting. The job runner sends a request to the worker but the sender-ip is actually the client's-ip .. but your interface states not a synchronous interface...
Important questions may be:
- Chris November 27, 2017- size of job
- character of job (CPU bound, I/O bound)
- where's the data for the job (needs to be read from other services, ...)
- notification (really callback, no long polling etc. - firewalls of clients? clients behind a NAT?)
- authentication & authorization model
- criticality of requirements on single processing, keeping order of requests (e.g from same client, over all,...) etc.
A loadbalancer would round robin (or distribute smarter e.g. on a hash ring) to a workers that execute jobs. Each worker has two slaves that the worker notifies when he gets the request. A worker is slave for an other worker as well. That means, he keeps the jobs of his slaves and could jump in in case a worker dies. Otherwise he wouldn't work on jobs he is slave for.
If jobs and / or results need to be persisted, the worker would be responsible to store the job in a database. The database would be sharded based on job id. Job Id would be something like user_id concatenated with a running job-number per user (assuming the per user frequency of jobs is at most a few a second and typically a few every hour). To prefent data loss, the database is replicated.
For notification I would try to notify from the worker itself. Before notifying I would persist the result of the job, if needed and notify the slaves that I am now notifying, so in case the worker crashes while notifying one of the slaves can take over.
But notification could fail (client not reachable). In this case I would retry from the same worker immediately once, after a few seconds a second time and after 1-2 Minutes again. If it still fails I'm either living in a network partition that would hopefully be detected by one of my slaves) or the client is gone for some time.
The fail-over from worker to a slave is critical and must be coordinated (depending on requirements how important it is to not process a job twice or notify a client twice in rare cases)