Amazon Interview Question
SDE-2sCountry: United States
Interview Type: Phone Interview
It is to be handled using the concept called fault tolerance. A bit of it is explained below.
When a server crashes or a hard disk runs out of room in an on-premises datacenter environment, administrators are
notified immediately, because these are noteworthy events that require at least their attention — if not their
intervention as well. The ideal state in a traditional, on-premises datacenter environment tends to be one where failure
notifications are delivered reliably to a staff of administrators who are ready to spring into action in order to solve the
problem. Many organizations are able to reach this state of IT nirvana – however, doing so typically requires extensive
experience, up-front financial investment, and significant human resources.
Check the following
1. See if particular hosts in the web service are behaving badly, if yes then remove them the service to mitigate the immediate issue.
2. If all the hosts are impacted, then randomly spot check host's cpu, memory and performance.
3. Check web service logs.
4. The load balancer might have been configured to have more connections that the web service can handle, if that is the case try reducing the connections.
If a specific web server starts failing you could remove this one from load balancer connection pool until problem be fixed.
- Felipe Cerqueira January 15, 2014Using a cloud infrastructure, could be easy start a new server to replace that one with problem...
If you have a software problem happening in all servers. It could be a capacity problem and the solution could be adding new servers to the pool to reduce the load.