Amazon Interview Question
Principal Software EngineersCountry: United States
Interview Type: Phone Interview
- How can we make sure the files are in sync? Use a hash function like SHA-1 for each file to produce a 160 bit value for each file. You can then easily compare this small value for the files on the client computer and your back end.
- Protocol? Use FTP over SSL. The client program connect to the server and after authentication tries to sync the files. And if finds a mismatch compare the date, and either download or upload the file.
- A good hint for implementation: Since most of the files do not change frequently, you can have a database to store the files hash values and query when you want to compare them. This way you reduce the traffic to access the files on the server. You only go to the files if they are needed. And when upload a file to the server the hash value in DB has to be uploaded. You should also put a layer of web service between the DB and client.
1) Use RSync protocol for file synchronization between clients and client-server.
- Cerberuz September 03, 20142) Store file blocks of equal size instead of file. Have a file block server which keeps track of file blocks and SHA-256 hash of each block stored.
3) Use Amazon S3 to store file blocks.
4) Each client should be identified by it's unique root namespace.
5) Store client metadata in a DB (MySQL) Table. Each row in DB table will represent a particular file. Table attributes should be list of file blocks, root namespace of client who owns this file, a relative path of file in namespace to get the location of file in client's dropbox installation folder. Store other user metadata like settings, account configuration, access level etc. also in this DB in some other table.
6) Have a metedata server to fetch result from DB.
7) Have multiple instances of metadata and file block server to handle large number of requests. Amazon S3 will handle your file blocks.
8) Use memcache and load balancer with metadata server for efficiency.
To upload a file, client will split the file into blocks of equal size (4MB) and client will talk to metadata server to send the information (hash of file blocks) about the file to upload. If any file block is not already found in Amazon S3 then block server will tell metadata server about it. Metadata server will tell client to send those file blocks to block server. Client will send those file blocks to block server. Once block server stores file blocks in Amazon S3, metadata server adds the entry for that file in MySQL DB representing an update in client namespace.
To sync file to other clients of a user, metadata server will inform the clients about the update in client namespace. Clients will ask metadata server about newly added file. Metadata server will send the list of hashes of file blocks of the new file.Then the clients will talk to block server, give the list of file block hashes, retrieve the corresponding file blocks and combine them to generate the file.
AND THIS IS HOW EXACTLY DROPBOX WORKS.