Apple Interview Question
SDE-3sCountry: United States
Interview Type: In-Person
1. Assuming that Every line in the input data contains user-id and list of product ids.
In the map phase, we will first extract all products purchased by a user and pair them up with the count.
e.g. CUST_123, PROD_1, PROD2, PROD3
result of map phase.
(PROD_1:PROD_2,1)
(PROD_1:PROD_3,1)
(PROD_2:PROD_3,1)
In the reduce phase, we will collect all such results from all users and then add all counts and then return top 100.
This is how to answer the second question in old and boring SQL, join a table with itself by user id (so that each product is mapped with each product). Then remove rows with the same product and deduplicate them by filtering higher product id:
SELECT
FP.product AS product1,
T.product AS product2,
COUNT(1) AS bought_count
FROM Purchases AS FP
-- the < sign in the join so that we keep only 1 pair of (p1,p2) and (p2,p1)
INNER JOIN Purchases AS T
ON FP.user = T.user AND FP.product < T.product
GROUP BY FP.product, T.product
ORDER BY bought_count DESC
LIMIT 100
Though I have no idea how to do this in Spark. The bottleneck is obviously inner join, but what can we do to optimize it? Maybe the question means distributing the load proportionally among workers, I don't know.
Problem here is , same id will get different partition number if using Random()*ID and hence will go to different reducers. Aggregation functions based on ID will result in incorrect results.
- Mayank Jain August 20, 2017