Apple On-site at Cupertino Te

Apple Interview Question for SDE-3s

2

of 2 votes

3
Answers
Apple On-site at Cupertino
Team Data Warehousing

Questions on Hadoop, Hive and Spark

I. Given a table with 1B of user ID and product IDs that the users bought, and another table with product ID mapped with product name. We are trying to find the paired products that are often purchased together by the same user, such as wine and bottle opener, chips and beer … How to find the top 100 of these co-existed pairs of products. If going with hadoop, where is the bottleneck and how to optimize?

II. Someone put distribute Random()*ID in a Hive script to prevent data skew. What would be the problem here?
- aonecoding May 10, 2017 in United States | Report Duplicate | Flag | PURGE
Apple SDE-3 design

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

II. Someone put distribute Random()*ID in a Hive script to prevent data skew. What would be the problem here?

Problem here is , same id will get different partition number if using Random()*ID and hence will go to different reducers. Aggregation functions based on ID will result in incorrect results.

- Mayank Jain August 20, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

1. Assuming that Every line in the input data contains user-id and list of product ids.
In the map phase, we will first extract all products purchased by a user and pair them up with the count.
e.g. CUST_123, PROD_1, PROD2, PROD3

result of map phase.
(PROD_1:PROD_2,1)
(PROD_1:PROD_3,1)
(PROD_2:PROD_3,1)

In the reduce phase, we will collect all such results from all users and then add all counts and then return top 100.

- dhruven91 November 13, 2017 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

This is how to answer the second question in old and boring SQL, join a table with itself by user id (so that each product is mapped with each product). Then remove rows with the same product and deduplicate them by filtering higher product id:

SELECT 
    FP.product AS product1, 
    T.product AS product2, 
    COUNT(1) AS bought_count
  FROM Purchases AS FP
  -- the < sign in the join so that we keep only 1 pair of (p1,p2) and (p2,p1)
  INNER JOIN Purchases AS T 
    ON FP.user = T.user AND FP.product < T.product
  GROUP BY FP.product, T.product
  ORDER BY bought_count DESC
  LIMIT 100

Though I have no idea how to do this in Spark. The bottleneck is obviously inner join, but what can we do to optimize it? Maybe the question means distributing the load proportionally among workers, I don't know.

- inthecottonfield February 05, 2019 | Flag Reply

CareerCup

Apple Interview Question for SDE-3s

Books

Videos

Resume Review

Mock Interviews