Amazon Interview Question
Software Engineer / DevelopersLets say that we have two tables:
+ All visited pages, with a unique id for each page.
+ Customer table with days visited for each customer, with information on all pages visited
struct pages_table {
struct page_info[MAX_PAGES];
};
struct day {
int day[MAX_PAGE_CNT];
int page_count;
};
struct cust {
int cust_id;
int num_days_visited;
struct day1;
struct day2;
struct day3;
};
Collect all customers with num_days_visited == 2.
<pre lang="whitespace" line="1" title="CodeMonkey62719" class="run-this">-- Assume that we have access to canonicalized log file with three fields (user, page, date).
A = load 'visits.log' as (user:chararray, page:chararray, date:chararray);
-- Group tuples by user.
B = group A by user;
-- For each user, count distinct pages viewed and dates of visit.
C = foreach B {
pages = distinct(A.page);
dates = distinct(A.date);
generate group as user,
COUNT(pages) as pages,
COUNT(dates) as dates;
};
-- Apply filter to each user's aggregates.
D = filter C by pages > 2 and dates == 2;
-- Project out columns that are no longer needed.
E = foreach D generate user;
-- Dump out all the users who have passed the filter conditions.
dump E;
</pre><pre title="CodeMonkey62719" input="yes">
</pre>
process per day data at a time and create a table
- Anonymous August 02, 2010customer id | page count | toggle bit
day 1: create customer -> page count data -> toggle = 0
day 2: update day 1 table with customer -> page count
if customer[hit] -> update page count, and toggle = !toggle,
so toggle == 1 for 2 day visitors.
day 3: do same as above, update data, toggle bit for customer hits
Go through customer table and output customers with toggle == 1
Can be done with map-reduce, with reduce doing the toggling.