Amazon Interview Question
Data EngineersCountry: India
Interview Type: In-Person
Let me do it using mapreduce instead of spark, I don't know much of spark. I would assume the concept is similar though.
file1: <startDate endDate>
1 5
8 20
file2: <date visitor>
2 5
3 8
10 120
So answer should be 8-20 since in that date range we have max visitor.
Assumption : since file1 just has ranges it should be a small file and can be loaded into the distributed cache of hadoop.
Now execute map code for file2, the code should do following :
1) read file2 and for each date , see which range it belongs to from the distributed cache and increment the counter for that time range.
eg
for 2 increment counter for 1_5
for 3 increment counter for 1_5
for 10 increment counter for 8_20
3) output <range , counter> as map output
4) In reduce add all the counters for every range.
Also - we need to add total order sorting so that overall output of all reducers are sorted.
(spark sql):
%sql
drop table dateRange
create table dateRange(startdate int, enddate int)
insert dateRange values (1,5),(8,20)
drop table dateVisitor
create table dateVisitor(date int , visitors int)
insert dateVisitor values (2,5),(3,8),(10,120)
select * from daterange;
select * from datevisitor
select top 1 sum(visitors) as totalvisitors, startdate,enddate from daterange d join datevisitor v on v.date between d.startdate and d.enddate
group by startdate,enddate order by totalvisitors desc
select distinct top 1 startdate,enddate,sum(visitors) over (partition by startdate,enddate ) as totalvisitors from daterange d join datevisitor v on v.date between d.startdate and d.enddate
order by 3 desc
drop table dateRange
create table dateRange(startdate int, enddate int)
insert dateRange values (1,5),(8,20)
drop table dateVisitor
create table dateVisitor(date int , visitors int)
insert dateVisitor values (2,5),(3,8),(10,120)
select * from daterange;
select * from datevisitor
select top 1 sum(visitors) as totalvisitors, startdate,enddate from daterange d join datevisitor v on v.date between d.startdate and d.enddate
group by startdate,enddate order by totalvisitors desc
select distinct top 1 startdate,enddate,sum(visitors) over (partition by startdate,enddate ) as totalvisitors from daterange d join datevisitor v on v.date between d.startdate and d.enddate
order by 3 desc
Couple of doubts
- Progu November 02, 20181. Is date range non overlapping and sorted?
2. Is visitor/date file has record sorted by date?