You have two files in hdfs one

Amazon Interview Question for Data Engineers

0

of 0 votes

4
Answers
You have two files in hdfs one having date range with two columns start date and end date and another having two column with date and visitors field. You have to write a spark code which gives date range having maximum no. of visitors using that two files.
- tokritijain October 30, 2018 in India | Report Duplicate | Flag | PURGE
Amazon Data Engineer

Email me when people comment.

An error occurred in subscribing you.

Country: India
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

Couple of doubts
1. Is date range non overlapping and sorted?
2. Is visitor/date file has record sorted by date?

- Progu November 02, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Let me do it using mapreduce instead of spark, I don't know much of spark. I would assume the concept is similar though.

file1: <startDate endDate>
1 5
8 20
file2: <date visitor>
2 5
3 8
10 120

So answer should be 8-20 since in that date range we have max visitor.

Assumption : since file1 just has ranges it should be a small file and can be loaded into the distributed cache of hadoop.

Now execute map code for file2, the code should do following :
1) read file2 and for each date , see which range it belongs to from the distributed cache and increment the counter for that time range.
eg
for 2 increment counter for 1_5
for 3 increment counter for 1_5
for 10 increment counter for 8_20
3) output <range , counter> as map output
4) In reduce add all the counters for every range.

Also - we need to add total order sorting so that overall output of all reducers are sorted.

- kabs November 24, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

(spark sql):

%sql

drop table dateRange
create table dateRange(startdate int, enddate int)
insert dateRange values (1,5),(8,20)

drop table dateVisitor
create table dateVisitor(date int , visitors int)
insert dateVisitor values (2,5),(3,8),(10,120)

select * from daterange;

select * from datevisitor


select top 1 sum(visitors) as totalvisitors, startdate,enddate from daterange d join datevisitor v on v.date between d.startdate and d.enddate
group by  startdate,enddate order by totalvisitors desc

select  distinct top 1 startdate,enddate,sum(visitors) over (partition by startdate,enddate ) as totalvisitors from daterange d join datevisitor v on v.date between d.startdate and d.enddate
order by 3 desc

- Vijay Panchal July 01, 2019 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Hi Vijay,

Please, I would like you to explain (insert dateRange values (1,5),(8,20)) and (insert dateVisitor values (2,5),(3,8),(10,120))

Also, are you on linkedIn?

I hope to hear from you soon

- Aaron April 24, 2020 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

drop table dateRange
create table dateRange(startdate int, enddate int)
insert dateRange values (1,5),(8,20)

drop table dateVisitor
create table dateVisitor(date int , visitors int)
insert dateVisitor values (2,5),(3,8),(10,120)

select * from daterange;

select * from datevisitor

select top 1 sum(visitors) as totalvisitors, startdate,enddate from daterange d join datevisitor v on v.date between d.startdate and d.enddate
group by startdate,enddate order by totalvisitors desc

select distinct top 1 startdate,enddate,sum(visitors) over (partition by startdate,enddate ) as totalvisitors from daterange d join datevisitor v on v.date between d.startdate and d.enddate
order by 3 desc

- vijay panchal July 01, 2019 | Flag Reply

CareerCup

Amazon Interview Question for Data Engineers

Books

Videos

Resume Review

Mock Interviews