Write data structure for query

Yahoo Interview Question for Software Engineer / Developers

0

of 0 votes

12
Answers
Write data structure for query such as given user it returns all pages he visited with frequency.
If users clicks pages its (page) frequency count increases by 1.
- PT July 21, 2013 in United States | Report Duplicate | Flag | PURGE
Yahoo Software Engineer / Developer Data Structures

Email me when people comment.

An error occurred in subscribing you.

Country: United States
Interview Type: In-Person

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

A 2 level hashing should be sufficient.
a) HashMap<int,HashMap> person = new HashMap<int,HashMap>();
b) The second HashMap will map a pageId to its frequency.
HashMap<int,int> pages = new HashMap<int,int>();

void incrementFrequency (int personId, int pageId){
   HashMap pageMap = persons.get(personId);
   pageMap.set(pageId, 1+pageMap.get(pageId));
}

Some null checks will also be required.

- Learner July 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

What about 100s of millions of users? You certainly can't fit them all in the first hashmap.

-Y! engineer

- Y! engineer July 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

In that case, it seems that we need to use something like "paging" as we have in Operating Systems.
An indexed mechanism to determine the machine (since it will be a Distributed File System), a secondary index to retrieve the person data, and then modify the frequency for a page.

It would bring out another interesting question if one asks that given a page, you need to find out the number of distinct users who have accessed it (maybe a time frame can also be provided).

- Learner July 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 1 vote

Yep, distribution of users across machines is the key here. But why go to such low levels as a file system? How about a distributed hash table (may be a No-Sql based database) to store the user records on a large cluster of shards (e.g. tablets in a Google big Table) with the page-frequency map stored either as a single column or each page mapped to a column name and the frequency stored in the column value, in a column oriented database like Hbase

- Y! engineer July 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Y! engineer :
create hash function which gives unique value for all pages of each user visited.
Map this hash values to userId

- pirate July 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Learner & Y! engineer
Why not just create a HashMap<Person, Page>?... Why page id is so important here?

- Chih.Chiu.19 July 22, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@ Chih.Chiu

We need two levels of hashing anyway, first on the basis of users, which is where the data distribution across nodes comes into picture, and second is on the basis of page views per user.

- Anonymous July 23, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

I certainly agree with this code above but if fast response and memory is not an issue probably a hash of ("userid" + "pageId") would be better.

But if you want to track large number of users and large number of pages like some of the folks here are asking for the solution would require some mechanism to find an user fast and the page fast.

If this is not that heavy transitionally in real life I would probably use a SQL type of database with 3 tables:
- users - Will index based on whatever we want to hash (userid or username)
- pagesNames - Contain all pages that any user found
- userPageCount - contains the relationship between user and pages and the respective count or number of views.

If is mildly heavy transactional would do the same approach but with a cache service which maintains a copy currently used copy in memory and asynchronously stores the data adding to the current count every now an then to the database but locking the userPageCount record in order to make it thread safe.

If is heavy transactional I'll probably go with a no-SQL solution where I could have a single blob table indexed by the userName + pageName containing the count.
So when updating I would find a record based on "userName + PageName" with an thread safe Add operation.

- Nelson Perez August 04, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

all the user's usage will be tracked in a separate file per user. File name could be userID.log. Inside the file data will be stored as HashMap.

HashMap getUsageStats(String userID){
//<URL, Count>
HashMap<String,Integer> usageMap = new HashMap<String,Object>();
if (TextUtils.isEmpty(userID) return null;

File file = new File(userID);
FileInputStream f = new FileInputStream(file);
ObjectInputStream objFile = new ObjectInputStream(f);
usageMap = (HashMap<String,Object>)objFile.readObject();
objFile.close();
return usageMap;
}

- a_coder July 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

You would probably need to explain how you would manage a billion separate files. Odds are you won't be able to store it on one machine. You would need a distributed file system.

- some guy July 21, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

A btree structure would be good for storing users, where each node can be on separate machines.

For instance, Machine 1 might contain info for all names. If your name leads off with letter A, it will forwards you to machine 2, B machine 3, etc.. You might need many many copies of machine 1 to handle all the requests since essentially machine 1 is at the root of the tree. At machine 2, you might forward based on the second letter in the username, AB -> machine 4 AC -> machine 5 etc. The amount of splitting depends on distribution of naming patterns. Also, the number of machine clones you need depends on which level you are on the tree. Nodes further down will need less clones while nodes at the top will need more to handle more traffic. This will reduce the number of searches for a user to be dependent on the length of the username than the number of users. Obviously, the same principal works if there is a mapping function from username to a numerical ID where a B tree is even more intuitive.

Once we get a match for a specific user, we can then access the specific usage frequency tree of the user. In this case, we can store the usage frequency as a binary hash map. A user is unlikely to have a huge amount of different websites visited, so we don't have to worry as much about a huge tree and lookup/update. We should probably cache the location of user specific frequency tree so that we don't have to dig through the hundreds of millions of user every time a user clicks on a new page and make the cache decay in time as hits often will bunch up in the time space.

I think an important aspect of thinking of these problems is scaling and parallel/distributed processing. Obviously, many web scale solutions will never be able to run one a single machine so algorithms have to be adapted.

- Some guy July 21, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

User object to contain the Page details
Something like

class User {
    int userId;
    Map<int, int> pageVisits; //<pageId, frequency>
};

Now create a hashtable for storing User objects

- bibbsey July 28, 2013 | Flag Reply

CareerCup

Yahoo Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews