Google Interview Question for Software Developers


Team: Maps
Country: United States




Comment hidden because of low score. Click to expand.
1
of 1 vote

We should start decomposing the problem:
1. we will only store data for last 30 days in a rolling window.
2. we will have 3 (or one?) max heap data structures (one for 10 frequent words in a day, one for 10 frequent words in an hour and one for 10 frequent in a month).
3. there will be a system that maintains and stores that data.

- celeritas August 07, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

How do we approach this problem?

- Anonymous August 07, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Interesting approach with the heap but I was thinking of a different approach.
Same approach for 1H, 1D, etc.
Maintain a sorted array and a Map from word to position in the array.
For each insert you increment the count and move the value to the right into the array.
For expiring values you decrement the values and move the value to the left.
Problems:
1) Have the same count many times and worst case you have O(N) to move an item:
Solution: Keep another map that maps each value where the index is. So your move is O(1)
2) I have 1000s of words coming at the same time.:
We can use a optimistic locking where we keep track of the words we are modiffing right now and unless we are more than 2 count away from any of the words that are processed right then we are ok to modify the array. (No collisions). If not we wait until the +1 count or -1 count finished the change.
(We can also do some batch processing, if we don't want real time)
Interested of what you think about this approach.

- save.the.w. August 13, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Keep a map with the word as the KEY and 2 arrays and 3 max value variables as the VALUE. First array is of size 31 (one location for each date of the month) and second array is 60 (one location for each minute in an hour).

When the work comes in, check whether it is there. If not, increment count at index of current date. Increment count at position of min. Increment the variable for Sum30, sum today and sum hour resetting on rollover conditions.

Compare these with global max_30, max_day and max-hour storages and change them if they are no longer the max.

- Upendra D August 14, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

dasd

- Anonymous August 16, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

Use a map that maps the words to frequency. Use 3 min heaps (one each for hour, day and month) to store the most frequent 10 words. When a new word is processed, increment it's count in the map and compare it's count with the count at the top of the heap, and if greater, replace the top of the heap with the new count and key.

Worst case time for insertion is O(Log10).

- Anonymous August 16, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

For every tweet, we generate multiple tuples as follows <TID, time, count> for example:
<Tweetid, Time, +1>, <TweetId,Time + Hour, -1>, <Tweetid,Time + Day, -1> <Tweetid,Time + Month, -1>
We have three hashmaps (for hour day month). Each HM's key is a tweet Id and the value is a pointer to a linked list which stores for each count a hashtable of all the tweets,
When we get a (+1) operation, if the tweet is in one of the main hash tables we find its location, remove it from the linked list's current count, and add it to the (count + 1). otherwise, if this is the first time we see it then we add it to the list and the corresponding HM's with count = 1. Same goes when we get an event of type (-1), just the other way around (e.g. we put it in the list with a lower count).
Query: At any given time, we traverse the lists (hour,day,month) from largest count to smallest, and collecting along the way the K most frequent items.

- Anonymous August 16, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

The solution differs depending on what we should optimize for. We can do something like this, using a circular buffer, hash maps, and a min heap:

#include <vector>
#include <unordered_map>
#include <queue>
#include <string>
#include <iostream>

using namespace std;

class WordFrequency {
    public:
        WordFrequency(int period) : period_(period)
        {
            slots_.resize(period_);
            words_.push_back("");
            head_ = 0;
            tail_ = 0;
        }

        void AddWord(uint32_t time, const string &word)
        {
            uint64_t &word_id = word_to_word_id_[word];
            if (word_id == 0) {
                word_id = words_.size();
                words_.push_back(word);
            }
            TrimTail(time);
            ++slots_[time % period_][word_id];
            ++period_words_[word_id];
            head_ = time;
        }

        vector<string> GetTopWords(uint32_t time, int n)
        {
            vector<string> out;
            if (time < head_)
            {
                return out;
            }
            TrimTail(time);
            for (auto it = period_words_.begin(); it != period_words_.end(); ++it)
            {
                uint64_t word_id = it->first;
                uint64_t count = it->second;
                if (pq.size() < n)
                {
                    pq.push(PQElement(word_id, count));
                }
                else if (pq.top().count_ < count)
                {
                    pq.pop();
                    pq.push(PQElement(word_id, count));
                }
            }
            while (!pq.empty())
            {
                out.push_back(words_[pq.top().word_id_]);
                pq.pop();
            }
            reverse(out.begin(), out.end());
            return out;
        }

    private:
        class PQElement {
            public:
                PQElement(uint64_t word_id, uint64_t count) : word_id_(word_id), count_(count)
                {}
                bool operator<(const PQElement &other) const
                {
                    if (count_ == other.count_) {
                        return word_id_ > other.word_id_;
                    }
                    return count_ > other.count_;
                }
                uint64_t word_id_;
                uint64_t count_;
        };

        void TrimTail(uint32_t time)
        {
            int new_tail = time - (period_ - 1);
            if (new_tail > tail_)
            {
                int tail_augment = new_tail - tail_;
                if (tail_augment > period_)
                {
                    tail_augment = period_;
                }
                for (int i = 0; i < tail_augment; ++i)
                {
                    DecrementWordCounts(tail_);
                    ++tail_;
                }
            }
        }

        void DecrementWordCounts(uint32_t slot_idx)
        {
            unordered_map<uint64_t, uint64_t> &slot_words = slots_[slot_idx %= period_];
            for (auto it = slot_words.begin(); it != slot_words.end(); ++it)
            {
                uint64_t word_id = it->first;
                uint64_t count = it->second;
                period_words_[word_id] -= count;
                if (period_words_[word_id] == 0)
                {
                    period_words_.erase(word_id);
                }
            }
            slot_words.clear();
        }

        int period_;
        unordered_map<string, uint64_t> word_to_word_id_;
        vector<string> words_;
        vector<unordered_map<uint64_t, uint64_t>> slots_;
        unordered_map<uint64_t, uint64_t> period_words_;
        uint64_t max_word_id_;
        uint32_t head_;
        int tail_;
        priority_queue<PQElement> pq;
};

void Print(const vector<string> &words)
{
    for (auto w : words)
    {
        cout << w << "\n";
    }
    cout << "---\n";
}

int main()
{
    WordFrequency wf(60);

    wf.AddWord(1, "cat");
    wf.AddWord(1, "cat");
    wf.AddWord(10, "rat");
    wf.AddWord(15, "cat");
    wf.AddWord(15, "bat");
    wf.AddWord(20, "rat");
    wf.AddWord(21, "dog");
    wf.AddWord(21, "dog");

    Print(wf.GetTopWords(21, 3));

    wf.AddWord(21, "dog");
    wf.AddWord(41, "rat");
    wf.AddWord(41, "rat");

    Print(wf.GetTopWords(41, 3));

    wf.AddWord(55, "dot");
    wf.AddWord(61, "net");
    wf.AddWord(100, "let");

    Print(wf.GetTopWords(100, 3));

    return 0;
}

- Alex August 19, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

My Approach: Say you get a stream of words in a LIST<String> every hour, for every word in the list search if present in the TRIE data structure else insert into TRIE data structure. This will significantly reduce the inserting time complexity. Then have a hashMap<String, Count> whenever you add a new string value add it to map with count as 1, whenever you find duplicate update its count. Then have a big List<HashMaps>(that holds the HashMaps<String Integer> mentioned before), each list holds hashMaps for an hour, based on timer, example every hour a new HashMap is added to this BIG list. In this way whenever you want the past hour words get List.get(list.size()-1); if you want for past day get :List.Sublist(List.Size()-24, list.size()-1) ; if you want for past month do calculation (31-or-30 days based on month * 24 hours )accordingly . One you get the big List<HashMaps> it is easy from here, all you have to do is take all the Values from the HashMap(by getting hashmap.values()) , add them to an orderedset to avoid duplicates and to have the values sorted from lowest to highest, from the last get only last 10 from this set and only for those values have a final set/list. For each key(words) in the hashmaps add to resulting List(the TOP 10 you need to return) only if the value (frequency of the word) is in the give set. Stop when the size of resulting List is 10 and return the Final LIST.

- Nitish September 14, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
0
of 0 vote

data stream (probabilistic) algorithm like Misra-Gries' Frequent algorithm
- hash table with only 10 items or less at all times
- if elem in hash table increment count by 1
- if elem not in hash table
-- if num elem in hash table < 10, insert elem with count 1
--- else decrement count of every elem in hash table by 1
------- remove elem from hash table if count = 0 or outside time window t

so you need 3 hash tables (for 1 hr, 1 day and 1 month)

why does this work probabilistically ? decrementing the count by 1 for every elem in stream will knock out the less frequently items while the most frequent items will survive..

- shanthikp September 15, 2018 | Flag Reply
Comment hidden because of low score. Click to expand.
-1
of 1 vote

Can an approximation solution be considered or an exact solution is required?

- adr August 07, 2018 | Flag Reply


Add a Comment
Name:

Writing Code? Surround your code with {{{ and }}} to preserve whitespace.

Books

is a comprehensive book on getting a job at a top tech company, while focuses on dev interviews and does this for PMs.

Learn More

Videos

CareerCup's interview videos give you a real-life look at technical interviews. In these unscripted videos, watch how other candidates handle tough questions and how the interviewer thinks about their performance.

Learn More

Resume Review

Most engineers make critical mistakes on their resumes -- we can fix your resume with our custom resume review service. And, we use fellow engineers as our resume reviewers, so you can be sure that we "get" what you're saying.

Learn More

Mock Interviews

Our Mock Interviews will be conducted "in character" just like a real interview, and can focus on whatever topics you want. All our interviewers have worked for Microsoft, Google or Amazon, you know you'll get a true-to-life experience.

Learn More