Find top 10 most frequent word

Google Interview Question for Software Developers

1

of 1 vote

11
Answers
Find top 10 most frequent words in the past hour, day and month from twitter service. Given a streaming data such as tweets from twitter service, the objective is to find the top 10 frequent words in the past hour, day and past month at any instant of time.
- Pedro August 07, 2018 in United States for Maps | Report Duplicate | Flag | PURGE
Google Software Developer

Email me when people comment.

An error occurred in subscribing you.

Team: Maps
Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

We should start decomposing the problem:
1. we will only store data for last 30 days in a rolling window.
2. we will have 3 (or one?) max heap data structures (one for 10 frequent words in a day, one for 10 frequent words in an hour and one for 10 frequent in a month).
3. there will be a system that maintains and stores that data.

- celeritas August 07, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How do we approach this problem?

- Anonymous August 07, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Interesting approach with the heap but I was thinking of a different approach.
Same approach for 1H, 1D, etc.
Maintain a sorted array and a Map from word to position in the array.
For each insert you increment the count and move the value to the right into the array.
For expiring values you decrement the values and move the value to the left.
Problems:
1) Have the same count many times and worst case you have O(N) to move an item:
Solution: Keep another map that maps each value where the index is. So your move is O(1)
2) I have 1000s of words coming at the same time.:
We can use a optimistic locking where we keep track of the words we are modiffing right now and unless we are more than 2 count away from any of the words that are processed right then we are ok to modify the array. (No collisions). If not we wait until the +1 count or -1 count finished the change.
(We can also do some batch processing, if we don't want real time)
Interested of what you think about this approach.

- save.the.w. August 13, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Keep a map with the word as the KEY and 2 arrays and 3 max value variables as the VALUE. First array is of size 31 (one location for each date of the month) and second array is 60 (one location for each minute in an hour).

When the work comes in, check whether it is there. If not, increment count at index of current date. Increment count at position of min. Increment the variable for Sum30, sum today and sum hour resetting on rollover conditions.

Compare these with global max_30, max_day and max-hour storages and change them if they are no longer the max.

- Upendra D August 14, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

dasd

- Anonymous August 16, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Use a map that maps the words to frequency. Use 3 min heaps (one each for hour, day and month) to store the most frequent 10 words. When a new word is processed, increment it's count in the map and compare it's count with the count at the top of the heap, and if greater, replace the top of the heap with the new count and key.

Worst case time for insertion is O(Log10).

- Anonymous August 16, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

But you have to expire the words also after a hour ,a day or a month

- Anonymous February 28, 2019 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

For every tweet, we generate multiple tuples as follows <TID, time, count> for example:
<Tweetid, Time, +1>, <TweetId,Time + Hour, -1>, <Tweetid,Time + Day, -1> <Tweetid,Time + Month, -1>
We have three hashmaps (for hour day month). Each HM's key is a tweet Id and the value is a pointer to a linked list which stores for each count a hashtable of all the tweets,
When we get a (+1) operation, if the tweet is in one of the main hash tables we find its location, remove it from the linked list's current count, and add it to the (count + 1). otherwise, if this is the first time we see it then we add it to the list and the corresponding HM's with count = 1. Same goes when we get an event of type (-1), just the other way around (e.g. we put it in the list with a lower count).
Query: At any given time, we traverse the lists (hour,day,month) from largest count to smallest, and collecting along the way the K most frequent items.

- Anonymous August 16, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

The solution differs depending on what we should optimize for. We can do something like this, using a circular buffer, hash maps, and a min heap:

#include <vector>
#include <unordered_map>
#include <queue>
#include <string>
#include <iostream>

using namespace std;

class WordFrequency {
    public:
        WordFrequency(int period) : period_(period)
        {
            slots_.resize(period_);
            words_.push_back("");
            head_ = 0;
            tail_ = 0;
        }

        void AddWord(uint32_t time, const string &word)
        {
            uint64_t &word_id = word_to_word_id_[word];
            if (word_id == 0) {
                word_id = words_.size();
                words_.push_back(word);
            }
            TrimTail(time);
            ++slots_[time % period_][word_id];
            ++period_words_[word_id];
            head_ = time;
        }

        vector<string> GetTopWords(uint32_t time, int n)
        {
            vector<string> out;
            if (time < head_)
            {
                return out;
            }
            TrimTail(time);
            for (auto it = period_words_.begin(); it != period_words_.end(); ++it)
            {
                uint64_t word_id = it->first;
                uint64_t count = it->second;
                if (pq.size() < n)
                {
                    pq.push(PQElement(word_id, count));
                }
                else if (pq.top().count_ < count)
                {
                    pq.pop();
                    pq.push(PQElement(word_id, count));
                }
            }
            while (!pq.empty())
            {
                out.push_back(words_[pq.top().word_id_]);
                pq.pop();
            }
            reverse(out.begin(), out.end());
            return out;
        }

    private:
        class PQElement {
            public:
                PQElement(uint64_t word_id, uint64_t count) : word_id_(word_id), count_(count)
                {}
                bool operator<(const PQElement &other) const
                {
                    if (count_ == other.count_) {
                        return word_id_ > other.word_id_;
                    }
                    return count_ > other.count_;
                }
                uint64_t word_id_;
                uint64_t count_;
        };

        void TrimTail(uint32_t time)
        {
            int new_tail = time - (period_ - 1);
            if (new_tail > tail_)
            {
                int tail_augment = new_tail - tail_;
                if (tail_augment > period_)
                {
                    tail_augment = period_;
                }
                for (int i = 0; i < tail_augment; ++i)
                {
                    DecrementWordCounts(tail_);
                    ++tail_;
                }
            }
        }

        void DecrementWordCounts(uint32_t slot_idx)
        {
            unordered_map<uint64_t, uint64_t> &slot_words = slots_[slot_idx %= period_];
            for (auto it = slot_words.begin(); it != slot_words.end(); ++it)
            {
                uint64_t word_id = it->first;
                uint64_t count = it->second;
                period_words_[word_id] -= count;
                if (period_words_[word_id] == 0)
                {
                    period_words_.erase(word_id);
                }
            }
            slot_words.clear();
        }

        int period_;
        unordered_map<string, uint64_t> word_to_word_id_;
        vector<string> words_;
        vector<unordered_map<uint64_t, uint64_t>> slots_;
        unordered_map<uint64_t, uint64_t> period_words_;
        uint64_t max_word_id_;
        uint32_t head_;
        int tail_;
        priority_queue<PQElement> pq;
};

void Print(const vector<string> &words)
{
    for (auto w : words)
    {
        cout << w << "\n";
    }
    cout << "---\n";
}

int main()
{
    WordFrequency wf(60);

    wf.AddWord(1, "cat");
    wf.AddWord(1, "cat");
    wf.AddWord(10, "rat");
    wf.AddWord(15, "cat");
    wf.AddWord(15, "bat");
    wf.AddWord(20, "rat");
    wf.AddWord(21, "dog");
    wf.AddWord(21, "dog");

    Print(wf.GetTopWords(21, 3));

    wf.AddWord(21, "dog");
    wf.AddWord(41, "rat");
    wf.AddWord(41, "rat");

    Print(wf.GetTopWords(41, 3));

    wf.AddWord(55, "dot");
    wf.AddWord(61, "net");
    wf.AddWord(100, "let");

    Print(wf.GetTopWords(100, 3));

    return 0;
}

- Alex August 19, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

My Approach: Say you get a stream of words in a LIST<String> every hour, for every word in the list search if present in the TRIE data structure else insert into TRIE data structure. This will significantly reduce the inserting time complexity. Then have a hashMap<String, Count> whenever you add a new string value add it to map with count as 1, whenever you find duplicate update its count. Then have a big List<HashMaps>(that holds the HashMaps<String Integer> mentioned before), each list holds hashMaps for an hour, based on timer, example every hour a new HashMap is added to this BIG list. In this way whenever you want the past hour words get List.get(list.size()-1); if you want for past day get :List.Sublist(List.Size()-24, list.size()-1) ; if you want for past month do calculation (31-or-30 days based on month * 24 hours )accordingly . One you get the big List<HashMaps> it is easy from here, all you have to do is take all the Values from the HashMap(by getting hashmap.values()) , add them to an orderedset to avoid duplicates and to have the values sorted from lowest to highest, from the last get only last 10 from this set and only for those values have a final set/list. For each key(words) in the hashmaps add to resulting List(the TOP 10 you need to return) only if the value (frequency of the word) is in the give set. Stop when the size of resulting List is 10 and return the Final LIST.

- Nitish September 14, 2018 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

data stream (probabilistic) algorithm like Misra-Gries' Frequent algorithm
- hash table with only 10 items or less at all times
- if elem in hash table increment count by 1
- if elem not in hash table
-- if num elem in hash table < 10, insert elem with count 1
--- else decrement count of every elem in hash table by 1
------- remove elem from hash table if count = 0 or outside time window t

so you need 3 hash tables (for 1 hr, 1 day and 1 month)

why does this work probabilistically ? decrementing the count by 1 for every elem in stream will knock out the less frequently items while the most frequent items will survive..

CareerCup

Google Interview Question for Software Developers

Books

Videos

Resume Review

Mock Interviews