Google Interview Question
Country: United States
I agree with this one. Actually, there are several ways to solve this problem:
1. Count each char in "S", and scan the dictionary, for each word, compare the count.
However, when the size of the dictionary is big, there will be lots of repeated count. Eg. In the dict., we have "cat", "cats"
2. It is good to build a trie, but not easy to solve in the interview.
3. The problem statement clearly states "S" only has 4 char, which means the permutation is not expensive at all. After that, all you need to do is just to build a hashmap for all permutation, and scan the dict.
This is not too hard (I think). First, we need to define the data structure of the dictionary in the memory. We build a hash table of the dictionary where the key is sorted letters and the value is a list of all words that can be created by those letter
i.e. 'dgo' is the key for for the list of words of 'dog' and 'god', 'aet' is the key for list of words of 'eat', 'ate' and 'tea'.
Then for the input string, we compute all the permutation of letters and for each permutation we form them into different set.
i.e. abcd is one of the possible permutation, we create all possible set such as {'a', 'bcd'}, {'a', 'b', 'cd'}, {'a', 'b', 'c', 'd'}, {'abc', 'd'}, {'a', 'bc', 'd'}. For each set, we use each element in the set as a key to search into our hash table dictionary.
Nice idea as well. The only thing I don't like is to create every permutation of the string as mentioned above. So far both the trie with sorted key and the hash table with sorted key seems like valid solution to this problem.
I still have an impression that there must be something better.
hi autoboli:
Thanks for your comments. In fact, I thought about that and I actually do not need to create permutation of the string. All I need is for each input string, I create all the possible set, including the empty set. but of course the meaning of "Set" here will be character 'a' in the the string is different from character of 'a'. By that I mean if I have a string 'aaab', first 'a' is different from second 'a' and I cannot reduce it into set {'a', 'b'}, I will still need to treat it in term of {'a', 'a', 'a', 'b'}
Well, it is more complicated I think. How about this:
OFFLINE:
1) Take only <=4 char words from dictionary
2) Sort each word, keep unsorted word as well
3) Put it to the trie
If a char node in the trie is the final char it will contain a reference to a list of unsorted dictionary words e.g.
a-b-c --> {abc, cab, bac}
a-a-f-l --> {alfa fala}
ONLINE:
1) Sort 's'
2) Search in trie and return corresponding list of words.
It should be O(1) time. Well.... no it is not gonna work :( The problem is that you should search for all sorted substrings of 's' Hence, it does not sound like an elegant idea......
If a word contains 2 a, it can be potentially matched by 2 a, 3 a, 4 a, but not less than 2.
If a word contains 0 b, it can be potentially matched by 0b, 1b and etc (Interesting observation - it cannot be matched by 4b because it leaves no more letters for a word. If it contains to letters, it cannot be matched by 3b)
Just thinking about cross-sectioning sets with given properties to get very small set to work with.
Seems pretty easy, just define an array to hold the counts of each letter in the string s (remember we can index into an array by converting ascii char to its int representation). Doing so requires a linear scan of the array.
Now for each word in the dictionary do the same thing (you can even reuse the array as you go through words in the dictionary), and compare the counts of each character in the word to the counts of each character in the string s. Specifically, it requires that the count of each letter in the array holding the counts of s is greater than that of the array holding the count of a given word. If we define n as the total sum of the lengths of all the strings, the algorithm runs in O(n) time and O(1) space.
For the follow up, simply write the results to a file once hitting the memory limit and free the corresponding memory, rinse and repeat.
1. first Filter all words that can't be created using chars of string S, this any word that doesn't contain only chars from S, iterate over every word from dictionary and check if it contains only chars from S, filter out those who don't have.
2. From the result of 1 create a Trie from all remaining words (the ones you can form from S)
3. Traverse the Trie printing all leafs(resulting words)
For english you have 26 files, each is responsible for particular character. If word contains this character then it is presented there. "good" is presented in file g,o and d. Each record also contains a field how much particular character have occured in this word.
For good g - 1, o - 2, d - 1. Each file contains word in sorted order. So when you receive and request you find intersection of c1 and c2 in O(n), and so on. Also you check quantity of particular character in the words.
I'll use Dictionary in a Hash table format where Keys are in alphabetically sorted order , and values are all possible dictionary strings that can be made from key:
example :
[e] => {}
[g] => {}
[eg] => {egg , /*strings containing only e and g alphabets in dictionary order*/ }
[0] => {}
[go] => {go}
[ego] => {ego}
1. now , sort the string s /* = "ogeg */ and remove duplicates , hence s' = ego , and character counts as g = 2, e = 1, o = 1
2. traverse over all permutations of strings
string permutations[] ;
for( i = 1 ; i < 2^(length(s')) ; i++ ){
string temp = "";
j = i;
index = 0;
while( j != 0){
if( j >> 1 ){
temp = s'[index] + "" + temp;
}
index++;
}
permutations[i] = temp;
}
3. for all permutations find all possible strings from dictionary , taking all permutations as dictionary key
we have variables : g = 2, e = 1, o = 1
- for all the permutations get hashtable[key] values
- loop through values , if isvalid() then print it.
isvalid(){
/* if character count in value = character count in permutation string */
}
I like this approach, but I have a huge improvement to your map. If the map used the counts of letters in 's' and mapped to a collection of valid words produced by those keys, the number of possible entries in the map would be significantly reduced and computational complexity of searching the map would be reduced from O(n!) to O(n^2) where n is the length of 's'. An off-the-cuff example of what I mean is something like
'e1g2' -> { 'egg', 'geg', 'gge' } //if all those strings are valid.
'e1g1o1' -> { 'ego', 'eog', 'geo', 'goe', 'oeg', 'oge' }
etc
@ct
I don't think so. I mean, you COULD do that, but it would then make it harder to generate the map because you'd have to relate all the collections that could be subsets.
@ct - i agree with zortlord!
I do have 1 more idea that will improve the efficiency of algo
instead of saving related dictionary words in an array of strings against a hash key, we can form trie!!
hash key will be trie root and related dictonary strings as trie child nodes.
hence if we have 10 hash keys in dictionary , then for every key we'll form a trie structure, therefore 10 tries.
Assumptions:
1. This function is only going to be called once. If this were to be called multiple times, it should use preprocessing like building a trie
2. valid characters are ASCII (8 bit representation)
This function will operate in O(n) complexity and O(n) memory (only O(n) to store the results before returning them) where n is the number of words:
public static ArrayList<String> getValidWords(String str, String[] dict){
if(str == null || dict == null){
throw new NullPointerException():
}
int[] strSig = getSig(str);
ArrayList<String> results = new ArrayList<String>();
for(String word : dict){
if(validSig(strSig, word){
results.add(word);
}
}
return results;
}
private static int[] getSig(String str){
int[] sig = new int[256];
for(char c : str.toCharArray()){
sig[cToI(c)]++;
}
return sig;
}
private int cToI(char c){
return (int)c;
}
private static boolean validSig(int[] sig, String word){
if(word.length() > 4){
return false;
}
int[] usableSig = new int[256];
for(char c : word.toCharArray()){
int i = cToI(c);
if(usableSig[i] == sig[i]){
return false;
}
usableSig[i]++;
}
return true;
}
If the dictionary file cannot fit in memory, then I would stream the file and pull each individual word out for comparison. This change would only affect the first method and only really around the while loop.
I like to do that just in case I decide to support limited alphabets (ie: only 26 letters vs the entire 256 set).
I am a bit confused by the if condition
if(usableSig[i] == sig[i]){
return false;
}
doesn't this mean, if for words "ant" and "apple", you return false right away for 1st character and not check of rest ?
and also when you return true, you don't check anything, just coming out of for loop says the word can be formed with given characters ?
please explain, I might be overlooking something here.
#include<stdio.h>
#include<string.h>
main()
{
char s[10];
int i = 0;
int j,k;
char dictionary[10][20];
printf("create a dictionary\n");
while(scanf("%s",dictionary[i]) != EOF)
{
i++;
}
printf("enter the string to be searched\n");
scanf("%s",s);
i=0;
while(dictionary[i][20])
{
j=0;
k=0;
while(s[j++] == dictionary[i][k++])
{
if(s[j] == '\0')
{
printf("%s\n",dictionary[i]);
}
}
i++;
}
}
~
Assumption: the dictionary is a text file containing all words separated by spaces.
Complexity:
Can be loaded in memory : O(1)
Cannot be loaded in memory: O(n)
So, we'd just sort up the input word and make up sorted strings of all possible lengths i.e. from 2 to 4 eg: ogeg = oegg = oe,og,gg,oeg,ogg,egg. This can be done in constant time.
While making the dictionary out of the text file, we can take each word, sort it and add it to a HashMap<String,ArrayList<String>> that could store all the words with a given set of letters. Since this is pre-processing, it can also be achieved in constant time. All you gotta do now homie is check those words created in the hashtable. So overall constant.
If we cannot load the dictionary in the memory, the problem clearly becomes a string matching problem, which can be solved in O(n) using Z-Algorithm. Just makeup all the possible permutations (4! + 4C3*3!+ 4C2*2!) and search'em up!
Create sorted key to exact match map for keys length 1, 2, 3, 4.
For instance, aab={baa,aba} but not ba
Create a multi-root tree where 4 letter keys are roots, and up to 4 children are generated by removing one of the letters. (could be less because in aabc remove a once) - similarly for other children until have one letter grandchildren leafs.
Note, that this is not really a tree but a graph because abc is a child of both, aabc and abcz, so the amount of nodes will not exceed one million.
Walk it from 4 letter key till one letter and collect set of words.
Why is it better than just computing all of the words for each of the 4 letter keys, including subsets. The thing is, 3, 2, 1 letter keys are shared across path, so less repetition. And they can be loaded/unloaded on demand, per memory requirement (for instance, un-load lest frequently walked node). Note that the set of words for exact 4 letter match will be small, it is subset that will generate a lot, and they are shared.
Maybe this solution is too simple, but from the original text, I assume that:
- The alphabet can only have 26 characters (english lowercase)
- The dictionary is not sorted (so we will need at least O(n) to know it's content).
- There is no relation between each dictionary word and the input sequence.
#include <list>
#include <string>
#include <iostream>
#include <fstream>
#include <stdint.h>
class WordPattern
{
public:
// here "useCountLimit" indicates that a dictionary word can not have more than the characters in the input
WordPattern (const std::string &s, bool useCountLimit) : _useCountLimit(useCountLimit)
{
for (int i(0); i < _alphabetSize; ++ i)
_charsCount[i] = 0;
int len = s.size();
for (int i(0); i < len; ++ i)
_charsCount[(s[i]-'a')] ++;
}
inline bool canFormWord (const std::string &s) const
{
uint8_t charsCount [_alphabetSize];
for (int i(0); i < _alphabetSize; ++ i)
charsCount[i] = _charsCount[i];
int len = s.size();
for (int i(0); i < len; ++ i)
{
uint8_t charVal (s[i]-'a');
if (_useCountLimit)
{
if (charsCount[charVal] > 0)
charsCount[charVal]--;
else
return false;
}
else
if (charsCount[charVal] == 0)
return false;
}
return true;
}
private:
static const uint8_t _alphabetSize = 26;
uint8_t _charsCount [_alphabetSize];
bool _useCountLimit;
};
void checkFromFile (const std::string &fileName, const std::string &input, bool useCountLimit = false)
{
WordPattern pattern (input, useCountLimit);
std::ifstream file (fileName.c_str(), std::ifstream::in);
for (std::string line; std::getline(file, line); )
{
if (pattern.canFormWord(line))
std::cout << line << std::endl;
}
file.close();
}
void checkFromList (const std::list<std::string> &words, const std::string &input, bool useCountLimit = false)
{
WordPattern pattern (input, useCountLimit);
std::list<std::string>::const_iterator it;
for (it = words.begin(); it != words.end(); it ++)
{
if (pattern.canFormWord(*it))
std::cout << (*it) << std::endl;
}
}
int main()
{
std::list <std::string> words;
words.push_back ("go");
words.push_back ("egg");
words.push_back ("ego");
words.push_back ("the");
words.push_back ("god");
words.push_back ("movie");
words.push_back ("google");
checkFromList (words, "ogel");
//checkFromFile ("dict.txt", "ogel");
return 0;
}
We can create a hash map of letters and their counts in the original word and then for each word in dictionary check if it consists only letters from this map only in fewer amount. This will take constant time for each letter
#include <unordered_map>
#include <vector>
#include <iostream>
using namespace std;
class StringPermutation{
private:
unordered_map<char, int> origLetters;
StringPermutation(){};
public:
StringPermutation(string orig){
for (auto c: orig){
if (origLetters.find(c) != origLetters.end())
origLetters[c]++;
else
origLetters.insert(make_pair(c, 1));
}
}
bool check(string s)
{
unordered_map<char, int> lettersHere(origLetters);
for (auto c: s)
{
if (lettersHere.find(c) != lettersHere.end())
{
lettersHere[c]--;
if (lettersHere[c] == 0)
lettersHere.erase(c);
}else
return false;
}
return true;
}
};
int main()
{
vector<string> dict{"gg", "ego", "ed", "egg", "hhg"};
StringPermutation sp("ogeg");
for (auto s: dict)
{
if (sp.check(s))
cout << s << endl;
}
}
It can be an interesting to note that there are only 14950 versions of sorted s. 26!/(4!22!). Knowing Google: no matter what your solution is, I am certain that Google would want to hear this before you even attempt to discuss efficiency of your algorithm. Also, knowing them, they gave the 4 letters in the condition for reason.
The only reason I am saying this - review and practice combinatorics in preparation to the interview. I think they want you to see it approximating on the whiteboard : 26!/(4!22!) = 26*25*23, roughly 30*25*20 which is easy to calculate in the head.
You have the number of possible words wrong, because letters can be repeated. The formula for this is (n + k - 1)!/((k - 1)!n!). In this case n = 4 and k = 26, so there are 29!/(25!4!) = 29*28*27*26/(4*3*2) = 23,751 words. Agree with everything else you said though!
Oops, the formula I gave above assumes the words have length of exactly 4. It didn't account for words of length 3, 2 or 1. If you do account for those there are 27,404 possibilities.
(First rule of correcting someone on the Internet: double check your correction!)
O, thanks, I always keep confusing pascal triangle row and row in DP counting matrix. This is part of my practice to eventually stop confusing them.
BTW, the problem did say s is exactly 4 characters, no less.
public static void printDicWords(String input,ArrayList<String> dic){
if(input == null || dic == null) return;
getCombination(dic, new StringBuffer(), "ogeg", 0);
}
public static void getCombination(ArrayList<String> dic, StringBuffer sb, String input, int start){
int max = input.length() -1;
for(int i=0; i<input.length(); i++){
int index = (i+start)%max;
StringBuffer cur = sb.append(input.charAt(index));
if(dic.contains(cur.toString())){
System.out.println("dic from : " + cur.toString());
}
if(sb.length() < input.length()){
getCombination(dic, cur, input, index+1);
}
sb.deleteCharAt( sb.length() - 1 );
}
}
Some of the other answers here have mentioned building a trie. I think I'd build 2 of them: one out of all possible permutations of the input letters 4, the other out of all dictionary words up to length 4. Assuming a 26 letter alphabet, there are up to 64 possible input words of length 1, 2, 3 or 4 (not all of them distinct if there are repeated letters; and up to 27,404 possible dictionary words of length 1, 2, 3 or 4. Both tries are small enough that they're guaranteed to fit in memory on any realistic device, regardless of whether the full dictionary is too large.
To get the matching words you do a preorder traversal of both tries simultaneously, only moving to child c if its present in both and yielding every word you reach.
Building the dictionary trie is O(D) where D is the number of words in the dictionary. We have to visit each word in the dictionary once to build the trie and, because we're only interested in words below a fixed length, insertion into the trie is O(1) for each word.
Building the input trie is O(1) - but with a large constant factor. The number of insert operations we perform is proportional to the number of permutations of the input letters, which is a constant; and each insert operation is O(1) too.
Likewise, matching the input trie against the dictionary trie is O(1) because the size of each trie has a constant upper bound and the time taken is proportional to the smaller of the two tries.
I've written some very elegant code demonstrating this solution to the problem, but unfortunately this margin is too small to contain it. :-)
Solution according to @deepanshu's suggested answer. Generates all permutations and the does binary search over the chunks of dictionary (default chunks 50 lines). Dictionary is assumed to contain only lower-letter-latin-character words, sorted and without repetitions.
This solution is also convenient for large dictionary (when dictionary can't fit in the memory) because of it's buffering nature.
#include <fstream>
#include <algorithm>
using namespace std;
const int buffer_size = 50;
vector<string> chck(const vector<string> &v, const vector<string> &n, int w){
int vsz = int(v.size());
int nsz = int(n.size());
vector<string> res;
if (vsz > 0) {
// as long as current needle is potentially in this segment, try to find it
while (w < nsz && n[w] <= v[vsz-1]) {
// do binary seach over partial dictionary:
// 1) since needle is lexicographically before last word, lower bound exists
// 2) assume no duplicate words in the dictionary
auto lb = lower_bound(v.begin(), v.end(), n[w]);
if (*lb == n[w]) {
res.push_back(n[w]);
}
// go to the next needle (skip repeated)
do {
++w;
} while (w < nsz && n[w-1] == n[w]);
}
}
return res;
}
int main() {
ifstream fin("dict");
ofstream fout("words.txt");
// sort letters
string letters = "ogeg";
int lsz = int(letters.size());
sort(letters.begin(), letters.end());
// generate words to be searched, (1) - init
vector<string> needles;
for (int i = 0; i < int(letters.size()); ++i) {
needles.push_back(string(1, letters[i]));
}
// generate words to be searched, (2) - BFS
// max. 64 words for 4 letters (there are repetitions if letters repeat)
for (int p = 0; p < int(needles.size()); ++p) {
string cn = needles[p];
int cnl = int(cn.size());
// mark all used letters
vector<bool> used(lsz, false);
for (int i = 0; i < cnl; ++i) {
for (int j = 0; j < lsz; ++j) {
if (!used[j] && cn[i] == letters[j]) {
used[j] = true;
break;
}
}
}
// create new needles by appending unused letters to the current needle
for (int j = 0; j < lsz; ++j) {
if (!used[j]) {
needles.push_back(cn + letters[j]);
}
}
}
// sort needles
sort(needles.begin(), needles.end());
// next needle to be searched for
int next_needle = 0;
while (true) {
vector<string> pdict;
string word;
int cnt = 0;
// read in maximum 50 lines and do binary search over them
for ( ; cnt < buffer_size && getline(fin, word); ++cnt) {
pdict.push_back(word);
}
vector<string> picked = chck(pdict, needles, next_needle);
for (auto it = picked.begin(); it != picked.end(); ++it) {
fout << *it << endl;
}
if (cnt < buffer_size) {
break;
}
}
}
Here is a question, why do we need to put the whole dictionary into the memory? The goal is to output all words that can be formed from s. From the example, we know that the order doesn't matter. So we only need to construct a hash map and store all four characters of s in it. And then read the dict from the stream and compare it with the characters in the hash map. I used another hash map. Since there are at most four characters for both strings, we don't need much extra space, do we?
However, if the question is, how to store the output words in an efficient data structure, I guess a trie should be a good one.
public static void validWords(String s, String dict) throws FileNotFoundException, IOException{
BufferedReader br = new BufferedReader(new FileReader(dict));
String currLine;
Map<Character, Integer> sMap = new HashMap<Character, Integer>();
for(int i = 0; i < s.length(); i++){
char c = s.charAt(i);
if(!sMap.containsKey(c))
sMap.put(c, 1);
else
sMap.put(c, sMap.get(c) + 1);
}
Map<Character, Integer> word = new HashMap<Character, Integer>();
while((currLine = br.readLine()) != null){
if(currLine.length() > 4)
continue;
word.clear();
boolean valid = true;
for(int i = 0; i < currLine.length(); i++){
char c = currLine.charAt(i);
if(!sMap.containsKey(c)){
valid = false;
break;
}
if(!word.containsKey(c))
word.put(c, 1);
else
word.put(c, word.get(c) + 1);
if(word.get(c) > sMap.get(c)){
valid = false;
break;
}
}
if(valid)
System.out.println(currLine);
}
br.close();
}
What we can do is we can create permutation of all the strings which can be formed using the input string 'S':
- deepanshu January 08, 2015Let's say the string 's' is "ab"
Permutations:
a, b, ab, ba
Now just search each permutation in the dictionary in log(n) time(binary search as it's a dictionary and sorted) this takes (no.of permutations)*log(n). if we say that the length of s is fixed to 4, then there can only be constant number of permutations and hence this will run in log(n) time.