Google Interview Question
Software EngineersCountry: United States
Interview Type: In-Person
Let's say we are dealing with ASCII characters. Keep matrix 258x258. 256 for each possible character and 2 special ones: beginning of the string and the end of the string.
As you walk each string from beginning to the end, note every adjacent pair of characters, including the imaginary character before the string - beginning - and after - end.
Increment corresponding cell in the matrix [int representation of first char][int representation of second char]
Generating:
Look at the row corresponding to beginning symbol and chose a cell randomly based on its weight, which is a cell value divided by total in the row. Its column is next char.
Repeat until encounter end symbol.
String[] strings = { "Read a set of training strings", "For each character in any of the strings", "calculate the probability of seeing that character" };
int[] counts = new int[27];
int total = 0;
for (String s : strings) {
total += s.length();
for (int i=0; i<s.length(); i++) {
int index = (int)Character.toLowerCase(s.charAt(i))-97;
index = index < 0 ? 26 : index;
counts[index]++;
}
}
for (int i=1; i<counts.length; i++) counts[i] += counts[i-1];
int n = 25;
StringBuilder sb = new StringBuilder();
for (int i=0; i<n; i++) {
int r = (new Random()).nextInt(total);
for (int j=0; j<counts.length; j++) {
if (r > counts[j]) continue;
int value = j == 26 ? ' ' : j+97;
sb.append((char)(value));
break;
}
}
System.out.println(sb.toString());
- Keep a sorted list with the chars in frequency order and the number of occurrences
- Generate a random number between 0 and total number of characters
- Go through the list by decrementing the number above with the number of occurrences
- When negative - the current characters is the randomly generated one
C++:
#include <iostream>
#include <cstdlib>
using namespace std;
struct Node {
Node *next;
int data, count;
Node(int d, int c = 1) : data(d), count(c) {}
};
Node* addData(Node *list, int data) {
if (!list) {
return new Node(data);
}
Node *cur = list, *last = 0, *l=0;
while (cur) {
if (cur->data == data) {
cur->count++;
if (last && last->count < cur->count) {
int v = last->data; last->data = cur->data; cur->data = v;
v = last->count; last->count = cur->count; cur->count = v;
}
return list;
}
if (!last || last->count != cur->count) {
last = cur;
}
l = cur;
cur = cur->next;
}
// Not found - append
l->next = new Node(data);
return list;
}
char getRandChar(Node *list, int total) {
int r = rand()%total;
while (r >= list->count) {
r -= list->count;
list = list->next;
}
return (char)list->data;
}
int main() {
int n = 10, count = 0;
Node *list = 0;
string strings[] = {"akdjfsfdf", "sdfsdfsdf", "sdfsdfsdfdsf", "aaaasasasaa"};
for (string s : strings) {
for (char c : s) {
list = addData(list, c);
count++;
}
}
srand (time(NULL));
while (n--) {
cout << getRandChar(list, count);
}
}
I used an array of 256 strings to store the frequencies. When character C(cur) is followed by C(next), add C(next) to the string Array[C(cur)].
When creating the output for part 2, just choose a random character in the current string to find the next one.
void main(){
while (1){
std::string freqencies[256], input;
int cnt;
std::cout << "Enter a string\n";
std::getline(std::cin, input);
input += ' ';
std::cout << "Enter output count\n";
std::cin >> cnt; std::cin.ignore();
char cur = ' ';
for (char c : input){
freqencies[cur] += tolower(c);
cur = tolower(c);
}
cur = ' ';
for (int i = 0; i < cnt; i++) {
cur = freqencies[cur].at(rand() % freqencies[cur].size());
std::cout << cur;
}
std::cout << '\n';
}
}
output:
Enter a string
This is Great! Isn't it? I think it is a great solution! That guy tjcbs2 deserves an upvote for his amazing effort.
Enter output count
100
hisort. teamazis fonk in! hit gupvot ang in is sn is erthison't guy t t thazi upvortjcbs foresn
Hi, your solution is very elegant, but not scalable ... The total size of the 256 strings = the size of the input strings. And we may expect a lot of input strings, for the probabilities to be statistically significant. It makes more sense to store only the counters, or the probabilities as proposed by the other solutions.
class CharcaterProb {
static CharcaterProb[] container = new CharcaterProb[26];
static CharcaterProb getChar(char c) {
return container[c - 'a'];
}
static int count = 0;
char c;
int frequency;
CharcaterProb(char c) {
this.c = c;
container[c - 'a'] = this;
touch();
}
static void touch(char c) {
this.frequency ++;
count++;
}
}
void precalc(Stirng[] set) {
for (String entry: set) {
for (char c : entry.toCharArray()) {
if ( CharcaterProb.getCar(c) == null ) {
new CharacterProb(c);
} else {
CharcaterProb.getCar(c).touch();
}
}
}
}
generateRandomString(int n) {
char[] container = new char[n];
int l = 0;
for (char c = 'a'; c <= 'z'; c++) {
int freq = CharacterProb.getChar(c);
while (freq--) {
container[l++] = c;
}
}
char[] ans = new char[n];
for ( int i =0; i < n; i++ ) {
int rnd = rand.nextInt(0, CharcaterProb.count - 1);
ans[i] = container[rnd];
}
}
Construct a directed graph that starts with a space (" ") node. As you read through the training data, insert a new node for each new character (or update existing nodes). If you have the word "bat", then the graph would be " " -> "b" -> "a" -> "t" -> " ".
- JW March 26, 2015After you're done processing the training data, your graph will be a state machine where given any node/state, you have the probabilities to reach any other reachable states.
Ex. If you are at node corresponding to letter "c", and there are 2 links to "a", 3 links to "e", and 1 link to " ", then you know that the next letter will be P("a") = 2/6, P("e") = 3/6, P(" ") = 1/6. Calculate the one to choose, then move to it.
Note: If you have multiple links between two nodes, you wouldn't actually store the same links copied over and over. Instead store a count for each link.
This mechanism has the benefit of finding the next letter based on the current letter you're on rather than a straight up probability of overall letter occurrences, thereby better approximating the behaviour of the training data.