Facebook Interview Question
Software EngineersCountry: United States
//Levenshtein Algo
public static void main(String [] args) {
String a [] = {"tasneem@gmail.com", "tas@hotmail.com", "abc@gmail.com", "abc@hotmail.com", "apc@yahoo.com"};
HashMap<Integer, ArrayList<String>> similarElementsMap = new HashMap<>();
for (int i = 0 ; i < a.length - 1; i ++) {
int d = distance(a[i], a[i+1]);
if (similarElementsMap.containsKey(d)) {
ArrayList<String> list = similarElementsMap.get(d);
list.add(a[i]);
list.add(a[i+1]);
} else {
ArrayList<String> list = new ArrayList<>();
list.add(a[i]);
list.add(a[i+1]);
similarElementsMap.put(d, list);
}
}
for (Entry<Integer, ArrayList<String>> map : similarElementsMap.entrySet()) {
System.out.println("For distance : "+map.getKey());
for (int i = 0 ; i < map.getValue().size(); i ++) {
System.out.println(map.getValue().get(i));
}
}
}
private static int distance(String a, String b) {
a = a.substring(0, a.indexOf('@')).toLowerCase();
b = b.substring(0, b.indexOf('@')).toLowerCase();
// base case
if (a.length() == 0) return b.length();
if (b.length() == 0)return a.length();
int cost[] = new int[b.length() + 1];
// Initialise cost with each indexes initialy
for (int i = 0 ; i < b.length(); i ++) {
cost[i] = i;
}
// Iterate through a and b string and update cost array with each a iteration
for (int i = 1; i <= a.length(); i ++) {
cost[0] = i;
int k = i - 1;
for (int j =1; j <= b.length(); j ++) {
int c = Math.min(1 + Math.min(cost[j], cost[j-1]), a.charAt(i-1) == b.charAt(j-1) ? k : k + 1);
// Swap the min value with cost and k
k = cost[j];
cost[j] = c;
}
}
return cost[b.length()];
}
Weighted Quick Union + Path Compression
Complexity: N + M lg* N. M unions (or similarities) on N Emails.
We can use the Levenshtein distance or edit distance algorithm. More the levenshtein distance, more dissimilar the email IDs are. We have to make sure, we do not compare the TLDs i.e. ignore @gmail.com and @hotmail.com
For instance, "hello" and "jello" have LD as 1. So, quite similar. However, "good" and "goodbye" will have LD as 3, so slight dissimilar.
//Levenshtein Algo
public static void main(String [] args) {
String a [] = {"tasneem@gmail.com", "tas@hotmail.com", "abc@gmail.com", "abc@hotmail.com", "apc@yahoo.com"};
HashMap<Integer, ArrayList<String>> similarElementsMap = new HashMap<>();
for (int i = 0 ; i < a.length - 1; i ++) {
int d = distance(a[i], a[i+1]);
if (similarElementsMap.containsKey(d)) {
ArrayList<String> list = similarElementsMap.get(d);
list.add(a[i]);
list.add(a[i+1]);
} else {
ArrayList<String> list = new ArrayList<>();
list.add(a[i]);
list.add(a[i+1]);
similarElementsMap.put(d, list);
}
}
for (Entry<Integer, ArrayList<String>> map : similarElementsMap.entrySet()) {
System.out.println("For distance : "+map.getKey());
for (int i = 0 ; i < map.getValue().size(); i ++) {
System.out.println(map.getValue().get(i));
}
}
}
private static int distance(String a, String b) {
a = a.substring(0, a.indexOf('@')).toLowerCase();
b = b.substring(0, b.indexOf('@')).toLowerCase();
// base case
if (a.length() == 0) return b.length();
if (b.length() == 0)return a.length();
int cost[] = new int[b.length() + 1];
// Initialise cost with each indexes initialy
for (int i = 0 ; i < b.length(); i ++) {
cost[i] = i;
}
// Iterate through a and b string and update cost array with each a iteration
for (int i = 1; i <= a.length(); i ++) {
cost[0] = i;
int k = i - 1;
for (int j =1; j <= b.length(); j ++) {
int c = Math.min(1 + Math.min(cost[j], cost[j-1]), a.charAt(i-1) == b.charAt(j-1) ? k : k + 1);
// Swap the min value with cost and k
k = cost[j];
cost[j] = c;
}
}
return cost[b.length()];
}
Given people already are massively overthought this - and they did not think what should be thought about : en.wikipedia.org/wiki/Similarity_measure.
Take a pause and think this.
Is it transitive? That is :
A similar_to B and B similar_to C ==>(implies) A similar_to C
does it hold true? 99.999% of the real practical cases, it won't be.
Take name for example.
ram : ok, male
rama : ok, male again perhaps
ramaa : ambiguous - perhaps it is a female name ? Now thinking backward,
ramaa --> rama :: hmm. See the problem there?
what about roma? ramaa and roma are close but is ram and roma close? Well.
That... pose a problem.
How did we get it? By working in system - where close name matching may point to same individual - facebook is another classic example. Same problem persists in the Indian addresses.
Thus, we need to ask, is the similarity function transitive? That would have clear impact on any algorithm we try to apply.
I assumed that traistive rule doesnt work in this case (As NoOne mentioned it earlier). So that means I have to compare each emailaddress for similarity. so the brute force takes the complexity of O(n^2)
Here is a sample code, Is there any better solution
List<HashSet<string>> allSets = new List<HashSet<string>>();
for(int i = 0; i < emailIds.Length; ++i)
{
HashSet<string> currentSet = new HashSet<string>();
currentSet.Add(emailIds[i]);
for (int j = 0; j < emailIds.Length; ++j)
{
if(i == j)
{
//same check
continue;
}
if(true == IsSimilar(emailIds[i], emailIds[j]))
{
currentSet.Add(emailIds[j]);
}
}
if(currentSet.Count == 1)
{
// there is no similar email id to this, so let just remove it
currentSet.Remove(emailIds[i]);
}
allSets.Add(currentSet);
}
Let's suppose the interviewer says it /is/ transitive, i.e. A=B, B=C --> A=C. And let's say they only want sets with multiple emails, i.e. no singleton sets need be kept.
LinkedList<String> input = ...;
List<HashSet<String>> output = new LinkedList<>();
while (!input.isEmpty()) {
// pop one item
String current = input.remove(0);// (if array list, take last instead)
HashSet<String> equal = new HashSet<>();// (or ArrayList)
// transfer any similar emails
Iterator<String> it = input.iterator();
while (it.hasNext()) {
String s = it.next();
if (isSimilar(current,s)) {
equal.add(s);
it.remove();// fast for LinkedList (otherwise careful)
}
}
// only output non-singleton sets
if (!equal.isEmpty()) {
equal.add(current);//sic: add only if not empty
output.add(equal);
}
}
Let's suppose the interviewer says it /is/ transitive, i.e. A=B, B=C --> A=C. And let's say they only want sets with multiple emails, i.e. no singleton sets need be kept.
LinkedList<String> input = ...;
List<HashSet<String>> output = new LinkedList<>();
while (!input.isEmpty()) {
// pop one item
String current = input.remove(0);// (if array list, take last instead)
HashSet<String> equal = new HashSet<>();// (or ArrayList)
// transfer any similar emails
Iterator<String> it = input.iterator();
while (it.hasNext()) {
String s = it.next();
if (isSimilar(current,s)) {
equal.add(s);
it.remove();// fast for LinkedList
}
}
// only output non-singleton sets
if (!equal.isEmpty()) {
equal.add(current);//sic: add only if not empty
output.add(equal);
}
}
object Solution {
import scala.collection._
type Email = String
def readInput(): List[Email] = {
val sc = new java.util.Scanner (System.in)
var n = sc.nextInt()
var emails = new Array[String](n)
for(i <- 0 until n) {
emails(i) = sc.next()
}
emails.toList
}
def isSimilar(email1: Email, email2: Email): Boolean = {
val maxDistance = 2
def countify(str: String): Map[Char, Int] = {
str.groupBy(identity).mapValues(_.length)
}
def distance(str1: String, str2: String): Int = {
val (c1, c2) = (countify(str1), countify(str2))
val (s1, s2) = (c1.keySet, c2.keySet)
val all = (s1 | s2).toList
all.map(x => math.abs(c1.getOrElse(x, 0) - c2.getOrElse(x, 0))).sum / 2
}
distance(email1, email2) <= maxDistance
}
def groupBySimilarity(emails: List[Email], similarFunc: (Email, Email) => Boolean): List[Set[Email]] = {
val res = emails.foldLeft(Map.empty[Email, mutable.Set[Email]])( (map, email) => {
val key = map.keys.filter(k => similarFunc(k, email)).headOption
if (key.isDefined) {
map(key.get) += email
map
} else {
val set = new mutable.HashSet[Email]()
set += email
map + (email -> set)
}
})
res.mapValues(_.toSet).values.toList
}
def main(args: Array[String]) {
val emails = readInput()
val grouped = groupBySimilarity(emails, isSimilar)
grouped.foreach(g => println(g.mkString(" ")))
}
}
I think y'all are overthinking this; it says the similarity function is a given. This question is about how you manage building the similarity dictionary (whoops I gave it away).
- Honcho March 29, 2017