Given a table of [Url, Content

Google Interview Question for Software Engineer / Developers

0

of 0 votes

15
Answers
Given a table of [Url, Content] pairs produce a new table of [Url, Set of duplicate Urls].

Example Input:
a.com => <html>a</html>
b.com => <html>b</html>
c.com => <html>c</html>
d.com => <html>a</html>
e.com => <html>a</html>

Example Output:
a.com => [d.com, e.com]
b.com => []
c.com => []
- jali_smith January 07, 2015 in United States | Report Duplicate | Flag | PURGE
Google Software Engineer / Developer Algorithm

Email me when people comment.

An error occurred in subscribing you.

Country: United States

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 6 vote

We can do this in two pass.
1. Create new map with 'Content' as key and 'Url' as list of values. Every time we find something the same 'content' we add the url to list.

<html>a</html> => a.com, d.com, e.com
<html>b</html> => b.com
<html>c</html> => c.com

2. Now take the iterate over the map we just generated. Read the values part. Take one url from the list and make it key and rest of the urls as set of values.

a.com => d.com, e.com
b.com =>
c.com =>

- bafna.iitr January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 votes

How about the constraints of the key being content? What if the page content is huge? Your table will consume a lot of memory not being scalable enough. I believe you should build some sort of hash value based on the content to suit as a key, then pages with exactly same content will always resolve to the same unique hash string value.

- guilhebl January 07, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

Content as a key - common, this can be large and esquire time complexity for every hash operation, they will no longer be O(1) but the size of Tolstoy's War and Piece, if this is what Website is displaying.

How about a Tray with every content trimmed after the first unique character. (Will need to be temporary retrieved for inserting more URLs)

- CT January 07, 2015 | Flag

Comment hidden because of low score. Click to expand.

of 2 vote

1. Use tables:

a. [Url, Set of duplicate Urls]
b. [contentHashStringKey, original URL]

2. For every new page:

fetch page content as a plain string and use a transformation Function like MD5 to create a key for this table.

3. Check if it is present in table as key:

- if it is get original URL using hash key and append new URL to list of duplicates for that original URL

- if not, insert new entry on both tables.

- Anonymous January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

You can use trie and whenever you insert the text, when it ends, add something on the ending to node to keep track of who ended here.

Add all the content to the trie and at the end traverse the trie and find the nodes where contents are ending. If two contents end at the same node , the its a duplicate URL.

- whatever January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

- whatever January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

I didn't use LINQ to keep answer clean. With LINQ it can be even more beautiful...

using System.Collections.Generic;

namespace UrlGames
{
	public class DistinctContent
	{
		public Dictionary<string, IList<string>> GetDistinctContent(Dictionary<string, string> urlContentPairs)
		{
			var duplicates = new Dictionary<string, List<string>>();
			foreach(var item in urlContentPairs)
			{
				if (duplicates.ContainsKey(item.Value))
                    duplicates[item.Value].Add(item.Key);
				else
                    duplicates.Add(item.Value, new List<string>{item.Key});
			}
			
            var result = new Dictionary<string, IList<string>>();
			foreach(var item in duplicates)
			{
				var newKey = item.Value[0];
				item.Value.RemoveAt(0);
				result.Add(newKey, item.Value);
			}
			return result;
		}
	}
}

- YourThirdLife January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

One can use a symbol table (map) where KEY = 'Url', VAL = List of 'content'. Given a new [Url, content] pair do:

Check if the KEY is present in a table:
a) Not present - create an empty list VAL and add content to it, add KEY, VALUE pair into the symbol table.
b) Is present - chack VAL list if it contains the 'content', if yes do nothing, if no add the 'content' to the VAL list;

Depending on how critical is the application you can use different implementation of the symbol table e.g.: a hash table O(1) amortized under assumption of good hash function or a red black BST O(log N) guaranteed.

- autoboli January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

How about O(n) complexity and O(n) memory:

public static Map<String, Set<String>> dupesMap(Map<String, String> table){
	HashMap<String, String> keyMap = new HashMap<String, String>();
	HashMap<Set<String>> resultsMap = new HashMap<Set<String>>();
	for(Entry<String, String> entry : table.entrySet()){
		String oldKey = entry.getKey();
		String oldVal = entry.getValue();
		String resultKey = keyMap.get(oldVal);
		if(resultKey == null){
			keyMap.put(oldVal, oldKey);
			resultsMap.get(oldKey, new HashSet<String>());
		}
		else{
			Set<String> dupes = resultsMap.get(resultKey);
			dupes.add(oldKey);
		}
	}
	return resultsMap;
}

- zortlord January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

import java.awt.event.ItemListener;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map.Entry;
import java.util.Stack;



public class distinct_content_extractor {
	
	private HashMap<String, String> original_content;
	
	public distinct_content_extractor(){
		original_content = new HashMap<String, String>();
	}
	
	
	
	public void generate_distinct_content(String url_adder, String content_adder){
		StringBuilder url_generator = new StringBuilder();
		StringBuilder content_generator = new StringBuilder();
		for(int i = 0; i < url_adder.length(); i++){
			url_generator.append(url_adder.charAt(i));
			url_generator.append(".com");
			content_generator.append("<html>");
			content_generator.append(content_adder.charAt(i));
			content_generator.append("</html>");
			original_content.put(url_generator.toString(), content_generator.toString());
			url_generator = new StringBuilder();
			content_generator = new StringBuilder();
		}
	}
	
	
	public void display_table_original(){
		System.out.println("ORIGINAL CONTENT");
		StringBuilder table_row = new StringBuilder();
		for (Entry<String, String> cell : original_content.entrySet()) {
			table_row.append(cell.getKey());
			table_row.append(" => ");
			table_row.append(cell.getValue());
			System.out.println(table_row.toString());
			table_row = new StringBuilder();
		}
	}
	
	public void display_table_distinct(HashMap<String, List<String>> distinct_table){
		System.out.println("DISTINCT CONTENT");
		StringBuilder row_sb = new StringBuilder();
		for(Entry<String,List<String>> row: distinct_table.entrySet()){
			row_sb.append(row.getKey());
			row_sb.append(" => [");
			for(int i = 0; i < row.getValue().size(); i++){
					if(i > 0){
						row_sb.append(", ");
					}
					row_sb.append(row.getValue().get(i));
			}
			row_sb.append("]");
			System.out.println(row_sb.toString());
			row_sb = new StringBuilder();
		}
	}
	
	
	public HashMap<String, List<String>> get_distinct_content_generator(HashMap<String, String> original){
		HashMap<String, List<String>> distinct_table = new HashMap<String, List<String>>();
		StringBuilder extractor = new StringBuilder();
		List<String> list_holder = new ArrayList<String>();
		String extractor_string = null;
		
		//For each cell 
		for(Entry<String,String> cell:original.entrySet()){
			extractor.append(cell.getValue().substring(6, cell.getValue().length()-7));
			extractor.append(".com");
			extractor_string = extractor.toString();
			if(cell.getKey().equals(extractor_string)){
				if(!(distinct_table.containsKey(extractor_string))){
					distinct_table.put(extractor_string, list_holder);
				}
			}else{
				if(distinct_table.containsKey(extractor_string)){
					distinct_table.get(extractor_string).add(cell.getKey());
				}else{
					list_holder.add(cell.getKey());
					distinct_table.put(extractor_string, list_holder);
				}
			}
			extractor = new StringBuilder();
			extractor_string = null;
			list_holder = new Stack<String>();
		}
		return distinct_table;
	}
	public static void main(String[] args) {
		String url_adder = "abcde";
		String content_adder = "abcaa";
		distinct_content_extractor d = new distinct_content_extractor();
		d.generate_distinct_content(url_adder, content_adder);
		d.display_table_original();
		HashMap<String, List<String>> distinct_content = d.get_distinct_content_generator(d.original_content);
		d.display_table_distinct(distinct_content);
	}

}

- GivingItAShot January 07, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

/* package whatever; // don't place package name! */

import java.util.*;
import java.lang.*;
import java.io.*;

/* Name of the class has to be "Main" only if the class is public. */
class Ideone
{
	
	public static Map<String,List<String>> generateDuplicateUrls(Map<String,String> table)
	{
		Map<String,List<String>> duplicateEntries = new HashMap<String,List<String>>();
		
		for(Map.Entry<String, String> entry : table.entrySet())
		{
			String val = entry.getValue();
			if(!duplicateEntries.containsKey(val))
			{
				duplicateEntries.put(val,new ArrayList<String>());
			}
			else
			{
				List<String> dupUrls = duplicateEntries.get(val);
				dupUrls.add(entry.getKey());
			}
		}
		return duplicateEntries;
	}
	public static void main (String[] args) throws java.lang.Exception
	{
		// your code goes here
		Map<String,String> table = new HashMap<String,String>();
		table.put("a.com","a");
		table.put("b.com","b");
		table.put("c.com","c");
		table.put("d.com","a");
		table.put("e.com","a");
		
		Map<String,List<String>> duplicate = generateDuplicateUrls(table);
		
		System.out.println(duplicate.toString());
	}
}

- Himanshu Jain January 08, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

public HashMap<String,List<String>>  rebuild (HashMap<String,String> map){		
		HashMap<String,String> found = new HashMap<> ();
		HashMap<String,List<String>> result = new HashMap<> ();
		for (Map.Entry<String, String> entry : map.entrySet()) {
			if (found.containsKey(entry.getValue())){
				found.put(entry.getValue(), entry.getKey()) ;
				result.put(entry.getKey(), new ArrayList<String>());
			} else{
				List<String> tmp = result.get(found.get(entry.getValue())) ;
				tmp.add(entry.getKey());				
			}				
		}
		return result;

}

- Anonymous January 08, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

Urls as well as Content can be rather big (of course Content will be much bigger), but in this implementation I managed to avoid comparisons as much as possible by using a checksum.
For every new [Url, Content] I calculate a simple checksum and if this checksum was calculated before, then I start comparing the new Content with the stored Contents. But improving the checksum greatly reduces the Content comparisons required.

#include <map>
#include <list>
#include <string>
#include <iostream>

struct Data
{
	Data (const std::string &u, const std::string &c) : content(c), urls(1, u){}
	Data (const Data &another) : content(another.content), urls(another.urls){}
	
	std::string content;
	std::list <std::string> urls;
};

unsigned int contentChecksum (const std::string &content) // fletcher-32 (for simplicity)
{
	unsigned int sum1(0), sum2(0);
	unsigned short word(0);
	
	unsigned long contentSize (content.size());
	for (unsigned long i(0); i < contentSize; i += 2)
	{
		word = ((unsigned char)content[i]) << 8;
		if ((i+1) < contentSize)
			word += ((unsigned char)content[i + 1]);
		
		sum1 = (sum1 + word) % 65536;
		sum2 = (sum2 + sum1) % 65536;
	}
	return (sum2 << 16) | sum1;
}

void filterDuplicates (const std::map <std::string, std::string> &inputMap, 
					   std::map <std::string, std::list<std::string> > &filteredUrls)
{
	std::map <unsigned int, std::list<Data> > filteredMap;
	std::map <unsigned int, std::list<Data> >::iterator filteredIt;
	std::list<Data>::iterator contentIt;
	
	std::map <std::string, std::string>::const_iterator mapIt;
	for (mapIt = inputMap.begin(); mapIt != inputMap.end(); mapIt ++)
	{
		unsigned int checksum = contentChecksum (mapIt->second);
		filteredIt = filteredMap.find (checksum);
		
		if (filteredIt != filteredMap.end()) // we need to check the full content 
		{
			bool notExisting (true);
			for (contentIt = filteredIt->second.begin(); 
				 notExisting && (contentIt != filteredIt->second.end()); contentIt ++)
			{
				if (mapIt->second == (*contentIt).content)
				{
					(*contentIt).urls.push_back (mapIt->first);
					notExisting = false; // the content was found
				}
			}
			
			if (notExisting)
				filteredIt->second.push_back (Data(mapIt->first, mapIt->second));
		}
		else
		{
			filteredMap[checksum] = std::list<Data>(1, Data(mapIt->first, mapIt->second));
		}
	}
	
	for (filteredIt = filteredMap.begin(); filteredIt != filteredMap.end(); filteredIt ++)
		for (contentIt = filteredIt->second.begin(); contentIt != filteredIt->second.end(); contentIt ++)
		{
			std::string firstUrl = (*contentIt).urls.front();
			(*contentIt).urls.pop_front();			
			filteredUrls[firstUrl] = (*contentIt).urls;
		}
}

int main()
{
	std::map <std::string, std::string> inputMap;
	inputMap["a.com"] = "<html>a</html>";
	inputMap["b.com"] = "<html>b</html>";
	inputMap["c.com"] = "<html>c</html>";
	inputMap["d.com"] = "<html>a</html>";
	inputMap["e.com"] = "<html>a</html>";

	std::map <std::string, std::list<std::string> > filteredUrls;
	filterDuplicates (inputMap, filteredUrls);
	
	std::map <std::string, std::list<std::string> >::iterator mapIt;
	for (mapIt = filteredUrls.begin(); mapIt != filteredUrls.end(); mapIt ++)
	{
		std::cout << mapIt->first << " => [";
		std::list<std::string> & tempList = mapIt->second;
		std::list<std::string>::iterator listIt;
		
		for (listIt = tempList.begin(); listIt != tempList.end(); listIt ++)
		{
			if (listIt != tempList.begin())
				std::cout << ", ";
			std::cout << *listIt;
		}
		std::cout << "]" << std::endl;
	}
	
	return 0;
}

- Anonymous January 09, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

f = open('input','r')

content={}

#parsing input file
for line in f:
print line
value=line.split(' ')[0]
key=line.split(' ')[2].replace('<html>',"").replace('</html>',"")

if key in content:
content[key].append(value)
else:
content[key]=[value]

for element in content:
if len(content[element])==1:
print str(content[element][0])+" => []"
else:
print str(content[element][0])+" => "+str(content[element][1:])

- Mar January 19, 2015 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashSet;
import java.util.Hashtable;
import java.util.List;


public class Main {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		
		Hashtable<String, String> map = new Hashtable<String, String>();

		Hashtable<String, String> urlsContents = new Hashtable<String,String>();
		urlsContents.put("a.com", "<html>a</html>");
		urlsContents.put("b.com", "<html>b</html>");
		urlsContents.put("c.com", "<html>c</html>");
		urlsContents.put("d.com", "<html>b</html>");
		urlsContents.put("e.com", "<html>a</html>");
		urlsContents.put("f.com", "<html>a</html>");
		
		System.out.println(range(urlsContents));
	}
	
	//public static Hashtable range(List<String> keys, List<String> values)
	public static Hashtable range(Hashtable<String, String> urlsContents)
	{
			Hashtable map = new Hashtable<String,ArrayList<String>>();
			

			while(urlsContents.size()!=0)
			{
				ArrayList<String> keys = new ArrayList(urlsContents.keySet());
				ArrayList valeus = new ArrayList(urlsContents.values());
				ArrayList <String> sameUrls= new ArrayList<String>();
				map.put(keys.get(0), sameUrls);
				
				for(int i=1; i<keys.size();i++)
				{
					if(valeus.get(i).equals(valeus.get(0)))
					{
						sameUrls.add(keys.get(i));
						urlsContents.remove(keys.get(i));
					}
				}
				urlsContents.remove(keys.get(0));
			}
			
			return map;
			
	}

}

- Slaheddine January 22, 2015 | Flag Reply

CareerCup

Google Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews