Amazon Interview Question
Country: India
Assume the file to be a static file with well formed XML tags. Suppose a tag is <Colour>Red</Colour>, then corresponding to this tage we should get all the values.
OK. Since an XML document can be represented as a tree, we can do a level-by-level traversal using 2 queues(or stacks).
One queue would hold the parent nodes and as a parent node is de-queued, the other queue is populated with it's children. Once the queue holding the parents is emptied, it will start holding the children of the nodes from the queue holding the children.
While adding to the queue(or removing from it) if a node matches the given tag, print its value.
Continue this till the bottom level of the XML tree is reached. Time complexity is O(n) since each node is visited only once. Space complexity is also O(n) because of the additional queues
@Dumbo, Can u elaborate more about how u will construct the tree also how tag wise check will be applied while adding to the queue.
@Nascent
Your question provoked me to think further. At first, I thought of using a stack to build a tree - when you see a beginning tag push it on to the stack and when you see the ending tag, pop it out of the stack. Before pushing on to the stack, if the stack has already any elements present in it, the top most element would be the parent element of the current element. This method should build the tree.
However, it seems there is no need to build a tree at all. As mentioned in other posts, a SAX parser should do the job just fine. I too realized it after giving it a little more thought. Some SAX parsers, like that of PHP, provide a mechanism to invoke callback functions on entering and exiting a node/tag. A callback function that checks whether the current tag matches the given tag and prints the value should work fine.
Preparing DOM is only suitable for a complex structured XML, though it's a complex structure. Still I won't recommend to use DOM because DOM only useful when full fledged analysis is required for a XML.
Here I would I like to use SAX Parsing style from my own.
Suppose a given tag of which reference to be found from large XML file.
I like to split into multiple chunks as much as possible with standard numbers of line (say for example 512, 1024 .. may be some standard value) and assign those to a thread to analyze (assumed tags don't end in two lines instead one line)
Now each thread analysis code is like,
first letter of tag followed by < found then create another thread to get occurrence and original thread continues (from found index + tag length) with scanning and do the same thing if probable matches found.
A shared variable is updated across all thread(s) when a tag begins then increment and if tag ends then decrease.
It's a kind of map reduce kind of design, I think it could a better approach ;)
Since the question says tags with millions of entries, it will not be possible to create a huge tree and store the entire XML file in-memory.
So may be reading the xml file in chunks will be a good option.
In the chunk read into memory, parse it token-wise and search for tags between <> brackets.
For the matching tag print all the tokens till the closing of the tag</>.
Now this single thread can be made into multi-threaded, with each thread assigned a different chunk of the file.
As there are millions of entries, in memory search is out of question. SAX parsing is one way as Debasis mentioned.
Secondly in most XMLpayloads, tags are limited and they are repeated so we can safely assume #tags would be much less than # entries. We can index this file on basis of tag names. So another file is to prepared as a pre processing step in which every entry would be somewhat like this
<<tag name>> index1,index2,index3...........
These entries can further be sorted on basis of tag name so that binary search can be done for a given tag. once we reach on the record containing tag name, it is easy to search on all the mentioned indexes.
Search Time complexity would be O(log(n)*t * (string matching complexity))
n = number of tags
t = average number of indexes for a particular tag
PreProcessing complexity -
n = number of tags
O(nlog(n))
@Nascent
- Murali Mohan July 02, 2013Are the entries specified using an XPath?