Given a XML file with tags hav

Amazon Interview Question

0

of 0 votes

13
Answers
Given a XML file with tags having millions of entry. Design a suitable algorithm for searching all the entries pertaining to a particular tag in the file. Stress was on optimization.
- Nascent July 02, 2013 in India | Report Duplicate | Flag | PURGE
Amazon

Email me when people comment.

An error occurred in subscribing you.

Country: India

More Questions from This Interview

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 0 vote

@Nascent

Are the entries specified using an XPath?

- Murali Mohan July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

Assume the file to be a static file with well formed XML tags. Suppose a tag is <Colour>Red</Colour>, then corresponding to this tage we should get all the values.

- Nascent July 02, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

OK. Since an XML document can be represented as a tree, we can do a level-by-level traversal using 2 queues(or stacks).

One queue would hold the parent nodes and as a parent node is de-queued, the other queue is populated with it's children. Once the queue holding the parents is emptied, it will start holding the children of the nodes from the queue holding the children.

While adding to the queue(or removing from it) if a node matches the given tag, print its value.

Continue this till the bottom level of the XML tree is reached. Time complexity is O(n) since each node is visited only once. Space complexity is also O(n) because of the additional queues

- Murali Mohan July 02, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

@Dumbo, Can u elaborate more about how u will construct the tree also how tag wise check will be applied while adding to the queue.

- Nascent July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 2 votes

@Nascent

Your question provoked me to think further. At first, I thought of using a stack to build a tree - when you see a beginning tag push it on to the stack and when you see the ending tag, pop it out of the stack. Before pushing on to the stack, if the stack has already any elements present in it, the top most element would be the parent element of the current element. This method should build the tree.

However, it seems there is no need to build a tree at all. As mentioned in other posts, a SAX parser should do the job just fine. I too realized it after giving it a little more thought. Some SAX parsers, like that of PHP, provide a mechanism to invoke callback functions on entering and exiting a node/tag. A callback function that checks whether the current tag matches the given tag and prints the value should work fine.

- Murali Mohan July 03, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@Dumbo. Yeah the callback functions of the SAX parser should do it.

- subahjit July 03, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 votes

@dumbo. not only SAX parser , lot of parser available you can use but here question is how the parser logic implemented to do this task. i.e you want to make such parser then?

- bimlesh July 08, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Preparing DOM is only suitable for a complex structured XML, though it's a complex structure. Still I won't recommend to use DOM because DOM only useful when full fledged analysis is required for a XML.

Here I would I like to use SAX Parsing style from my own.
Suppose a given tag of which reference to be found from large XML file.
I like to split into multiple chunks as much as possible with standard numbers of line (say for example 512, 1024 .. may be some standard value) and assign those to a thread to analyze (assumed tags don't end in two lines instead one line)

Now each thread analysis code is like,

first letter of tag followed by < found then create another thread to get occurrence and original thread continues (from found index + tag length) with scanning and do the same thing if probable matches found.
A shared variable is updated across all thread(s) when a tag begins then increment and if tag ends then decrease.

It's a kind of map reduce kind of design, I think it could a better approach ;)

- Debasis Jana July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 votes

I think finding a tag in a separate thread is not a good option.
Original thread should do this by first last search algo

- Debasis Jana July 03, 2013 | Flag

Comment hidden because of low score. Click to expand.

of 0 vote

Since the question says tags with millions of entries, it will not be possible to create a huge tree and store the entire XML file in-memory.
So may be reading the xml file in chunks will be a good option.
In the chunk read into memory, parse it token-wise and search for tags between <> brackets.
For the matching tag print all the tokens till the closing of the tag</>.
Now this single thread can be made into multi-threaded, with each thread assigned a different chunk of the file.

- subahjit July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

As there are millions of entries, in memory search is out of question. SAX parsing is one way as Debasis mentioned.
Secondly in most XMLpayloads, tags are limited and they are repeated so we can safely assume #tags would be much less than # entries. We can index this file on basis of tag names. So another file is to prepared as a pre processing step in which every entry would be somewhat like this
<<tag name>> index1,index2,index3...........
These entries can further be sorted on basis of tag name so that binary search can be done for a given tag. once we reach on the record containing tag name, it is easy to search on all the mentioned indexes.

Search Time complexity would be O(log(n)*t * (string matching complexity))
n = number of tags
t = average number of indexes for a particular tag

PreProcessing complexity -
n = number of tags
O(nlog(n))

- arpit.gautam July 02, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

of 0 vote

doing a DFS over DOM tree will solve the problem. complexity will be O(N) and memory consumed will be O(depth). and you will be having reference to parent node also. so you just need to store the parent nodes which will be having child as a given tag .

- Sourabh July 03, 2013 | Flag Reply

Comment hidden because of low score. Click to expand.

CareerCup

Amazon Interview Question

Books

Videos

Resume Review

Mock Interviews