Goldman Sachs Interview Question for Software Engineer / Developers

0

of 0 votes

1
Answer
What is an index?
- M January 17, 2007 | Report Duplicate | Flag | PURGE
Goldman Sachs Software Engineer / Developer Database

Email me when people comment.

An error occurred in subscribing you.

Email me when people comment.

An error occurred in subscribing you.

Comment hidden because of low score. Click to expand.

of 1 vote

Database Indexes
*****************
Any type of data structure that allows for (potentially) faster access can be considered an index. Let’s look at some.
Hash Indexes

Take the same example from above, finding all people with a last name of ‘Smith.’ One solution would be to create a hash table. The keys of the hash would be based off of the last_name field and the values would be pointers to the database row.

This type of index is called, unsurprisingly, a “hash index.” Most databases support them but they’re generally not the default type. Why?

Well, consider a query like this: “Find all people who are younger than 45.” Hashes can deal with equality but not inequality. That is, given the hashes of two fields, there’s just no way for me to tell which is greater than the other, only whether they’re equal or not.

B-tree Indexes
***************

The data structure most commonly used for database indexes are B-trees, a specific kind of self-balancing tree. A picture’s worth a thousand words, so here’s an example. B-tree

The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario. And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.

For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.

What is a b-tree?
*****************
In a tree, records are stored in locations called leaves. The starting point is called the root. The maximum number of children per node is called the order of the tree. The maximum number of access operations required to reach the desired leaf (data stored on the leaf) is called the depth (level). Oracle indexes are balanced b-trees; the order is the same at every node and the depth is the same for every leaf.

How does a b-tree help with database access?
*********************************************
Most indexes are too large to fit into memory, which means that they are going to be stored on disk. Since I/O is usually the most expensive thing you can do in a computer system, these indexes need to be stored in an I/O efficient way.

A b-tree is a good way to do this. If we make the nodes the size of a physical I/O block, it would take one I/O to move to a lower depth in the tree. In the example below, an index was created on a first name kind of field.

How good is the index?
**********************
Now back to the original point I was trying to make-- low cardinality fields make bad indexes. Why is this the case? The answer here is really about selectivity.

unique index values
selectivity = -----------------------
total number records

A primary key is highly selective. If there are 1000 rows, there will be 1000 unique keys in the index. Eacy unique key will return at most 1 row. The index will be 100% selective (1000/1000).. the best you can get.

Now let's say we have an index on a low cardinality thing like gender. If we had 1000 records, the selectivity is in the database is 2/1000 =.2%. Said in another way, 500 records come back per unique key (1000 records / 2 uniques).

Note: this seems to assume an even distribution of data (e.g. 500 male, 500 female). Things might be different if you had 999 males, and 1 female.

********
R-TREE
********

R-trees are tree data structures that are similar to B-trees, but are used for spatial access methods i.e., for indexing multi-dimensional information; for example, the (X, Y) coordinates of geographical data. A common real-world usage for an R-tree might be: "Find all museums within 2 kilometres (1.2 mi) of my current location".

The data structure splits space with hierarchically nested, and possibly overlapping, minimum bounding rectangles (MBRs, otherwise known as bounding boxes, i.e. "rectangle", what the "R" in R-tree stands for).

Each node of an R-tree has a variable number of entries (up to some pre-defined maximum). Each entry within a non-leaf node stores two pieces of data: a way of identifying a child node, and the bounding box of all entries within this child node.

The insertion and deletion algorithms use the bounding boxes from the nodes to ensure that "nearby" elements are placed in the same leaf node (in particular, a new element will go into the leaf node that requires the least enlargement in its bounding box). Each entry within a leaf node stores two pieces of information; a way of identifying the actual data element (which, alternatively, may be placed directly in the node), and the bounding box of the data element.

Similarly, the searching algorithms (for example; intersection, containment, nearest) use the bounding boxes to decide whether or not to search inside a child node. In this way, most of the nodes in the tree are never "touched" during a search. Like B-trees, this makes R-trees suitable for databases, where nodes can be paged to memory when needed.

Each page in an R-tree index structure is a physical disk page. The R-tree index is designed to minimize the number of pages that need to be fetched from disk during the execution of a query, since disk I/O is often the most costly part.

A bitmap index is a special kind of database index that uses bitmaps.
********************************************************************

Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number of distinct values, e.g., male and female, but many occurrences of those values. This would happen if, for example, you had gender data for each resident in a city. However, some researchers argue that Bitmap indexes are also useful for unique valued data which is not updated frequently.[1] Bitmap indexes have a significant space and performance advantage over other structures for such data. Bitmap indexes use bit arrays (commonly called "bitmaps") and answer queries by performing bitwise logical operations on these bitmaps.

- Anonymous July 03, 2009 | Flag Reply

CareerCup

Goldman Sachs Interview Question for Software Engineer / Developers

Books

Videos

Resume Review

Mock Interviews