Merikanto

一簫一劍平生意,負盡狂名十五年

Database Indices - How They Work

A database index is an auxiliary data structure, which allows for faster retrieval of data stored in the database. They are keyed off of a specific column so that queries like “Give me all people with a last name of ‘Smith’” are fast.

Reference:

Database Indices

B+ Trees

Composite Indices


Technically the size of an index is going to be proportional to the cardinality of the column being indexed.


Database tables, at least conceptually, look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
id	age	last_name	hometown
-- -- -- --
1 10 Johnson San Francisco, CA
2 27 Smith San Joe, CA
3 15 Rose Palo Alto, CA
4 64 Farmer Mill Valley, CA
5 55 Pauling San Francisco, CA
6 17 Smith Oakland, CA
... ... ... ...
100 49 Meyer Berkeley, CA
101 30 Wayne Monterey, CA
102 18 Schwartz San Francisco, CA
104 6 Johnson San Francisco, CA
... ... ... ...
10000 41 Fetterman Mountain View, CA
10001 25 Breyer Redwood City, CA

That is, a table is a collection of tuples. If we have a file like this sitting on disk how do we get all records that have a last name of ‘Smith?’

The code would wind up looking something like this:

1
2
3
4
results = []
for row in rows:
if row[2] == 'Smith':
results.append[row]

Finding the appropriate records requires checking the conditions for each row. This is linear in the number of rows which, for many databases, could be millions or billions of rows. Bad news.

How can we make it faster?


Database Indexes

Any type of data structure that allows for faster access can be considered an index.


Hash Indexes

Take the same example from above, finding all people with a last name of ‘Smith.’ One solution would be to create a hash table. The keys of the hash would be based off of the last_name field and the values would be pointers to the database row.

This type of index is called, unsurprisingly, a “hash index“. Most databases support them but they’re generally not the default type. Why?

Well, consider a query like this: “Find all people who are younger than 45.” Hashes can deal with equality but not inequality. That is, given the hashes of two fields, there’s just no way for me to tell which is greater than the other, only whether they’re equal or not.


B-tree Indexes

The data structure most commonly used for database indexes are B-trees, a specific kind of self-balancing tree.

01

The main benefit of a B-tree is that it allows logarithmic selections, insertions, and deletions in the worst case scenario. And unlike hash indexes it stores the data in an ordered way, allowing for faster row retrieval when the selection conditions include things like inequalities or prefixes.

For example, using the tree above, to get the records for all people younger than 13 requires looking at only the left branch of the tree root.


Other Indexes

Hash indexes and B-tree indexes are the most common types of database indexes, but there are others, too. MySQL supports R-tree indexes, which are used to query spatial data, e.g., “Show me all cities within ten miles of San Francisco, CA.”

There are also bitmap indexes, which allow for almost instantaneous read operations but are expensive to change and take up a lot of space. They are best for columns which have only a few possible values.


Nuances

Performance

Indexes don’t come for free. What you gain for in retrieval speed you lose in insertion and deletion speed because every time you alter a table the indexes must be updated accordingly. If your table is updating frequently it’s possible that having indexes will cause overall performance of your database to suffer.

There is also a space penalty, as the indexes take up space in memory or on disk. A single index is smaller than the table because it doesn’t contain all the data, only pointers to the data, but in general the larger the table the larger the index.


Design

Nodes in a B-tree contain a value and a number of pointers to children nodes. For database indexes the “value” is really a pair of values: the indexed field and a pointer to a database row. That is, rather than storing the row data right in the index, you store a pointer to the row on disk.

For example, if we have an index on an age column, the value in the B-tree might be something like (34, 0x875900). 34 is the age and 0x875900 is a reference to the location of the data, rather than the data itself.

This often allows indexes to be stored in memory even for tables that are so large they can only be stored on disk.

Furthermore, B-tree indexes are typically designed so that each node takes up one disk block. This allows each node to be read in with a single disk operation.

Many databases use B+ trees rather than classic B-trees for generic database indexes. InnoDB’s BTREE index type is closer to a B+ tree than a B-tree, for example.


Summary

Database indexes are auxiliary data structures that allow for quicker retrieval of data. The most common type of index is a B-tree index because it has very good general performance characteristics and allows a wide range of comparisons, including both equality and inequalities.

The penalty for having a database index is the cost required to update the index, which must happen any time the table is altered. There is also a certain amount of space overhead, although indexes will be smaller than the table they index.

For specific data types, different indexes might be better suited than a B-tree. For example, R-trees allow quicker retrieval of spatial data. For fields with only a few possible values, bitmap indexes might be appropriate.