Thursday, November 24, 2011

Choose the Database Index: Selectivity and Cardinality

It is a known fact that most of the database performance tuning is achieved by building a suitable index. In most database engines, such as Oracle and MySQL, the most popular indexes types are either B*Tree or Bitmap. This article describes how to choose the appropriate index.

Determining the right index is all about cardinality and selectivity.

What is cardinality? Cardinality is the number of distinct values you have in a field. For example if  you have a table with a field Country, there are around 250 possible countries. Thus the cardinality of the Country field is 250.

What is selectivity? Selectivity is what we actually need to gauge the type of index to use. Selectivity can be determined by the following formula:

Selectivity =             Cardinality
                      --------------------
                      Total number of Rows



Taking the previous case, if I have 2500 customers, the selectivity is 250/2500 = 0.1%.

Selectivity is important because it helps us to determine whether to use a Balanced Tree (B*Tree) Index versus a Bitmap index. In general you have to follow these golden rules:

If Selectivity > 4%, use B*Tree index
If Selectivity < 4%, use Bitmap index

In Oracle one can do an audit of the currently used indexes and determine whether they are good or not by using the following query:


SELECT
  INDEX_NAME "NAME",
  DISTINCT_KEYS / NUM_ROWS * 100 "SELECTIVITY %",
  NUM_ROWS,
  DISTINCT_KEYS "DISTINCT",
  LEAF_BLOCKS,
  CLUSTERING_FACTOR,
  BLEVEL "LEVEL",
  AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY"
FROM
  DBA_INDEXES
WHERE
  DISTINCT_KEYS / NUM_ROWS < .1  AND
  NUM_ROWS > 0
ORDER BY "SELECTIVITY %" DESC;



This will list all indexes which have a selectivity lower than 0.1%. These indexes should be Bitmap, otherwise they are not used by the Cost Based Optimizer.

To determine the selectivity of  a particular index, one can use the following query:

SELECT
  INDEX_NAME "NAME",
  DISTINCT_KEYS / NUM_ROWS * 100 "SELECTIVITY %",
  NUM_ROWS,
  DISTINCT_KEYS "DISTINCT",
  LEAF_BLOCKS,
  CLUSTERING_FACTOR,
  BLEVEL "LEVEL",
  AVG_LEAF_BLOCKS_PER_KEY "ALFBPKEY"
FROM
  DBA_INDEXES
WHERE
 index_name = 'INX_TRAN_FT_EXT_DATE'
ORDER BY "SELECTIVITY %" DESC;



One might argue that the Selectivity model above holds only if there are duplicates within the user record set. Of course it is true for small tables - infact in small tables with low cardinality, one might not even use indexes at all. However one can also use the Birthday Paradox to further predict if there are going to be duplicates.

The Birthday Paradox states that for every 23 persons, a pair will share the same birthday, or written mathematically, there is a 99% probability that there is a duplicate if the following is satisfied:


N = 3.0 * sqrt (cardinality)



, where N is the position within the table which returns the first duplicate. For the example above, for a cardinality of 203, you are 99% certain that you will encounter a duplicate on 3.0 * sqrt (203) = 43rd record.This means that even in a table of 100 rows you will most likely have a duplicate even if the country set is 250!
Post a Comment