Statistics

Before we see how a database optimizes a query we need to speak about statistics because without thema database is stupid. If you don’t tell the database to analyze its own data, it will not do it and it will make (very) bad assumptions.

But what kind of information does a database need?

I have to (briefly) talk about how databases and Operating systems store data. They’re using a minimum unit called apage or a block (4 or 8 kilobytes by default). This means that if you only need 1 Kbytes it will cost you one page anyway. If the page takes 8 Kbytes then you’ll waste 7 Kbytes.

Back to the statistics! When you ask a database to gather statistics, it computes values like:

  • The number of rows/pages in a table
  • For each column in a table:
    • distinct data values
    • the length of data values (min, max, average)
    • data range information (min, max, average)
  • Information on the indexes of the table.

These statistics will help the optimizer to estimate thedisk I/O, CPU and memory usages of the query.

The statistics for each column are very important. For example if a table PERSON needs to be joined on 2 columns: LAST_NAME, FIRST_NAME. With the statistics, the database knows that there are only 1 000 different values on FIRST_NAME and 1 000 000 different values on LAST_NAME. Therefore, the database will join the data on LAST_NAME, FIRST_NAME instead of FIRST_NAME,LAST_NAME because it produces way less comparisons since the LAST_NAME are unlikely to be the same so most of the time a comparison on the 2 (or 3) first characters of the LAST_NAME is enough.

But these are basic statistics. You can ask a database to compute advanced statistics called histograms. Histograms are statistics that inform about the distribution of the values inside the columns. For example

  • the most frequent values
  • the quantiles

These extra statistics will help the database to find an even better query plan. Especially for equality predicate (ex: WHERE AGE = 18 ) or range predicates (ex: WHERE AGE > 10 and AGE <40 ) because the database will have a better idea of the number rows concerned by these predicates (note: the technical word for this concept is selectivity).

The statistics are stored in the metadata of the database. For example you can see the statistics for the (non-partitioned) tables:

  • in USER/ALL/DBA_TABLES and USER/ALL/DBA_TAB_COLUMNS for Oracle
  • in SYSCAT. TABLES and SYSCAT.COLUMNS for DB2 .

The statistics have to be up to date. There is nothing worse than a database thinking a table has only 500 rows whereas it has 1 000 000 rows. The only drawback of the statistics is that it takes time to compute them. This is why they’re not automatically computed by default in most databases. It becomes difficult with millions of data to compute them. In this case, you can choose to compute only the basics statistics or to compute the stats on a sample of the database.

For example, when I was working on a project dealing with hundreds of millions rows in each tables, I chose to compute the statistics on only 10%, which led to a huge gain in time. For the story it turned out to be a bad decision because occasionally the 10% chosen by Oracle 10G for a specific column of a specific table were very different from the overall 100% (which is very unlikely to happen for a table with 100M rows). This wrong statistic led to a query taking occasionally 8 hours instead of 30 seconds; a nightmare to find the root cause. This example shows how important the statistics are.

Note: Of course, there are more advanced statistics specific for each database. If you want to know more, read the documentations of the databases. That being said, I’ve tried to understand how the statistics are used and the best official documentation I found was the one from PostgreSQL.

results matching ""

    No results matching ""