Incremental Statistics – A Mixed Blessing

Incremental statistics are an interesting feature of Oracle 11g to reduce the time for gathering global statistics. But are they a feasible option in large Data Warehouse environments?

Gathering optimizer statistics on a partitioned table usually needs a lot of time. To reduce this time, typically only a small percentage of the table and each partition is used for the calculation of the statistics. A value for the parameter estimate_percent of 1% or even less is common on partitioned tables. Additionally, the statistics can be gathered in parallel. For this, the parameter degree is available in dbms_stats. Let’s make a little test on a table with 5 partitions and 1 million rows:

(ownname          => USER     ,tabname          => 'PARTITION_TEST'       ,estimate_percent => 1       ,degree           => 8       ,granularity      => 'AUTO'); END; /
SELECT table_name, partition_name, num_rows, sample_size
FROM user_tab_statistics
WHERE table_name = 'PARTITION_TEST';
--------------- --------------- ---------- -----------
PARTITION_TEST                      989800        9898
PARTITION_TEST  PT_0                198651        5078
PARTITION_TEST  PT_1                200756        5034
PARTITION_TEST  PT_2                196224        5073
PARTITION_TEST  PT_3                201382        5017
PARTITION_TEST  PT_4                200884        4943

As you can see in the columns SAMPLE_SIZE of the query above, only a small percentage of the rows was used for the statistics calculation (1% on table level, 2.5% on partition level). This is feasible for large tables – and partitioned tables are typically large. But even with this small percentage, the statistics calculation would take a lot of time on a table with many partitions because the global statistics, i.e. the statistics on table level, requires a full table scan on all partitions. That’s why the following approach is used in many Data Warehouses: During the ETL jobs, only the statistics on partition level are calculated (with the parameter granularity set to ‘PARTITION’). The global statistics on table level are gathered asynchronous with an additional job, e.g. once a week or even less (with the parameter granularity set to ‘GLOBAL’). This helps to reduce the elapsed time for the ETL jobs dramatically, but an additional long-running job is needed to gather the global statistics periodically.

Oracle 11g allows to gather the global statistics “on the fly”: When statistics are gathered on partition level, additional information is stored in several synopsis tables in the SYSAUX tablespace. This information is used to calculate the statistics on table level without reading all other partitions of the table. These so called “incremental statistics” are a really cool feature for large tables. At least that’s what I thought before I wanted to use it in a customer project.

To enable incremental statistics, the following command must be executed once for the partitioned table:

(ownname => USER ,tabname => 'PARTITION_TEST' ,pname => 'incremental' ,pvalue => 'true'); END; /

After that, the statistics must be gathered with the parameter granularity set to ‘AUTO’ (or ‘GLOBAL AND PARTITION’). When we do this for our test table and check the result after the execution of the gather procedure used above, we will see a little surprise:

--------------- --------------- ---------- -----------
PARTITION_TEST                     1000000     1000000
PARTITION_TEST  PT_0                200000      200000
PARTITION_TEST  PT_1                200000      200000
PARTITION_TEST  PT_2                200000      200000
PARTITION_TEST  PT_3                200000      200000
PARTITION_TEST  PT_4                200000      200000

Although only 1% was specified for the parameter estimate_percent, dbms_stats used 100% of the rows for the statistic calculation! What does this mean? When we decide to enable incremental statistic to avoid an asynchronous job for gathering the global statistics, we need much more time during the ETL jobs because 100% of the current partitions have to be scanned to calculate the partition level statistics. The advantage of incremental statistics is to reduce the time for statistics calculation, and now we need more time than ever before.

The document “Collect incremental statistics for a large partitioned table in versions 10.2 and 11” (Doc ID 1319225.1) on “My Oracle Support” (aka “Metalink”) and the “Oracle Database PL/SQL Packages and Types Reference” of the Oracle 11.2 documentation describe the following preconditions for incremental statistics:

  • The INCREMENTAL value for the partitioned table is true.
  • The PUBLISH value for the partitioned table is true.
  • The user specifies AUTO_SAMPLE_SIZE for ESTIMATE_PERCENT and AUTO for GRANULARITY when gathering statistics on the table.

Other values for estimate_percent seem not to be supported. Another interesting information can be found in the document “Collect statistics for a large partitioned table takes a lot when incremental is used” (Doc ID 1302628.1 on “My Oracle Support”). A customer describes more or less the same situation as in our test case. But the problem from Oracle’s point of view seems to be different: “incremental is used even though user has specified an estimate_percent other than AUTO“. The workaround described in the document is: “turn off incremental if user wants to gather stats using sampling“.

In other words: We have to decide to either use a small sample size (which is the only feasible option in many DWH environments) or use incremental statistics. Both at the same time seems not to be possible. Perhaps in Oracle 12c?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s