Incremental statistics are an interesting feature of Oracle 11g to reduce the time for gathering global statistics. But are they a feasible option in large Data Warehouse environments?
Gathering optimizer statistics on a partitioned table usually needs a lot of time. To reduce this time, typically only a small percentage of the table and each partition is used for the calculation of the statistics. A value for the parameter estimate_percent of 1% or even less is common on partitioned tables. Additionally, the statistics can be gathered in parallel. For this, the parameter degree is available in dbms_stats. Let’s make a little test on a table with 5 partitions and 1 million rows:
(ownname => USER ,tabname => 'PARTITION_TEST' ,estimate_percent => 1 ,degree => 8 ,granularity => 'AUTO'); END; /
SELECT table_name, partition_name, num_rows, sample_size FROM user_tab_statistics WHERE table_name = 'PARTITION_TEST';
TABLE_NAME PARTITION_NAME NUM_ROWS SAMPLE_SIZE --------------- --------------- ---------- ----------- PARTITION_TEST 989800 9898 PARTITION_TEST PT_0 198651 5078 PARTITION_TEST PT_1 200756 5034 PARTITION_TEST PT_2 196224 5073 PARTITION_TEST PT_3 201382 5017 PARTITION_TEST PT_4 200884 4943
As you can see in the columns SAMPLE_SIZE of the query above, only a small percentage of the rows was used for the statistics calculation (1% on table level, 2.5% on partition level). This is feasible for large tables – and partitioned tables are typically large. But even with this small percentage, the statistics calculation would take a lot of time on a table with many partitions because the global statistics, i.e. the statistics on table level, requires a full table scan on all partitions. That’s why the following approach is used in many Data Warehouses: During the ETL jobs, only the statistics on partition level are calculated (with the parameter granularity set to ‘PARTITION’). The global statistics on table level are gathered asynchronous with an additional job, e.g. once a week or even less (with the parameter granularity set to ‘GLOBAL’). This helps to reduce the elapsed time for the ETL jobs dramatically, but an additional long-running job is needed to gather the global statistics periodically.
Oracle 11g allows to gather the global statistics “on the fly”: When statistics are gathered on partition level, additional information is stored in several synopsis tables in the SYSAUX tablespace. This information is used to calculate the statistics on table level without reading all other partitions of the table. These so called “incremental statistics” are a really cool feature for large tables. At least that’s what I thought before I wanted to use it in a customer project.
To enable incremental statistics, the following command must be executed once for the partitioned table:
(ownname => USER ,tabname => 'PARTITION_TEST' ,pname => 'incremental' ,pvalue => 'true'); END; /
After that, the statistics must be gathered with the parameter granularity set to ‘AUTO’ (or ‘GLOBAL AND PARTITION’). When we do this for our test table and check the result after the execution of the gather procedure used above, we will see a little surprise:
TABLE_NAME PARTITION_NAME NUM_ROWS SAMPLE_SIZE --------------- --------------- ---------- ----------- PARTITION_TEST 1000000 1000000 PARTITION_TEST PT_0 200000 200000 PARTITION_TEST PT_1 200000 200000 PARTITION_TEST PT_2 200000 200000 PARTITION_TEST PT_3 200000 200000 PARTITION_TEST PT_4 200000 200000
Although only 1% was specified for the parameter estimate_percent, dbms_stats used 100% of the rows for the statistic calculation! What does this mean? When we decide to enable incremental statistic to avoid an asynchronous job for gathering the global statistics, we need much more time during the ETL jobs because 100% of the current partitions have to be scanned to calculate the partition level statistics. The advantage of incremental statistics is to reduce the time for statistics calculation, and now we need more time than ever before.
The document “Collect incremental statistics for a large partitioned table in versions 10.2 and 11” (Doc ID 1319225.1) on “My Oracle Support” (aka “Metalink”) and the “Oracle Database PL/SQL Packages and Types Reference” of the Oracle 11.2 documentation describe the following preconditions for incremental statistics:
- The INCREMENTAL value for the partitioned table is true.
- The PUBLISH value for the partitioned table is true.
- The user specifies AUTO_SAMPLE_SIZE for ESTIMATE_PERCENT and AUTO for GRANULARITY when gathering statistics on the table.
Other values for estimate_percent seem not to be supported. Another interesting information can be found in the document “Collect statistics for a large partitioned table takes a lot when incremental is used” (Doc ID 1302628.1 on “My Oracle Support”). A customer describes more or less the same situation as in our test case. But the problem from Oracle’s point of view seems to be different: “incremental is used even though user has specified an estimate_percent other than AUTO“. The workaround described in the document is: “turn off incremental if user wants to gather stats using sampling“.
In other words: We have to decide to either use a small sample size (which is the only feasible option in many DWH environments) or use incremental statistics. Both at the same time seems not to be possible. Perhaps in Oracle 12c?