Online Statistics Gathering in Oracle 12c

Since Oracle 12c, table statistics are gathered automatically for some bulk operations. This inconspicuous new feature called “Online Statistics Gathering for Bulk Loads” is very practical for ETL jobs in Data Warehouses.

I remember an interesting discussion during Oracle OpenWorld 2009 in the Thirsty Bear Pub in San Francisco. Thanks to my Trivadis colleague Chris Antognini, we were allowed to join an “Oracle ClosedWorld” session of OakTable members. There, they discussed about statistics gathering, and one idea was to gather the table statistics “on the fly” during a CREATE TABLE AS SELECT statement.

Four year later, Oracle 12.1.0.1 was released. One of the nice little features of the 12c database is exactly what was discussed several years ago. In Oracle 12c, table statistics are gathered automatically for the following statements:

  • CREATE TABLE AS SELECT (CTAS)
  • Direct-Path INSERT on an empty table

Because these types of bulk operations are often used in load processes of Data Warehouses, it’s worth to have a detailed look at this feature. Data from different sources – e.g. OLTP databases or flat files – is typically loaded into a Staging Area and transformed in several steps until it is loaded into the target table of a Core DWH. It is common to TRUNCATE these intermediate tables in Staging Area and Cleansing Area at the beginning of an ETL job and then load the data into the tables with a Direct-Path INSERT. The same approach can be used for initial loads for Data Marts from a Core Data Warehouse. For further details about the different DWH layers see my very first blog post.

For our tests, we just load data one to one from the source table SALES into a stage table STG_SALES with the same structure.

CREATE TABLE AS SELECT

First, we create the stage table STG_SALES with a CREATE TABLE AS SELECT (CTAS) statement and load one partition of data into the table. The partition SALES_Q1_1998 contains 43687 rows. What happens to the statistics?

CREATE TABLE stg_sales AS
SELECT * FROM sh.sales PARTITION (sales_q1_1998);
 
Table created.
 
SELECT table_name, num_rows, last_analyzed
  FROM user_tab_statistics
 WHERE table_name = 'STG_SALES';
 
TABLE_NAME NUM_ROWS LAST_ANALYZED
--------------- ---------- -------------------
STG_SALES 43687 22.12.2015 18:39:01 

 

As we can see in the data dictionary view USER_TAB_STATISTICS, the table statistics were gathered during the CTAS and contain the exact number of rows. Further analysis on USER_TAB_COL_STATISTICS shows that the column statistics are also complete and correct. Because all data is read for the load anyway, it is no overhead for Oracle to calculate exact statistics instead of estimate only a percentage based on a subset of rows. The only restriction is that no histograms can be gathered, the number of histogram buckets (NUM_BUCKETS) is always 1. An interesting detail: The column NOTES contains a comment STATS_ON_LOAD when the statistics were gathered during a CTAS or Direct-Path INSERT. This information is only available on column level, but not for the table.

SELECT column_name, num_distinct, num_buckets, sample_size, notes
  FROM user_tab_col_statistics
 WHERE table_name = 'STG_SALES';
 
COLUMN_NAME     NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE NOTES
--------------- ------------ ----------- ----------- ---------------
AMOUNT_SOLD              396           1       43687 STATS_ON_LOAD
QUANTITY_SOLD              1           1       43687 STATS_ON_LOAD
PROMO_ID                   2           1       43687 STATS_ON_LOAD
CHANNEL_ID                 4           1       43687 STATS_ON_LOAD
TIME_ID                   90           1       43687 STATS_ON_LOAD
CUST_ID                 3203           1       43687 STATS_ON_LOAD
PROD_ID                   60           1       43687 STATS_ON_LOAD

 

Direct-Path INSERT

Now, let’s do a Direct-Path INSERT and see what happens to the statistics. A Direct-Path INSERT is an INSERT with either an append or a parallel hint. For our example, we load another partition (64186 rows) of the SALES table into the stage table STG_SALES.

INSERT /*+ append */ INTO stg_sales
SELECT * FROM sh.sales PARTITION (sales_q1_1999);
 
64186 rows created.
 
COMMIT;
 
Commit complete.
 
SELECT table_name, num_rows, last_analyzed
  FROM user_tab_statistics
 WHERE table_name = 'STG_SALES';
 
TABLE_NAME      NUM_ROWS   LAST_ANALYZED
--------------- ---------- -------------------
STG_SALES            43687 22.12.2015 18:39:01

 

The query on USER_TAB_STATISTICS shows that nothing happened to the statistics. We still have the statistics of the previous CTAS statement. This is the case because the target table was not empty. In other words: Statistics are only gathered during a Direct-Path INSERT if the table was either just created (without data) or if a TRUNCATE TABLE was executed before (By the way: a TRUNCATE statement does not reset the statistics). On a stage table, we usually perform a TRUNCATE statement before the load. In this case, the online statistic gathering works as expected.

TRUNCATE TABLE stg_sales;
 
Table truncated.
 
INSERT /*+ append */ INTO stg_sales
SELECT * FROM sh.sales PARTITION (sales_q1_1999);
 
64186 rows created.
 
COMMIT;
 
Commit complete.
 
SELECT table_name, num_rows, last_analyzed
  FROM user_tab_statistics
 WHERE table_name = 'STG_SALES';
 
TABLE_NAME      NUM_ROWS   LAST_ANALYZED
--------------- ---------- -------------------
STG_SALES            64186 22.12.2015 18:40:49

 

The statistics were now recalculated, as we can see from the number of rows (NUM_ROWS) and the time of the last gathering (LAST_ANALYZED). As in the CTAS example before, the table and column statistics were gathered exactly, but no histograms are calculated.

How about Index Statistics?

The table and column statistics are gathered automatically, but what happens with the index statistics? For the CTAS statement, it is obvious that no index statistics are gathered. When a new table is created, no indexes are available at this time.

But what happens in case of a Direct-Path INSERT into a table that contains indexes? Are the index statistics gathered as well? On stage tables, there is usually no need for indexes, so we don’t have to worry about index statistics. But if indexes are created (e.g. a primary key constraint on a cleanse table), we have to take care about the index statistics.

For our test case, we create some indexes on the stage table STG_SALES (after a TRUNCATE statement to see the effect on the statistics):

ALTER TABLE stg_sales ADD (id NUMBER INVISIBLE GENERATED ALWAYS AS IDENTITY);
ALTER TABLE stg_sales ADD CONSTRAINT stg_sales_pk PRIMARY KEY (id);
CREATE INDEX stg_sales_cust_id_idx ON stg_sales (cust_id);
CREATE BITMAP INDEX stg_sales_prod_id_bix ON stg_sales (prod_id);

 

We add a primary key constraint on a generated ID column. For the primary key constraint, an implicit unique index is created. Additionally, we create a non-unique index on column CUST_ID and a bitmap index on PROD_ID. (Note: Please do not use such an index strategy in your Staging Area! It’s just for the test case.). Since Oracle 10g, index statistics are gathered automatically in a CREATE INDEX statement, so we will have statistics on the (empty) indexes. Now, we run the Direct-Path INSERT statement and check the index statistics after the load:

INSERT /*+ append */ INTO stg_sales
SELECT * FROM sh.sales PARTITION (sales_q1_1999);
 
64186 rows created.
 
COMMIT;
 
Commit complete.
 
SELECT index_name, blevel, leaf_blocks, distinct_keys
  FROM user_ind_statistics
 WHERE table_name = 'STG_SALES’;
 
INDEX_NAME             BLEVEL LEAF_BLOCKS DISTINCT_KEYS
---------------------- ------ ----------- -------------
STG_SALES_PROD_ID_BIX       0           0             0
STG_SALES_CUST_ID_IDX       0           0             0
STG_SALES_PK                0           0             0

 

The index statistics are still in the same state as before. Number of branch levels, leaf blocks and distinct keys of all indexes are 0, i.e. the indexes seem to be empty. During the Direct-Path INSERT, no online gathering of index statistics is done. To calculate useful index statistics we have two options:

  • Gathering the index statistics explicitly after the load with a call of dbms_stats.gather_index_stats for each available index.
  • Set all indexes to UNUSABLE before the load an rebuild them after the load. This is a recommended approach for large data sets anyway. A nice side effect is that the index statistics are gathered during the index rebuild.

The index of a primary key or unique constraint cannot be set to UNUSABLE. This would cause an ORA-26026 error during the load. So, we disable the primary key constraint and set all other indexes to UNUSABLE. After the load, the primary key is enabled (which creates a new unique index automatically) and all other indexes are rebuilt.

ALTER TABLE stg_sales DISABLE CONSTRAINT stg_sales_pk;
ALTER INDEX stg_sales_cust_id_idx UNUSABLE;
ALTER INDEX stg_sales_prod_id_bix UNUSABLE;
 
TRUNCATE TABLE stg_sales;
 
INSERT /*+ append */ INTO stg_sales
SELECT * FROM sh.sales PARTITION (sales_q1_1999);
 
COMMIT;
 
ALTER TABLE stg_sales ENABLE CONSTRAINT stg_sales_pk;
ALTER INDEX stg_sales_cust_id_idx REBUILD;
ALTER INDEX stg_sales_prod_id_bix REBUILD;

 

Now, the index statistics are up-to-date because the were calculated during the index creation:

SELECT index_name, blevel, leaf_blocks, distinct_keys
  FROM user_ind_statistics
 WHERE table_name = 'STG_SALES';
 
INDEX_NAME             BLEVEL LEAF_BLOCKS DISTINCT_KEYS
---------------------- ------ ----------- -------------
STG_SALES_PROD_ID_BIX       1           2            71
STG_SALES_CUST_ID_IDX       1         136          2398
STG_SALES_PK                1         133         64186

 

Recommendation for Data Warehouses

Online gathering of table statistics is very useful for tables that are truncated and reloaded in every load job. This is usually the case in intermediate tables of DWH layers like Staging Area or Cleansing Area (or whatever layers you have in your Data Warehouse). If there are indexes on these tables, it is recommended to create or rebuild them after the load, at least for large data sets. This is not only faster than to load the data into a table with indexes, but also calculates the index statistics automatically.

For Core tables that are loaded incrementally or are even updated, it is still recommended to gather table and index statistics after the load using dbms_stats. This is also the case for dimension and fact tables of the Data Marts, even if they are loaded from scratch. To have more flexibility for user queries, we probably want to use histograms or extended statistics. This is not possible with the online gathering during the insert.

Be aware that online statistics gathering for bulk loads is only possible with ETL tools that support Direct-Path INSERT (e.g. Oracle Data Integrator). For ETL tools that run Conventional INSERT statements, Bulk Loads with multiple COMMITs per load job or even single row INSERTs, no statistics are gathered automatically. In all these tools it is strongly recommended to gather the statistics at the end of the load job with a dbms_stats call.

4 thoughts on “Online Statistics Gathering in Oracle 12c

  1. As described here, we do truncate insert and then gather stats. My question is about extended stats … are they helpful in this type of a data warehouse environment? The extended stats SYS_ columns are causing an issue in some of our custom programming; perhaps we should disable extended stats instead of coding around the SYS_ columns?

    Like

    • Hi Eileen
      Extended statistics are rarely used in the background layers of a Data Warehouse (at least from my point of view). I used them only in Data Marts till know. But if you have good reasons where extended statistics help to improve the ETL performance or ad-hoc queries on specific DWH tables, use them. Just be aware that they are not calculated with Online Statistics Gathering in Oracle 12c.

      Like

  2. Hi,
    Unfortunately Bulk insert doesn’t work with partitioned tables.
    And Truncate updates the STALE_STATS col from NO to YES, and the daily stats job picks up all the partitions in the table

    SQL> create table t(x int,y char(1),z date) partition by range(x) interval(1) (partition dummy values less than (1));
    Table created.

    SQL> insert into t select level,’a’,sysdate from dual connect by level exec dbms_stats.gather_table_stats(ownname=>user,tabname=>’T’,estimate_percent => dbms_stats.auto_sample_size,cascade=>false);
    PL/SQL procedure successfully completed.

    SQL> select table_name,partition_name,num_rows,last_analyzed,stale_stats from user_tab_statistics where table_name=’T’;

    TABLE_NAME PARTITION_ NUM_ROWS LAST_ANAL STALE_STATS
    ———- ———- ———- ——— ———–
    T 5 24-MAY-19 NO
    T SYS_P15347 1 24-MAY-19 NO
    T SYS_P15345 1 24-MAY-19 NO
    T DUMMY 0 24-MAY-19 NO
    T SYS_P15346 1 24-MAY-19 NO
    T SYS_P15344 1 24-MAY-19 NO
    T SYS_P15348 1 24-MAY-19 NO

    7 rows selected.

    SQL> create table t_backup as select * from t;
    Table created.

    SQL> truncate table t;
    Table truncated.

    Immediately after truncate the STALE_STATS col is updated to YES.

    SQL> select table_name,partition_name,num_rows,last_analyzed,stale_stats from user_tab_statistics where table_name=’T’;

    TABLE_NAME PARTITION_ NUM_ROWS LAST_ANAL STALE_STATS
    ———- ———- ———- ——— ———–
    T 5 24-MAY-19 YES
    T SYS_P15345 1 24-MAY-19 YES  Though the stats are preserved, stale_stats is updated to ‘YES’.
    T SYS_P15347 1 24-MAY-19 YES
    T DUMMY 0 24-MAY-19 YES
    T SYS_P15346 1 24-MAY-19 YES
    T SYS_P15348 1 24-MAY-19 YES
    T SYS_P15344 1 24-MAY-19 YES

    7 rows selected.

    Is there a way to disable the STALE_STATS for this table during the truncate/re-insert operation, as we know that truncate doesn’t delete stats, and the data is same, as it is just a re-insert of same data.

    Problem Scenario: STALE Statistics
    —————-
    We have a scenario where we need to truncate huge partitioned table and re-insert the data.

    Stale Percent
    By default stale_percent it 10, when I do a truncate, 100% of data is modified and stale_stats are updated to YES

    SQL> select dbms_stats.get_prefs(ownname=>user,tabname=>’T’,pname=>’STALE_PERCENT’) stale_percent from dual;

    STALE_PERCENT
    ————–

    10

    We have a daily gather stats task, which picks up all the STALE STATS objects to be gathered,
    Now because of the truncate, all the partitions(total 86366) of this table are picked up.
    Gather stats on this table is taking days to complete, that too with CASCADE=>FALSE option.
    As we are re-creating the indexes, we are using cascade=>false to ignore the indexes, as the index stats are gathered at the time of creation.

    Thanks,
    Rahmat Ali

    Like

    • Hi Rahmat

      Thanks for your long comment. I try to answer your questions in a short reply:

      1. Online statistics gathering works for partitioned tables, but only for global statistics. Statistics on partition level must either be gathered with dbms_stats, or you use partition exchange to load the data.
      2. Disable STATE_STATS is not possibe. For your use case, I would recommend to lock the statistics on the huge partitioned table. In this case, the default statistics job ignores this table.
      3. Instead of TRUNCATE and re-insert the data I would highly recommend to use partition exchange to load individual partitions. In this case, the statistics for the partition are replaced by the table statistics of the exchange table, and the statistics will not become stale.

      Hope these answers help to solve your problem.
      Regards, Dani

      Like

Leave a comment