Which data warehouse cloud platform is faster: Oracle Autonomous Data Warehouse or Snowflake Cloud Data Warehouse? The short answer: it depends. For a more detailed answer, read this blog post.
Skiing is not recommended at the moment. Because of the current pandemic situation, it’s better to stay at home. So, I do a different kind of winter sports this year and play around with Snowflake. No, I’m not writing about building a snowman or anything like that. Currently, I’m doing some tests with Snowflake, a cloud-based data warehousing platform, that runs on Amazon AWS, Microsoft Azure and Google Cloud. I never worked before with Snowflake databases, and my main focus is to understand how this data warehousing solution works, what is similar to Oracle (which I know a little bit better) and where are the differences. As a first test, I wanted to find out how fast the Cloud platform of Snowflake works, compared to an Autonomous Data Warehouse in the Oracle Cloud.
Another type of a Snowflake table – but not the one I used for my performance tests
Of course, it is not possible to compare the two cloud solutions directly. Although Snowflake looks surprisingly familiar for me as an Oracle developer, it uses a different technical architecture and other performance features than Oracle. The results of any performance tests also depend on the type of queries. So, my experiments may not be meaningful for all kind of data warehouses. But for me, it was important to get a first impression of Snowflake with my own use case, not only with the demo schemas of Snowflake Inc.
Use Case: Craft Beer Data Vault
Example Data Vault schema for my performance tests
In 2017, I created a Data Vault example model for a video tutorial about Data Vault modeling. The use case for the demo was the small microbrewery of my wife and me. One year later, I used the same example schema for ETL tests with the Oracle Autonomous Data Warehouse. For the presentation Craft Beer in the Cloud: A Data Warehouse Case Study I reused and extended the data model to show how Virtual Data Marts can be built with views on top of the Data Vault tables. For this, a PIT (point-in-time) table and some history views are created for every hub and its satellite tables. On top of the history view layer, I created some dimension and fact views to provide two star schemas. The basic principle for this approach I explained in the presentation and in a previous blog post. For the situation here, it is just important to know that each query on a star schema needs to join many tables in the Data Vault schema.
To show this more clearly, here an example: In a Snowflake workbook, I executed a query that joins a fact view with two dimension views (slowly changing dimension type 2).
SQL queries can be executed in a browser-based Snowflake worksheet
To get the result for this query, Snowflake must scan and join 27 tables of the Data Vault schema (6 hubs, 6 PIT tables, 3 links, 12 satellites).
An excerpt of the execution plan for the query above in Snowflake
The execution plan in Oracle shows that only 13 tables (6 PIT tables, 3 links, 4 satellites) must be scanned and joined because of Join Elimination. This is just one of many differences between Snowflake and Oracle. Both platforms have different ways to optimize query performance, as you will read in this post.
An excerpt of the execution plan in Oracle for the same query
Running the Performance Tests
My performance tests contain four star schema queries. All of them join a fact view with one or two dimension views, using different types of Slowly Changing Dimensions (SCD). The queries answer the following questions:
- How many bottles were delivered per month and beer name? (SCD type 1)
- Which beer style was ordered how often in 2019 in which city? (SCD type 1)
- Which beer style was ordered how often in 2019 in which city? (SCD type 2)
- Which beer style was ordered how often in which year and city? (SCD type 2)
To run the queries in an Oracle database, I use SQL Developer on an Autonomous Data Warehouse in the Oracle Cloud. As an Oracle ACE Director, I’m lucky to have an account with more resources than the usual “Always Free” account, so I can scale up and down the ADW database with different number of CPUs. This is important, as you will see later. I ran the queries on the ADW with 1, 2, 4 and 8 CPU.
Oracle Autonomous Data Warehouse: Scale Up/Down the number of CPUs
For the Snowflake environment, I use a trial account in the Amazon AWS cloud. To run the queries and check the results, runtimes measures and execution plans, I use the browser-based web interface of Snowflake. This is the most common way to work with Snowflake platform. Since yesterday I know that it is also possible to connect to Snowflake with SQL Developer. My Trivadis colleague Philipp Salvisberg provided a JDBC proxy for this, as he described in his blog post Accessing Snowflake from SQL Developer.
Scaling is also possible in Snowflake by using different Virtual Warehouses to run the SQL statements. The size of a Warehouse defines how many servers are used in the Compute Layer: X-Small (1 server), Small (2 servers), Medium (4 servers), Large (8 servers). It is either possible to resize a Warehouse or to switch to another Warehouse. Both can be done in the web interface or on command line. I used four different warehouses and switched between them.
Snowflake: The Warehouse size can be changed to scale the Compute Layer
In both environments, I executed the queries multiple times. The test results are the average of all executions, except the first one (usually an outlier because no data is chached). So the preconditions should be similar for both platforms.
The original demo schema contains around 360’000 orders with 460’000 order items. This is much more beer than we sell in our real microbrewery, but it is a small data set for perfomance tests in a Cloud database. In both environments, all the queries finished in a few seconds. So, I installed to more schemas with my brewery data and increased the number of sales orders and beer deliveries by factor 10 resp. 100. One of the new demo schemas contain 3.6 million orders and 4.6 million order items, the other one 36 million orders and 46 million order items. Can we still call this a “microbrewery”?
Findings of the Performance Comparison
With the larger data sets it is possible to see differences in the query performance, but it is not easy to compare them directly. In the first version of this blog post, I made some misinterpretations when comparing the execution times of the queries. I compared apples with oranges. The technical architecture of Snowflake and Oracle is different, and they use different features to improve performance. So, if a specific query is faster on one platform, it is not guaranteed that this is the case for other queries. Additionally, software providers don’t like to see direct comparisons of performance tests in public. So, you will not see any test results here:
For these reasons, I changed the focus of this blog post. More meaningful than performance results for some queries is to understand, why a direct comparison of response times is not feasible. Here some remarks about the similarities and differences of the performance features of Snowflake and Oracle databases.
- Result caching is used by default on both platforms. Because all subsequent executions of a SQL statement would run in milliseconds, I disabled it in both environments. In Oracle with the hint no_result_cache, in Snowflake with a session command alter session set use_cached_result=false;
- Physical reads are reduced by the usage of the Database Buffer Cache (Oracle) or the Data Cache (Snowflake). This is why I excluded the first execution of the statements in both environments, But even then, a direct comparison is not possible because of the different implementation of data caching. The Database Buffer Cache in Oracle is stored in memory and used with an LRU (least recently used) algorithm. So, a high percentage of the data may be available in memory. In Snowflake, the Data Cache contains all data of the previous executed queries. So, after the first execution, 100% of the data can be read from the cache. But this cache is not in memory, it is stored on SSD disks. So even with data caching, physical reads are required in Snowflake.
- Both platforms can be scaled. In Oracle by the number of CPUs (1, 2, 4, 8), in Snowflake with the Warehouse size (XS, S, M, L), which is related to the number of nodes that are used. It was interesting for me to see that scaling on Oracle ADW worked as expected: The more CPUs are available, the faster the queries are executed. In Snowflake, some of the queries were faster with a warehouse of size XS (one none) that with warehouses with multiple nodes. This was caused by additional overhead for network traffic and synchronization between the nodes for bigger warehouse sizes. At least for my queries, this was the case, but for queries that contain more calculations, it might be different.
- Oracle ADW and Snowflake both use Hybrid Columnar Compression to store the table data. Snowflake additionally uses an automatic “micro-partition pruning” to split the data into many small partitions. Oracle allows several methods of partitioning of the tables. For this use case, I did not use the Oracle Partitioning features.
- Indexes are not supported in Snowflake and not recommended in Oracle ADW. The only useful purpose of indexes in ADW are primary key constraints and unique constraints. For this test case, I implemented referential integrity with reliable constraints in Oracle. Therefore, no indexes are created. The primary/foreign key constraints are required by the optimizer for Join Elimination in Oracle. In Snowflake, I also defined all primary and foreign key constraints, but they are neither checked nor used by the optimizer.
I hope to show you with this blog post, that there is no simple answer to the question “which data warehouse platform is faster”. Many details have an impact on the query performance: technical architecture and configuration of the Cloud platform, amount and distribution of the data, type and complexity of the queries. The good news is: Both platforms are scaleable, so good response times are possible with the right size of hardware.
Sounds good, but how about the price? Allocating more hardware also means higher costs – on both platforms. But a direct comparison of pricing is also very difficult. Oracle calculates the prices per OCPU and disk space, Snowflake charges for running warehouses, depending on the size. Because these warehouses are usually suspended when no queries are running, the price calculation is rather different to an Oracle ADW.
If somebody asks me: “Which Cloud platform is faster, Oracle or Snowflake?”, my answer will still be: “It depends”.
And if somebody asks me about the number of sales order of our real microbrewery, I would reply: “For this we don’t need a Cloud Data Warehouse”. We have around 50 orders per year.