Keep Your SQL Simple (and Fast)

Complex SQL statements are hard to understand – not only for us, but for the query optimizer, too. Sometimes, it is a good idea to write a simpler SQL query to do the same work. Here a nice story about a performance improvement by rewriting the SQL statement.

In my job as consultant, I often see complex SQL queries with nested inline views, subqueries, function calls, complex expressions, and much more. Of course, SQL is a powerful query language with many features and functions. The good thing is that it allows to write complex data requirements in one query. The bad thing is that such queries are often hard to understand – and hard to estimate by the query optimizer.

An approach that I often use to improve the query performance is to reduce complexity by rewriting the SQL statement, or by splitting into several pieces, e.g. with subquery factoring clauses (see blog post 10 Tips to Improve ETL Performance). Of course, this is not always possible, but it may be more efficient and sustainable than to add hints, create SQL profiles, or create additional indexes.

A Current Example

A few days ago, I got an e-mail from a customer:

“We have a problem with the following SQL query. Is it possible to improve it?”

Attached was a SQL statement and the screenshot of an execution plan. The SQL query was a hierarchical query with CONNECT BY on an inline view with some analytical functions. The following statement is more or less what I received from the customer, except that I renamed the table and column names to “anonymize” the SQL code:

 

SELECT (CASE

           WHEN v.code = ‘ABC’ THEN 1

           WHEN v.code = ‘DEF’ THEN 2

           WHEN v.code = ‘GHI’ THEN 3

           WHEN v.code = ‘YYZ’ THEN 4

           ELSE 5

        END)         AS code,

        v.fk_column  AS fk_column,

        v.fk_list    AS fk_list,

        v.grp_id     AS grp_id

  FROM (SELECT S1.grp_id,

               S1.code,

               MIN (S1.fk_column) AS fk_column,

               LTRIM (MAX (SYS_CONNECT_BY_PATH (S1.fk_column, ‘;’))

                        KEEP (DENSE_RANK LAST ORDER BY S1.curr),

                        ‘;’) AS fk_list

          FROM (SELECT grp_id,

                       code,

                       fk_column,

                       ROW_NUMBER ()

                          OVER (PARTITION BY grp_id, code

                                ORDER BY fk_column) AS curr,

                       ROW_NUMBER ()

                          OVER (PARTITION BY grp_id, code

                                ORDER BY fk_column) – 1 AS prev

                  FROM quite_a_big_table

                 WHERE grp_id > 0) S1

        GROUP BY S1.grp_id, S1.code

        CONNECT BY S1.prev = PRIOR S1.curr

               AND S1.grp_id = PRIOR S1.grp_id

        START WITH curr = 1) v

ORDER BY grp_id, code

 

The execution plan didn’t look too bad – as far as I can judge without knowing the application and without access to the database. It contained two full table scans on table QUITE_A_BIG_TABLE and a hash join to do a self-join on the table. The table is quite big (as the name says in this example) and has more than 150 million rows. In the OEM screenshot of the execution plan, I saw that the statement was running for more than half an hour, but did not finish yet.

So, my first answer to the customer was:

“I’m sure it is possible to improve the statement, but I don’t have a fast solution for your problem because I have no remote access to the database. I can come to your office next week if you need some assistance”.

After replying the e-mail, I was not happy with my answer and wanted to understand what this SQL query does. It took me some time to find out that the purpose of the statement is, and after rebuilding the same query on the good old SCOTT.EMP table, I realized that the statement is used to extract a semicolon-separated list of foreign keys. Do I really need a CONNECT BY clause for this? I don’t think so! With the LISTAGG function it is much easier to get the same result:

 

SELECT (CASE

           WHEN v.code = ‘ABC’ THEN 1

           WHEN v.code = ‘DEF’ THEN 2

           WHEN v.code = ‘GHI’ THEN 3

           WHEN v.code = ‘YYZ’ THEN 4

           ELSE 5

        END)         AS code,

        v.fk_column  AS fk_column,

        v.fk_list    AS fk_list,

        v.grp_id     AS grp_id

  FROM (SELECT grp_id,

               code,

               MIN (fk_column) AS fk_column,

               LISTAGG(fk_column, ‘;’)

                  WITHIN GROUP (ORDER BY fk_column) AS fk_list

          FROM quite_a_big_table

         WHERE grp_id > 0

        GROUP BY grp_id, code) v

ORDER BY grp_id, code

 

Of course, I could not do any performance tests with my example query on the small EMP table. But because the table has to be scanned only once and no self-join is required, the same statement on QUITE_A_BIG_TABLE should definitely be faster. I rewrote the original query and sent a second e-mail to my customer:

I cannot guarantee that my query is executable and actually returns the same results. I adapted your example to the table SCOTT.EMP, then rewrote it, and finally replaced the columns with the original attribute names. I don’t know if I succeeded without errors. But it would be worth a try to execute the query.

The answer I got a few day later from the customer showed me that the rewritten SQL statement was successful. The e-mail contained many smileys and exclamation marks:

We tested your statement and it’s not only much the hell faster but also returns the same results 😉
It already ran on production and we are happy!!!!! 😊 😊

Why do I write this story here? One reason is that I like this kind of customer feedback, and it makes me happy as well. Another reason is that this example shows once more how important it is to know the different options to write a SQL statement. Often, there are many ways leading to the goal, but not all of them are simple and efficient.

Conclusion

It would be a hard rule to say: The simpler a SQL statement is, the faster it runs. But as I can see in many projects and performance tuning sessions, it is often the case that a SQL statement that can be simplified or written in a more elegant way is easier to optimize than a complex query. So, I recommend a special form of the KISS approach (“keep it simple and stupid”): Keep your SQL simple and fast.

 

Leave a comment