Thursday, July 2, 2020

Midsummer Midsomer SQL Winner

In June I launched the Midsummer Midsomer Murder SQL Challenge and Book Competition.

I promised to announce a winner within a week of the deadline - that was yesterday, so I'm afraid I'm one day late, sorry about that but better late than never 😇.

Very interesting entries, so it was not an easy choice - but in the end I was able to pick the best SQL detective 😎.

On LiveSQL you will find these answers to the challenge:

The entire setup and all answers is also available in this script.

All answers met the correctness criteria.

I ran all answers several times in SQL*Plus with autotrace and picked a representative autotrace output (usually third run.)

Here's the autotrace of the first statement (murder statistics) of each answer:

Ref. Answer Answer 1 Answer 2 Answer 3 Answer 4 Answer 5
elapsed 0.07 0.12 0.14 0.3 0.31 0.51
recursive calls 0 1223 115 230 0 0
db block gets 0 0 0 0 0 0
consistent gets 2 2 7 7 7 2
physical reads 13 7 13 13 13 13
redo size 0 0 0 0 0 0
bytes sent via SQL*Net to client 2005 2004 2005 2006 2005 2005
bytes received via SQL*Net from client 510 510 510 510 510 510
SQL*Net roundtrips to/from client 3 3 3 3 3 3
sorts (memory) 3 2 3 118 0 3220
sorts (disk) 0 0 0 0 0 0
rows processed 23 23 23 23 23 23

And here's the autotrace of the second statement (the lists of persons) of each answer:

Ref. Answer Answer 1 Answer 2 Answer 3 Answer 4 Answer 5
elapsed 0.08 0.13 0.28 0.35 0.31 0.47
recursive calls 0 1223 121 115 0 0
db block gets 0 0 5846 0 0 0
consistent gets 2 2 4743 7 7 2
physical reads 13 7 13 13 13 13
redo size 0 0 0 0 0 0
bytes sent via SQL*Net to client 20645 20644 34186 32338 34186 20645
bytes received via SQL*Net from client 510 510 27110 24650 27110 510
SQL*Net roundtrips to/from client 3 3 140 128 140 3
sorts (memory) 3 2 118 3 0 3219
sorts (disk) 0 0 0 0 0 0
rows processed 23 23 23 23 23 23

The complete autotrace output you can find here.

Some comments on the different answers:

  • The reference answer:
    •  I use relatively simple XMLTABLE to pick out all the <td> elements of the XML with each act (murder, suicide, etc) being a column.
    • Then I turn those columns into rows using UNPIVOT.
    • Those that have multiple delimited names I turn into more rows using APEX_STRING.SPLIT.
    • And then it's plain SQL to do GROUP BY ROLLUP with COUNT or LISTAGG respectively.

  • Answer 1:
    • Stew also use XMLTABLE, but with an XQUERY FLWR expression to completely parse the XML and do the splitting of delimited names using ora:tokenize.
    • The transformed XML output from XQUERY is then straightforward for XMLTABLE to read as columns.
    • And finally Stew also uses plain GROUP BY ROLLUP.
    • On the negative side, Stew neglects to use ORDER BY to ensure deterministically getting the desired output.
    • Performance is slower than the reference answer, but not by much. It does use more recursive calls.

  • Answer 2:
    • Philipp here creates a fairly advanced object type, which simultaneously implements an ODCI userdefined aggregate as well as some member functions to return counts and delimited text.
    • In this object type, the constructor function splits the delimited name lists into collections. The aggregation implementation aggregates the collections. The member functions then return data from the aggregated collections.
    • On the plus side, Philipp returns CLOBs that can handle long strings (but here we don't actually need it since the database in question uses extended strings up to 32K rather than the standard 4000 bytes.)
    • On the negative side, the CLOB handling when generating the csv output is rather slow, since Philipp does a whole lot of DBMS_LOB.APPEND operations (almost two per name), which could have been much reduced by buffering into a VARCHAR2 variable and only writing to the LOB once for every almost 4000 bytes.
    • The statistics answer is practically same speed as Answer 1, but the names list answer uses twice the time because of the many LOB operations.

  • Answer 3:
    • In this one Philipp works almost exclusively with XML, transforming the original data using XQUERY FLWR expression inspired by Stew.
    • But after getting transformed raw data, Philipp gets aggregate results (counts and name lists) also using XML functions.
    • On the plus side again Philipp can handle more data using CLOBs.
    • On the negative side the exhaustive use of XML functions does appear to take more time. This answer is about 2½ times slower than Answer 1.

  • Answer 4:
    • In his third attempt Philipp goes all in on XQUERY and solves everything with FLWR expression, so the XMLTABLE simply outputs the result. SQL does nothing here, XQUERY does everything.
    • CLOBs again make this answer able to handle larger lists of names if needed (though not needed for these data.)
    • Performancewise this is quite close to Answer 3.

  • Answer 5:
    • Mathias goes another way by simply retrieving all <td> elements as rows using XMLTABLE.
    • Plain MOD and FLOOR then shows which rows become what columns in a PIVOT operations to put 2nd, 9th, 16th, ... <td> into column "Title", and so on.
    • Clever and tricky use of LATERAL with CONNECT BY and analytic LEAD function splits the delimited name lists.
    • And plain SQL GROUP BY ROLLUP deals with the aggregates. Mathias here shows ON OVERFLOW TRUNCATE for LISTAGG, which for the PERSONS_MURDER column would be needed if the database used max_string_size=standard (but not needed on LiveSQL that uses extended.)
    • The plus side is that often SQL is so optimized, that it pays off to "get out" of the specialty formats quickly and "get into" SQL to let SQL to the heavy lifting.
    • On the negative side, the heavy lifting SQL here is not as optimal. The clever trick with LATERAL and LEAD can be seen in the autotrace to use a lot of memory sorts (many WINDOW SORT operations in the explain plan.)
    • This makes it the slowest answer of all.

The lesson that I see is, that using relatively simple XML operations and then some more SQL seems to pay off - but that the SQL needs to be simple and efficient for this to work out. Witness the difference between my reference answer and Mathias' Answer 5. The use of APEX_STRING.SPLIT appears to be a winner compared to a do-it-yourself pure-SQL text-splitting solution.

The XQUERY solutions are very clever and does demonstrate how you can really make extremely advanced XML code right inside the database. I can imagine more complex cases where solutions like these could be awesome. For this case, however, they look like a bit overkill and seems to require a bit more time to run.

But performance was not the only criteria I set for choosing a winner - also how clever it was, what methods used and how well I liked it. Based on those criteria, picking a winner was hard, since each answer had different clever methods.

In the end I decided that I agreed with Philipp, who stated about Stew's solution: "And the implementation is elegant. This idea alone should make you the winner of this competition."

So therefore I announce:

Congratulations, Stew! I'll get a book sent your way when I get back home from vacation.

No comments:

Post a Comment