Thursday, July 30, 2015

Row pattern matching nested within hierarchy

I've been playing around with MATCH_RECOGNIZE - the data pattern matching extension to SELECT that was introduced in version 12.

Most examples I've seen have used the default AFTER MATCH SKIP PAST LAST ROW as most often the logic dictates, that when we have found a match in a group of rows, we want to search for further matches after those rows to avoid unwanted "double" matches.

But can there be uses where we want overlapping or even nested matches? Well, I found at least one case where I think it makes perfect sense...

Suppose for the employees in SCOTT.EMP we want the employee hierarchy and for each employee we wish to show how many subordinates that employee has (employees below him in the hierarchy.)

First let's do a classic hierarchial query on SCOTT.EMP:

select empno
     , lpad(' ', (level-1)*2) || ename as ename
     , (
         select count(*)
           from emp sub
          start with sub.mgr = emp.empno
          connect by sub.mgr = prior sub.empno
       ) subs
  from emp
 start with mgr is null
 connect by mgr = prior empno
 order siblings by empno
/

EMPNO ENAME        SUBS
----- ------------ ----
 7839 KING           13
 7566   JONES         4
 7788     SCOTT       1
 7876       ADAMS     0
 7902     FORD        1
 7369       SMITH     0
 7698   BLAKE         5
 7499     ALLEN       0
 7521     WARD        0
 7654     MARTIN      0
 7844     TURNER      0
 7900     JAMES       0
 7782   CLARK         1
 7934     MILLER      0

The outer query shows the employee hierarchy, the scalar subquery finds how many subordinates each employee has.

Simple and easy to understand. Now for the MATCH_RECOGNIZE method:

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from emp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select empno
     , lpad(' ', (lvl-1)*2) || ename as ename
     , subs
  from hierarchy
match_recognize (
   order by rn
   measures
      strt.rn as rn
    , strt.lvl as lvl
    , strt.empno as empno
    , strt.ename as ename
    , count(higher.lvl) as subs
   one row per match
   after match skip to next row
   pattern (
      strt higher*
   )
   define
      higher as higher.lvl > strt.lvl
)
 order by rn
/

EMPNO ENAME        SUBS
----- ------------ ----
 7839 KING           13
 7566   JONES         4
 7788     SCOTT       1
 7876       ADAMS     0
 7902     FORD        1
 7369       SMITH     0
 7698   BLAKE         5
 7499     ALLEN       0
 7521     WARD        0
 7654     MARTIN      0
 7844     TURNER      0
 7900     JAMES       0
 7782   CLARK         1
 7934     MILLER      0

Exactly same result, but at the cost of longer code, true. Let's take a closer look at the code and see if it really is that hard...

We place the base query in a WITH clause just for clarity, so the pattern matching query is easier to read (it could have been an inline view instead if we wanted.)

To preserve the hierarchial ordering, the CONNECT BY query with its ORDER BY clause is wrapped in an inline view, so we can save ROWNUM as rn, and use RN for ordering both in the MATCH_RECOGNIZE clause and the final ORDER BY.

Using the PATTERN clause in line 26 we look for a pattern in the rows that consists of first exactly one STRT row followed by zero or more HIGHER rows. The syntax here is very similar to regular expressions, we are just not matching characters, we are matching rows.

But how does it then know what is a STRT row and what is a HIGHER row? By the DEFINE clause, where we in line 29 define that a HIGHER row is defined as a row whose LVL value is greater than the LVL value of the STRT row.

The STRT row has no definition, so any row can match a STRT row.

This means that the database takes a look a the first row. Can it match STRT? Yes, any row can do that. OK, then it tries second row, can it match HIGHER? Yes, the LVL value is higher than LVL of the STRT row. How about the third? Yes, that too. And so on, actually on to the end, since everybody has a LVL value higher than King...

So the first match has been found with one STRT row (the first one) followed by 13 HIGHER rows. In line 23 we state we only want a single row returned from this match containing the values we define in the MEASURES clause, which is values from the STRT row plus a COUNT of the HIGHER rows. That count is actually the number of subordinates.

So once the first match has been found, line 24 state SKIP TO NEXT ROW. That means NEXT ROW after the first row of the match. So the database moves on to Jones (even though he was a HIGHER row in the first match.)

Can Jones match STRT? Yes, any row can do that. OK, can Scott match HIGHER? Yes, Scott has higher LVL than Jones. And so on until Blake - can Blake match HIGHER? No, Blake does not have a higher LVL than Jones. So the match stops here. Even though there are more rows further down that has a higher LVL than Jones, the pattern defines zero or more consecutive HIGHER rows.

So the second match contains one STRT row (Jones) followed by 4 HIGHER rows (Scott to Smith), and therefore is output a row with Jones having 4 subordinates (the count of HIGHER rows in the match.)

And so it continues for all 14 rows, each of them will start their own match, since they match STRT row, and then each will have the count of HIGHER rows immediately following the STRT row.

We can see in more details how this works if we change from ONE ROW PER MATCH to ALL ROWS PER MATCH:

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from emp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select mn
     , rn
     , empno
     , lpad(' ', (lvl-1)*2) || ename as ename
     , roll
     , subs
     , cls
     , stno
     , stname
     , hino
     , hiname
  from hierarchy
match_recognize (
   order by rn
   measures
      match_number() as mn
    , classifier() as cls
    , strt.empno as stno
    , strt.ename as stname
    , higher.empno as hino
    , higher.ename as hiname
    , count(higher.lvl) as roll
    , final count(higher.lvl) as subs
   all rows per match
   after match skip to next row
   pattern (
      strt higher*
   )
   define
      higher as higher.lvl > strt.lvl
)
 order by mn, rn
/

 MN  RN EMPNO ENAME        ROLL SUBS CLS     STNO STNAME  HINO HINAME
--- --- ----- ------------ ---- ---- ------ ----- ------ ----- ------
  1   1  7839 KING            0   13 STRT    7839 KING
  1   2  7566   JONES         1   13 HIGHER  7839 KING    7566 JONES
  1   3  7788     SCOTT       2   13 HIGHER  7839 KING    7788 SCOTT
  1   4  7876       ADAMS     3   13 HIGHER  7839 KING    7876 ADAMS
  1   5  7902     FORD        4   13 HIGHER  7839 KING    7902 FORD
  1   6  7369       SMITH     5   13 HIGHER  7839 KING    7369 SMITH
  1   7  7698   BLAKE         6   13 HIGHER  7839 KING    7698 BLAKE
  1   8  7499     ALLEN       7   13 HIGHER  7839 KING    7499 ALLEN
  1   9  7521     WARD        8   13 HIGHER  7839 KING    7521 WARD
  1  10  7654     MARTIN      9   13 HIGHER  7839 KING    7654 MARTIN
  1  11  7844     TURNER     10   13 HIGHER  7839 KING    7844 TURNER
  1  12  7900     JAMES      11   13 HIGHER  7839 KING    7900 JAMES
  1  13  7782   CLARK        12   13 HIGHER  7839 KING    7782 CLARK
  1  14  7934     MILLER     13   13 HIGHER  7839 KING    7934 MILLER
  2   2  7566   JONES         0    4 STRT    7566 JONES
  2   3  7788     SCOTT       1    4 HIGHER  7566 JONES   7788 SCOTT
  2   4  7876       ADAMS     2    4 HIGHER  7566 JONES   7876 ADAMS
  2   5  7902     FORD        3    4 HIGHER  7566 JONES   7902 FORD
  2   6  7369       SMITH     4    4 HIGHER  7566 JONES   7369 SMITH
  3   3  7788     SCOTT       0    1 STRT    7788 SCOTT
  3   4  7876       ADAMS     1    1 HIGHER  7788 SCOTT   7876 ADAMS
  4   4  7876       ADAMS     0    0 STRT    7876 ADAMS
  5   5  7902     FORD        0    1 STRT    7902 FORD
  5   6  7369       SMITH     1    1 HIGHER  7902 FORD    7369 SMITH
  6   6  7369       SMITH     0    0 STRT    7369 SMITH
  7   7  7698   BLAKE         0    5 STRT    7698 BLAKE
  7   8  7499     ALLEN       1    5 HIGHER  7698 BLAKE   7499 ALLEN
  7   9  7521     WARD        2    5 HIGHER  7698 BLAKE   7521 WARD
  7  10  7654     MARTIN      3    5 HIGHER  7698 BLAKE   7654 MARTIN
  7  11  7844     TURNER      4    5 HIGHER  7698 BLAKE   7844 TURNER
  7  12  7900     JAMES       5    5 HIGHER  7698 BLAKE   7900 JAMES
  8   8  7499     ALLEN       0    0 STRT    7499 ALLEN
  9   9  7521     WARD        0    0 STRT    7521 WARD
 10  10  7654     MARTIN      0    0 STRT    7654 MARTIN
 11  11  7844     TURNER      0    0 STRT    7844 TURNER
 12  12  7900     JAMES       0    0 STRT    7900 JAMES
 13  13  7782   CLARK         0    1 STRT    7782 CLARK
 13  14  7934     MILLER      1    1 HIGHER  7782 CLARK   7934 MILLER
 14  14  7934     MILLER      0    0 STRT    7934 MILLER

In the MEASURES clause here we use MATCH_NUMBER to show us a unique number for each match that the MATCH_RECOGNIZE clause has found. And CLASSIFIER tells us which part of the PATTERN that each row of the match has matched.

So we can see that match number 1 consists of RN=1 to 14, where RN=1 is a STRT row and the 13 rest of the rows are HIGHER rows. Match number 2 consists of RN=2 to 6, where RN=2 is a STRT row and the other 4 are HIGHER rows. And so on.

Notice that when we use ALL ROWS instead of ONE ROW, the expression COUNT(HIGHER.LVL) becomes a rolling count. It now works similar to the analytic ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. To get the total count, we here need to add the FINAL keyword, which makes it work similar to analytic ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

In all rows of the match, using STRT.xxx in the MEASURES clause gives us the values of the single STRT row. Using HIGHER.xxx shows us values from the actual HIGHER row, therefore NULL for the STRT rows in the output.

We can visualize the rows of each of the 14 matches by pivoting:

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from emp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select rn, empno, ename
     , case "1"  when 1 then 'XX' end "1" 
     , case "2"  when 1 then 'XX' end "2" 
     , case "3"  when 1 then 'XX' end "3" 
     , case "4"  when 1 then 'XX' end "4" 
     , case "5"  when 1 then 'XX' end "5" 
     , case "6"  when 1 then 'XX' end "6" 
     , case "7"  when 1 then 'XX' end "7" 
     , case "8"  when 1 then 'XX' end "8" 
     , case "9"  when 1 then 'XX' end "9" 
     , case "10" when 1 then 'XX' end "10"
     , case "11" when 1 then 'XX' end "11"
     , case "12" when 1 then 'XX' end "12"
     , case "13" when 1 then 'XX' end "13"
     , case "14" when 1 then 'XX' end "14"
  from (
   select mn
        , rn
        , empno
        , lpad(' ', (lvl-1)*2) || ename as ename
     from hierarchy
   match_recognize (
      order by rn
      measures
         match_number() as mn
      all rows per match
      after match skip to next row
      pattern (
         strt higher*
      )
      define
         higher as higher.lvl > strt.lvl
   )
  )
 pivot (
   count(*)
   for mn in (
      1,2,3,4,5,6,7,8,9,10,11,12,13,14
   )
 )
 order by rn
/

 RN EMPNO ENAME        1  2  3  4  5  6  7  8  9  10 11 12 13 14
--- ----- ------------ -- -- -- -- -- -- -- -- -- -- -- -- -- --
  1  7839 KING         XX
  2  7566   JONES      XX XX
  3  7788     SCOTT    XX XX XX
  4  7876       ADAMS  XX XX XX XX
  5  7902     FORD     XX XX       XX
  6  7369       SMITH  XX XX       XX XX
  7  7698   BLAKE      XX                XX
  8  7499     ALLEN    XX                XX XX
  9  7521     WARD     XX                XX    XX
 10  7654     MARTIN   XX                XX       XX
 11  7844     TURNER   XX                XX          XX
 12  7900     JAMES    XX                XX             XX
 13  7782   CLARK      XX                                  XX
 14  7934     MILLER   XX                                  XX XX

14 columns showing which of the rows are part of each of the 14 matches.

Even if we go back to using ONE ROW PER MATCH, we can still include some information from the detail rows in an "aggregate" manner:

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from emp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select empno
     , lpad(' ', (lvl-1)*2) || ename as ename
     , subs
     , hifrom
     , hito
     , himax
  from hierarchy
match_recognize (
   order by rn
   measures
      strt.rn as rn
    , strt.lvl as lvl
    , strt.empno as empno
    , strt.ename as ename
    , count(higher.lvl) as subs
    , first(higher.ename) as hifrom
    , last(higher.ename) as hito
    , max(higher.lvl) as himax
   one row per match
   after match skip to next row
   pattern (
      strt higher*
   )
   define
      higher as higher.lvl > strt.lvl
)
 order by rn
/

EMPNO ENAME        SUBS HIFROM HITO   HIMAX
----- ------------ ---- ------ ------ -----
 7839 KING           13 JONES  MILLER     4
 7566   JONES         4 SCOTT  SMITH      4
 7788     SCOTT       1 ADAMS  ADAMS      4
 7876       ADAMS     0
 7902     FORD        1 SMITH  SMITH      4
 7369       SMITH     0
 7698   BLAKE         5 ALLEN  JAMES      3
 7499     ALLEN       0
 7521     WARD        0
 7654     MARTIN      0
 7844     TURNER      0
 7900     JAMES       0
 7782   CLARK         1 MILLER MILLER     3
 7934     MILLER      0

FIRST and LAST can be applied to HIGHER.xxx to give us the from and to of the HIGHER rows of the match.

And just like we have used aggregate function COUNT to get number of subordinates, we can also use aggregate function MAX to find the highest LVL of the subordinates.

Suppose we only want to see those employees that have subordinates? In other words those where SUBS > 0.

Sure, we could do it by wrapping the query above in an inline view and use a WHERE SUBS > 0 clause. But there's a simpler way with the PATTERN clause as alternative:

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from emp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select empno
     , lpad(' ', (lvl-1)*2) || ename as ename
     , subs
  from hierarchy
match_recognize (
   order by rn
   measures
      strt.rn as rn
    , strt.lvl as lvl
    , strt.empno as empno
    , strt.ename as ename
    , count(higher.lvl) as subs
   one row per match
   after match skip to next row
   pattern (
      strt higher+
   )
   define
      higher as higher.lvl > strt.lvl
)
 order by rn
/

EMPNO ENAME        SUBS
----- ------------ ----
 7839 KING           13
 7566   JONES         4
 7788     SCOTT       1
 7902     FORD        1
 7698   BLAKE         5
 7782   CLARK         1

By replacing HIGHER* with HIGHER+ we change definition from zero or more to one or more HIGHER rows, which explicitly states we only want to match where a STRT row is followed by at least one HIGHER row, meaning the employee has at least one subordinate.

But is all this really worth it when we have such a simple query as the one at the top of this page using a scalar subquery?

Well, let's try to create a slightly bigger employee table - 14.000 subordinates + 1 at the very top of the pyramid:

create table bigemp
as
select 1 empno
     , 'LARRY' ename
     , cast(null as number) mgr
  from dual
union all
select dum.dum*10000+empno empno
     , ename || '#' || dum.dum ename
     , coalesce(
          dum.dum*10000+mgr
        , 1
       ) mgr
  from emp
 cross join (
   select level dum
     from dual
   connect by level <= 1000
 ) dum
/

create index bigemp_mgr_eno on bigemp (mgr, empno)
/

begin
   dbms_stats.gather_table_stats(USER,'BIGEMP',cascade=>true);
end;
/

So let's try the pattern matching method:

set autotrace traceonly
set timing on

with hierarchy as (
   select lvl, empno, ename, rownum as rn
   from (
      select level as lvl, empno, ename
        from bigemp
       start with mgr is null
       connect by mgr = prior empno
       order siblings by empno
   )
)
select empno
     , lpad(' ', (lvl-1)*2) || ename as ename
     , subs
  from hierarchy
match_recognize (
   order by rn
   measures
      strt.rn as rn
    , strt.lvl as lvl
    , strt.empno as empno
    , strt.ename as ename
    , count(higher.lvl) as subs
   one row per match
   after match skip to next row
   pattern (
      strt higher*
   )
   define
      higher as higher.lvl > strt.lvl
)
 order by rn
/

14001 rows selected.

Elapsed: 00:00:00.35

Execution Plan
----------------------------------------------------------
Plan hash value: 443729768

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                                             | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                                      |        |     3 |   237 |    20  (15)| 00:00:01 |
|   1 |  SORT ORDER BY                                        |        |     3 |   237 |    20  (15)| 00:00:01 |
|   2 |   VIEW                                                |        |     3 |   237 |    19  (11)| 00:00:01 |
|   3 |    MATCH RECOGNIZE SORT DETERMINISTIC FINITE AUTOMATON|        |     3 |   198 |    19  (11)| 00:00:01 |
|   4 |     VIEW                                              |        |     3 |   198 |    18   (6)| 00:00:01 |
|   5 |      COUNT                                            |        |       |       |            |          |
|   6 |       VIEW                                            |        |     3 |   159 |    18   (6)| 00:00:01 |
|*  7 |        CONNECT BY NO FILTERING WITH START-WITH        |        |       |       |            |          |
|   8 |         TABLE ACCESS FULL                             | BIGEMP | 14001 |   300K|    17   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   7 - access("MGR"=PRIOR "EMPNO")
       filter("MGR" IS NULL)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         55  consistent gets
          0  physical reads
          0  redo size
     435280  bytes sent via SQL*Net to client
      10763  bytes received via SQL*Net from client
        935  SQL*Net roundtrips to/from client
          4  sorts (memory)
          0  sorts (disk)
      14001  rows processed

Nice, we process all 14.001 rows in less than half a second with a single full table access of BIGEMP.

So how about the scalar subquery method:

select empno
     , lpad(' ', (level-1)*2) || ename as ename
     , (
         select count(*)
           from bigemp sub
          start with sub.mgr = bigemp.empno
          connect by sub.mgr = prior sub.empno
       ) subs
  from bigemp
 start with mgr is null
 connect by mgr = prior empno
 order siblings by empno
/

14001 rows selected.

Elapsed: 00:00:11.61

Execution Plan
----------------------------------------------------------
Plan hash value: 2938165771

----------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name           | Rows  | Bytes | Cost (%CPU)| Time  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                |     3 |   198 |    27   (4)| 00:00:01 |
|   1 |  SORT AGGREGATE                         |                |     1 |    26 |            |       |
|*  2 |   CONNECT BY WITH FILTERING             |                |       |       |            |       |
|*  3 |    INDEX RANGE SCAN                     | BIGEMP_MGR_ENO |     2 |    24 |     2   (0)| 00:00:01 |
|   4 |    NESTED LOOPS                         |                |     5 |   125 |     4   (0)| 00:00:01 |
|   5 |     CONNECT BY PUMP                     |                |       |       |            |       |
|*  6 |     INDEX RANGE SCAN                    | BIGEMP_MGR_ENO |     2 |    24 |     1   (0)| 00:00:01 |
|*  7 |  CONNECT BY NO FILTERING WITH START-WITH|                |       |       |            |       |
|   8 |   TABLE ACCESS FULL                     | BIGEMP         | 14001 |   300K|    17   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("SUB"."MGR"=PRIOR "SUB"."EMPNO")
   3 - access("SUB"."MGR"=:B1)
   6 - access("SUB"."MGR"="connect$_by$_pump$_009"."prior sub.empno        ")
   7 - access("MGR"=PRIOR "EMPNO")
       filter("MGR" IS NULL)

Note
-----
   - this is an adaptive plan


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     465005  consistent gets
          0  physical reads
          0  redo size
     435280  bytes sent via SQL*Net to client
      10763  bytes received via SQL*Net from client
        935  SQL*Net roundtrips to/from client
      37008  sorts (memory)
          0  sorts (disk)
      14001  rows processed

Well, well... More than 11 seconds - 30 times slower. One full table access and then a lot of CONNECT BY with INDEX RANGE SCAN in the scalar subquery.

And notice 465.005 consistent gets versus 55 - and 37.008 sorts versus 3 !

I think the numbers speak for themselves - I will go with MATCH_RECOGNIZE for this...

Details on the syntax can be found in the chapter on SQL for Pattern Matching in the 12c Data Warehousing Guide.

No comments:

Post a Comment