ROWS versus default RANGE in analytic window clause
I have talked at KScope about the difference between ROWS and RANGE in the analytic window clause, but haven't yet blogged about it. Recently while working out a quiz for the PL/SQL Challenge I discovered yet another reason for remembering to primarily use ROWS and only use RANGE when the actual problem requires it.
From my KScope presentation examples here is a simple case of a rolling sum of salaries:
Scott and Ford get the same sum and similarly for Martin and Ward. Why? Because when the window clause is omitted, it defaults to RANGE BETWEEN:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means, that for Scott it takes the sum of the salary of all employees in department 20 that have the same or smaller salary value (column of the ORDER BY) as Scott. So it will be the sum of all in deptno 20 with salary <= 3000, which includes Ford! When calculating SUM_SAL for Ford, it will be the same result.
We can get the rolling sum we probably expect by using ROWS BETWEEN instead:
The effect of ROWS BETWEEN rather than the default RANGE BETWEEN is that the sum does not take into account the value of the colum in the ORDER BY clause, but rather just takes the sum of salaries from Scott and the previous rows. Note: Since we have a non-unique ordering, it is indeterminate which row comes first of Scott or Ford. You might run this another day and see Ford first having SUM_SAL=7875 and Scott right after with SUM_XAL=10875. Generally I recommend adding something unique (for example order by sal, empno) to make it determinate giving same output on the same data.
But what if we are ordering by EMPNO rather than SAL so we have something unique and we are certain that RANGE BETWEEN never encounters duplicate values? Can't we just allow ourselves to save some typing and just write:
Since we have a unique ordering, the default RANGE BETWEEN in this case gives the same result as if we add the ROWS BETWEEN clause:
So why should we bother to type that long ROWS BETWEEN clause when we have something unique?
Well, consider this example of using two analytic functions - one within an inline view. (This example is a bit contrived on the EMP table, but I use similar constructs many times in my daily work.) The inline view creates a rolling sum, the outer select filters on the result and makes a new rolling sum on the leftover rows:
The interesting part comes when examining autotrace output for that SQL:
The inner SUM() uses a WINDOW SORT operation and the outer SUM() uses WINDOW BUFFER. Two memory sorts are used.
Let us add the ROWS BETWEEN clause:
Giving us the exact same output:
But a couple of interesting difference in the autotrace output:
The inner SUM() uses a WINDOW SORT operation like before, but the outer SUM() uses WINDOW NOSORT this time! And just one memory sort is used.
So when we are lazy and save on typing by not writing a window specification, we will get the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When doing RANGE, the optimizer cannot save an extra sorting operation in this example (probably because RANGE needs to "look ahead" in the row source), but if we explicitly use the ROWS BETWEEN clause, the optimizer knows it can use the sorting it has already created and save some work.
The same is true if the example had used diffent analytic functions in the inline view and the outer select (for example COUNT() as the inner analytic function and SUM() as the outer analytic function.) What matters is that they use the same PARTITION and ORDER BY. In real life I find it is often different functions I use in constructs like this, but the WINDOW NOSORT still applies.
So this is yet another argument for my rule of thumb when I use analytic functions:
If I am doing analytic on the entire partition (or entire resultset), then I skip the window clause (that is: I do not write ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):
But once I have an ORDER BY, I generally do not rely on default window clause, but rather explicitly writes either ROWS or RANGE BETWEEN:
Analytic functions that only support the order-by clause and not a windowing clause - like for example ROW_NUMBER() - are of course excempt from this rule of thumb. But if a function supports both order-by clause and windowing clause, then I do not write an order-by clause without adding the windowing clause.
I use this rule of thumb because in my experience 95% of the time I really want ROWS BETWEEN - only rarely have I a true case for using RANGE BETWEEN. In 80% of the cases I have a unique ORDER BY clause which actually make RANGE BETWEEN give the same result, but as we see above at a cost of potentially doing much more work sorting the data. Also explicitly writing ROWS or RANGE make it clear to other developers what the intention is - if the windowing clause is omitted I do not really know if it is on purpose meant to be RANGE, or if it is a lazy programmer who really meant ROWS :-)
So please be explicit in your windowing clauses. If you need to save typing, then create an auto-replace in your TOAD or whatever IDE you use :-)
From my KScope presentation examples here is a simple case of a rolling sum of salaries:
select deptno
, ename
, sal
, sum(sal) over (
partition by deptno
order by sal
) sum_sal
from scott.emp
order by deptno
, sal
/
DEPTNO ENAME SAL SUM_SAL
---------- ---------- ---------- ----------
10 MILLER 1300 1300
10 CLARK 2450 3750
10 KING 5000 8750
20 SMITH 800 800
20 ADAMS 1100 1900
20 JONES 2975 4875
20 SCOTT 3000 10875
20 FORD 3000 10875
30 JAMES 950 950
30 MARTIN 1250 3450
30 WARD 1250 3450
30 TURNER 1500 4950
30 ALLEN 1600 6550
30 BLAKE 2850 9400
14 rows selected.
Scott and Ford get the same sum and similarly for Martin and Ward. Why? Because when the window clause is omitted, it defaults to RANGE BETWEEN:
select deptno
, ename
, sal
, sum(sal) over (
partition by deptno
order by sal
range between unbounded preceding and current row
) sum_sal
from scott.emp
order by deptno
, sal
/
DEPTNO ENAME SAL SUM_SAL
---------- ---------- ---------- ----------
10 MILLER 1300 1300
10 CLARK 2450 3750
10 KING 5000 8750
20 SMITH 800 800
20 ADAMS 1100 1900
20 JONES 2975 4875
20 SCOTT 3000 10875
20 FORD 3000 10875
30 JAMES 950 950
30 MARTIN 1250 3450
30 WARD 1250 3450
30 TURNER 1500 4950
30 ALLEN 1600 6550
30 BLAKE 2850 9400
14 rows selected.
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means, that for Scott it takes the sum of the salary of all employees in department 20 that have the same or smaller salary value (column of the ORDER BY) as Scott. So it will be the sum of all in deptno 20 with salary <= 3000, which includes Ford! When calculating SUM_SAL for Ford, it will be the same result.
We can get the rolling sum we probably expect by using ROWS BETWEEN instead:
select deptno
, ename
, sal
, sum(sal) over (
partition by deptno
order by sal
rows between unbounded preceding and current row
) sum_sal
from scott.emp
order by deptno
, sal
/
DEPTNO ENAME SAL SUM_SAL
---------- ---------- ---------- ----------
10 MILLER 1300 1300
10 CLARK 2450 3750
10 KING 5000 8750
20 SMITH 800 800
20 ADAMS 1100 1900
20 JONES 2975 4875
20 SCOTT 3000 7875
20 FORD 3000 10875
30 JAMES 950 950
30 MARTIN 1250 2200
30 WARD 1250 3450
30 TURNER 1500 4950
30 ALLEN 1600 6550
30 BLAKE 2850 9400
14 rows selected.
The effect of ROWS BETWEEN rather than the default RANGE BETWEEN is that the sum does not take into account the value of the colum in the ORDER BY clause, but rather just takes the sum of salaries from Scott and the previous rows. Note: Since we have a non-unique ordering, it is indeterminate which row comes first of Scott or Ford. You might run this another day and see Ford first having SUM_SAL=7875 and Scott right after with SUM_XAL=10875. Generally I recommend adding something unique (for example order by sal, empno) to make it determinate giving same output on the same data.
But what if we are ordering by EMPNO rather than SAL so we have something unique and we are certain that RANGE BETWEEN never encounters duplicate values? Can't we just allow ourselves to save some typing and just write:
select deptno
, ename
, sal
, sum(sal) over (
partition by deptno
order by empno
) sum_sal
from scott.emp
order by deptno
, empno
/
DEPTNO ENAME SAL SUM_SAL
---------- ---------- ---------- ----------
10 CLARK 2450 2450
10 KING 5000 7450
10 MILLER 1300 8750
20 SMITH 800 800
20 JONES 2975 3775
20 SCOTT 3000 6775
20 ADAMS 1100 7875
20 FORD 3000 10875
30 ALLEN 1600 1600
30 WARD 1250 2850
30 MARTIN 1250 4100
30 BLAKE 2850 6950
30 TURNER 1500 8450
30 JAMES 950 9400
14 rows selected.
Since we have a unique ordering, the default RANGE BETWEEN in this case gives the same result as if we add the ROWS BETWEEN clause:
select deptno
, ename
, sal
, sum(sal) over (
partition by deptno
order by empno
rows between unbounded preceding and current row
) sum_sal
from scott.emp
order by deptno
, empno
/
DEPTNO ENAME SAL SUM_SAL
---------- ---------- ---------- ----------
10 CLARK 2450 2450
10 KING 5000 7450
10 MILLER 1300 8750
20 SMITH 800 800
20 JONES 2975 3775
20 SCOTT 3000 6775
20 ADAMS 1100 7875
20 FORD 3000 10875
30 ALLEN 1600 1600
30 WARD 1250 2850
30 MARTIN 1250 4100
30 BLAKE 2850 6950
30 TURNER 1500 8450
30 JAMES 950 9400
14 rows selected.
So why should we bother to type that long ROWS BETWEEN clause when we have something unique?
Well, consider this example of using two analytic functions - one within an inline view. (This example is a bit contrived on the EMP table, but I use similar constructs many times in my daily work.) The inline view creates a rolling sum, the outer select filters on the result and makes a new rolling sum on the leftover rows:
select s.*
, sum(sal) over (
partition by deptno
order by empno
) sum_sal_2
from (
select deptno
, empno
, ename
, sal
, sum(sal) over (
partition by deptno
order by empno
) sum_sal
from scott.emp
) s
where sum_sal > 5000
order by deptno
, empno
/
DEPTNO EMPNO ENAME SAL SUM_SAL SUM_SAL_2
---------- ---------- ---------- ---------- ---------- ----------
10 7839 KING 5000 7450 5000
10 7934 MILLER 1300 8750 6300
20 7788 SCOTT 3000 6775 3000
20 7876 ADAMS 1100 7875 4100
20 7902 FORD 3000 10875 7100
30 7698 BLAKE 2850 6950 2850
30 7844 TURNER 1500 8450 4350
30 7900 JAMES 950 9400 5300
8 rows selected.
The interesting part comes when examining autotrace output for that SQL:
Execution Plan
----------------------------------------------------------
Plan hash value: 3842450409
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 826 | 4 (25)| 00:00:01 |
| 1 | WINDOW BUFFER | | 14 | 826 | 4 (25)| 00:00:01 |
|* 2 | VIEW | | 14 | 826 | 4 (25)| 00:00:01 |
| 3 | WINDOW SORT | | 14 | 238 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 238 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("SUM_SAL">5000)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
7 consistent gets
0 physical reads
0 redo size
583 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
8 rows processed
The inner SUM() uses a WINDOW SORT operation and the outer SUM() uses WINDOW BUFFER. Two memory sorts are used.
Let us add the ROWS BETWEEN clause:
select s.*
, sum(sal) over (
partition by deptno
order by empno
rows between unbounded preceding and current row
) sum_sal_2
from (
select deptno
, empno
, ename
, sal
, sum(sal) over (
partition by deptno
order by empno
rows between unbounded preceding and current row
) sum_sal
from scott.emp
) s
where sum_sal > 5000
order by deptno
, empno
/
Giving us the exact same output:
DEPTNO EMPNO ENAME SAL SUM_SAL SUM_SAL_2
---------- ---------- ---------- ---------- ---------- ----------
10 7839 KING 5000 7450 5000
10 7934 MILLER 1300 8750 6300
20 7788 SCOTT 3000 6775 3000
20 7876 ADAMS 1100 7875 4100
20 7902 FORD 3000 10875 7100
30 7698 BLAKE 2850 6950 2850
30 7844 TURNER 1500 8450 4350
30 7900 JAMES 950 9400 5300
8 rows selected.
But a couple of interesting difference in the autotrace output:
Execution Plan
----------------------------------------------------------
Plan hash value: 1485958224
-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 14 | 826 | 4 (25)| 00:00:01 |
| 1 | WINDOW NOSORT | | 14 | 826 | 4 (25)| 00:00:01 |
|* 2 | VIEW | | 14 | 826 | 4 (25)| 00:00:01 |
| 3 | WINDOW SORT | | 14 | 238 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL| EMP | 14 | 238 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("SUM_SAL">5000)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
7 consistent gets
0 physical reads
0 redo size
587 bytes sent via SQL*Net to client
364 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
8 rows processed
The inner SUM() uses a WINDOW SORT operation like before, but the outer SUM() uses WINDOW NOSORT this time! And just one memory sort is used.
So when we are lazy and save on typing by not writing a window specification, we will get the default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. When doing RANGE, the optimizer cannot save an extra sorting operation in this example (probably because RANGE needs to "look ahead" in the row source), but if we explicitly use the ROWS BETWEEN clause, the optimizer knows it can use the sorting it has already created and save some work.
The same is true if the example had used diffent analytic functions in the inline view and the outer select (for example COUNT() as the inner analytic function and SUM() as the outer analytic function.) What matters is that they use the same PARTITION and ORDER BY. In real life I find it is often different functions I use in constructs like this, but the WINDOW NOSORT still applies.
So this is yet another argument for my rule of thumb when I use analytic functions:
If I am doing analytic on the entire partition (or entire resultset), then I skip the window clause (that is: I do not write ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):
sum(sal) over ( partition by deptno )
But once I have an ORDER BY, I generally do not rely on default window clause, but rather explicitly writes either ROWS or RANGE BETWEEN:
sum(sal) over ( partition by deptno order by empno rows between unbounded preceding and current row )
Analytic functions that only support the order-by clause and not a windowing clause - like for example ROW_NUMBER() - are of course excempt from this rule of thumb. But if a function supports both order-by clause and windowing clause, then I do not write an order-by clause without adding the windowing clause.
I use this rule of thumb because in my experience 95% of the time I really want ROWS BETWEEN - only rarely have I a true case for using RANGE BETWEEN. In 80% of the cases I have a unique ORDER BY clause which actually make RANGE BETWEEN give the same result, but as we see above at a cost of potentially doing much more work sorting the data. Also explicitly writing ROWS or RANGE make it clear to other developers what the intention is - if the windowing clause is omitted I do not really know if it is on purpose meant to be RANGE, or if it is a lazy programmer who really meant ROWS :-)
So please be explicit in your windowing clauses. If you need to save typing, then create an auto-replace in your TOAD or whatever IDE you use :-)
Thanks! I've been reading the training kit for 70-461 but it didn't make this clear. You blog here lays it our perfectly.
ReplyDeleteGreat explanation....crystal clear!!!!!
ReplyDeleteThanks Kim, everything is very clear, however, as always!
ReplyDeleteGreat explanation
ReplyDeleteThank you for such a great clarification!!!
ReplyDeletegreat insight
ReplyDeleteGood important insight indeed, explained very clear simple way, thanks
ReplyDeletegreat Explanation!
ReplyDeleteTo the point and perfect. Thanks a lot.
ReplyDeletePerfect!Thanks a lot.
ReplyDeletePerfect!! i believe it would increase the performance when table size is quite big.
ReplyDeleteThank you! Now I understand.
ReplyDeleteWhat a clarification sir, thanks much, concept is cleared like crystal :)
ReplyDeleteGood
ReplyDeleteMany Thanks!!
ReplyDeleteGreat article; please do not encourage Oracle developers to use toad.
ReplyDelete