Saturday, August 19, 2017

Partitioning external tables in 12.2

One of the new features in Oracle 12.2 is partitioning of external tables - a quite useful feature if you have multiple identical files of data, for example from multiple sources.

But how can you partition external files, you ask? Well, you don't partition the files, but each file can be considered a partition by the database. I'll show you an example.



Like always, to use external tables we need a DIRECTORY and our user needs privileges on the directory. So as a DBA user we do this:

create directory ext_table_dir as '/u01/extdata/orcl'
/

grant read, write on directory ext_table_dir to hr
/

In that directory we have 4 files:

[oracle@vbgeneric orcl]$ pwd
/u01/extdata/orcl
[oracle@vbgeneric orcl]$ ls -l *.csv
-rw-r--r-- 1 oracle oinstall  59 Aug 18 15:43 amen.csv
-rw-r--r-- 1 oracle oinstall  61 Aug 18 15:44 ames.csv
-rw-r--r-- 1 oracle oinstall 106 Aug 18 15:49 asoc.csv
-rw-r--r-- 1 oracle oinstall 103 Aug 18 15:46 emea.csv

Those CSV files contain sales data per region and country. We're getting those files from 4 offices around the world.

Region AMER (Americas) we get in two files - amen.csv is AMER North, ames.csv is AMER South. File emea.csv is data from region EMEA (Europe, Middle-East and Africa.) And file asoc.csv has data from two regions - ASIA and OCEA (Oceania.)

[oracle@vbgeneric orcl]$ head *.csv
==> amen.csv <==
REGION;COUNTRY;SALES
AMER;Canada;1000000
AMER;USA;2000000


==> ames.csv <==
REGION;COUNTRY;SALES
AMER;Brasil;750000
AMER;Uruguay;250000


==> asoc.csv <==
REGION;COUNTRY;SALES
ASIA;China;2750000
ASIA;Korea;1100000
OCEA;Australia;900000
OCEA;New Zealand;700000


==> emea.csv <==
REGION;COUNTRY;SALES
EMEA;Denmark;1500000
EMEA;Germany;1750000
EMEA;Turkey;300000
EMEA;Nigeria;100000

So, as user HR, let's use these files in an external table to be able to query those sales data.

First we try it without partitioning so we can observe the difference:

create table sales_ext (
   region      varchar2(4)
 , country     varchar2(20)
 , sales       number
) organization external (
   type oracle_loader
   default directory ext_table_dir
   access parameters (
      records delimited by newline
      field names all files
      fields terminated by ';'
      missing field values are null
      reject rows with all null fields
   )
   location ('amen.csv', 'ames.csv', 'asoc.csv', 'emea.csv')
)
/

We put all the files in the LOCATION clause, that way when we select from the external table, all the files will be read.

So we try to select all the data:

select /*+ gather_plan_statistics */ *
  from sales_ext
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
AMER Brasil                   750000
AMER Canada                  1000000
AMER USA                     2000000
AMER Uruguay                  250000
ASIA China                   2750000
ASIA Korea                   1100000
EMEA Denmark                 1500000
EMEA Germany                 1750000
EMEA Nigeria                  100000
EMEA Turkey                   300000
OCEA Australia                900000
OCEA New Zealand              700000

12 rows selected.

Yup, all the rows from all 4 files are shown, just like we want it.

If we look at the plan used, we see a full table scan of the external table.

select * from table(dbms_xplan.display_cursor(format=>'IOSTATS PARTITION LAST'))
/

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------
SQL_ID  0wax5hfmwn32u, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ *   from sales_ext  order by
region, country

Plan hash value: 1357455879

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |     12 |00:00:00.12 |     486 |
|   1 |  SORT ORDER BY              |           |      1 |    408K|     12 |00:00:00.12 |     486 |
|   2 |   EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |    408K|     12 |00:00:00.12 |     486 |
---------------------------------------------------------------------------------------------------

Let's query one of the regions - EMEA:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'EMEA'
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
EMEA Denmark                 1500000
EMEA Germany                 1750000
EMEA Nigeria                  100000
EMEA Turkey                   300000

Looking at the plan, again we see full table access, this time with a filter on the region:

---------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |           |      1 |        |      4 |00:00:00.11 |     486 |
|   1 |  SORT ORDER BY              |           |      1 |   4084 |      4 |00:00:00.11 |     486 |
|*  2 |   EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |   4084 |      4 |00:00:00.11 |     486 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("REGION"='EMEA')

So all the files have been read, and then those rows that weren't from EMEA was discarded. A bit of a wasted effort, since we happen to know that the data for region EMEA only exists in one of the 4 files.

So we drop this approach and try it a bit differently.

drop table sales_ext
/

Again we create the external table, but this time instead of one global LOCATION clause with all 4 files, we use the PARTITION clause to setup LIST partitioning on the external table:

create table sales_ext (
   region      varchar2(4)
 , country     varchar2(20)
 , sales       number
) organization external (
   type oracle_loader
   default directory ext_table_dir
   access parameters (
      records delimited by newline
      field names all files
      fields terminated by ';'
      missing field values are null
      reject rows with all null fields
   )
)
partition by list (region)
(
   partition sales_ext_part_amer values ('AMER') location ('amen.csv', 'ames.csv')
 , partition sales_ext_part_asoc values ('ASIA', 'OCEA') location ('asoc.csv')
 , partition sales_ext_part_emea values ('EMEA') location ('emea.csv')
)
/

We create three partitions, where we tell the database that region AMER is in two files, regions ASIA and OCEA both are in one file, and region EMEA is in the last file.

So we try to select everything again:

select /*+ gather_plan_statistics */ *
  from sales_ext
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
AMER Brasil                   750000
AMER Canada                  1000000
AMER USA                     2000000
AMER Uruguay                  250000
ASIA China                   2750000
ASIA Korea                   1100000
EMEA Denmark                 1500000
EMEA Germany                 1750000
EMEA Nigeria                  100000
EMEA Turkey                   300000
OCEA Australia                900000
OCEA New Zealand              700000

12 rows selected.

No change in the output, of course.

Looking at the plan, we can see that it is still full table access, but partitioning information is included and we see it scans all three partitions:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |     12 |00:00:00.14 |     486 |
|   1 |  SORT ORDER BY               |           |      1 |  24507 |       |       |     12 |00:00:00.14 |     486 |
|   2 |   PARTITION LIST ALL         |           |      1 |  24507 |     1 |     3 |     12 |00:00:00.14 |     486 |
|   3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      3 |  24507 |     1 |     3 |     12 |00:00:00.14 |     486 |
--------------------------------------------------------------------------------------------------------------------

Then the interesting thing happens when we select data only for region EMEA:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'EMEA'
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
EMEA Denmark                 1500000
EMEA Germany                 1750000
EMEA Nigeria                  100000
EMEA Turkey                   300000

The plan this time changes to PARTITION LIST SINGLE and in Pstart/Pstop is shown that only the third partition is accessed. This means that only file emea.csv has been read.

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |      4 |00:00:00.32 |     486 |
|   1 |  PARTITION LIST SINGLE       |           |      1 |     82 |   KEY |   KEY |      4 |00:00:00.32 |     486 |
|   2 |   SORT ORDER BY              |           |      1 |     82 |       |       |      4 |00:00:00.32 |     486 |
|   3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |     82 |     3 |     3 |      4 |00:00:00.32 |     486 |
--------------------------------------------------------------------------------------------------------------------

Note that there is no filter here, it is not necessary as the database knows, that partition 3 contains all the data for region EMEA and nothing but data for region EMEA. This is about as efficient as it can get.

Our second partition (file asoc.csv) had data from two regions in it, ASIA and OCEA. Let's see what happens if we query region OCEA:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'OCEA'
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
OCEA Australia                900000
OCEA New Zealand              700000

Again a PARTITION LIST SINGLE and Pstart/Pstop shows we scan the second partition.

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |      2 |00:00:00.16 |     486 |
|   1 |  PARTITION LIST SINGLE       |           |      1 |     82 |   KEY |   KEY |      2 |00:00:00.16 |     486 |
|   2 |   SORT ORDER BY              |           |      1 |     82 |       |       |      2 |00:00:00.16 |     486 |
|*  3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |     82 |     2 |     2 |      2 |00:00:00.16 |     486 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("REGION"='OCEA')

But since the partition contains two regions, this time a FILTER is needed. Still it is better than without partitioning, because we only read the single file necessary and only discard ASIA rows, rather than reading all files and discarding almost all rows.

But one thing you must beware of is, that for external tables the partitioning is not enforced - it cannot be. The database trusts you and believes you are telling it the truth.

So what happens if you lie to the database? Let's try to edit the emea.csv file:

[oracle@vbgeneric orcl]$ vi emea.csv

We add a line in region AMER (that should have been in file ames.csv), as well as a line in region MARS (that doesn't exist in the list partioning definition.)

REGION;COUNTRY;SALES
EMEA;Denmark;1500000
EMEA;Germany;1750000
EMEA;Turkey;300000
EMEA;Nigeria;100000
AMER;Argentina;80000
MARS;Red Planet;55555

So let's query region EMEA again after adding those rows:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'EMEA'
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
AMER Argentina                 80000
EMEA Denmark                 1500000
EMEA Germany                 1750000
EMEA Nigeria                  100000
EMEA Turkey                   300000
MARS Red Planet                55555

What? The output shows two rows that doesn't satisfy the WHERE clause? How did that happen?

Well, the plan is just like it was before - full access of the EMEA partition and no FILTER operation:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |      6 |00:00:00.22 |     506 |
|   1 |  PARTITION LIST SINGLE       |           |      1 |     82 |   KEY |   KEY |      6 |00:00:00.22 |     506 |
|   2 |   SORT ORDER BY              |           |      1 |     82 |       |       |      6 |00:00:00.22 |     506 |
|   3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |     82 |     3 |     3 |      6 |00:00:00.22 |     506 |
--------------------------------------------------------------------------------------------------------------------

So the database trusted us when we told it that file emea.csv contained only EMEA region data, therefore it didn't do any superfluous checking. That we lied to the database is our problem, so we're to blame for the "wrong" output.

We can also try to query region AMER:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'AMER'
 order by region, country
/

REGI COUNTRY                   SALES
---- -------------------- ----------
AMER Brasil                   750000
AMER Canada                  1000000
AMER USA                     2000000
AMER Uruguay                  250000

And the row for Argentina is not shown in the output even though it has region AMER?

Again the reason is the plan accesses partition 1 only, because the database is certain that region AMER data only exists in files amen.csv and ames.csv:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |      4 |00:00:00.11 |     486 |
|   1 |  PARTITION LIST SINGLE       |           |      1 |     82 |   KEY |   KEY |      4 |00:00:00.11 |     486 |
|   2 |   SORT ORDER BY              |           |      1 |     82 |       |       |      4 |00:00:00.11 |     486 |
|   3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      1 |     82 |     1 |     1 |      4 |00:00:00.11 |     486 |
--------------------------------------------------------------------------------------------------------------------

Similarly we cannot query region MARS even though a row with region MARS exists in the emea.csv file:

select /*+ gather_plan_statistics */ *
  from sales_ext
 where region = 'MARS'
 order by region, country
/

no rows selected

This time the plan directly tells us that we are querying a region that doesn't exist according to the metadata we have given the database:

----------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name      | Starts | E-Rows | Pstart| Pstop | A-Rows |   A-Time   |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |      1 |        |       |       |      0 |00:00:00.01 |
|   1 |  PARTITION LIST EMPTY        |           |      1 |     82 |INVALID|INVALID|      0 |00:00:00.01 |
|   2 |   SORT ORDER BY              |           |      0 |     82 |       |       |      0 |00:00:00.01 |
|*  3 |    EXTERNAL TABLE ACCESS FULL| SALES_EXT |      0 |     82 |INVALID|INVALID|      0 |00:00:00.01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("REGION"='MARS')

So in all, partitioning external tables is very handy to avoid unnecessarily reading files that we know doesn't contain the data we are querying. We just have to be certain the data really is distributed in the files matching what we describe in the list partitioning clauses - if it isn't, we risk wrong output.

UPDATE:

Besides this partitioning and constraints allowed and such documented new 12.2 features on external tables, there's some very cool other new 12.2 features that just happened to be forgotten in the documentation. Read here about runtime parameter overriding:

https://blogs.oracle.com/datawarehousing/the-first-really-hidden-gem-in-oracle-database-12c-release-2:-runtime-modification-of-external-table-parameters

No comments:

Post a Comment