CSV, XML and JSON parsing - a comparison over versions

September 23, 2014

CSV, XML and JSON parsing - a comparison over versions

Via Steven Feuerstein I was asked to try and give advice concerning fastest way to parse CSV data in PL/SQL. The case was that they had a different database that could be setup to deliver data as a webservice serving either CSV, XML or JSON, and they wished to let their APEX application use PL/SQL to retrieve data from that other database via such webservices. They were not yet on 12.1.0.2.0 so they did not have JSON_TABLE, and they seemed to find that do-it-yourself parsing of CSV data was the fastest. Did we have any idea of a faster way?

I would normally go for built-in functions as being generally the fastest, so XMLTABLE would be my preference. But there might be usecases where the overhead in bytes of XML creates too much network traffic. Or just possibly there might be cases where XMLTABLE is slower because it needs to be able to generically parse very complex XML and so might have to setup some complex structures internally that isn't really needed for very simple XML that simply represents rows in a table/view/query.

So maybe the manual CSV parsing could be faster. And since this would have to be used for maybe 10-20 calls to form a single APEX page, even shaving off milliseconds might make a difference in the total performance. I looked at the PL/SQL code example they had worked out and spotted at least one thing that could help.

The code would loop through the CSV data in the CLOB with a construct like this:

   loop
      l_rec_len := dbms_lob.instr(p_clob, l_eol, 1, l_rownr) - l_offset;
      l_buffer  := dbms_lob.substr(p_clob, l_rec_len, l_offset);
      exit when l_buffer is null;

      // ...
      
      l_offset := l_offset + l_rec_len + 1;
      l_rownr := l_rownr + 1;
   end loop;

That uses the "occurrence" parameter of DBMS_LOB.INSTR (fourth parameter), so the loop first asks for the position of the first EOL, then asks for the second EOL, and so on. When asking DBMS_LOB.INSTR for the second EOL, DBMS_LOB.INSTR will have to first find the first EOL (again) and then get the second EOL. When asking for the third EOL, DBMS_LOB.INSTR finds the first, the second, and then the third. So for N rows, the first EOL will actually be found N times, the second EOL will be found N-1 times, and so on.

That code I could easily improve by changing the construct slightly:

   loop
      l_crpos2 := dbms_lob.instr(p_clob, l_eol, l_crpos1+1);
      l_buffer := dbms_lob.substr(p_clob, l_crpos2-l_crpos1-1, l_crpos1+1);
      exit when l_buffer is null;

      // ...
      
      l_crpos1 := l_crpos2;
      l_rownr := l_rownr + 1;
   end loop;

Keep track of the positions so that each call to DBMS_LOB.INSTR asks for the first occurrence of EOL from the position of the last one + 1.

Similarly each column was also parsed by using "occurrence" parameter of INSTR for getting the "column"th occurrence of a semicolon:

      // ...
      
      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,4);
      l_row.col4 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,5);
      l_row.col5 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );
      
      // ...

And that could also quite easily be changed into code that keeps track of positions allowing INSTR to never need to pass over the same parts of the string over and over.

This had a very good effect on the code and improved the performance quite a bit - particularly for many rows. I decided to test it against XMLTABLE and saw that the improved manual CSV parsing seemed to be faster - at least in the 11.2 instance I was using. But then I wanted to test against JSON_TABLE and tried in a 12.1.0.2.0 - JSON_TABLE was quite fast, and then there was a surprise when I tried XMLTABLE in 12.1.

So I decided to try it out more rigorously in three different versions I have available to me...

First I create a function that can return some rows of testdata in either of the three formats. This will emulate the results of the webservice call.

create or replace function create_testdata(
   p_rows   number
 , p_type   varchar2
)
   return clob
is
   l_eol    varchar2(2) := chr(13);
   l_clob   clob;
begin
   if p_type = 'CSV' then

      dbms_lob.createtemporary(l_clob,false);
      for r in 1..p_rows loop
         dbms_lob.append(l_clob,
               'Row '||r||' Column 1 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 2 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 3 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 4 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 5 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 6 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 7 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 8 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 9 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 10 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 11 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 12 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 13 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 14 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 15 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 16 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 17 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 18 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 19 The quick brown fox jumped over the lazy dog'
            ||';Row '||r||' Column 20 The quick brown fox jumped over the lazy dog'
            ||l_eol
         );
      end loop;

   elsif p_type = 'XML' then

      select xmlelement(
               "Rows"
             , xmlagg(
                  xmlelement(
                     "Row"
                   , xmlforest(
                        'Row '||level||' Column 1 The quick brown fox jumped over the lazy dog' as "Column1"
                      , 'Row '||level||' Column 2 The quick brown fox jumped over the lazy dog' as "Column2"
                      , 'Row '||level||' Column 3 The quick brown fox jumped over the lazy dog' as "Column3"
                      , 'Row '||level||' Column 4 The quick brown fox jumped over the lazy dog' as "Column4"
                      , 'Row '||level||' Column 5 The quick brown fox jumped over the lazy dog' as "Column5"
                      , 'Row '||level||' Column 6 The quick brown fox jumped over the lazy dog' as "Column6"
                      , 'Row '||level||' Column 7 The quick brown fox jumped over the lazy dog' as "Column7"
                      , 'Row '||level||' Column 8 The quick brown fox jumped over the lazy dog' as "Column8"
                      , 'Row '||level||' Column 9 The quick brown fox jumped over the lazy dog' as "Column9"
                      , 'Row '||level||' Column 10 The quick brown fox jumped over the lazy dog' as "Column10"
                      , 'Row '||level||' Column 11 The quick brown fox jumped over the lazy dog' as "Column11"
                      , 'Row '||level||' Column 12 The quick brown fox jumped over the lazy dog' as "Column12"
                      , 'Row '||level||' Column 13 The quick brown fox jumped over the lazy dog' as "Column13"
                      , 'Row '||level||' Column 14 The quick brown fox jumped over the lazy dog' as "Column14"
                      , 'Row '||level||' Column 15 The quick brown fox jumped over the lazy dog' as "Column15"
                      , 'Row '||level||' Column 16 The quick brown fox jumped over the lazy dog' as "Column16"
                      , 'Row '||level||' Column 17 The quick brown fox jumped over the lazy dog' as "Column17"
                      , 'Row '||level||' Column 18 The quick brown fox jumped over the lazy dog' as "Column18"
                      , 'Row '||level||' Column 19 The quick brown fox jumped over the lazy dog' as "Column19"
                      , 'Row '||level||' Column 20 The quick brown fox jumped over the lazy dog' as "Column20"
                     )
                  )
               )
             ).getclobval()
        into l_clob
        from dual
      connect by level <= p_rows;

   elsif p_type = 'JSON' then

      dbms_lob.createtemporary(l_clob,false);
      dbms_lob.append(l_clob,'{ "rows" : [');
      for r in 1..p_rows loop
         dbms_lob.append(l_clob,
            case r when 1 then '' else ',' end
            ||'{ column1 : "Row '||r||' Column 1 The quick brown fox jumped over the lazy dog"'
            ||', column2 : "Row '||r||' Column 2 The quick brown fox jumped over the lazy dog"'
            ||', column3 : "Row '||r||' Column 3 The quick brown fox jumped over the lazy dog"'
            ||', column4 : "Row '||r||' Column 4 The quick brown fox jumped over the lazy dog"'
            ||', column5 : "Row '||r||' Column 5 The quick brown fox jumped over the lazy dog"'
            ||', column6 : "Row '||r||' Column 6 The quick brown fox jumped over the lazy dog"'
            ||', column7 : "Row '||r||' Column 7 The quick brown fox jumped over the lazy dog"'
            ||', column8 : "Row '||r||' Column 8 The quick brown fox jumped over the lazy dog"'
            ||', column9 : "Row '||r||' Column 9 The quick brown fox jumped over the lazy dog"'
            ||', column10 : "Row '||r||' Column 10 The quick brown fox jumped over the lazy dog"'
            ||', column11 : "Row '||r||' Column 11 The quick brown fox jumped over the lazy dog"'
            ||', column12 : "Row '||r||' Column 12 The quick brown fox jumped over the lazy dog"'
            ||', column13 : "Row '||r||' Column 13 The quick brown fox jumped over the lazy dog"'
            ||', column14 : "Row '||r||' Column 14 The quick brown fox jumped over the lazy dog"'
            ||', column15 : "Row '||r||' Column 15 The quick brown fox jumped over the lazy dog"'
            ||', column16 : "Row '||r||' Column 16 The quick brown fox jumped over the lazy dog"'
            ||', column17 : "Row '||r||' Column 17 The quick brown fox jumped over the lazy dog"'
            ||', column18 : "Row '||r||' Column 18 The quick brown fox jumped over the lazy dog"'
            ||', column19 : "Row '||r||' Column 19 The quick brown fox jumped over the lazy dog"'
            ||', column20 : "Row '||r||' Column 20 The quick brown fox jumped over the lazy dog"'
            ||'} '
         );
      end loop;
      dbms_lob.append(l_clob,'] }');

   end if;
   
   return l_clob;
end create_testdata;
/

The manual parsing of CSV data will be done with a table function, which needs a row type and a table type to return:

create type t_test_row as object (
   rownr integer
 , col1  varchar2(100)
 , col2  varchar2(100)
 , col3  varchar2(100)
 , col4  varchar2(100)
 , col5  varchar2(100)
 , col6  varchar2(100)
 , col7  varchar2(100)
 , col8  varchar2(100)
 , col9  varchar2(100)
 , col10 varchar2(100)
 , col11 varchar2(100)
 , col12 varchar2(100)
 , col13 varchar2(100)
 , col14 varchar2(100)
 , col15 varchar2(100)
 , col16 varchar2(100)
 , col17 varchar2(100)
 , col18 varchar2(100)
 , col19 varchar2(100)
 , col20 varchar2(100)
)
/

create type t_test_tab as table of t_test_row
/

The first function I use the original method that was sent to me in the question, which utilizes the "occurrence" parameter of both DBMS_LOB.INSTR and INSTR functions:

create or replace function parse_csv1(
   p_clob   clob
)
   return t_test_tab pipelined
is
   l_row       t_test_row := t_test_row(null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null);
   l_buffer    varchar2(32767);
   l_eol       varchar2(2) := chr(13);
   l_rownr     number := 1;
   l_offset    number := 1;
   l_rec_len   number;
   l_pos1      number;
   l_pos2      number;
begin
   loop
      l_rec_len := dbms_lob.instr(p_clob, l_eol, 1, l_rownr) - l_offset;
      l_buffer  := dbms_lob.substr(p_clob, l_rec_len, l_offset);
      exit when l_buffer is null;

      l_row.rownr := l_rownr;  

      l_pos1 := instr(l_buffer,';',1);
      l_row.col1 := substr( l_buffer,1, l_pos1-1 );

      l_pos1 := instr(l_buffer,';',1,1);
      l_pos2 := instr(l_buffer,';',1,2);
      l_row.col2 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,3);
      l_row.col3 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,4);
      l_row.col4 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,5);
      l_row.col5 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,6);
      l_row.col6 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,7);
      l_row.col7 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,8);
      l_row.col8 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,9);
      l_row.col9 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,10);
      l_row.col10 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,11);
      l_row.col11 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,12);
      l_row.col12 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,13);
      l_row.col13 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,14);
      l_row.col14 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,15);
      l_row.col15 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,16);
      l_row.col16 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,17);
      l_row.col17 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,18);
      l_row.col18 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := instr(l_buffer,';',1,19);
      l_row.col19 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      l_pos1 := l_pos2;
      l_pos2 := length(l_buffer)+1;
      l_row.col20 := substr( l_buffer,l_pos1+1, l_pos2-l_pos1-1 );

      pipe row(l_row);

      l_offset := l_offset + l_rec_len + 1;
      l_rownr := l_rownr + 1;
   end loop;

   return;
exception
   when no_data_needed then
      null;
end parse_csv1;
/

The second function is my improved version that keeps track of position as it goes along so it always asks for the first "occurrence" but from the specified position:

create or replace function parse_csv2(
   p_clob   clob
)
   return t_test_tab pipelined
is
   l_row       t_test_row := t_test_row(null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null);
   l_buffer    varchar2(32767);
   l_eol       varchar2(2) := chr(13);
   l_rownr     number := 1;
   l_offset    number := 1;
   l_rec_len   number;
   l_pos1      number := 0;
   l_pos2      number := 0;
   l_crpos1    number := 0;
   l_crpos2    number;
   function lf_getcol return varchar2
   is
   begin
      l_pos1 := l_pos2;
      l_pos2 := coalesce(
                   nullif(instr(l_buffer, ';', l_pos1+1), 0)
                 , length(l_buffer)+1
                );
      return(substr(l_buffer, l_pos1+1, l_pos2-l_pos1-1));
   end;
begin
   loop
      l_crpos2 := dbms_lob.instr(p_clob, l_eol, l_crpos1+1);
      l_buffer := dbms_lob.substr(p_clob, l_crpos2-l_crpos1-1, l_crpos1+1);
      exit when l_buffer is null;

      l_row.rownr := l_rownr;  

      l_pos1 := 0;
      l_pos2 := 0;
      
      l_row.col1  := lf_getcol;
      l_row.col2  := lf_getcol;
      l_row.col3  := lf_getcol;
      l_row.col4  := lf_getcol;
      l_row.col5  := lf_getcol;
      l_row.col6  := lf_getcol;
      l_row.col7  := lf_getcol;
      l_row.col8  := lf_getcol;
      l_row.col9  := lf_getcol;
      l_row.col10 := lf_getcol;
      l_row.col11 := lf_getcol;
      l_row.col12 := lf_getcol;
      l_row.col13 := lf_getcol;
      l_row.col14 := lf_getcol;
      l_row.col15 := lf_getcol;
      l_row.col16 := lf_getcol;
      l_row.col17 := lf_getcol;
      l_row.col18 := lf_getcol;
      l_row.col19 := lf_getcol;
      l_row.col20 := lf_getcol;

      pipe row(l_row);

      l_crpos1 := l_crpos2;
      l_rownr := l_rownr + 1;
   end loop;

   return;
exception
   when no_data_needed then
      null;
end parse_csv2;
/

I create a table to log the timing results of the tests:

create table test_log (
   run         number
 , testtype    varchar2(4)
 , num_rows    number
 , parse_time  interval day to second(6)
)
/

And then the block that executes the various tests 3 times for 1, 10, 100, 1000 and 10000 rows:

declare
   l_num_rows  number;
   l_clob      clob;
   l_starttime timestamp(6);
   l_endtime   timestamp(6);
begin
   for l_run in 1..3 loop
      for l_power in 0..4 loop
         l_num_rows := power(10, l_power);
         
         l_clob := create_testdata(l_num_rows, 'CSV');
         
         l_starttime := systimestamp;
         
         for parsed_rows in (
            select *
              from table(parse_csv1(l_clob))
         ) loop
            null;
         end loop;
         
         l_endtime := systimestamp;
         
         insert into test_log values (
            l_run, 'CSV1', l_num_rows, l_endtime-l_starttime
         );
         commit;

         --
         
         l_starttime := systimestamp;
         
         for parsed_rows in (
            select *
              from table(parse_csv2(l_clob))
         ) loop
            null;
         end loop;
         
         l_endtime := systimestamp;

         insert into test_log values (
            l_run, 'CSV2', l_num_rows, l_endtime-l_starttime
         );
         commit;

         --
         
         l_clob := create_testdata(l_num_rows, 'XML');
         
         l_starttime := systimestamp;
         
         for parsed_rows in (
            select *
              from xmltable(
               '/Rows/Row'
               passing xmltype(
                  l_clob
               )
               columns
                  rownr for ordinality
                , col1  varchar2(100) path 'Column1'
                , col2  varchar2(100) path 'Column2'
                , col3  varchar2(100) path 'Column3'
                , col4  varchar2(100) path 'Column4'
                , col5  varchar2(100) path 'Column5'
                , col6  varchar2(100) path 'Column6'
                , col7  varchar2(100) path 'Column7'
                , col8  varchar2(100) path 'Column8'
                , col9  varchar2(100) path 'Column9'
                , col10 varchar2(100) path 'Column10'
                , col11 varchar2(100) path 'Column11'
                , col12 varchar2(100) path 'Column12'
                , col13 varchar2(100) path 'Column13'
                , col14 varchar2(100) path 'Column14'
                , col15 varchar2(100) path 'Column15'
                , col16 varchar2(100) path 'Column16'
                , col17 varchar2(100) path 'Column17'
                , col18 varchar2(100) path 'Column18'
                , col19 varchar2(100) path 'Column19'
                , col20 varchar2(100) path 'Column20'
              ) xml_rows
         ) loop
            null;
         end loop;
         
         l_endtime := systimestamp;

         insert into test_log values (
            l_run, 'XML', l_num_rows, l_endtime-l_starttime
         );
         commit;

         --
         
         -- Uncomment this if database version is less than 12.1.0.2.0
         -- >>>
         
         l_clob := create_testdata(l_num_rows ,'JSON');
         
         l_starttime := systimestamp;

         for parsed_rows in (
            select *
              from json_table(
               l_clob
              ,'$.rows[*]' columns (
                  rownr for ordinality
                , col1  varchar2(100) path '$.column1'
                , col2  varchar2(100) path '$.column2'
                , col3  varchar2(100) path '$.column3'
                , col4  varchar2(100) path '$.column4'
                , col5  varchar2(100) path '$.column5'
                , col6  varchar2(100) path '$.column6'
                , col7  varchar2(100) path '$.column7'
                , col8  varchar2(100) path '$.column8'
                , col9  varchar2(100) path '$.column9'
                , col10 varchar2(100) path '$.column10'
                , col11 varchar2(100) path '$.column11'
                , col12 varchar2(100) path '$.column12'
                , col13 varchar2(100) path '$.column13'
                , col14 varchar2(100) path '$.column14'
                , col15 varchar2(100) path '$.column15'
                , col16 varchar2(100) path '$.column16'
                , col17 varchar2(100) path '$.column17'
                , col18 varchar2(100) path '$.column18'
                , col19 varchar2(100) path '$.column19'
                , col20 varchar2(100) path '$.column20'
              )) json_rows
         ) loop
            null;
         end loop;

         l_endtime := systimestamp;

         insert into test_log values (
            l_run, 'JSON', l_num_rows, l_endtime-l_starttime
         );
         commit;
         
         -- <<<
         
      end loop;
   end loop;
end;
/

The results of the test I can compare by querying by test type:

select *
  from (
   select run
        , testtype
        , num_rows
        , to_char(extract(minute from parse_time),'FM09')||':'||
          to_char(extract(second from parse_time),'FM09D999999') parse_time
        , round(
             1000 * (extract(minute from parse_time)*60+extract(second from parse_time)) / num_rows 
           , 2
          ) ms_per_row
     from (
      select *
        from test_log
       union all
      select null run
           , testtype
           , num_rows
           , numtodsinterval(avg(
                extract(minute from parse_time)*60+extract(second from parse_time)
             ),'second') parse_time
        from test_log
       group by testtype, num_rows
     )
  )
 pivot (
   max(parse_time) parse
 , max(ms_per_row) per_row
   for testtype in (
      'CSV1' as csv1
    , 'CSV2' as csv2
    , 'XML'  as xml
    , 'JSON' as json
   )
 )
 order by run, num_rows
/

My results in 3 different versions/environments are (run NULL is the average of the three runs):

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

This environment is an OEL server in VMWare used by several people for testing.

RUN	NUM ROWS	CSV1 TOTAL TIME	CSV1 PER ROW	CSV2 TOTAL TIME	CSV2 PER ROW	XML TOTAL TIME	XML PER ROW
1	1	00:00.005215	5,22	00:00.00468	4,68	00:00.008071	8,07
1	10	00:00.007828	0,78	00:00.006784	0,68	00:00.044712	4,47
1	100	00:00.173038	1,73	00:00.086039	0,86	00:00.492416	4,92
1	1000	00:11.750444	11,75	00:00.980171	0,98	00:11.752243	11,75
1	10000	19:48.409493	118,84	00:10.779098	1,08	17:00.178783	102,02
2	1	00:00.002452	2,45	00:00.01609	16,09	00:00.007291	7,29
2	10	00:00.012376	1,24	00:00.013253	1,33	00:00.047517	4,75
2	100	00:00.342004	3,42	00:00.247234	2,47	00:00.518944	5,19
2	1000	00:16.065279	16,07	00:00.954751	0,95	00:12.12573	12,13
2	10000	20:54.745519	125,47	00:21.760463	2,18	17:20.21969	104,02
3	1	00:00.001244	1,24	00:00.010238	10,24	00:00.005864	5,86
3	10	00:00.007862	0,79	00:00.0059	0,59	00:00.044587	4,46
3	100	00:00.174661	1,75	00:00.089375	0,89	00:00.501092	5,01
3	1000	00:11.369612	11,37	00:00.96306	0,96	00:12.223152	12,22
3	10000	20:07.233167	120,72	00:09.922741	0,99	17:35.55329	105,56
	1	00:00.00297	2,97	00:00.010336	10,34	00:00.007075	7,08
	10	00:00.009355	0,94	00:00.008646	0,86	00:00.045605	4,56
	100	00:00.229901	2,3	00:00.140883	1,41	00:00.504151	5,04
	1000	00:13.061778	13,06	00:00.965994	0,97	00:12.033708	12,03
	10000	20:16.79606	121,68	00:14.154101	1,42	17:18.650588	103,87

Here in the first set of tests we spot that CSV1 grows slower and slower per row as the number of rows increase. It does not scale, which is because of the described problem using "occurrence" parameter of INSTR. CSV2 does not have that problem and scale well as the row number increases. At 10.000 rows CSV2 is about 85 times faster per row than CSV1.

Interesting is that XMLTABLE seems to suffer a scaling problem almost as bad as the CSV1 method. Somehow it looks like XMLTABLE in this version also grows progressively slower as the number of rows increase? The do-it-yourself parsing of a CSV webservice seems a much better choice for the original requirement than built-in XMLTABLE on XML webservice.

Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production

This environment is a VM on a PC with a lot of SSD, so times cannot directly compare with the previous table. The interesting bit is more to compare the different methods.

RUN	NUM ROWS	CSV1 TOTAL TIME	CSV1 PER ROW	CSV2 TOTAL TIME	CSV2 PER ROW	XML TOTAL TIME	XML PER ROW
1	1	00:00.004626	4,63	00:00.004612	4,61	00:00.047893	47,89
1	10	00:00.002274	0,23	00:00.002644	0,26	00:00.002777	0,28
1	100	00:00.040193	0,4	00:00.027838	0,28	00:00.013524	0,14
1	1000	00:02.092844	2,09	00:00.269647	0,27	00:00.138578	0,14
1	10000	03:27.278202	20,73	00:02.828744	0,28	00:01.584584	0,16
2	1	00:00.000512	0,51	00:00.000576	0,58	00:00.001642	1,64
2	10	00:00.002183	0,22	00:00.002663	0,27	00:00.003235	0,32
2	100	00:00.042229	0,42	00:00.028084	0,28	00:00.015719	0,16
2	1000	00:02.122371	2,12	00:00.276936	0,28	00:00.195269	0,2
2	10000	03:28.612211	20,86	00:02.866212	0,29	00:01.641526	0,16
3	1	00:00.001379	1,38	00:00.001012	1,01	00:00.001801	1,8
3	10	00:00.002225	0,22	00:00.002936	0,29	00:00.002867	0,29
3	100	00:00.039881	0,4	00:00.027729	0,28	00:00.013694	0,14
3	1000	00:02.110584	2,11	00:00.277426	0,28	00:00.133827	0,13
3	10000	03:27.172959	20,72	00:02.839838	0,28	00:01.630668	0,16
	1	00:00.002172	2,17	00:00.002067	2,07	00:00.017112	17,11
	10	00:00.002227	0,22	00:00.002748	0,27	00:00.00296	0,3
	100	00:00.040768	0,41	00:00.027884	0,28	00:00.014312	0,14
	1000	00:02.1086	2,11	00:00.27467	0,27	00:00.155891	0,16
	10000	03:27.687791	20,77	00:02.844931	0,28	00:01.618926	0,16

Again CSV1 suffers scalability problems, which CSV2 does not. At 10.000 rows CSV2 is about 75 times faster per row than CSV1.

But really interesting is that XMLTABLE appears to have improved massively in this version. Once XMLTABLE is "warmed up" it scales well and is about twice as fast as the otherwise good CSV2 method. Now the built-in beats the do-it-yourself, so in 12.1.0.1.0 using XMLTABLE on an XML webservice looks like outperforming manual code on a CSV webservice.

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production

This environment is also an OEL server on VMWare, but only used by me. Times again cannot directly compare to the other versions, but the relative performance of the methods is interesting.

RUN	NUM ROWS	CSV1 TOTAL TIME	CSV1 PER ROW	CSV2 TOTAL TIME	CSV2 PER ROW	XML TOTAL TIME	XML PER ROW	JSON TOTAL TIME	JSON PER ROW
1	1	00:00.004476	4,48	00:00.004232	4,23	00:00.002809	2,81	00:00.000414	0,41
1	10	00:00.001474	0,15	00:00.000824	0,08	00:00.002234	0,22	00:00.000723	0,07
1	100	00:00.053043	0,53	00:00.009351	0,09	00:00.010603	0,11	00:00.006555	0,07
1	1000	00:04.797509	4,8	00:00.111597	0,11	00:00.104581	0,1	00:00.050972	0,05
1	10000	08:10.989516	49,1	00:01.220601	0,12	00:01.050176	0,11	00:00.623944	0,06
2	1	00:00.000408	0,41	00:00.000206	0,21	00:00.001499	1,5	00:00.000253	0,25
2	10	00:00.001412	0,14	00:00.000997	0,1	00:00.002317	0,23	00:00.000746	0,07
2	100	00:00.06044	0,6	00:00.009602	0,1	00:00.009898	0,1	00:00.005458	0,05
2	1000	00:04.840011	4,84	00:00.15369	0,15	00:00.148956	0,15	00:00.053851	0,05
2	10000	08:11.884956	49,19	00:01.141171	0,11	00:01.049298	0,1	00:00.564004	0,06
3	1	00:00.000295	0,3	00:00.00022	0,22	00:00.001325	1,33	00:00.000258	0,26
3	10	00:00.001406	0,14	00:00.000855	0,09	00:00.002049	0,2	00:00.00068	0,07
3	100	00:00.054669	0,55	00:00.009505	0,1	00:00.010099	0,1	00:00.005421	0,05
3	1000	00:04.821541	4,82	00:00.112043	0,11	00:00.098209	0,1	00:00.051282	0,05
3	10000	08:08.371391	48,84	00:01.213267	0,12	00:01.072971	0,11	00:00.534615	0,05
	1	00:00.001726	1,73	00:00.001553	1,55	00:00.001878	1,88	00:00.000308	0,31
	10	00:00.001431	0,14	00:00.000892	0,09	00:00.0022	0,22	00:00.000716	0,07
	100	00:00.056051	0,56	00:00.009486	0,09	00:00.0102	0,1	00:00.005811	0,06
	1000	00:04.819687	4,82	00:00.125777	0,13	00:00.117249	0,12	00:00.052035	0,05
	10000	08:10.415288	49,04	00:01.19168	0,12	00:01.057482	0,11	00:00.574188	0,06

As before CSV1 does not scale but CSV2 does. At 10.000 rows CSV2 is about 445 times faster per row than CSV1.

The XMLTABLE is again consistently scaling well. It is no longer twice as fast as CSV2 but approximately identical. I'd guess what we are seeing here is not a degrading of XMLTABLE, but rather that the PL/SQL method used in CSV2 somehow is compiled a bit more optimally in 12.1.0.2.0 - witness CSV2 is much faster than CSV1 here in 12.1.0.2.0 than in 12.1.0.1.0 and 11.2.0.3.0.

Really interesting is that JSON_TABLE also scales well and is about twice as fast as both XML and CSV2. Presumably that can be because JSON path queries are simpler than XQuery so the JSON_TABLE function needs to support less complex scenarios. Also it may help that no special datatype is involved, just plain CLOB. Or JSON_TABLE may simply be better written internally, who knows.

The lesson learned is that built-in functions can improve over the versions. Don't trust what you thought you knew was the fastest way - it may change with an upgrade. Keep up to date on new functionality and think about where that may be even faster.

For the original question where they could choose between letting the internal webservices of the other database publish CSV, XML or JSON and then use appropriate parsing methods in Oracle, it definitely matters what they pick.

If they could use version 12.1.0.2.0 they would get the fastest parsing method and at the same time gain the advantages of JSON_TABLE parsing rather than manually coding it, while spending a little more network bandwidth than a CSV but less than XML.

If they have version 12.1.0.1.0 they can get a fast method with XMLTABLE and gain the simpler coding than manual parsing, but at the cost of the network bandwidth overhead of XML.

And if they are on version 11.2.0.3.0, the performance desired probably can lead to a decision that the overhead of more code for manual CSV parsing is worth it in terms of performance and especially scalability.

If they are on 11.2.0.4.0 or a different version, they should run the test scripts (available here) themselves and see if the XMLTABLE scalability problems is only a problem in 11.2.0.3.0 ;-)

Comments

Daniel NyesteSaturday, November 18, 2017 11:02:00 AM
Hey Kim,

I'm in a situation of needing data from a MySQL database and decided on REST interface at the MySQL side, with manual CSV or XML (we are on 12.0.1.0 sadly, so no builtin JSON. It's worth mentioning that I tried with PLJSON, but it's even slower than the CSV1 method mentioned here.) parsing on the Oracle side.
So thank you for this blog post, valuable informations here! :)

Regards,
Daniel
ReplyDelete
Replies
Daniel NyesteSaturday, November 18, 2017 5:38:00 PM
Hey Kim,

Thanks for the quick reply :)
Actually, I had used the DB link method before, backed by a Heterogenous Gateway with ODBC, but it was just too slow and unreliable, when dealing with large amounts of data at once. Not to mention the connection issues and datatype conversion errors which happened a few times. Don't know who to blame there, because I never had the time, to fully debug it, sadly.

Also I think, the REST way is a bit more developer friendly (and maybe easier to maintain in the long run), because you only have to deal with a REST interface on one side, which can be the language of your choice, and PL/SQL on the other side, which is something you probably already know well :)
Unlike the ODBC way, where you have to deal with much more system administration stuff and seems harder to debug, because it has more components glued together,and use software you are unfamiliar with (like the MySQL ODBC connector).

Regards,
Daniel
ReplyDelete
Replies

Add comment

Search This Blog

KiBeHa

CSV, XML and JSON parsing - a comparison over versions

Comments

Post a Comment