Monday, October 5, 2015

Scraping and querying Oracle Database Developer Choice Awards votes - Part 2

In Part 1 I demonstrated how to "scrape" data from the live webpages with the votes of the Oracle Database Developer Choice Awards. Unfortunately those webpages are not ideal for scraping, so I promised a workaround. Here I do a semi-manual scraping af data by using a browser to retrieve the actual HTML, save it to a file, and then let the code parse out the data from the file.



In the live version, the first thing I had to do was privileges to let the database perform http calls. That is not necessary when doing it this way. So in cases where you might not get permission from your sysadmin to open the firewall for database callouts, this can also be a method.

Instead of http privileges, I need to create a directory object. This should point to a directory on the database server (or at least a network share that is mounted on a path on the database server.) I grant privileges on the directory to SCOTT user.

create or replace directory odevchoice as '/home/oracle/odevchoice'
/
grant read, write on directory odevchoice to scott
/

In SCOTT schema I create the types and table just like I did in Part 1. Only thing I have different here is I added a DATE column CACHED to the table. This is on request from Steven Feuerstein, so we have a history of voting data.

create type oddca_nominee_type as object (
   category varchar2(10)
 , idea_id  integer
 , name     varchar2(30)
 , score    integer
 , votes    integer
)
/

create type oddca_nominee_table_type as table of oddca_nominee_type
/

create table oddca_nominee_cache (
   cached   date
 , category varchar2(10)
 , idea_id  integer
 , name     varchar2(30)
 , score    integer
 , votes    integer
)
/

The package specification is unchanged from Part 1.

create or replace package oddca
as
   subtype category_type is varchar2(10);
   
   CATEGORY_SQL       constant category_type := 'sql';
   CATEGORY_PLSQL     constant category_type := 'plsql';
   CATEGORY_ORDS      constant category_type := 'ords';
   CATEGORY_APEX      constant category_type := 'apex';
   CATEGORY_DBDESIGN  constant category_type := 'db-design';

   function nominees(
      p_category category_type
   )
      return oddca_nominee_table_type pipelined;

   procedure populate_cache(
      p_category category_type
   );
end oddca;
/

In the body I need a few changes. The major change is the get_html function, where I open a BFILE and use DBMS_LOB.LOADCLOBFROMFILE. This retrieves the HTML from file instead of HTTP call, but still returns the HTML as a CLOB.

create or replace package body oddca
as
   
   function get_html(
      p_category category_type
   )
      return clob
   is
      html          clob;
      src           bfile;
      dest_offset   integer := 1;
      src_offset    integer := 1;
      lang_context  integer := dbms_lob.default_lang_ctx;
      warning       integer;
  begin
      if p_category not in (
         CATEGORY_SQL
       , CATEGORY_PLSQL
       , CATEGORY_ORDS
       , CATEGORY_APEX
       , CATEGORY_DBDESIGN
      ) then
         raise_application_error(-20000, 'Invalid award category');
      end if;

      dbms_lob.createtemporary(html, false, DBMS_LOB.CALL);

      src := bfilename('DEVCHOICE', p_category||'-voting.html');
      dbms_lob.open(src, dbms_lob.file_readonly);
      dbms_lob.loadclobfromfile(
         html
       , src
       , dbms_lob.lobmaxsize
       , dest_offset
       , src_offset
       , nls_charset_id('UTF8')
       , lang_context
       , warning
      );

      return html;
   end get_html;
   

The second helper function needs no change. The content of the HTML is no different when reading it from a file rather than HTTP.

   function html_to_xml(
      p_html clob
   )
      return xmltype
   is
      cutout  clob;
      pos1    integer;
      pos2    integer;
      data    xmltype;
   begin
      pos1 := dbms_lob.instr(p_html, '<div class="display-idea-list"', 1);
      pos2 := dbms_lob.instr(p_html, '<span id="jive-whats-new-more">', pos1+1);
       
      if pos1 = 0 or pos2 = 0 or pos1 = null or pos2 = null then
         raise_application_error(-20001, 'Not expected HTML content');
      end if;

      dbms_lob.createtemporary(cutout, false, DBMS_LOB.CALL);
      
      dbms_lob.copy(cutout, p_html, pos2-pos1, 1, pos1);
      data := xmltype(
                 '<doc>'
              || replace(replace(replace(replace(
                    cutout
                  , '&'||'rsquo;', ' ')
                  , '&'||'hellip;', ' ')
                  , '&'||'nbsp;', ' ')
                  , '<br>', '<br/>')
              || '</doc>'
              );
      
      dbms_lob.freetemporary(cutout);

      return data;
   end html_to_xml;


And the table function is also unchanged, since the XML returned from html_to_xml is unchanged.

   function nominees(
      p_category category_type
   )
      return oddca_nominee_table_type pipelined
   is
      html    clob;
      data    xmltype;
   begin
      html := get_html(p_category);
      data := html_to_xml(html);

      for nominees in (
         select oddca_nominee_type(
                   p_category
                 , to_number(regexp_replace(ideaid,'[^[:digit:]]')) /* 'ideaListDiv1234' => 1234 */
                 , trim(name)
                 , to_number(regexp_replace(score,'[^[:digit:]]'))
                 , to_number(regexp_replace(votes,'[^[:digit:]]')) /* '123 votes' => 123 */
                ) nominee
           from xmltable(
             '/doc/*'
             passing data
             columns
                ideaid varchar2(100) path '@id'
              , name   varchar2(100) path 'div/div[@class="jive-content"]/div[@class="jive-content-header"]/div[@class="jive-content-title"]/h2/a'
              , score  varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/div/strong'
              , votes  varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/span[@class="jive-score-meta font-color-meta"]/span[@class="idea-vote-count"]'
           )
      ) loop
         pipe row( nominees.nominee );
      end loop;
   end nominees;


The procedure to cache the data is a little bit changed from Part 1. This time I do not delete the cache content but only insert. The insert I give a timestamp in column CACHED. Because the data should be used internationally, I decided to save the timestamp in UTC. I only save the timestamp to "hour" precision, which of course only works if I don't populate the cache more often than maximum once an hour. As I expect to cache the voting history once a day, this is OK. But saving the hour as well gives me an idea of whether there was 28 hours or 20 hours since last time, as I cannot be certain to do this at the same hour every day. Not saving the entire timestamp with second precision is simply me being lazy to make querying with hour precision easier, and I don't want minute or second precision.

   procedure populate_cache(
      p_category category_type
   )
   is
   begin
      insert into oddca_nominee_cache
      select trunc(sys_extract_utc(systimestamp),'HH24')
           , n.category
           , n.idea_id
           , n.name
           , n.score
           , n.votes
        from table(oddca.nominees(p_category)) n;
   end populate_cache;
end oddca;
/

So now I can open these URLs in a browser:
https://community.oracle.com/community/database/awards/sql-voting
https://community.oracle.com/community/database/awards/plsql-voting
https://community.oracle.com/community/database/awards/ords-voting
https://community.oracle.com/community/database/awards/apex-voting
https://community.oracle.com/community/database/awards/db-design-voting

Each of those 5 pages I right-click, Save as, and save as HTML only to the directory I created, calling the files sql-voting.html, plsql-voting.html, etc.

And then I call my cache population procedure:

begin
   oddca.populate_cache(oddca.CATEGORY_SQL);
   oddca.populate_cache(oddca.CATEGORY_PLSQL);
   oddca.populate_cache(oddca.CATEGORY_ORDS);
   oddca.populate_cache(oddca.CATEGORY_APEX);
   oddca.populate_cache(oddca.CATEGORY_DBDESIGN);
   commit;
end;
/

That I can perform about once a day or how often I wish to save the voting history.

Having voting history, I can now study how much each nominee increases his/her votes over time:

select to_char(cached, 'Mon dd, HH24"h"') hour
     , category
     , name
     , score
     , votes
     , negative
     , positive
     , round(100 * positive / nullif(votes,0), 1) pos_pct
     , dense_rank() over (partition by cached, category order by score desc) rnk_score
     , dense_rank() over (partition by cached, category order by positive desc) rnk_pos
     , score - lag(score) over (partition by category, name order by cached) inc_score
     , positive - lag(positive) over (partition by category, name order by cached) inc_pos
  from (
   select cached
        , category
        , name
        , score
        , votes
        , (votes - score / 10) / 2 as negative
        , score / 10 + (votes - score / 10) / 2 as positive
     from oddca_nominee_cache
  ) s1
 order by cached desc, category, score desc
/

Notice how my pal Erik van Roon jumps up the ladder from an 8th place to a 5th place between October 2nd and 3rd? :-)

HOUR        CATEGORY  NAME               SCORE VOTES NEGATIVE POSITIVE POS_PCT RNK_SCORE RNK_POS INC_SCORE INC_POS
----------- --------- ------------------ ----- ----- -------- -------- ------- --------- ------- --------- -------
Oct 05, 04h apex      Jari Laine           680    80        6       74    92.5         1       1         0       0
Oct 05, 04h apex      Morten Braten        660    76        5       71    93.4         2       2         0       0
Oct 05, 04h apex      Juergen Schuster     510    55        2       53    96.4         3       3         0       0
Oct 05, 04h apex      Kiran Pawar          400    50        5       45    90.0         4       4         0       0
Oct 05, 04h apex      Karen Cannell        260    30        2       28    93.3         5       5        10       1
Oct 05, 04h apex      Paul MacMillan       130    29        8       21    72.4         6       6         0       0
Oct 05, 04h apex      Trent Schafer        110    21        5       16    76.2         7       7         0       0
Oct 05, 04h db-design Heli Helskyaho       850   103        9       94    91.3         1       1         0       0
Oct 05, 04h db-design Michelle Kolbe       580    68        5       63    92.6         2       2         0       0
Oct 05, 04h db-design Rob Lockard          410    51        5       46    90.2         3       3         0       0
Oct 05, 04h db-design Mark Hoxey            90    23        7       16    69.6         4       4        10       1
Oct 05, 04h ords      Dietmar Aust        1170   125        4      121    96.8         1       1         0       0
Oct 05, 04h ords      Dimitri Gielis       790    91        6       85    93.4         2       2         0       0
Oct 05, 04h ords      Morten Braten        420    56        7       49    87.5         3       3         0       0
Oct 05, 04h ords      Kiran Pawar          340    42        4       38    90.5         4       4         0       0
Oct 05, 04h ords      Anton Nielsen        240    32        4       28    87.5         5       5         0       0
Oct 05, 04h ords      Tim St. Hilaire      130    19        3       16    84.2         6       6         0       0
Oct 05, 04h plsql     Adrian Billington    970   103        3      100    97.1         1       2        20       2
Oct 05, 04h plsql     Roger Troller        870   115       14      101    87.8         2       1         0       0
Oct 05, 04h plsql     Sean Stuber          750    87        6       81    93.1         3       3        10       1
Oct 05, 04h plsql     Patrick Barel        650    85       10       75    88.2         4       4         0       0
Oct 05, 04h plsql     Morten Braten        620    74        6       68    91.9         5       5         0       0
Oct 05, 04h plsql     Kim Berg Hansen      410    55        7       48    87.3         6       6         0       0
Oct 05, 04h plsql     Bill Coulam          380    50        6       44    88.0         7       7        10       1
Oct 05, 04h sql       Emrah Mete          2980   366       34      332    90.7         1       1         0       1
Oct 05, 04h sql       Sayan Malakshinov   1750   239       32      207    86.6         2       2        30       3
Oct 05, 04h sql       Sean Stuber          670   103       18       85    82.5         3       3        10       1
Oct 05, 04h sql       Kim Berg Hansen      620    98       18       80    81.6         4       4        10       1
Oct 05, 04h sql       Erik Van Roon        300    60       15       45    75.0         5       5        30       3
Oct 05, 04h sql       Matthias Rogel       280    46        9       37    80.4         6       7        10       1
Oct 05, 04h sql       Justin Cave          270    55       14       41    74.5         7       6         0       0
Oct 05, 04h sql       Stew Ashton          250    49       12       37    75.5         8       7         0       0
Oct 03, 13h apex      Jari Laine           680    80        6       74    92.5         1       1         0       0
Oct 03, 13h apex      Morten Braten        660    76        5       71    93.4         2       2         0       0
Oct 03, 13h apex      Juergen Schuster     510    55        2       53    96.4         3       3         0       0
Oct 03, 13h apex      Kiran Pawar          400    50        5       45    90.0         4       4         0       0
Oct 03, 13h apex      Karen Cannell        250    29        2       27    93.1         5       5        10       1
Oct 03, 13h apex      Paul MacMillan       130    29        8       21    72.4         6       6         0       0
Oct 03, 13h apex      Trent Schafer        110    21        5       16    76.2         7       7         0       0
Oct 03, 13h db-design Heli Helskyaho       850   103        9       94    91.3         1       1         0       0
Oct 03, 13h db-design Michelle Kolbe       580    68        5       63    92.6         2       2         0       0
Oct 03, 13h db-design Rob Lockard          410    51        5       46    90.2         3       3         0       0
Oct 03, 13h db-design Mark Hoxey            80    22        7       15    68.2         4       4        10       1
Oct 03, 13h ords      Dietmar Aust        1170   125        4      121    96.8         1       1        10       1
Oct 03, 13h ords      Dimitri Gielis       790    91        6       85    93.4         2       2        10       1
Oct 03, 13h ords      Morten Braten        420    56        7       49    87.5         3       3         0       0
Oct 03, 13h ords      Kiran Pawar          340    42        4       38    90.5         4       4         0       0
Oct 03, 13h ords      Anton Nielsen        240    32        4       28    87.5         5       5         0       0
Oct 03, 13h ords      Tim St. Hilaire      130    19        3       16    84.2         6       6        20       2
Oct 03, 13h plsql     Adrian Billington    950   101        3       98    97.0         1       2        10       1
Oct 03, 13h plsql     Roger Troller        870   115       14      101    87.8         2       1         0       0
Oct 03, 13h plsql     Sean Stuber          740    86        6       80    93.0         3       3        30       3
Oct 03, 13h plsql     Patrick Barel        650    85       10       75    88.2         4       4         0       0
Oct 03, 13h plsql     Morten Braten        620    74        6       68    91.9         5       5         0       0
Oct 03, 13h plsql     Kim Berg Hansen      410    55        7       48    87.3         6       6         0       0
Oct 03, 13h plsql     Bill Coulam          370    49        6       43    87.8         7       7         0       0
Oct 03, 13h sql       Emrah Mete          2980   364       33      331    90.9         1       1       100      10
Oct 03, 13h sql       Sayan Malakshinov   1720   236       32      204    86.4         2       2        30       3
Oct 03, 13h sql       Sean Stuber          660   102       18       84    82.4         3       3        30       3
Oct 03, 13h sql       Kim Berg Hansen      610    97       18       79    81.4         4       4        10       1
Oct 03, 13h sql       Erik Van Roon        270    57       15       42    73.7         5       5       100      10
Oct 03, 13h sql       Justin Cave          270    55       14       41    74.5         5       6         0       0
Oct 03, 13h sql       Matthias Rogel       270    45        9       36    80.0         5       8         0       0
Oct 03, 13h sql       Stew Ashton          250    49       12       37    75.5         6       7         0       0
Oct 02, 08h apex      Jari Laine           680    80        6       74    92.5         1       1
Oct 02, 08h apex      Morten Braten        660    76        5       71    93.4         2       2
Oct 02, 08h apex      Juergen Schuster     510    55        2       53    96.4         3       3
Oct 02, 08h apex      Kiran Pawar          400    50        5       45    90.0         4       4
Oct 02, 08h apex      Karen Cannell        240    28        2       26    92.9         5       5
Oct 02, 08h apex      Paul MacMillan       130    29        8       21    72.4         6       6
Oct 02, 08h apex      Trent Schafer        110    21        5       16    76.2         7       7
Oct 02, 08h db-design Heli Helskyaho       850   103        9       94    91.3         1       1
Oct 02, 08h db-design Michelle Kolbe       580    68        5       63    92.6         2       2
Oct 02, 08h db-design Rob Lockard          410    51        5       46    90.2         3       3
Oct 02, 08h db-design Mark Hoxey            70    21        7       14    66.7         4       4
Oct 02, 08h ords      Dietmar Aust        1160   124        4      120    96.8         1       1
Oct 02, 08h ords      Dimitri Gielis       780    90        6       84    93.3         2       2
Oct 02, 08h ords      Morten Braten        420    56        7       49    87.5         3       3
Oct 02, 08h ords      Kiran Pawar          340    42        4       38    90.5         4       4
Oct 02, 08h ords      Anton Nielsen        240    32        4       28    87.5         5       5
Oct 02, 08h ords      Tim St. Hilaire      110    17        3       14    82.4         6       6
Oct 02, 08h plsql     Adrian Billington    940   100        3       97    97.0         1       2
Oct 02, 08h plsql     Roger Troller        870   115       14      101    87.8         2       1
Oct 02, 08h plsql     Sean Stuber          710    83        6       77    92.8         3       3
Oct 02, 08h plsql     Patrick Barel        650    85       10       75    88.2         4       4
Oct 02, 08h plsql     Morten Braten        620    74        6       68    91.9         5       5
Oct 02, 08h plsql     Kim Berg Hansen      410    55        7       48    87.3         6       6
Oct 02, 08h plsql     Bill Coulam          370    49        6       43    87.8         7       7
Oct 02, 08h sql       Emrah Mete          2880   354       33      321    90.7         1       1
Oct 02, 08h sql       Sayan Malakshinov   1690   233       32      201    86.3         2       2
Oct 02, 08h sql       Sean Stuber          630    99       18       81    81.8         3       3
Oct 02, 08h sql       Kim Berg Hansen      600    96       18       78    81.3         4       4
Oct 02, 08h sql       Justin Cave          270    55       14       41    74.5         5       5
Oct 02, 08h sql       Matthias Rogel       270    45        9       36    80.0         5       7
Oct 02, 08h sql       Stew Ashton          250    49       12       37    75.5         6       6
Oct 02, 08h sql       Erik Van Roon        170    47       15       32    68.1         7       8

96 rows selected.

So this is the alternative scraping method, for cases where either the database is not allowed to call out, or the webpages to be scraped are so much designed for interactive browser use that scraping http calls are not possible or prohibitively difficult. For those case retrieving the HTML via the browser can work, and then the data parsing code can work on the saved HTML files.

If you found this helpful, why don't you consider voting for me in the Oracle Database Developer Choice Awards? :-)

No comments:

Post a Comment