Monday, October 5, 2015

Scraping and querying Oracle Database Developer Choice Awards votes - Part 1

You know about the Oracle Database Developer Choice Awards voting, right? If not, then go vote now, please, and then come back and read on :-)

I want to be able to query the voting status, who is ranked what in each category and so on. But I don't want to type in votes in a table, especially not when they change all the time. Instead I'd like to query the live data in the HTML webpages. Can that be done? Yes it can...



But unfortunately a slight caveat for this. It appears that when I do these HTTP calls from the database to the voting webpages on http://community.oracle.com , something happens that often make the URLs give me "An unexpected error occurred" rather than the HTML I want. I suspect (but not sure) that it may have to do with either that these HTTP calls do not log in or that they fail to accept a cookie acceptance popup window or that they look like robot crawler and get blocked or something.

Whatever the reason, the code I show in this blog post works only sporadically on the Oracle Community website. But I will show it anyway, as the technique can be very useful to other websites and URLs too. And then in a Part 2 blogpost I will show a workaround I did for making this work anyway, just with a little bit of manual work involved.

So here's a recipe for querying the HTML of a live webpage. Just beware that the particular website I use as an example is not the best of examples ;-)

First I setup a wallet with the certificates of http://community.oracle.com website. I just followed the instructions by Tim Hall: https://oracle-base.com/articles/misc/utl_http-and-ssl

I have done this both in 11.2 and 12.1 instances, and note in particularly that in 12c you should not add the sites own certificate to the wallet, but only the higher certificates (Geotrust SSL and Geotrust Global.) I thank Tony Reed for this StackOverflow question that gave me the solution for making it work on 12c: http://stackoverflow.com/questions/19380116/using-utl-http-wallets-on-12c-certificate-validation-failure

Within Oracle I then login as SYSTEM (or similar privileged user) and setup connection privileges, both to the schema I use (SCOTT) as well as the APEX schema (in this case APEX_050000.)

begin
   dbms_network_acl_admin.create_acl(
      acl => 'oddca_demo.xml'
    , description => 'ODDCA nominee voting ACL'
    , principal => 'SCOTT'
    , is_grant => true
    , privilege => 'connect'
   );
   dbms_network_acl_admin.add_privilege(
      acl => 'oddca_demo.xml'
    , principal => 'APEX_050000'
    , is_grant => true
    , privilege => 'connect'
   );
   dbms_network_acl_admin.assign_acl(
      acl => 'oddca_demo.xml'
    , host => '*.oracle.com'
   );
   commit;
end;
/

In schema SCOTT I then do my actual work. First I create an object type, collection type and a table. The object and collection type enables me to make a table function I will use later, the table then can be used to cache the results of the table function to avoid excessive HTTP calls.

create type oddca_nominee_type as object (
   category varchar2(10)
 , idea_id  integer
 , name     varchar2(30)
 , score    integer
 , votes    integer
)
/

create type oddca_nominee_table_type as table of oddca_nominee_type
/

create table oddca_nominee_cache (
   category varchar2(10)
 , idea_id  integer
 , name     varchar2(30)
 , score    integer
 , votes    integer
)
/

And now for the package that will do all the work. A few constants, the table function that will retrieve live data, and a procedure to populate the cache table from the live data.

create or replace package oddca
as
   subtype category_type is varchar2(10);
   
   CATEGORY_SQL       constant category_type := 'sql';
   CATEGORY_PLSQL     constant category_type := 'plsql';
   CATEGORY_ORDS      constant category_type := 'ords';
   CATEGORY_APEX      constant category_type := 'apex';
   CATEGORY_DBDESIGN  constant category_type := 'db-design';

   function nominees(
      p_category category_type
   )
      return oddca_nominee_table_type pipelined;

   procedure populate_cache(
      p_category category_type
   );
end oddca;
/

The body has a couple private constants for wallet path and password. In production you might consider safer ways of storing these credentials.

create or replace package body oddca
as
   c_wallet_path  constant varchar2(100) := 'file:/home/oracle/wallets/testwallet';
   c_wallet_pwd   constant varchar2(100) := 'myTest!PSW123';
   

Then a helper function to execute the HTTP call for retrieving the HTML from the 5 different category webpages. As I need HTTPS (SSL) support, I can't use my favorite method, HttpUriType. Instead I can use UTL_HTTP, or in this case I chose APEX_WEB_SERVICE.MAKE_REST_REQUEST for simplicity. Even though meant for REST requests, it can retrieve any webpage and simple return it as CLOB. If I had used UTL_HTTP I would have had to read the HTML in a loop and build the CLOB myself.

   function get_html(
      p_category category_type
   )
      return clob
   is
   begin
      if p_category not in (
         CATEGORY_SQL
       , CATEGORY_PLSQL
       , CATEGORY_ORDS
       , CATEGORY_APEX
       , CATEGORY_DBDESIGN
      ) then
         raise_application_error(-20000, 'Invalid award category');
      end if;
      
      return apex_web_service.make_rest_request(
         p_url         => 'https://community.oracle.com/community/database/awards/'||p_category||'-voting'
       , p_http_method => 'GET'
       , p_wallet_path => c_wallet_path
       , p_wallet_pwd  => c_wallet_pwd
      );
   end get_html;
   

Second helper function is to transform the HTML into XML. A really well-formed HTML document is actually very close to being XML, but a few things I have to do here to make it work as XML.

  • First I cutout the "middle" (and relevant) part of the original HTML document; that gets rid of a lot of Javascript and other things that would make it invalid XML.
  • Secondly as the "cutout" part then contains a <div> element for each nominee, I need to wrap it in a root <doc> element (I could name it anything, but doc seems appropriate.)
  • Third the HTML contains a lot of <br> which traditionally in HTML is a simple newline, but in XML would be an opening <br> tag that misses a closing tag - new HTML standards state you should use <br/> which also works in XML, so I do a replace here.
  • And last there are some HTML entities that the database XML parser does not recognize as valid entities, but as they are not part of the text I am interested in, I simply replace them with spaces to get rid of them.

Note also that in case the HTML parameter does not contain the specific <div> and <span> tags I am looking for, then I know my XML parsing will fail, so I raise an explicit error saying that this HTML is not in the expected format.

   function html_to_xml(
      p_html clob
   )
      return xmltype
   is
      cutout  clob;
      pos1    integer;
      pos2    integer;
      data    xmltype;
   begin
      pos1 := dbms_lob.instr(p_html, '<div class="display-idea-list"', 1);
      pos2 := dbms_lob.instr(p_html, '<span id="jive-whats-new-more">', pos1+1);
       
      if pos1 = 0 or pos2 = 0 or pos1 = null or pos2 = null then
         raise_application_error(-20001, 'Not expected HTML content');
      end if;

      dbms_lob.createtemporary(cutout, false, DBMS_LOB.CALL);
      
      dbms_lob.copy(cutout, p_html, pos2-pos1, 1, pos1);
      data := xmltype(
                 '<doc>'
              || replace(replace(replace(replace(
                    cutout
                  , '&'||'rsquo;', ' ')
                  , '&'||'hellip;', ' ')
                  , '&'||'nbsp;', ' ')
                  , '<br>', '<br/>')
              || '</doc>'
              );
      
      dbms_lob.freetemporary(cutout);

      return data;
   end html_to_xml;


My table function starts by retrieving the HTML and transforming it to XML via the two helper functions, and then it extracts the information I want using XMLTABLE function. I use '/doc/*' to retrieve all the <div> that are nominees. For each <div> I specify the XPATH expression of the "position" in the XML of the 4 pieces of information I desire. But those 4 pieces contain superfluous text and whitespace, so I perform some trimming on name and for the 3 numeric expression I use regexp_replace to remove anything that is not a numeric digit. Each retrieved nominee from the XML is then piped out, so the results can be selected using the table function.

   function nominees(
      p_category category_type
   )
      return oddca_nominee_table_type pipelined
   is
      html    clob;
      data    xmltype;
   begin
      html := get_html(p_category);
      data := html_to_xml(html);

      for nominees in (
         select oddca_nominee_type(
                   p_category
                 , to_number(regexp_replace(ideaid,'[^[:digit:]]')) /* 'ideaListDiv1234' => 1234 */
                 , trim(name)
                 , to_number(regexp_replace(score,'[^[:digit:]]'))
                 , to_number(regexp_replace(votes,'[^[:digit:]]')) /* '123 votes' => 123 */
                ) nominee
           from xmltable(
             '/doc/*'
             passing data
             columns
                ideaid varchar2(100) path '@id'
              , name   varchar2(100) path 'div/div[@class="jive-content"]/div[@class="jive-content-header"]/div[@class="jive-content-title"]/h2/a'
              , score  varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/div/strong'
              , votes  varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/span[@class="jive-score-meta font-color-meta"]/span[@class="idea-vote-count"]'
           )
      ) loop
         pipe row( nominees.nominee );
      end loop;
   end nominees;


Calling the table function above involves an HTTP call. To avoid doing that over and over, I finally have this small procedure for populating my cache table. It very simply clears the cache for the invoked category and inserts rows from the table function into the cache.

   procedure populate_cache(
      p_category category_type
   )
   is
   begin
      delete oddca_nominee_cache
       where category = p_category;
      
      insert into oddca_nominee_cache
      select *
        from table(oddca.nominees(p_category));
   end populate_cache;
end oddca;
/

So having made the table function, I can actually now query the webpage by this simple select statement:

select *
  from table(oddca.nominees('sql'))
 order by score desc
/

But to avoid lots of HTTP calls when I start analysing the data, I populate my cache for all 5 categories.

begin
   oddca.populate_cache(oddca.CATEGORY_SQL);
   oddca.populate_cache(oddca.CATEGORY_PLSQL);
   oddca.populate_cache(oddca.CATEGORY_ORDS);
   oddca.populate_cache(oddca.CATEGORY_APEX);
   oddca.populate_cache(oddca.CATEGORY_DBDESIGN);
   commit;
end;
/

And now I just query my populated cache and try to see how the voting goes and who is in the lead.

select s1.*
     , round(100 * positive / nullif(votes,0), 1) pos_pct
     , dense_rank() over (partition by category order by score desc) rnk_score
     , dense_rank() over (partition by category order by positive desc) rnk_pos
  from (
   select category
        , name
        , score
        , votes
        , (votes - score / 10) / 2 as negative
        , score / 10 + (votes - score / 10) / 2 as positive
     from oddca_nominee_cache
  ) s1
 order by category, score desc
/

I've shamelessly highlighted my own results so far:

CATEGORY  NAME               SCORE VOTES NEGATIVE POSITIVE POS_PCT RNK_SCORE RNK_POS
--------- ------------------ ----- ----- -------- -------- ------- --------- -------
apex      Jari Laine           680    80        6       74    92.5         1       1
apex      Morten Braten        660    76        5       71    93.4         2       2
apex      Juergen Schuster     510    55        2       53    96.4         3       3
apex      Kiran Pawar          400    50        5       45    90.0         4       4
apex      Karen Cannell        260    30        2       28    93.3         5       5
apex      Paul MacMillan       130    29        8       21    72.4         6       6
apex      Trent Schafer        110    21        5       16    76.2         7       7
db-design Heli Helskyaho       850   103        9       94    91.3         1       1
db-design Michelle Kolbe       580    68        5       63    92.6         2       2
db-design Rob Lockard          410    51        5       46    90.2         3       3
db-design Mark Hoxey            90    23        7       16    69.6         4       4
ords      Dietmar Aust        1170   125        4      121    96.8         1       1
ords      Dimitri Gielis       790    91        6       85    93.4         2       2
ords      Morten Braten        420    56        7       49    87.5         3       3
ords      Kiran Pawar          340    42        4       38    90.5         4       4
ords      Anton Nielsen        240    32        4       28    87.5         5       5
ords      Tim St. Hilaire      130    19        3       16    84.2         6       6
plsql     Adrian Billington    970   103        3      100    97.1         1       2
plsql     Roger Troller        870   115       14      101    87.8         2       1
plsql     Sean Stuber          750    87        6       81    93.1         3       3
plsql     Patrick Barel        650    85       10       75    88.2         4       4
plsql     Morten Braten        620    74        6       68    91.9         5       5
plsql     Kim Berg Hansen      410    55        7       48    87.3         6       6
plsql     Bill Coulam          380    50        6       44    88.0         7       7
sql       Emrah Mete          2980   366       34      332    90.7         1       1
sql       Sayan Malakshinov   1750   239       32      207    86.6         2       2
sql       Sean Stuber          670   103       18       85    82.5         3       3
sql       Kim Berg Hansen      620    98       18       80    81.6         4       4
sql       Erik Van Roon        300    60       15       45    75.0         5       5
sql       Matthias Rogel       280    46        9       37    80.4         6       7
sql       Justin Cave          270    55       14       41    74.5         7       6
sql       Stew Ashton          250    49       12       37    75.5         8       7

32 rows selected.

So, this is a nice way to query live webpages, or perform "scraping" as it also is called sometimes.

Unfortunately sometimes webpages are not very good for calling in this way. The page might include something interactive like popups that need to be accepted, or it might be dynamically built with Ajax so the first call only retrieves an empty "shell" of HTML and some Javascript that populates the shell. In such cases it can be hard to produce a "scraping" code like this. And the Oracle Community webpages actually do something I haven't yet identified, so that they often return "An unexpected error occured" and my code raises 'Not expected HTML content' exception.

So what to do in those cases where a webpage is hard to get without actually using a browser? Well, one option is to use a browser manually, save the output as raw HTML from the browser, and then let the "scraping" code read the HTML from file instead of using HTTP. That is the subject of Part 2.

Was this useful to you? If so, can I ask you to consider to go and vote for me in the Oracle Database Developer Choice Awards SQL category? I would be eternally grateful :-)

4 comments:

  1. Thanks, Kim. This is great stuff. Unfortunately I cannot vote for you. I work for Oracle. But I will USE what you've done and tweet these totals, to encourage others to vote!

    ReplyDelete
    Replies
    1. Thanks for the initiative in the first place - without that, there wouldn't *be* an Oracle Database Developer Choice Award ;-)

      Delete
  2. You beat me to blogging!

    I've been doing something similar but just finally got around to posting my method.

    https://seanstuber.wordpress.com/2015/10/06/a-day-late-but-still-fun/

    I also got the "Unexpected error" messages but corrected them with a simple, albeit silly, fix.

    Add a trailing "/" to the end of the url.

    It's been working for me flawlessly, and the same fix seems to work using using APEX call in your code as well.

    I hope that helps.

    ReplyDelete
    Replies
    1. Interesting, sometimes the trailing slash doesn't work. Strangely though, removing it works then for a while and when it fails, putting the slash back on fixes it again. Maybe the problem is some kind of caching and changing the url, even trivially, resets it.

      I get the same fix/break cycle in my version too.

      Delete