Scraping and querying Oracle Database Developer Choice Awards votes - Part 1
You know about the Oracle Database Developer Choice Awards voting, right? If not, then go vote now, please, and then come back and read on :-)
I want to be able to query the voting status, who is ranked what in each category and so on. But I don't want to type in votes in a table, especially not when they change all the time. Instead I'd like to query the live data in the HTML webpages. Can that be done? Yes it can...
But unfortunately a slight caveat for this. It appears that when I do these HTTP calls from the database to the voting webpages on http://community.oracle.com , something happens that often make the URLs give me "An unexpected error occurred" rather than the HTML I want. I suspect (but not sure) that it may have to do with either that these HTTP calls do not log in or that they fail to accept a cookie acceptance popup window or that they look like robot crawler and get blocked or something.
Whatever the reason, the code I show in this blog post works only sporadically on the Oracle Community website. But I will show it anyway, as the technique can be very useful to other websites and URLs too. And then in a Part 2 blogpost I will show a workaround I did for making this work anyway, just with a little bit of manual work involved.
So here's a recipe for querying the HTML of a live webpage. Just beware that the particular website I use as an example is not the best of examples ;-)
First I setup a wallet with the certificates of http://community.oracle.com website. I just followed the instructions by Tim Hall: https://oracle-base.com/articles/misc/utl_http-and-ssl
I have done this both in 11.2 and 12.1 instances, and note in particularly that in 12c you should not add the sites own certificate to the wallet, but only the higher certificates (Geotrust SSL and Geotrust Global.) I thank Tony Reed for this StackOverflow question that gave me the solution for making it work on 12c: http://stackoverflow.com/questions/19380116/using-utl-http-wallets-on-12c-certificate-validation-failure
Within Oracle I then login as SYSTEM (or similar privileged user) and setup connection privileges, both to the schema I use (SCOTT) as well as the APEX schema (in this case APEX_050000.)
In schema SCOTT I then do my actual work. First I create an object type, collection type and a table. The object and collection type enables me to make a table function I will use later, the table then can be used to cache the results of the table function to avoid excessive HTTP calls.
And now for the package that will do all the work. A few constants, the table function that will retrieve live data, and a procedure to populate the cache table from the live data.
The body has a couple private constants for wallet path and password. In production you might consider safer ways of storing these credentials.
Then a helper function to execute the HTTP call for retrieving the HTML from the 5 different category webpages. As I need HTTPS (SSL) support, I can't use my favorite method, HttpUriType. Instead I can use UTL_HTTP, or in this case I chose APEX_WEB_SERVICE.MAKE_REST_REQUEST for simplicity. Even though meant for REST requests, it can retrieve any webpage and simple return it as CLOB. If I had used UTL_HTTP I would have had to read the HTML in a loop and build the CLOB myself.
Second helper function is to transform the HTML into XML. A really well-formed HTML document is actually very close to being XML, but a few things I have to do here to make it work as XML.
Note also that in case the HTML parameter does not contain the specific <div> and <span> tags I am looking for, then I know my XML parsing will fail, so I raise an explicit error saying that this HTML is not in the expected format.
My table function starts by retrieving the HTML and transforming it to XML via the two helper functions, and then it extracts the information I want using XMLTABLE function. I use '/doc/*' to retrieve all the <div> that are nominees. For each <div> I specify the XPATH expression of the "position" in the XML of the 4 pieces of information I desire. But those 4 pieces contain superfluous text and whitespace, so I perform some trimming on name and for the 3 numeric expression I use regexp_replace to remove anything that is not a numeric digit. Each retrieved nominee from the XML is then piped out, so the results can be selected using the table function.
Calling the table function above involves an HTTP call. To avoid doing that over and over, I finally have this small procedure for populating my cache table. It very simply clears the cache for the invoked category and inserts rows from the table function into the cache.
So having made the table function, I can actually now query the webpage by this simple select statement:
But to avoid lots of HTTP calls when I start analysing the data, I populate my cache for all 5 categories.
And now I just query my populated cache and try to see how the voting goes and who is in the lead.
I've shamelessly highlighted my own results so far:
So, this is a nice way to query live webpages, or perform "scraping" as it also is called sometimes.
Unfortunately sometimes webpages are not very good for calling in this way. The page might include something interactive like popups that need to be accepted, or it might be dynamically built with Ajax so the first call only retrieves an empty "shell" of HTML and some Javascript that populates the shell. In such cases it can be hard to produce a "scraping" code like this. And the Oracle Community webpages actually do something I haven't yet identified, so that they often return "An unexpected error occured" and my code raises 'Not expected HTML content' exception.
So what to do in those cases where a webpage is hard to get without actually using a browser? Well, one option is to use a browser manually, save the output as raw HTML from the browser, and then let the "scraping" code read the HTML from file instead of using HTTP. That is the subject of Part 2.
Was this useful to you? If so, can I ask you to consider to go and vote for me in the Oracle Database Developer Choice Awards SQL category? I would be eternally grateful :-)
I want to be able to query the voting status, who is ranked what in each category and so on. But I don't want to type in votes in a table, especially not when they change all the time. Instead I'd like to query the live data in the HTML webpages. Can that be done? Yes it can...
But unfortunately a slight caveat for this. It appears that when I do these HTTP calls from the database to the voting webpages on http://community.oracle.com , something happens that often make the URLs give me "An unexpected error occurred" rather than the HTML I want. I suspect (but not sure) that it may have to do with either that these HTTP calls do not log in or that they fail to accept a cookie acceptance popup window or that they look like robot crawler and get blocked or something.
Whatever the reason, the code I show in this blog post works only sporadically on the Oracle Community website. But I will show it anyway, as the technique can be very useful to other websites and URLs too. And then in a Part 2 blogpost I will show a workaround I did for making this work anyway, just with a little bit of manual work involved.
So here's a recipe for querying the HTML of a live webpage. Just beware that the particular website I use as an example is not the best of examples ;-)
First I setup a wallet with the certificates of http://community.oracle.com website. I just followed the instructions by Tim Hall: https://oracle-base.com/articles/misc/utl_http-and-ssl
I have done this both in 11.2 and 12.1 instances, and note in particularly that in 12c you should not add the sites own certificate to the wallet, but only the higher certificates (Geotrust SSL and Geotrust Global.) I thank Tony Reed for this StackOverflow question that gave me the solution for making it work on 12c: http://stackoverflow.com/questions/19380116/using-utl-http-wallets-on-12c-certificate-validation-failure
Within Oracle I then login as SYSTEM (or similar privileged user) and setup connection privileges, both to the schema I use (SCOTT) as well as the APEX schema (in this case APEX_050000.)
begin dbms_network_acl_admin.create_acl( acl => 'oddca_demo.xml' , description => 'ODDCA nominee voting ACL' , principal => 'SCOTT' , is_grant => true , privilege => 'connect' ); dbms_network_acl_admin.add_privilege( acl => 'oddca_demo.xml' , principal => 'APEX_050000' , is_grant => true , privilege => 'connect' ); dbms_network_acl_admin.assign_acl( acl => 'oddca_demo.xml' , host => '*.oracle.com' ); commit; end; /
In schema SCOTT I then do my actual work. First I create an object type, collection type and a table. The object and collection type enables me to make a table function I will use later, the table then can be used to cache the results of the table function to avoid excessive HTTP calls.
create type oddca_nominee_type as object ( category varchar2(10) , idea_id integer , name varchar2(30) , score integer , votes integer ) / create type oddca_nominee_table_type as table of oddca_nominee_type / create table oddca_nominee_cache ( category varchar2(10) , idea_id integer , name varchar2(30) , score integer , votes integer ) /
And now for the package that will do all the work. A few constants, the table function that will retrieve live data, and a procedure to populate the cache table from the live data.
create or replace package oddca as subtype category_type is varchar2(10); CATEGORY_SQL constant category_type := 'sql'; CATEGORY_PLSQL constant category_type := 'plsql'; CATEGORY_ORDS constant category_type := 'ords'; CATEGORY_APEX constant category_type := 'apex'; CATEGORY_DBDESIGN constant category_type := 'db-design'; function nominees( p_category category_type ) return oddca_nominee_table_type pipelined; procedure populate_cache( p_category category_type ); end oddca; /
The body has a couple private constants for wallet path and password. In production you might consider safer ways of storing these credentials.
create or replace package body oddca as c_wallet_path constant varchar2(100) := 'file:/home/oracle/wallets/testwallet'; c_wallet_pwd constant varchar2(100) := 'myTest!PSW123';
Then a helper function to execute the HTTP call for retrieving the HTML from the 5 different category webpages. As I need HTTPS (SSL) support, I can't use my favorite method, HttpUriType. Instead I can use UTL_HTTP, or in this case I chose APEX_WEB_SERVICE.MAKE_REST_REQUEST for simplicity. Even though meant for REST requests, it can retrieve any webpage and simple return it as CLOB. If I had used UTL_HTTP I would have had to read the HTML in a loop and build the CLOB myself.
function get_html( p_category category_type ) return clob is begin if p_category not in ( CATEGORY_SQL , CATEGORY_PLSQL , CATEGORY_ORDS , CATEGORY_APEX , CATEGORY_DBDESIGN ) then raise_application_error(-20000, 'Invalid award category'); end if; return apex_web_service.make_rest_request( p_url => 'https://community.oracle.com/community/database/awards/'||p_category||'-voting' , p_http_method => 'GET' , p_wallet_path => c_wallet_path , p_wallet_pwd => c_wallet_pwd ); end get_html;
Second helper function is to transform the HTML into XML. A really well-formed HTML document is actually very close to being XML, but a few things I have to do here to make it work as XML.
- First I cutout the "middle" (and relevant) part of the original HTML document; that gets rid of a lot of Javascript and other things that would make it invalid XML.
- Secondly as the "cutout" part then contains a <div> element for each nominee, I need to wrap it in a root <doc> element (I could name it anything, but doc seems appropriate.)
- Third the HTML contains a lot of <br> which traditionally in HTML is a simple newline, but in XML would be an opening <br> tag that misses a closing tag - new HTML standards state you should use <br/> which also works in XML, so I do a replace here.
- And last there are some HTML entities that the database XML parser does not recognize as valid entities, but as they are not part of the text I am interested in, I simply replace them with spaces to get rid of them.
Note also that in case the HTML parameter does not contain the specific <div> and <span> tags I am looking for, then I know my XML parsing will fail, so I raise an explicit error saying that this HTML is not in the expected format.
function html_to_xml( p_html clob ) return xmltype is cutout clob; pos1 integer; pos2 integer; data xmltype; begin pos1 := dbms_lob.instr(p_html, '<div class="display-idea-list"', 1); pos2 := dbms_lob.instr(p_html, '<span id="jive-whats-new-more">', pos1+1); if pos1 = 0 or pos2 = 0 or pos1 = null or pos2 = null then raise_application_error(-20001, 'Not expected HTML content'); end if; dbms_lob.createtemporary(cutout, false, DBMS_LOB.CALL); dbms_lob.copy(cutout, p_html, pos2-pos1, 1, pos1); data := xmltype( '<doc>' || replace(replace(replace(replace( cutout , '&'||'rsquo;', ' ') , '&'||'hellip;', ' ') , '&'||'nbsp;', ' ') , '<br>', '<br/>') || '</doc>' ); dbms_lob.freetemporary(cutout); return data; end html_to_xml;
My table function starts by retrieving the HTML and transforming it to XML via the two helper functions, and then it extracts the information I want using XMLTABLE function. I use '/doc/*' to retrieve all the <div> that are nominees. For each <div> I specify the XPATH expression of the "position" in the XML of the 4 pieces of information I desire. But those 4 pieces contain superfluous text and whitespace, so I perform some trimming on name and for the 3 numeric expression I use regexp_replace to remove anything that is not a numeric digit. Each retrieved nominee from the XML is then piped out, so the results can be selected using the table function.
function nominees( p_category category_type ) return oddca_nominee_table_type pipelined is html clob; data xmltype; begin html := get_html(p_category); data := html_to_xml(html); for nominees in ( select oddca_nominee_type( p_category , to_number(regexp_replace(ideaid,'[^[:digit:]]')) /* 'ideaListDiv1234' => 1234 */ , trim(name) , to_number(regexp_replace(score,'[^[:digit:]]')) , to_number(regexp_replace(votes,'[^[:digit:]]')) /* '123 votes' => 123 */ ) nominee from xmltable( '/doc/*' passing data columns ideaid varchar2(100) path '@id' , name varchar2(100) path 'div/div[@class="jive-content"]/div[@class="jive-content-header"]/div[@class="jive-content-title"]/h2/a' , score varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/div/strong' , votes varchar2(100) path 'div/div[@class="jive-ideas-list-scoreblock"]/span[@class="jive-score-meta font-color-meta"]/span[@class="idea-vote-count"]' ) ) loop pipe row( nominees.nominee ); end loop; end nominees;
Calling the table function above involves an HTTP call. To avoid doing that over and over, I finally have this small procedure for populating my cache table. It very simply clears the cache for the invoked category and inserts rows from the table function into the cache.
procedure populate_cache( p_category category_type ) is begin delete oddca_nominee_cache where category = p_category; insert into oddca_nominee_cache select * from table(oddca.nominees(p_category)); end populate_cache; end oddca; /
So having made the table function, I can actually now query the webpage by this simple select statement:
select * from table(oddca.nominees('sql')) order by score desc /
But to avoid lots of HTTP calls when I start analysing the data, I populate my cache for all 5 categories.
begin oddca.populate_cache(oddca.CATEGORY_SQL); oddca.populate_cache(oddca.CATEGORY_PLSQL); oddca.populate_cache(oddca.CATEGORY_ORDS); oddca.populate_cache(oddca.CATEGORY_APEX); oddca.populate_cache(oddca.CATEGORY_DBDESIGN); commit; end; /
And now I just query my populated cache and try to see how the voting goes and who is in the lead.
select s1.* , round(100 * positive / nullif(votes,0), 1) pos_pct , dense_rank() over (partition by category order by score desc) rnk_score , dense_rank() over (partition by category order by positive desc) rnk_pos from ( select category , name , score , votes , (votes - score / 10) / 2 as negative , score / 10 + (votes - score / 10) / 2 as positive from oddca_nominee_cache ) s1 order by category, score desc /
I've shamelessly highlighted my own results so far:
CATEGORY NAME SCORE VOTES NEGATIVE POSITIVE POS_PCT RNK_SCORE RNK_POS --------- ------------------ ----- ----- -------- -------- ------- --------- ------- apex Jari Laine 680 80 6 74 92.5 1 1 apex Morten Braten 660 76 5 71 93.4 2 2 apex Juergen Schuster 510 55 2 53 96.4 3 3 apex Kiran Pawar 400 50 5 45 90.0 4 4 apex Karen Cannell 260 30 2 28 93.3 5 5 apex Paul MacMillan 130 29 8 21 72.4 6 6 apex Trent Schafer 110 21 5 16 76.2 7 7 db-design Heli Helskyaho 850 103 9 94 91.3 1 1 db-design Michelle Kolbe 580 68 5 63 92.6 2 2 db-design Rob Lockard 410 51 5 46 90.2 3 3 db-design Mark Hoxey 90 23 7 16 69.6 4 4 ords Dietmar Aust 1170 125 4 121 96.8 1 1 ords Dimitri Gielis 790 91 6 85 93.4 2 2 ords Morten Braten 420 56 7 49 87.5 3 3 ords Kiran Pawar 340 42 4 38 90.5 4 4 ords Anton Nielsen 240 32 4 28 87.5 5 5 ords Tim St. Hilaire 130 19 3 16 84.2 6 6 plsql Adrian Billington 970 103 3 100 97.1 1 2 plsql Roger Troller 870 115 14 101 87.8 2 1 plsql Sean Stuber 750 87 6 81 93.1 3 3 plsql Patrick Barel 650 85 10 75 88.2 4 4 plsql Morten Braten 620 74 6 68 91.9 5 5 plsql Kim Berg Hansen 410 55 7 48 87.3 6 6 plsql Bill Coulam 380 50 6 44 88.0 7 7 sql Emrah Mete 2980 366 34 332 90.7 1 1 sql Sayan Malakshinov 1750 239 32 207 86.6 2 2 sql Sean Stuber 670 103 18 85 82.5 3 3 sql Kim Berg Hansen 620 98 18 80 81.6 4 4 sql Erik Van Roon 300 60 15 45 75.0 5 5 sql Matthias Rogel 280 46 9 37 80.4 6 7 sql Justin Cave 270 55 14 41 74.5 7 6 sql Stew Ashton 250 49 12 37 75.5 8 7 32 rows selected.
So, this is a nice way to query live webpages, or perform "scraping" as it also is called sometimes.
Unfortunately sometimes webpages are not very good for calling in this way. The page might include something interactive like popups that need to be accepted, or it might be dynamically built with Ajax so the first call only retrieves an empty "shell" of HTML and some Javascript that populates the shell. In such cases it can be hard to produce a "scraping" code like this. And the Oracle Community webpages actually do something I haven't yet identified, so that they often return "An unexpected error occured" and my code raises 'Not expected HTML content' exception.
So what to do in those cases where a webpage is hard to get without actually using a browser? Well, one option is to use a browser manually, save the output as raw HTML from the browser, and then let the "scraping" code read the HTML from file instead of using HTTP. That is the subject of Part 2.
Was this useful to you? If so, can I ask you to consider to go and vote for me in the Oracle Database Developer Choice Awards SQL category? I would be eternally grateful :-)
Thanks, Kim. This is great stuff. Unfortunately I cannot vote for you. I work for Oracle. But I will USE what you've done and tweet these totals, to encourage others to vote!
ReplyDeleteThanks for the initiative in the first place - without that, there wouldn't *be* an Oracle Database Developer Choice Award ;-)
DeleteYou beat me to blogging!
ReplyDeleteI've been doing something similar but just finally got around to posting my method.
https://seanstuber.wordpress.com/2015/10/06/a-day-late-but-still-fun/
I also got the "Unexpected error" messages but corrected them with a simple, albeit silly, fix.
Add a trailing "/" to the end of the url.
It's been working for me flawlessly, and the same fix seems to work using using APEX call in your code as well.
I hope that helps.
Interesting, sometimes the trailing slash doesn't work. Strangely though, removing it works then for a while and when it fails, putting the slash back on fixes it again. Maybe the problem is some kind of caching and changing the url, even trivially, resets it.
DeleteI get the same fix/break cycle in my version too.