Scraping web pages in JavaScript with Perl

schulung — Thu, 18 Nov 2010 21:30:19 +0000

Sometimes you want to scrape Webpages which contain JavaScript and therefore resist beeing scraped with Web::Scraper or the likes. Imagine some JavaScript code like the following to disguise a email address.
function mail() { var name = "mail"; var domain = "example.com"; var mailto = 'mailto:' + name + '@' + domain; document.write(mailto); } mail();

One could use somethink elaborate like Selenium to execute the code within a browser and then extract the address with “conventional” means. There are cases when this isn’t sufficent.
Enter JavaScript::SpiderMonkey, which allows you to execute JavaScript Code on the console without a browser. The only problem remaining is that the console doesn’t provide some properties and methods the browser has, so you have to define them yourself. This happens from line 11-14 where we define the “document” and the method “write”. The rest of the code is pretty self explanatory.

000: use strict; 001: use warnings; 002: 003: use Slurp; 004: use JavaScript::SpiderMonkey; 005: 006: my $js = JavaScript::SpiderMonkey->new(); 007: my $code = slurp('mailto.js'); 008: 009: $js->init(); 010: 011: my $obj = $js->object_by_path("document"); 012: 013: my @write; 014: $js->function_set("write", sub { push @write, @_ }, $obj); 015: 016: my $rc = $js->eval( 017: $code 018: ); 019: 020: printf "document.write:\n%s\n", join "\n", @write; 021: printf "Error: %s\n", $@; 022: printf "Return Code: %s\n", $rc; 023: 024: $js->destroy();

The output is:
document.write:
mailto:mail@example.com
Error:
Return Code: 1

JavaScript – SQL, perl und Unix/Linux Schulungen in und um Wien

Scraping web pages in JavaScript with Perl