<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>SQL, perl und Unix/Linux Schulungen in und um Wien &#187; JavaScript</title>
	<atom:link href="http://www.trust-box.at/category/javascript/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.trust-box.at</link>
	<description>SQL, perl und Unix/Linux Schulungen in und um Wien</description>
	<pubDate>Tue, 22 Feb 2011 10:24:16 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Scraping web pages in JavaScript with Perl</title>
		<link>http://www.trust-box.at/2010/11/18/scraping-web-pages-in-javascript-with-perl/</link>
		<comments>http://www.trust-box.at/2010/11/18/scraping-web-pages-in-javascript-with-perl/#comments</comments>
		<pubDate>Thu, 18 Nov 2010 21:30:19 +0000</pubDate>
		<dc:creator>schulung</dc:creator>
		
		<category><![CDATA[JavaScript]]></category>

		<category><![CDATA[Perl]]></category>

		<guid isPermaLink="false">http://www.trust-box.at/?p=102</guid>
		<description><![CDATA[Sometimes you want to scrape Webpages which contain JavaScript and therefore resist beeing scraped with Web::Scraper or the likes. Imagine some JavaScript code like the following to disguise a email address.

function mail() {
    var name = "mail";
    var domain = "example.com";
    var mailto = 'mailto:' + [...]]]></description>
			<content:encoded><![CDATA[<p>Sometimes you want to scrape Webpages which contain JavaScript and therefore resist beeing scraped with Web::Scraper or the likes. Imagine some JavaScript code like the following to disguise a email address.<br />
<code><br />
function mail() {<br />
    var name = "mail";<br />
    var domain = "example.com";<br />
    var mailto = 'mailto:' + name + '@' + domain;<br />
    document.write(mailto);<br />
 }<br />
mail();<br />
</code></p>
<p>One could use somethink elaborate like <a href="http://seleniumhq.org/">Selenium</a> to execute the code within a browser and then extract the address with &#8220;conventional&#8221; means. There are cases when this isn&#8217;t sufficent.<br />
Enter <a href="http://search.cpan.org/~tbusch/JavaScript-SpiderMonkey/SpiderMonkey.pm">JavaScript::SpiderMonkey</a>, which allows you to execute JavaScript Code on the console without a browser. The only problem remaining is that the console doesn&#8217;t provide some properties and methods the browser has, so you have to define them yourself. This happens from line 11-14 where we define the &#8220;document&#8221; and the method &#8220;write&#8221;. The rest of the code is pretty self explanatory. </p>
<p><code><br />
000:  use strict;<br />
001:  use warnings;<br />
002:<br />
003:  use Slurp;<br />
004:  use JavaScript::SpiderMonkey;<br />
005:<br />
006:  my $js = JavaScript::SpiderMonkey->new();<br />
007:  my $code = slurp('mailto.js');<br />
008:<br />
009:  $js->init();<br />
010:<br />
011:  my $obj = $js->object_by_path("document");<br />
012:<br />
013:  my @write;<br />
014:  $js->function_set("write", sub { push @write, @_ }, $obj);<br />
015:<br />
016:  my $rc = $js->eval(<br />
017:    $code<br />
018:  );<br />
019:<br />
020:  printf "document.write:\n%s\n", join "\n", @write;<br />
021:  printf "Error: %s\n", $@;<br />
022:  printf "Return Code: %s\n", $rc;<br />
023:<br />
024:  $js->destroy();<br />
</code></p>
<p>The output is:<br />
document.write:<br />
mailto:mail@example.com<br />
Error:<br />
Return Code: 1</p>
]]></content:encoded>
			<wfw:commentRss>http://www.trust-box.at/2010/11/18/scraping-web-pages-in-javascript-with-perl/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

