Does your ISP block Web Crawling?


By Art Reisman

Art Reisman CTO www.netequalizer.com

Editor’s note: Art Reisman is the CTO of APconnections. APconnections designs and manufactures the popular NetEqualizer bandwidth shaper.

About one year ago I got the idea to see if I could build a Web Crawler (robot) with the Specific mission of finding references to our brand name on the Internet.

I admit to being a complete amateur to the art of writing a Web Crawler, and certainly it might make more sense to  do Google search on “NetEqualizer” , but I wanted to see if any occurances were  out there,  in Cyber space, that Google ignored or missed.

If you are a hack and want to try this for yourself, I have included my beta Web Crawler source code below.

Back on topic, Does your ISP block Web Crawling?

First a little background on how my Web Crawler works.

1) It takes a seed , a set of web pages to start on

2) It systematically reads those seed Web Pages looking for URL’s amongst them

3) When it finds a URL, it reads it as text, looking for additional URLS within the text.

4) It ranks URLs as Interesting if it finds certain keywords ( a List I created) in the Text of the new URL

5) The more Interesting a URL the more likely it is to get read and so forth.

6) If no keywords are found at all on the searched page it tosses it out as not to be searched. (I think better check this)

7) Ultimately it will stop when it finds “NetEqualizer” or loops a whole bunch of times without finding any new keywords whichever comes first

So you can imagine when this thing is running it is sucking bandwidth as fast as it can read pages, and also hitting random web pages more than humanly possible, after all it is a crawler.

I only ran this script two or three times in its present form because each time I ran it within an hour or so my Internet service would crash and stop altogether. It may just be coincidence that I was having problems with my line at the time as within the next month I did have to have the external cable to the pole replaced by my provider. So honestly I am not postive if my Provider shut me down, but I think so.

At the time, I had not really given it much thought, but if my provider had any watch dog type big brother metric keeping tabs on me, surely this thing would have set off a code Red at the main office. I would assume that residential Internet accounts that start scanning the web at high speed are considered infected with a virus ? Is there a formal clause that by my provider that says they can shut me down if I write a crawler ? I don’t know , as I did not push the issue.

Below is the code. It did start with a perl program written by somebody else, but critical pieces seemed to be omitted (Specific Perl calls on the original) so I had stripped it way down and then built it back up to crawl. I honestly have no idea where I got the original code from as it was over a year ago. Apologies for not giving credit.

See also a generic flow diagram of a Web Crawler.

Sorry about the formatting in the blog.

Use at your won risk etc.

#!/usr/bin/perl -w
##
# spider.pl Set tabstops to 3.
#
$| = 1;

if(scalar(@ARGV) < 2){
print “Usage: $0 <fully-qualified- seed URL> <search-phrase> <keywords>\n”;
exit 1;
}

# Initialize.
%URLqueue = ();
chop($client_host=`hostname`);
$been = 0;
$search_phrase = $ARGV[1];
if (scalar(@ARGV) > 2 ) {
$kicker1 = $ARGV[2]; }
if (scalar (@ARGV) > 3 ) {
$kicker2 = $ARGV[3];
}
if(scalar (@ARGV) > 4 ) {
$kicker3 = $ARGV[4]; }

# Load the queue with the first URL to hit.
$URLqueue{$ARGV[0]} = 0;

# While there’s a URL in our queue which we haven’t looked at …
$total_sites=0;
while ($total_sites < 10000)
{
$x= `echo total sites loop $total_sites >> visited `;
# Progress report.
if ($total_sites > 1000) { exit 1; }
for ( $sites=0; $sites < 200; ) # keep looping hundred times in this beta version
{
$x= `echo sites loop $sites >> visited `;
while(($key,$value) = each(%URLqueue)){
if ( $URLqueue{$key} < 0 ){ if ($URLqueue{$key} == -1)
{ delete $URLqueue{$key}; } # garbage collection
next; } # already been there
if ($sites > 50 && $value < 1 ) {$sites ++; next; }
if ($sites > 100 && $value < 2 ) {$sites ++ ;next;}
if ($sites > 50)
{
$x=`echo primo sites $sites value $value site $key`;
}
($protocol, $rest) = $key =~ m|^([^:/]*):(.*)$|;

# If the protocol is http, fetch the page and process it.
if ( !defined ($protocol)) {next;}
if($protocol eq “http”){
$URLqueue{$key}=-1 ; # mark as visited
$sites++;
$total_sites++;
# Split out the hostname, port and document.
# ($server_host, $port, $document) =
# $rest =~ m|^//([^:/]*):*([0-9]*)/*([^:]*)$|;
print “getting $key \n”;
$x = `cd /tmp; wget -nd -Q 10000 –timeout=2 –tries=1 $key` ;
print “done wget \n”;
$x= `echo $key >> ./visited`;
$page_text = `cat /tmp/* 2> /dev/null`;
$x=`rm /tmp/* 2> /dev/null`;

$page_text =~ tr/\r\n//d;
$page_text =~ s|<!–[^>]*–>||g;
# Report if our search string is found here.
$kick=0;
if($page_text =~ m|$search_phrase|i){
print “found phrase $key $search_phrase ,total sites $total_sites \n”;
exit ;
}
if ( defined $kicker1) {
if($page_text =~ m|$kicker1|i){
#rank this page higher if it has this key word
$x= `echo found kicker $key $kicker1 total sites $total_sites >> visited`;
$kick++;
}
}
if ( defined $kicker2 ) {
if($page_text =~ m|$kicker2|i){
# rank this page higher if it has this key word
$x= `echo found kicker $key $kicker2 ,total sites $sites >> visited`;
$kick++;
}
}
if (defined $kicker3 ) {
if($page_text =~ m|$kicker3|i){
# rank this page higher if it has this key word
print “found kicker $key $kicker3 ,total sites $sites \n”;
$kick++;
}
}
else
{
delete $URLqueue{$key}; # not http
}

# Find anchors in the HTML and update our list of URLs..
(@anchors) = $page_text =~ m|<A[^>]*HREF\s*=\s*”([^
“>]*)”|gi;
foreach $anchor (@anchors){

$newURL = &fqURL($key, $anchor);

if ( exists $URLqueue{$newURL} )
{
$URLqueue{$newURL}= $URLqueue{$newURL} -1;
#don’t garbage collect low numbers
print “duplicate $newURL \n”;
}
else
{
print “new anchor $newURL \n”;
if ($kick > 0 ) {
$x=`echo kick $kick $key $newURL >> ./anchors`; }
$URLqueue{$newURL} =$kick; #new url added to queu
}
}
} #end of while URLqueue
} # end of sites
} #end of total sites
}

sub fqURL
{
local($thisURL, $anchor) = @_;
local($has_proto, $has_lead_slash, $currprot, $currhost, $newURL);

# Strip anything following a number sign ‘#’, because its
# just a reference to a position within a page.
$anchor =~ s|^.*#[^#]*$|$1|;

# Examine anchor to see what parts of the URL are specified.
$has_proto = 0;
$has_lead_slash=0;
$has_proto = 1 if($anchor =~ m|^[^/:]+:|);
$has_lead_slash = 1 if ($anchor =~ m|^/|);

if($has_proto == 1){

# If protocol specified, assume anchor is fully qualified.
$newURL = $anchor;

}
elsif($has_lead_slash == 1){

# If document has a leading slash, it just needs protocol and host.
($currprot, $currhost) = $thisURL =~ m|^([^:/]*):/+([^:/]*)|;
$newURL = $currprot . “://” . $currhost . $anchor;

}
else{

# Anchor must be just relative pathname, so append it to current URL.
($newURL) = $thisURL =~ m|^(.*)/[^/]*$|;
$newURL .= “/” if (! ($newURL =~ m|/$|));
$newURL .= $anchor;

}
return $newURL;
}
The disclaimers:

Use this code at your own risk. I am not even sure if it follows the moral and ethic standards that the major players who crawl the web for living abide by; but since I was only doing this as a weekend experiment I did not worry too much about the standard.

In other words it is experimental and not for commerical use. Do not walk away and leave it running unattended lest you get censured and black listed from the Internet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: