Loading...
 

LinkChecker

Motivation

I blogged about the motivation for this script here: Interesting website activities makes me worry about Drive-by Downloads

Background

Need to check all the links to see which ones might point to spamming sites or worse, drive by downloads sites.

Found an existing script that I then heavily modified. The original script was checklinks.pl by Jim Weirich. I added:
  • Traverses documents by mime type rather than by the extension .html
  • Obeys the robots.txt file if one exists
  • Added signal handlers that dump the progress so far
  • Added a configurable user agent

The file is attached to this page:

Theory of Operations


References

Similar ideas I've seen.
Stop spam flood attack with postfix and iptables


Perl/Apache: Parsing Apache HTTPD Logs with Perl Patterns
apache-tools
WWW::RobotRules - Perl module for obeying robot rules
Contributors to this page: michael .
Page last modified on Thursday 30 of April, 2009 02:21:28 CDT by michael.