email, link and text protector - proven to reduce the incidence of spam

email, link and text protector - proven to reduce the incidence of spamThe Email, Link and Text Protector encodes any text into entity syntax. The output is meant to be used in a web page, in order to make it more difficult for spam-harvesting spiders to extract the original content. The encoder was originally designed to be used on email addresses. It is not impossible to harvest emails encoded with this method - just slightly more difficult.

The principle is simple. Characters are encoded using the entity syntax &#xxx; where xxx is the numeric ASCII value for a character. For example abc would encode to a b c. You can find more information about character encoding at w3.org.

This script will do the same thing as the email encoder (which I discovered through Jim Kent's wife's, Heidi, web site). Another such service is provided by PCNet. The email protector here has these additional enhancements.

By using a hide ratio less than 1, some characters are left unencoded. This results in a partially encoded text which may foil harvesting spiders which look for text which is fully encoded. By switching the encoding between hex and dec for each token it is possible to make parsing more difficult.


Spider Catcher - generalized email and web page fakerIf you are interested in fighting spam and have your own web site, check out my Spider Catcher, the generalized email and web page faker designed to trap spiders and pollute their email databases. The Spider Catcher uses Markov chains and a babelizer to display realistic text while producing bogus, but authentic-looking, email addresses.

encoded text

text to encode

hide ratio (fraction of characters to be encoded)

randomize between dec and hex encoding

randomize encoding padding length


Encoding Results

original string
http://www.bcgsc.ca
encoded text (seen by browser)
http://www.bcgsc.ca
encoded text (used in a link)
http://www.bcgsc.ca
encoded text (raw)
http://www.bcgs c.ca
Perl subroutine to encode a single character.
sub encode {
    my $token = shift;
    my $randompad = shift;
    my $randombase = shift;
    my $MAXPAD = 3;
    my $ord = ord($token);
    my $format;
    my $format_prefix = "&#";
    my $format_digit  = "%d";
    my $format_suffix = ";";
    # random padding
    if ($randompad) {
	my $thispad = int(rand($MAXPAD));
	$format_digit = sprintf("%%0%dd",3+$thispad) if $thispad;
    }
    # switch base
    if ($randombase && rand() < 0.5) {
	# we want hex encoding now
	$format_digit =~ s/d/x/;
	$format_prefix = "&#X";
    }
    $format = sprintf("%s%s%s",$format_prefix,$format_digit,$format_suffix);
    return sprintf($format,$ord);
}