Blog spam (blam?)

Anyone who runs a blog knows about blog spam (or comment spam). Here on this site I run SpamKarma 2 which keeps out 99% of the spam and puts the other 1% in moderation for me to check. But when the number of spam comments reaches into the hundreds per day, then it starts to skew the logs too much. Flattering though it is at first glance to see hundreds of eager visitors each day, closer inspection shows many of them to be comment spammers.
So, I started to look at ways to stop them before they got to leave a comment in the first place.

A typical day, back in December, would see between 150 and 250 spam comments each day. As I mentioned, SpamKarma catches the vast majority. One of the main reasons that they get flagged is that they are obviously posted by robots, the comment appears within a few seconds of the page being loaded (sometimes the page isn’t loaded at all) and is far too fast for a mere human. So, a first step and stopping the robots would seem to be renaming the comment script.

Renaming the script

At about 08:00 (Mar 2) I changed the name of the comment posting script (from wp-commentpost.php to my-comment-post.php), here is the log of the first visitor to suss it out:

66.207.205.130 - - [01/Mar/2007:05:14:15 -0500] "POST /wp-comment-post.php HTTP/1.1" 302 - "http://codeworks.gnomedia.com/wp-comment-post.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
66.207.205.130 - - [01/Mar/2007:13:44:34 -0500] "GET /?p=3 HTTP/1.1" 200 20051 "-" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
66.207.205.130 - - [01/Mar/2007:14:30:59 -0500] "POST /wp-commentpost.php HTTP/1.1" 403 296 "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite" "Opera/9.0 (Windows NT 5.1; U; en)"
66.207.205.130 - - [01/Mar/2007:19:28:20 -0500] "POST /wp-comments-post.php HTTP/1.1" 302 - "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite/" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"
66.207.205.130 - - [02/Mar/2007:03:24:29 -0500] "GET /archives/2005/programming/wordpress/wordpress-lite/ HTTP/1.1" 200 16559 "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite/" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1)"
66.207.205.130 - - [02/Mar/2007:03:24:11 -0500] "GET /archives/2005/programming/wordpress/wordpress-lite/ HTTP/1.1" 200 16603 "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite/" "Mozilla/4.0 (compatible; MSIE 6.0; AOL 9.0; Windows NT 5.1)"
66.207.205.130 - - [02/Mar/2007:03:51:43 -0500] "POST /wp-comments-post.php HTTP/1.1" 302 - "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
66.207.205.130 - - [02/Mar/2007:09:39:53 -0500] "GET /archives/2005/general/wordpress-lite-update/ HTTP/1.1" 200 15732 "http://codeworks.gnomedia.com/archives/2005/general/wordpress-lite-update/" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
66.207.205.130 - - [02/Mar/2007:09:40:04 -0500] "POST /my-comment-post.php HTTP/1.1" 403 296 "http://codeworks.gnomedia.com/archives/2005/general/wordpress-lite-update/" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

and, strangely, this one:

59.94.249.163 - - [02/Mar/2007:09:24:10 -0500] "POST /my-comment-post.php HTTP/1.1" 403 296 "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite" "Opera/9.0 (Windows NT 5.1; U; en)"

but only one mention of that IP address. So clearly the first bot had already passed around the correct access address.

and here is another one:

84.78.86.1 - - [02/Mar/2007:09:47:09 -0500] "GET /archives/2005/programming/wordpress/wordpress-lite/ HTTP/1.0" 200 16556 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
84.78.86.1 - - [02/Mar/2007:09:47:13 -0500] "POST /my-comment-post.php HTTP/1.0" 200 42 "http://codeworks.gnomedia.com/archives/2005/programming/wordpress/wordpress-lite/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

This last one shows 4 seconds between loading the first page and sending in the comment, so it is clearly an automated script. This last one is also the only one so far that has received a 200 (the others are 403s for other reasons).

Offer a choice

So, at 10:14:30 I uploaded a comment.php form with three form tags, to see which of them were picked up by the bots. The form tags referenced my-commentsa-post.php, my-comments-post.php and my-commentsb-post.php. The first and the last tag are commented out, the middle tag is the one that is ‘real’.

And here is the first customer, six minutes later:

218.30.84.110 - - [02/Mar/2007:10:20:39 -0500] "GET /archives/2005/general/wordpress-lite-update/ HTTP/1.1" 200 15942 "http://codeworks.gnomedia.com/archives/2005/general/wordpress-lite-update/" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
218.30.84.110 - - [02/Mar/2007:10:20:48 -0500] "POST /my-commenta-post.php HTTP/1.1" 404 18868 "http://codeworks.gnomedia.com/archives/2005/general/wordpress-lite-update/" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

Since then most of the bots seem to stop at the first form tag they find, even if it is commented out. A few have taken the last one and a very few (half a dozen a day) make it to the real tag.

Also interesting… in January when I first started this experiment I would get maybe 200 attempts a day. As I started the experiment above, the number of attempts has slowly dropped. Possibly because 80% of the bots are receiving a 404 and slowly the IP drops off of the lists.

Numbers for the last week show about 160 getting a 404 because they use the wrong address and 35 picking the correct address (although SpamKarma still catches them at that point).

Keeping track

Of course, we now know that anyone trying to submit a comment using wp-comment-post.php (the original Wordpress comment script) or my-commenta-post.php or my-commentc-post.php is a spambot and it is a simple matter of replacing those three pages with a php script that could do all kinds of things. In my case I keep track of all the IPs that make use of the script and I could then use that script to block them from other sites and other blogs that I use or administer. I haven’t got around to it yet, but it might be interesting when I find the time.

Other techniques

Blog spam is a common problem and there are a raft of sites out there which detail various ways of dealing with it. A lot of techniques involve making it (slightly) more difficult for commenters to post and I don’t feel that is a good idea. The point of blogs and blogging is to foster communication, not hinder it. And I don’t think this is a social problem either, it is a technical problem and it’s best to use a technical solution that is invisible to the end user.

If you want to take some of these ideas further, these links may be of interest:
SimonG explores some similar themes (also using WordPress).
jeremy.zawodny has (a somewhat dated) interesting article linking to other articles that explore the theme in some depth.