Monday, April 27, 2009

Spam and schizophrenia: the hidden link.

I've been getting emails at school from some poor fellow who mined my address, along with about a hundred others, from some listing of UCSC students somewhere. As I understand the content, he's certain there's a conspiracy between the CIA, the Mormon church, and possibly Mossad and NOAA (I'm a little fuzzy on the latter two) to wipe out the east coast. Or black people. Or something. As an example of his writing style, I submit the following:

Is it better to protect faculty, staff, and students from spammers (some of whom will be CIA agent provocateurs in an attempt to discourage colleges/universities from posting email addresses), or, is it better to protect faculty, staff, and students and their communities from another attack on America (and more CIA mind-controlled suicide shooters on campus)?

CIA agent Matt Bakker and CIA agent Matt Bakerman (not their real undercover names) and some of the rest of the CIA agents here in Brooklyn (including some who live in Brooklyn Heights pretending to be Jehovah's Witnesses who also do not go along with the Mormon church's satanic hidden agenda), asked me to write this note to you to let you know that it is absolutely imperative that you authorize webmasters to include, at college websites, student email addresses, as well as, of course, all faculty and staff email addresses, so I can send emails to let them, informing them of some things some of the CIA agents in this area and in Brooklyn Heights who're pretending to be Jehovah's Witnesses, asked me to tell them.


However, crazy people on the internet isn't exactly news, or even interesting. I usually delete these out of hand. However, occasionally I take a peek, just out of morbid curiosity, before I bin it. How does this relate to NLP in anyway? Consider the following, out of a missive from last week:

Subject: http://XXXXXXX.blogspot.com/ - ACTION REQUIRED
Date: 4/22/2009 11:15:31 P.M. Eastern Daylight Time
From: no-reply@google.com
To: XXXXXXX@aol.com
Sent from the Internet (Details)

Hello,

Your blog at: http://XXXXXXX.blogspot.com/ has been identified as a potential spam blog. To correct this, please request a review by filling out the form at ...

Your blog will be deleted in 20 days if it isn't reviewed, and your readers will see a warning page during this time. After we receive your request, we'll review your blog and unlock it within two business days. Once we have reviewed and determined your blog is not spam, the blog will be unlocked and the message in your Blogger dashboard will no longer be displayed. If this blog doesn't belong to you, you don't have to do anything, and any other blogs you may have won't be affected.

We find spam by using an automated classifier. Automatic spam detection is inherently fuzzy, and occasionally a blog like yours is flagged incorrectly. We sincerely apologize for this error. By using this kind of system, however, we can dedicate more storage, bandwidth, and engineering resources to bloggers like you instead of to spammers. For more information, please see Blogger Help: ...

Thank you for your understanding and for your help with our spam-fighting efforts.

Sincerely,

The Blogger Team

P.S. Just one more reminder: Unless you request a review, your blog will be deleted in 20 days. Click this link to request the review: ...
(Email to me from Google Blogger.com, April 23, 2009)

Is spam permitted on Blogger?
What Are Spam Blogs?
As with many powerful tools, blogging services can be both used and abused. The ease of creating and updating webpages with Blogger has made it particularly prone to a form of behavior known as link spamming. Blogs engaged in this behavior are called spam blogs, and can be recognized by their irrelevant, repetitive, or nonsensical text, along with a large number of links, usually all pointing to a single site.
(Google, Blogger Help)


In other words, the classifier they're using looks for obsessive linking and incoherent text. Great for most of us, but not so good for your average schizophrenic trying to get the good word out about potential hazards to the purity of our bodily fluids. In general, generated text tends towards the same looseness of association that schizophrenics do (for a brilliant example, try the SCIgen program out of MIT. It's admittedly a random context free grammar, and they've handcrafted all the sentences, but even so, the output reads like the ramblings of a madman.)

Just another interesting variant of the Turing test. Can a computer generate text good enough to pose as an *insane* human? And, perhaps more practically, can we build spam filters that can distinguish computer generated greed from the merely mad?