Rice University’s Department of Computer Science has been researching the trackback spam problem, and has released a technical report explaining how their “Validator” system can reduce trackback spam:
The TrackBack protocol, conceived as a way to automatically link together web sites which reference one another, has become a new vector for spammers wishing to divert web surfers to their sites. A site which supports TrackBack allows any entity to inject arbitrary HTML code, plus the URL of the sender, into its pages; an attacker need only follow the TrackBack protocol to exploit the system and leverage such a site in a link farm. Current approaches to combating TrackBack spam are limited to content-based filters (of the sort currently used against email and weblog comment spam). In this paper, we propose a way to identify TrackBack spam by considering the relationship between the sender’s URL and the site under attack. In particular, we observe that, for spam TrackBacks, the page at the given URL does not link to the page to which the TrackBack was sent. We have developed software for weblog authors that rejects TrackBacks from sources lacking this reciprocal link. Data collected from our users demonstrates that this test is 100% accurate at identifying and separating spam from legitimate TrackBacks.
This concept, validating a trackback by checking to see if the URL for the sender trackback actually contains the URL of its recipient. Sort of like reverse trackback autodiscovery… It would certainly help to identify trackback spam and spammers. But for larger blog providers or blogs which receive a lot of trackback spam it might be a pretty big job both in terms of CPU and bandwidth to check back on all the incoming links…
An intermediary to handle this kind of thing might be useful… Maybe an API to query a service, feed the service an XML-RPC query with 2 URLs, the permalink and the trackback URL, and the service responds with a response code: 0 means I know that the permalink is not contained at the trackback URL, 1 means I know it is contained at the trackback URL, 2 means I haven’t yet checked the trackback URL but will add it to the queue… TrackbackValidator.com, perhaps? Couldn’t help myself, had to register the domain (as Dave Winer often notes, not all of the good domains are already taken) … Now I just need some help in building it.
One other thing of note in this paper which needs some exploring:
We also considered the textual content of TrackBack spam. Many independent victims often received very similar TrackBacks spams, including similar or identical text and URLs (including typographical errors and other “chaff” designed to thwart content-filters), from disparate IP addresses. This leads us to believe that, just as email spammers do, TrackBack spammers rely on botnets: innocent PCs around the Internet under control of the spammer due to virus infection or other remote security exploit. Attempts to estimate the number of active TrackBack spammers based on these recurring spam profiles are beyond the scope of this paper.
Patterns of trackback spam I’ve seen would lead me to believe that trackback spammers have indeed been using this technique for some time…