Web 2.0 and basically the act of allowing users to publish content created two problems:

  • Scaling your website is harder -you couldn't just take a feed from a news provider and generate some static html files and put them on a harddrive and serve them on request. The pages are much more dynamic.
  • A massive spam protection problem - if users are able to submit content to the site, spammers can write bots do cover your site with spam.

Any site that supports UGC has to have spam protection. The protection strategy should consist of 3 systems working together

  • Machine spam detection - build a series of filters that generate feature scores for the text. For example: a known bad url filter, a number of matches on known spam words filter, a spam topic filter - each filter gives a score - the spam score of the text is polynomial that takes as input the scores of these individual features. These are tools like: Akismet or Spam Karma or you could build your own
  • Human moderators - provide a tool that allows people employed by your company to quickly review and provide a judgment on whether something is spam. Depending on who is doing this human review work - it could cost you a $1 per judgment.
  • Community protection - use UGC against the spammers by allowing your users to report content as abusive and give a reason why. The problem is that spammers will write bots to click this button - so you need to protect against that.

If you blog, you will probably have received "a thanks for posting this" comment on your blog - which was written by a bot. More recently spammers have been writing bots to look at the content of the conversation and then post.

From Carsten "The biatch is just faking it again. It took her one minute to figure out what you are talking about in your post to be able to then respond with a message that relates directly to the subject of the post, but really does not add very much, if any, helpful information to the discussion."

UGC - stands for "User generated content"