Tuesday, November 07, 2006

The Blogosphere and Splogs

Just read Technorati's latest "State of the Blogosphere, October, 2006" presented with in-depth visual stats on the 57 million blogs they're currently tracking, and yes, all the splogs they're fighting to filter. Worth taking your time to go through the post, and you may also be interested in finding how come my ROI out of blogging is so positive these days.

"As we’ve said in the past, some of the new blogs in our index are Spam blogs or 'splogs'. The good news is Technorati has gotten much better at preventing these kinds of blogs from getting into our indexes in the first place, which may be a factor in the slight slowing in the average of new blogs created each day.

The spikes in red on the chart above shows the increased activity that occurs when spammers create massive numbers of fake blogs and try to get them into our indexes. As the chart shows, we’ve done a much better job over the last quarter at nearly eliminating those red spikes. While last quarter I reported about 8% of new blogs that get past our filters and make it into the index are splogs, I’m happy to report that that number is now more like 4%. As always, we’ll continue to be hyper-focused on making sure that new attacks are spotted and eliminated as quickly as possible.

My gut feeling is that since we're better at dealing with Spam now, even some of the blue areas in last quarter's graph were probably accountable to spam, which would mean that rather than the bumpy ride shown above, we're actually seeing a steady increased (but slower) growth of the blogosphere. Hopefully we'll be able to have a more detailed analysis of these issues next quarter."

Meanwhile, the splogfigher is doing an amazing job of analyzing and coming up with exact splog URLs -- I'm reposting so that third-parties of particular interest reading here take a notice -- and week ago came up with 150,000 splogs, notice the dominating blogging platform? Blogspot all the way!

"I see that Google has been deleting quite a large number of splogs but even then they are on average about 20% effective. What that means is if a single spammer creates 1000 splogs, Google will eventually delete at most about 200 of them leaving 800 alone. Obvously this is rather poor percentage and hopefully my efforts will bump up that figure close to 90% and above.

20061030_1.txt - 19401 splogs
20061030_2.txt - 4332 splogs
20061030_3.txt - 8936 splogs
20061030_4.txt - 8794 splogs
20061030_5.txt - 18912 splogs
20061030_6.txt - 5158 splogs
20061030_7.txt - 70755 splogs
20061030_8.txt - 1182 splogs
20061030_9.txt - 11410 splogs
20061030_10.txt - 968 splogs
20061030_11.txt - 1584 splogs
Here is a tarball of all splog list files listed above: 20061030.tar.gz"

Obviously, spammers are exploiting Blogspot's signup process, and I really feel it's about time Google starts tolerating more errors with users having trouble reading a sophisticated CAPTCHA, compared to its current too user-friendly and easily defeated one. They can balance for sure. Something else to consider, take for example the splogs collected for May, and whole the splogfighter is pointing out on the engineered 404s and Google's efforts in removing them, I was able to verify content response from over 200 splogs reported back then, take cigar-accessories-2008.blogspot.com for instance -- anyone up for crawling the lists and clustering the results? Once the signup process is flawed, not even the wisdom of crowds flagging splogs can help you.

Another recommended and very recent analysis "Characterizing the Splogosphere" is also full of juicy details, and statistical info on the emerging problem. Spammers are anything but old-fashioned.