Constantly Crawling MIllions and Millions of Blogs

News about Edgeio is trickling out, Business Week says the following about this upcoming service:

…Edgeio is doing just what its tagline says: gathering “listings from the edge”–classified-ad listings in blogs, and even online product content in newspapers and Web stores, and creating a new metasite that organizes those items for potential buyers.

The way Edgeio works is that bloggers would post items they want to sell right on their blogs, tagging them with the word “listing” (and eventually other descriptive tags). Then, Edgeio will pluck them as it constantly crawls millions of blogs looking for the “listing” tag and index them on Edgeio.com.

Sounds great. Exciting and cool in fact. But reread that last line there: constantly crawls millions of blogs looking for the “listing” tag. When will the weight of all these search engines indexing blogs start to affect the price of blogging?

Yesterday on this obscure blog 15% of the access was from RSS readers and aggregators, 28% from search engine robots. 18 different crawlers visited yesterday alone. There are more than a few of these robots that come in daily and hit 60-80 pages whether anything’s been updated or not, and I’m sure there are bloggers who are seeing higher ratios of access from robots and crawlers.

If as Dave Sifry of Technorati says, the blogosphere is doubling in size every 5-6 months, will services requiring blog indexing grow at the same rate? Will 70+ crawlers be visiting this site daily in a year’s time? 300 crawlers a day 2 years from now?

Bloggers pay the cost for the bandwidth consumed by all of this search engine indexing either directly or indirectly. Bandwidth is not free, a blog hosting provider has to pay for it and must recoup the cost of bandwidth (and the other costs associated with blog hosting) either by subscription fees or by placing advertising on their blogs.

Doesn’t it seem inevitable that the explosion of blog indexing services will eventually have an effect on the price of blog hosting services?

Maybe there’s a better way to do this… Bob Wyman said the following last year on this post:

I’m hoping that Yahoo!’s support for the FeedMesh will convince folk that services that might otherwise compete can see clear advantage in cooperating to ensure that the task of discovering blogs and updates to blogs is shared among all parties. We’ll still compete… It’s just that we’ll compete based on the quality of the services we provide rather than just on how many blogs we monitor.

If this idea was extended to not only the discovery of blogs and updates, but the nature of those updates, perhaps the bandwidth pressure on blogs can be alleviated? What if there were a mirroring service, a Blog Cacher, that monitored the FeedMesh for update notfications and stored a copy of the blog pages and feeds for use exclusively by the blog search services?

Access to the cached or mirrored copy would be restricted to blog indexing services, ensuring that the general public only sees the “original”. Make it opt-in, let the blog owner choose to request that search engines access the cached copy, maybe via a simple file uploaded to the root directory of the blog, a robots.txt style service.

And how would this Blog Cacher service pay for itself… How would it monetize itself? Hmmm… That’s a good question, I can see a few different models… And I’m sure you can too… I’d be surprised if we didn’t see such a service by the end of the year.

 

Update: Blog Cacher sounded pretty cool, I couldn’t resist registering blogcacher.com. 😉

Update 2005/02/24: Looks like there’s been some work done on an API called the RSS Cloud interface, which would allow updates to an RSS feed to be sent to “interested parties.”  Would be a great place to start for blog caching service…

Articles with these Tags at Technorati: 

,