Thursday, June 19

The Digg Paradigm: Community Spirit vs. Active Promotion

Everyone knows that having an article hit the front page of a popular social bookmarking site is an instant traffic boon that can't be ignored, but how do you get to the front page? Learn more about the issues involved.

read more | digg story

Wednesday, June 4

Republishing, crawling and linking in 2008

There's an ongoing conversation on the nextNY mailing list regarding linking and republishing of RSS feeds.

An interesting discussion on robots.txt has sprouted as well.
Ed Costello wrote a post well worth reading. A few highlights:

"Why might a site owner block a crawler?
  • on analysis of the site's server logs they realize that a section of their site has been crawled that they had no intention of making crawlable (this doesn't mean that the content is supposed to be "private", but perhaps it's the output of their inventory control system and they just don't want it in the world's search engine caches)
  • abusive behaviour by crawlers, pounding away at CGI scripts for example, or systematically posting content to forms found in retrieved pages
  • misrepresentation of a site's content
  • pretty much any reason at the discretion of the site's owners. It's their site, they set the rules."

"
A spider is not one spider on its own, it's one of potentially thousands hitting popular sites, sucking up bandwidth and other resources away from other users of the site. "

"
My advice:
  • be absolutely certain you comply with robots.txt,
  • make sure your spider is well behaved (take a breather between requests to a single site, or do a bunch of requests at a time, then move onto another site for awhile)
  • make sure your spider responds to weird server behavior by backing off (the number of spiders which don't process redirects correctly is sadly high)
  • include a URL to your spider's description/info page in the User-Agent field of the request (suggest using a tinyurl or something comparable to keep the number of bytes down, no need to waste bandwidth)
  • make sure your spider's IP address(es) reverse resolve to your domain
  • make sure someone's actually monitoring the spider as it runs, or gets alerted if something bad or strange happens. "

Finally, perhaps most importantly:
"If someone complains, don't quibble, stop crawling their site first then work with them to address their concerns"

Long story short: don't be abusive. Crawl others as you would wish to be crawled. Respect published instructions. In this internetworked world populated by an ever increasing amount of content and plumbed by an ever wider variety of open APIs, simple rules of etiquette still apply.