Click here

Mar 03

Should scrAPIs be open source?

Posted by Thor on March 3, 2006. Filled under Random.

Some people have asked how important it is that scrAPIs be open source. Put simply, a scrAPI is simply a screen scraper with an open API. But because of the nature of maintaining a scrAPI of any complexity, parsing pages that may change with some frequency, it should ideally harness open source-style collaboration by the developers that use it.

When individuals are responsible for scrapers, they are likely to miss the inevitable parsing hiccups caused by changes on the data provider’s site. Communities of users are far more likely to catch these problems as they emerge.

There’s nothing better than a gaggle of geeks to keep a scrAPI running smoothly.

About this entry:

This entry was posted on Friday, March 3rd, 2006 at 11:09 pm and is filed under Random. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
3 Comments to “Should scrAPIs be open source?”
  1. Labnotes » Blog Archive » What exactly is a scrAPI? Says:

    […] 8217;s also raising the point that there could be an ecosystem of scrAPIs, including unifying services and open source lib […]

  2. seanohagan Says:

    I’ve always wanted to implement an automated way for a scraper to catch site changes and to notify the developer.

    Each time a site is scraped, certain static features of the site should always be checked. This could include text or images that always occurs in the same place; text that always has the same format (ie. a date, a time, a price, etc.); other static HTML markers; etc.

    As soon as the scraper detects that the site’s “skeleton” has changed, it alerts the developer so that the proper modifications can be made.

    Besides being run each time a scrape is executed, this check could be made hourly or at some other regular interval.

    Perhaps this is already a standard feature of most scrapers?

  3. Thor Says:

    Sean,
    Good scrapers should have error checking just as you describe, but given how quick-and-dirty so many scrapers are I’d guess that most don’t. One of the advantages of a scrAPI initiative would be to popularize these best practices and make reusable code readily available. We need to remove all barriers to doing this.