Here come the scrAPIs!

ScraperLet’s be honest with ourselves. It‘s the crappy technology no one cares about that so often changes everything. MP3s, for instance, sound pretty awful compared to the CD tracks they’re ripped from. But as a format MP3 was good enough to blow a big smoking hole in the music industry…once there was an easy way for people to share them with each other. From the beginning artists and labels have tried to control file sharing, but now the successful artists are actually harnessing it.

We can learn from this. Screen scrapers, like the MP3 format, are anything but ideal, but dammit they get the job done. scrAPIs are a means of turning these unfortunate little hacks into something remarkable.

The way to do this? We turn them into APIs, lay down some coding conventions and collectively share their maintenance. We apply open source collaboration and guerilla marketing tactics to make data as free as can be. With a bit of duct tape and chewing gum we have the opportunity to do for data what Napster and Kazaa did for music: change the game entirely.

But I’m getting ahead of myself. scrAPIs are mostly just a concept today. Let’s talk about why we should make them a reality.

The explosion in mashups is demonstrating the ease and power of combining services and data from multiple sources. We’re seeing how much more valuable data becomes when it is presented optimally for the specific task at hand–usually quite differently than the data providers display it. But it’s also awakened us to the reality that there are only a few hundred open APIs to draw from amongst a countless number of data providers. After all, the Web has made every organization, company, and individual a de facto data provider. The information we need is too often obscured by arcane forms, unusable screen designs, and workflow constraints.

APIs are the gold standard for data access. They’re managed as critical pieces of a larger application, and it shows, with their usually reliable performance, solid documentation and adherence to best practices. But just ask Stewart Butterfield of Flickr: there are real costs associated with developing, supporting and maintaining APIs. Most businesses, even Internet and software firms, have no plans to open up their APIs, and for good reason.

So we scrape. We build little collections of tortured code that splice and dice html, text files, PDFs and other documents to pull out the structured data that our apps need. We have no idea whether we’re the only ones in the world parsing for a particular set of data, or if there are hundreds of others duplicating the effort.

And of course, we end up maintaining all these scrapers. This can be a huge chore, such as when a data source makes a minor tweak to their html that throws our scraper into a tailspin. Too often we discover the scraper is broken after our app has been crippled for a week or two, and we scramble to get it fixed.

ScrAPIs are different. Sure, they’re still scrapers, but they’re a whole lot more. Let’s take a look:

Scrapers scrAPIs
Run within a closed application Delivers public access to a scraper’s functionality
Operated locally by direct invocation Provides a REST interface
Maintained by an application developer Maintained by a community
Built with custom code Built with reusable code components
No documentation API documentation

We’re seeing the model begin to emerge with projects like Ontok’s Wikipedia web service and XMLTV (though XMLTV does not provide an API, just xml files).

Most importantly, scrAPIs act as open APIs for data sources that don’t have them. We don’t have to wait an eternity for the great majority of sites to get around to creating APIs of their own, if they ever do. And the recipe is simple: Scrapers + web services + open source collaboration = scrAPIs.

The environment is ripe for this vision to emerge. We have all the makings of a perfect storm:

  • A critical mass of structured data on the Web
    It’s taken ten years, but virtually every organization or business is publishing some kind of structured information via the Web. When it comes to data, everything is valuable to somebody.
  • Popular demand for API-level access to all this data
    How many new Google Maps mashups have you seen lately? The sheer numbers of professionals and amateurs using open APIs is exploding, driving more usage and spawning new data providers
  • Best practices in Web API design
    We know what works for the masses: simple signup, REST calls, ready-to-go developer toolkits, a liberal usage policy. This is the recipe for shake-n-bake API access.
  • Economic benefits to useful mashups
    The cost/effort of integrating a Web service into an app is minor, and the benefits are recognized.
  • A large number of ad hoc efforts to repurpose web data
    Not only are there are a tremendous number of scrapers being built and maintained around the Web, but a new generation of scraping tools has emerged to increase the ease of scraping. No need to be a regular expressions god to get rolling.
  • Best practices in online collaboration
    The open source movement has demonstrated how individuals acting in self-interest can come together to create tangible work of value, without outright ownership. Platforms like SourceForge have been instrumental in laying the infrastucture.

Put it all together and we have a real opportunity on our hands.

So let’s get started, shall we? It’s time to stop building scrapers and start building scrAPIs.

10 Responses to “Here come the scrAPIs!”

  1. Labnotes » Blog Archive » What exactly is a scrAPI? Says:

    [...] ormation, you’ve created a contacts scrAPI. Thor makes compelling arguments for his idea. He’s also raising the point that the [...]

  2. netpositive | Another definition for scrAPI Says:

    [...] efinition for scrAPI
    March 5th, 2006

    When a group of us initially discussed the idea of scrAPIs at Mashup Camp it was in the context of turning our scra [...]

  3. Annoyingly Cheerful » Blog Archive » scrAPI? Says:

    [...] find about microformats, and today I stumbled over a concept that is somewhat related: the scrAPI. I already knew about screen scraping, of course: it’s where you [...]

  4. ProgrammableWeb.com » Blog Archive » scrAPIs Says:

    [...] What’s a scrAPI? A scrAPI, which at this point is more of an idea than a thing, was recently described by Thor Muller in his blog as a type of community-built API that provides a programming layer above web sites that don’t otherwise have an API. This intermediate layer, which exists independently of the destination web site, in turn does the dirty work of screen-scraping of raw HTML from the source and returns just the relevant data in some cleaner XML format. Thus a collaboratively built and maintained set of code for data access from any source. [...]

  5. seanohagan Says:

    Sounds like a great idea! I’ve created two web scrapers for myself over the last few years: one to scrape a specific used book site, and more recently a very generic one as a command (called scrape) at YubNub which allows almost any site to be scraped (although only for a consecutive block of text).

    I’d be very interested in seeing a project like this get started and would like to contribute if I could.

  6. Thor Says:

    Thanks for your enthusiasm, Sean. I’ll be posting suggestions (and launching a wiki) for a specific grassroots initiative around scrAPIs in the next week or so. It’ll be managed by the community of users, so it should be a great vehicle for the great scraping work you’re doing.

  7. Richard K Miller dot coooooooooom » scrAPIs Says:

    [...] Sources: ThorMuller.com ProgrammableWeb.com [...]

  8. THINK / Musings » Blog Archive » Scraping data and API’s Says:

    [...] I was wondering how easy it would be to build a generic approach to opening up API’s on web sites who didnt formally publish them and then last night I saw this post about scrAPI’s.    Great stuff—would like to be able to cut and paste data sources and mix them together myself.   I find myself doing manually today too often (eg: the other night I was cutting and pasting rotten tomatoes reviews vs. a movie database).   So many mashup’s today and based on geo location data—its like my one year old who has six or seven words, most everything is at some point “hot”.    Latitude and longitude are just the easiest and first data source to be mined—things are going to get a lot more interesting as the data sources become increasingly diverse.  I look forward to Muller’s coming posts on the business and legal issues regarding scpAPIng. Posted by John Filed in think, building blocks, API’s [...]

  9. tijs.org » Blog Archive » Program all of the web Says:

    [...] Altough most modern web applications have an API, many do not. For the rest we have site scrapers. Thor Muller suggests (follow up here) we open up our scapers with their own scrAPI to get to those now hard to reach sources of nfo into the mashup mix. [...]

  10. Breyten’s Dev Blog » Blog Archive » links for 2006-05-11 Says:

    [...] netpositive | Here come the scrAPIs! “It’s time to stop building scrapers and start building scrAPIs.” (tags: api xml scrapi mashups web2.0) [...]

In defense of irrational exuberance