Let’s be honest with ourselves. It‘s the crappy technology no one cares about that so often changes everything. MP3s, for instance, sound pretty awful compared to the CD tracks they’re ripped from. But as a format MP3 was good enough to blow a big smoking hole in the music industry…once there was an easy way for people to share them with each other. From the beginning artists and labels have tried to control file sharing, but now the successful artists are actually harnessing it.
We can learn from this. Screen scrapers, like the MP3 format, are anything but ideal, but dammit they get the job done. scrAPIs are a means of turning these unfortunate little hacks into something remarkable.
The way to do this? We turn them into APIs, lay down some coding conventions and collectively share their maintenance. We apply open source collaboration and guerilla marketing tactics to make data as free as can be. With a bit of duct tape and chewing gum we have the opportunity to do for data what Napster and Kazaa did for music: change the game entirely.
But I’m getting ahead of myself. scrAPIs are mostly just a concept today. Let’s talk about why we should make them a reality.
The explosion in mashups is demonstrating the ease and power of combining services and data from multiple sources. We’re seeing how much more valuable data becomes when it is presented optimally for the specific task at hand–usually quite differently than the data providers display it. But it’s also awakened us to the reality that there are only a few hundred open APIs to draw from amongst a countless number of data providers. After all, the Web has made every organization, company, and individual a de facto data provider. The information we need is too often obscured by arcane forms, unusable screen designs, and workflow constraints.
APIs are the gold standard for data access. They’re managed as critical pieces of a larger application, and it shows, with their usually reliable performance, solid documentation and adherence to best practices. But just ask Stewart Butterfield of Flickr: there are real costs associated with developing, supporting and maintaining APIs. Most businesses, even Internet and software firms, have no plans to open up their APIs, and for good reason.
So we scrape. We build little collections of tortured code that splice and dice html, text files, PDFs and other documents to pull out the structured data that our apps need. We have no idea whether we’re the only ones in the world parsing for a particular set of data, or if there are hundreds of others duplicating the effort.
And of course, we end up maintaining all these scrapers. This can be a huge chore, such as when a data source makes a minor tweak to their html that throws our scraper into a tailspin. Too often we discover the scraper is broken after our app has been crippled for a week or two, and we scramble to get it fixed.
ScrAPIs are different. Sure, they’re still scrapers, but they’re a whole lot more. Let’s take a look:
| Scrapers |
scrAPIs |
| Run within a closed application |
Delivers public access to a scraper’s functionality |
| Operated locally by direct invocation |
Provides a REST interface |
| Maintained by an application developer |
Maintained by a community |
| Built with custom code |
Built with reusable code components |
| No documentation |
API documentation |
We’re seeing the model begin to emerge with projects like Ontok’s Wikipedia web service and XMLTV (though XMLTV does not provide an API, just xml files).
Most importantly, scrAPIs act as open APIs for data sources that don’t have them. We don’t have to wait an eternity for the great majority of sites to get around to creating APIs of their own, if they ever do. And the recipe is simple: Scrapers + web services + open source collaboration = scrAPIs.
The environment is ripe for this vision to emerge. We have all the makings of a perfect storm:
- A critical mass of structured data on the Web
It’s taken ten years, but virtually every organization or business is publishing some kind of structured information via the Web. When it comes to data, everything is valuable to somebody.
- Popular demand for API-level access to all this data
How many new Google Maps mashups have you seen lately? The sheer numbers of professionals and amateurs using open APIs is exploding, driving more usage and spawning new data providers
- Best practices in Web API design
We know what works for the masses: simple signup, REST calls, ready-to-go developer toolkits, a liberal usage policy. This is the recipe for shake-n-bake API access.
- Economic benefits to useful mashups
The cost/effort of integrating a Web service into an app is minor, and the benefits are recognized.
- A large number of ad hoc efforts to repurpose web data
Not only are there are a tremendous number of scrapers being built and maintained around the Web, but a new generation of scraping tools has emerged to increase the ease of scraping. No need to be a regular expressions god to get rolling.
- Best practices in online collaboration
The open source movement has demonstrated how individuals acting in self-interest can come together to create tangible work of value, without outright ownership. Platforms like SourceForge have been instrumental in laying the infrastucture.
Put it all together and we have a real opportunity on our hands.
So let’s get started, shall we? It’s time to stop building scrapers and start building scrAPIs.