Another definition for scrAPI
When a group of us initially discussed the idea of scrAPIs at Mashup Camp it was in the context of turning our scrapers into proper APIs and collaborating in their maintenance. However, Assaf points out that there is another way of defining a scrAPI: the implicit API in the html patterns that contain the data. As he explains:
A scrAPI uses HTTP transport, HTML parsing and some custom code for making sense of the data. Each scrAPI has its own custom code, depending on the service being used and what data you’re looking for.
In short, there’s an API, it just requires a little bit of scraping.
Flickr has a scrAPI, as does WordPress, MoveableType, Blogger, MetaFilter and a whole set of other sites.
When he first explained this idea to me it didn’t strike me as the most useful of definitions. After all, simply calling the scrapeable code something different doesn’t make scraping any easier. It seemed more beneficial to popularize the definition that enables a transformation in how we share scrapers and collaborate with their maintenance.
But his essay makes the very important point that propagating the idea of html as a de facto API might encourage Web developers to make it easier to scrape by investing in semantic markup. Or at least minimize making the kinds of code changes that will knowingly break scrapers.
This confusion over the definition for scrAPI is not new, as it turns out. The original coiner of the term, Paul Bausch, seems to agree with my definition in his initial mention of it, back in 2002:
I’ve been thinking about turning my Amazon scraping scripts into an XML API to their book information (I call these SCRAPIs), but it could never be as reliable as Amazon offering their own API. Plus I’d have to keep up with their page design changes. It’s fun to think about rogue APIs to web sites, though.
For some reason, in a post that same month, Peter Lindberg paraphrases Paul’s description, creating a counter-definition identical to Assaf’s:
I saw the term “SCRAPI†for the API that all websites have in the form of HTML, that you can use by writing apps that request pages and “scrape†them for information. It seems it was coined by Paul Bausch.
Both definitions make sense. However, I believe the benefits of popularizing API-versions of scrapers per my original post is what we ought to be focused on. “scrAPIs” should describe what we’re building, not what we’re scraping. With something approaching a collective effort, it will be easier to promote Assaf’s goal of getting data providers to help us out rather than blocking us. As the operators of scrAPIs in the clear light of day we can insist on ethical and legal uses by our users, representing the interests of our data providers.
Then not only will we be their API providers, but their usage watchdogs as well. They oughtta love that.
March 8th, 2006 at 4:53 pm
Interesting points in this and the last entry on the topic. Recently, I began considering my options for developing my own mashup API [TrainCheck], which gives people the freedom to check public transit train times using only cell phones equipped w/ eml or sms. Currently, TrainCheck’s information system uses data “scraped” either by using Beautiful Soup [check out Rubyful Soup for the Rubesters], or by manually reproducing timetables published by the metros.
Two points:
1. It would be great, if as Assaf puts it, the originators would agree to supply data in a scrapable format at the very least, to help ensure the constant integrity of our data against theirs. After all, re-scraping is easy once you have a ScrAPI.
2. With the data provided in a “scrAPI-friendly” format i.e. /(X)?HTML/, ambitious coders could then work on a product for developing ScrAPIs. Litterally, the open source item in question could be a facade for modular ScrAPI development. A gateway to scraping, if you will.
For example, the most austere ScrAPI [as Thor describes it] may provide a quick and dirty REST uri that will retrieve scraped data from a site once a developer registers their ScrAPI module. How would that work?
1. Chuck wants to scrape gonefishin.com for the tomorrow’s prime fishing times. He wants to post the times on his site for your average fisherman.
2. Chuck visits ghostscrape.com [the scrAPI] and creates a “scrapster” profile for his scraper. He loads the gonefishin site in a “scriframe”, then “magically” picks and chooses the info he wants to scrape right out of the site he’s targeting.
3. The scrapi will let him customize the representational state transfer variables in the query string as well as the format of the returned “scrapola” in the HTTP response.
4. The scrAPI will give chuck a URI for the API [everyone who uses the system will get the same base URI plus the query string for their requested info and a token of some sort to identify the "scrapster" that Chuck profiled in step 2.]
e.g. http://scrapi.ghostscrape.com?token=09_KFDJLK098KL&scrape01=f_times&template=XML
5. The ScrAPI returns the stuff chuck needs without any confusion or hullabaloo. Whatever comes back, it’s up to Chuck to write the parser that’ll make it hit his page. However, all the configuration, XML templates and everything else will be parsed and rendered to a response by the ScrAPI app.
Back to public transit. So I’m facing the possibility of talking to the people responsible for DC’s transit infrastructure about designing their api. The more I think about it, the less I want to help them design an api for themselves, but instead build an API like ghostscrape [but geared entirely toward transporation systems]. Sure, first I’ll suggest that they give me the freedom to buid a “live” API for all their GPS enabled systems. Let’s assume they tell me it would cost them too much [bs, for sure]. Then, at the very least, I could look forward to asking them to simply get their static data into HTML tables that I could scrape with ease.
And for starters, that doesn’t sound all that bad.
Thanks Thor!
So let’s make it happen: that’s a challenge to anyone willing to wrap your mind around this very useful concept.
March 14th, 2006 at 1:10 pm
http://simile.mit.edu/solvent/
Very cool. Solvent is a Firefox extension that helps you write Javascript screen scrapers for Piggy Bank.
March 21st, 2006 at 10:25 pm
[...] In this follow-up post Thor notes that the original coiner of the term was Paul Bausch back in 2002. Which in turn was in reference to scraping Amazon data. And interestingly, it was just this sort of scraping that was a key driver in leading Amazon to subsequently build a real API: people are going to do it anyway, let’s formalize and leverage it. « home | March 21st, 2006 | Posted by John Musser in News [...]