Rubyred Labs in the Chron

Jonathan was interviewed by Dan Fost of the SFChronicle at SXSW.

“Do we want to make money? Absolutely. I want to make big stuff,” said Jonathan Grubb, 26, a co-founder and design director of Ruby Red Labs, a San Francisco Web and mobile product design firm. “But I’m tired of making stuff that’s the most profitable thing.”

Will he take venture capital when it comes calling? No way. “I don’t want money from some guy who wants to be my boss.”

Later in the article Chris Messina is quoted, “I prefer sustainability over profitability.”

So perhaps it comes down to this: In Boom 1.0 we raised venture capital so we didn’t have to make money. In Boom 2.0 we’re NOT taking venture capital so we don’t have to make money.

Another definition for scrAPI

When a group of us initially discussed the idea of scrAPIs at Mashup Camp it was in the context of turning our scrapers into proper APIs and collaborating in their maintenance. However, Assaf points out that there is another way of defining a scrAPI: the implicit API in the html patterns that contain the data. As he explains:

A scrAPI uses HTTP transport, HTML parsing and some custom code for making sense of the data. Each scrAPI has its own custom code, depending on the service being used and what data you’re looking for.

In short, there’s an API, it just requires a little bit of scraping.

Flickr has a scrAPI, as does WordPress, MoveableType, Blogger, MetaFilter and a whole set of other sites.

When he first explained this idea to me it didn’t strike me as the most useful of definitions. After all, simply calling the scrapeable code something different doesn’t make scraping any easier. It seemed more beneficial to popularize the definition that enables a transformation in how we share scrapers and collaborate with their maintenance.

But his essay makes the very important point that propagating the idea of html as a de facto API might encourage Web developers to make it easier to scrape by investing in semantic markup. Or at least minimize making the kinds of code changes that will knowingly break scrapers.

This confusion over the definition for scrAPI is not new, as it turns out. The original coiner of the term, Paul Bausch, seems to agree with my definition in his initial mention of it, back in 2002:

I’ve been thinking about turning my Amazon scraping scripts into an XML API to their book information (I call these SCRAPIs), but it could never be as reliable as Amazon offering their own API. Plus I’d have to keep up with their page design changes. It’s fun to think about rogue APIs to web sites, though.

For some reason, in a post that same month, Peter Lindberg paraphrases Paul’s description, creating a counter-definition identical to Assaf’s:

I saw the term “SCRAPI” for the API that all websites have in the form of HTML, that you can use by writing apps that request pages and “scrape” them for information. It seems it was coined by Paul Bausch.

Both definitions make sense. However, I believe the benefits of popularizing API-versions of scrapers per my original post is what we ought to be focused on. “scrAPIs” should describe what we’re building, not what we’re scraping. With something approaching a collective effort, it will be easier to promote Assaf’s goal of getting data providers to help us out rather than blocking us. As the operators of scrAPIs in the clear light of day we can insist on ethical and legal uses by our users, representing the interests of our data providers.

Then not only will we be their API providers, but their usage watchdogs as well. They oughtta love that.

Should scrAPIs be open source?

Some people have asked how important it is that scrAPIs be open source. Put simply, a scrAPI is simply a screen scraper with an open API. But because of the nature of maintaining a scrAPI of any complexity, parsing pages that may change with some frequency, it should ideally harness open source-style collaboration by the developers that use it.

When individuals are responsible for scrapers, they are likely to miss the inevitable parsing hiccups caused by changes on the data provider’s site. Communities of users are far more likely to catch these problems as they emerge.

There’s nothing better than a gaggle of geeks to keep a scrAPI running smoothly.

Here come the scrAPIs!

ScraperLet’s be honest with ourselves. It‘s the crappy technology no one cares about that so often changes everything. MP3s, for instance, sound pretty awful compared to the CD tracks they’re ripped from. But as a format MP3 was good enough to blow a big smoking hole in the music industry…once there was an easy way for people to share them with each other. From the beginning artists and labels have tried to control file sharing, but now the successful artists are actually harnessing it.

We can learn from this. Screen scrapers, like the MP3 format, are anything but ideal, but dammit they get the job done. scrAPIs are a means of turning these unfortunate little hacks into something remarkable.

The way to do this? We turn them into APIs, lay down some coding conventions and collectively share their maintenance. We apply open source collaboration and guerilla marketing tactics to make data as free as can be. With a bit of duct tape and chewing gum we have the opportunity to do for data what Napster and Kazaa did for music: change the game entirely.

But I’m getting ahead of myself. scrAPIs are mostly just a concept today. Let’s talk about why we should make them a reality.

The explosion in mashups is demonstrating the ease and power of combining services and data from multiple sources. We’re seeing how much more valuable data becomes when it is presented optimally for the specific task at hand–usually quite differently than the data providers display it. But it’s also awakened us to the reality that there are only a few hundred open APIs to draw from amongst a countless number of data providers. After all, the Web has made every organization, company, and individual a de facto data provider. The information we need is too often obscured by arcane forms, unusable screen designs, and workflow constraints.

APIs are the gold standard for data access. They’re managed as critical pieces of a larger application, and it shows, with their usually reliable performance, solid documentation and adherence to best practices. But just ask Stewart Butterfield of Flickr: there are real costs associated with developing, supporting and maintaining APIs. Most businesses, even Internet and software firms, have no plans to open up their APIs, and for good reason.

So we scrape. We build little collections of tortured code that splice and dice html, text files, PDFs and other documents to pull out the structured data that our apps need. We have no idea whether we’re the only ones in the world parsing for a particular set of data, or if there are hundreds of others duplicating the effort.

And of course, we end up maintaining all these scrapers. This can be a huge chore, such as when a data source makes a minor tweak to their html that throws our scraper into a tailspin. Too often we discover the scraper is broken after our app has been crippled for a week or two, and we scramble to get it fixed.

ScrAPIs are different. Sure, they’re still scrapers, but they’re a whole lot more. Let’s take a look:

Scrapers scrAPIs
Run within a closed application Delivers public access to a scraper’s functionality
Operated locally by direct invocation Provides a REST interface
Maintained by an application developer Maintained by a community
Built with custom code Built with reusable code components
No documentation API documentation

We’re seeing the model begin to emerge with projects like Ontok’s Wikipedia web service and XMLTV (though XMLTV does not provide an API, just xml files).

Most importantly, scrAPIs act as open APIs for data sources that don’t have them. We don’t have to wait an eternity for the great majority of sites to get around to creating APIs of their own, if they ever do. And the recipe is simple: Scrapers + web services + open source collaboration = scrAPIs.

The environment is ripe for this vision to emerge. We have all the makings of a perfect storm:

  • A critical mass of structured data on the Web
    It’s taken ten years, but virtually every organization or business is publishing some kind of structured information via the Web. When it comes to data, everything is valuable to somebody.
  • Popular demand for API-level access to all this data
    How many new Google Maps mashups have you seen lately? The sheer numbers of professionals and amateurs using open APIs is exploding, driving more usage and spawning new data providers
  • Best practices in Web API design
    We know what works for the masses: simple signup, REST calls, ready-to-go developer toolkits, a liberal usage policy. This is the recipe for shake-n-bake API access.
  • Economic benefits to useful mashups
    The cost/effort of integrating a Web service into an app is minor, and the benefits are recognized.
  • A large number of ad hoc efforts to repurpose web data
    Not only are there are a tremendous number of scrapers being built and maintained around the Web, but a new generation of scraping tools has emerged to increase the ease of scraping. No need to be a regular expressions god to get rolling.
  • Best practices in online collaboration
    The open source movement has demonstrated how individuals acting in self-interest can come together to create tangible work of value, without outright ownership. Platforms like SourceForge have been instrumental in laying the infrastucture.

Put it all together and we have a real opportunity on our hands.

So let’s get started, shall we? It’s time to stop building scrapers and start building scrAPIs.

Co.mments does conversation tracking right

Co.mment gets it right
I just discovered Co.mments and am excited to talk about it. The obvious comparison is another web app with a similar name, Cocomment, which allows users to track the conversations around comments they leave on other blogs. It received a lot of attention last month from the blogosphere, thanks to a now-famous wining and dining of A-list bloggers. That app itself is well-done and useful, I grant.

But Co.mment is even more useful, and just as well executed by lone technologist, Assaf Arkin. Rather than focusing on tracking your own comments, Co.mment is all about tracking any conversation you find interesting. Given how often the comment thread is more interesting than the original post, this has huge benefits.

Assaf’s system takes the comment thread and turns it into its own feed, which you can subscribe to like any blog feed. This unlocks a whole new dimension to blog content, allowing users to filter information at a more granular level. For starters, this suggests that our existing feed readers have outworn their functional usefulness. Why? Because they assume that all subscriptions are equal–conversation subscriptions decrease in importance as the conversation dies down. If this catches on it means a high-level of churn through “disposable” subscriptions.

Changing the world one screen scraper at a time

Several of us from Rubyred attended Mashup Camp last week (in fact, we were the official grapefruit sponsor of the event), and we were fortunate to meet some of the real pioneers of mashdom, all showing off what can be done with an API or two, some regular expressions and a bit of vision. People like Paul Rademacher from Housingmaps, Adrian Holovaty of Chicagocrime.org, Taylor McKnight of PodBop, and Bartosz Solowiej/Frank Harris from Traincheck.

The most exciting thing for me about Mashup Camp was seeing clearly the contours of an emergent phenomenon now in its earliest stages. We have APIs for only a miniscule portion of the data providers out there, and this is unlikely to change anytime soon. But we are starting to see a new breed of home-brewed APIs built on top of the screen scrapers we’ve all been writing and maintaining for years—scrapers that pull crime stats from police blotters, address data from Craigslist apartment listings, mp3s from web sites.

It’s one thing to build a scraper for your own app, it’s another to provide open access to it, enabling other developers to call its methods as if it were an open API from the data provider itself. For instance, Ontok is providing just such an interface for grabbing data from Wikipedia.

I proposed at the conference that we call this new incarnation “scrAPIs”, not realizing that the term had been coined back in 2002 by Paul Bausch. Paul was seeing the need for this back before Amazon opened up its API, and is one of the Godfathers of the modern mashup.

The concept of the scrAPI is potentially huge. Rather than waiting years for data providers to build their own APIs, we can build them today by leveraging and sharing the work we’ve already done on scrapers. Given the intellectual property issues, there are some tricks to doing this on the right side of the law. Consolidating scrAPI efforts as collaborative projects has huge benefit (it’s open source!), but how can we structure this particular kind of effort? I have some ideas about how to get things rolling, and I know others do as well. For starters, I’ll be posting a piece every day this week, each exploring an aspect of the scrAPI.

Coming up next: What is a scrAPI, and why we need them?

Rubyred at Mashup Camp
Rubyred brings the grapefruit

Measuremap scores!

Back in December I was asked by Liz Gannes of Red Herring who I thought would be the next Web 2.0 company to get bought by a portal. Without much deliberation I replied: “Measure Map. It’s just the right size, right timing and right space for a tidy acquisition.” It also had all the right buzz, cultivated through the careful seduction of A-list bloggers like Michael Arrington. Liz ran the quote in her article (“Hungry Hungry Yahoo”), as a counterpoint to the consensus that Technorati would be the first in line for a buyout.

Sure enough, after quiet speculation by gossip-mongers like Valleywag, Jeff Veen today appeared on the official Google blog announcing that Measure Map is now one of the family. In the background, Technorati suffered a noticeable braindrain with the loss of Niall Kennedy, Jason DeFillippo and Derek Powazek. Talk about turnabout!

But it does make sense for Google, if for no other reason than they need Jeff and his leadership. I have no doubt that he’ll raise the stature of user-centered design in that engineering-centric organization. Go Jeff, Go!

Logo Wars Pt. 2

Courtesy of Will Glass, here’s the next logo war, pitting Rubyred’s beloved half grapefruit against, well, a whole planet.
att vs rrl


Create polls and vote for free. dPolls.com

Logo wars

Chris Messina, Flock’s chief designer, recently unveiled his new company t-shirt with a special “Flockstar” emblem. It’s based on the favorites button on the Flock browser. Looks nice. But it sure does remind me of Rubyred’s logo:

The face-off

What do you think: friends or foes?

It’s funny because it’s true.



Morning

Originally uploaded by Neven Mrgan.

In defense of irrational exuberance