Posted on October 23, 2008
Why you should learn to love data
Learn a Web framework, Matt Waite urged my journalism students on Tuesday. He recommended either Django or Ruby on Rails, and he characterized arguments over which one is better as “nerd Pepsi vs. Coke.”
Matt is the concept and programming brains behind both Neighborhood Watch ( “porn” for middle-class homeowners in a couple of Florida counties) and Politifact ( “porn” for all U.S. politics junkies, everywhere). He likes to make a point that programming was never in his background or training when he was a college student — “I majored in journalism and minored in journalism and journalism — that is, sociology and English,” he said. (I love that line.)
Cutting his teeth as a cops reporter in Little Rock, Arkansas, Matt was like a lot of folks who choose journalism — he didn’t have a big career plan; he just followed a path from one reporting job to a better one. He moved into computer-assisted reporting as part of doing investigative stories. And then, somewhere along the way, he realized that a lot of what journalists do is repetitive and could very well be automated.
When he explains structured data, he sounds exactly like Adrian Holovaty — but that’s no surprise, because Adrian has inspired a lot of people. This is Matt’s version, paraphrased:
Every murder, for example, has a day, a time, a location. Every murder has a victim, and (we hope) a suspect. Both the victm and the suspect have age, race, and gender. Every murder also has a weapon, a motive, and a relationship.
I used to copy-edit on the Metro desk at The Washington Post, during the years when Washington was known as the murder capital of the United States. Every night, one of us copy editors was assigned the briefs, and the briefs were mostly a litany of who had been shot, whether they had died, and usually, no arrests had been made. For the copy editors (and I know this sounds heartless), working on the briefs meant you had to come up with a lot of different ways to write shot, died, and killed — for the headlines. For the reporter, it meant you had to write almost the same story again and again, with only the street address, day, and time being different. I don’t mean to belittle these deaths — it was a terrible, terrible time. Each of these so-called stories was about 150 words long — they weren’t doing anything other than relaying the barest facts.
The point about structured data is, when all you’re doing is pressing the facts into a pre-set mold, it’s not really “writing.” It’s not something you need to outsource to India, either.
So you look at a “story” like Prices down, sales up in Gulf Harbors, at the Neighborhood Watch site. (That’s the headline today, anyway — six months from now, it may be different.) The story below the headline is actually this:
The median sale price of a single family house in [place] was [up/down] by [amount] or [number] percent from the [first/second] half [year] to the [first/second] half [year], according to a Tampabay.com analysis of sales records. The median sale value in [first/second] half [year] was [amount], compared with [amount] in [first/second] half [year]. Sales were up by [number] sales or [number] percent from the [first/second] half [year] to the [first/second] half [year]. In the [first/second] half [year], [number] houses sold, compared with [number] in [first/second] half [year].
Yeah, totally bland — but after the database fills in the numbers and dates automatically, you have details about every single neighborhood in two highly populated counties. You have hundreds of stories, and no one had to write a single one of them! Even the headline is written by code.
The point of all this is NOT to get out of doing work. The point is to provide information that the audience is really happy to get in a frequently updated fashion — and to do so in a way that is easy for people to use and pleasant enough for them to consume (like this). But having real humans sit down and write this is not necessary — and so, the Web framework.
Matt knew what to do for Neighborhood Watch because he had previously done this manually — had actually written every story. These had run in a print issue of the St. Petersburg (Fla.) Times, and the issue sold out. The readership loved it. When his editors asked Matt to do the same thing again, a couple years later, he asked if he could create a data-driven version instead. It was either that, or shoot himself — there was no way he was ever going to write all those boring little “stories” again!
Clarification (from Matt, via e-mail @ 11:43 a.m.): “In 2004, I shared bylines on many of the stories that ran — more than 30 in two days — but I didn’t have to write all of them (thank God). Reporters from around the company did them with me — I provided data analysis, tips, direction, custom maps, etc. I just wrote the 1A mainbar after doing all the data analysis and helping with graphics and the Web presentation.”
Another point that Matt likes to drive home is that by building this solution in-house (as he did), the news organization spends money once (paying Matt and one graphic designer while they create and test Neighborhood Watch), and everything afterward is pure profit.
In contrast, news organizations that instead choose to use a vendor such as Caspio to host their “data ghetto” will have to pay fees monthly, forever, to keep those data online. Stop paying the vendor, and the whole thing is erased, gone. Moreover, you will pay more if your number of records grows larger. It’s kind of a short-sighted solution: Because you didn’t hire a guy like Matt, you will be paying over and over and over again for the same thing — while Matt will be building new things for his news organization.