Why you should learn to love data

Learn a Web framework, Matt Waite urged my journalism students on Tuesday. He recommended either Django or Ruby on Rails, and he characterized arguments over which one is better as “nerd Pepsi vs. Coke.”

Matt is the concept and programming brains behind both Neighborhood Watch ( “porn” for middle-class homeowners in a couple of Florida counties) and Politifact ( “porn” for all U.S. politics junkies, everywhere). He likes to make a point that programming was never in his background or training when he was a college student — “I majored in journalism and minored in journalism and journalism — that is, sociology and English,” he said. (I love that line.)

Matt recommends this site as a good place for you to start learning. It offers tutorials on Excel, on mapping, on Web applications, etc.

Cutting his teeth as a cops reporter in Little Rock, Arkansas, Matt was like a lot of folks who choose journalism — he didn’t have a big career plan; he just followed a path from one reporting job to a better one. He moved into computer-assisted reporting as part of doing investigative stories. And then, somewhere along the way, he realized that a lot of what journalists do is repetitive and could very well be automated.

When he explains structured data, he sounds exactly like Adrian Holovaty — but that’s no surprise, because Adrian has inspired a lot of people. This is Matt’s version, paraphrased:

Every murder, for example, has a day, a time, a location. Every murder has a victim, and (we hope) a suspect. Both the victm and the suspect have age, race, and gender. Every murder also has a weapon, a motive, and a relationship.

I used to copy-edit on the Metro desk at The Washington Post, during the years when Washington was known as the murder capital of the United States. Every night, one of us copy editors was assigned the briefs, and the briefs were mostly a litany of who had been shot, whether they had died, and usually, no arrests had been made. For the copy editors (and I know this sounds heartless), working on the briefs meant you had to come up with a lot of different ways to write shot, died, and killed — for the headlines. For the reporter, it meant you had to write almost the same story again and again, with only the street address, day, and time being different. I don’t mean to belittle these deaths — it was a terrible, terrible time. Each of these so-called stories was about 150 words long — they weren’t doing anything other than relaying the barest facts.

The point about structured data is, when all you’re doing is pressing the facts into a pre-set mold, it’s not really “writing.” It’s not something you need to outsource to India, either.

So you look at a “story” like Prices down, sales up in Gulf Harbors, at the Neighborhood Watch site. (That’s the headline today, anyway — six months from now, it may be different.) The story below the headline is actually this:

The median sale price of a single family house in [place] was [up/down] by [amount] or [number] percent from the [first/second] half [year] to the [first/second] half [year], according to a Tampabay.com analysis of sales records. The median sale value in [first/second] half [year] was [amount], compared with [amount] in [first/second] half [year]. Sales were up by [number] sales or [number] percent from the [first/second] half [year] to the [first/second] half [year]. In the [first/second] half [year], [number] houses sold, compared with [number] in [first/second] half [year].

Yeah, totally bland — but after the database fills in the numbers and dates automatically, you have details about every single neighborhood in two highly populated counties. You have hundreds of stories, and no one had to write a single one of them! Even the headline is written by code.

The point of all this is NOT to get out of doing work. The point is to provide information that the audience is really happy to get in a frequently updated fashion — and to do so in a way that is easy for people to use and pleasant enough for them to consume (like this). But having real humans sit down and write this is not necessary — and so, the Web framework.

Matt knew what to do for Neighborhood Watch because he had previously done this manually — had actually written every story. These had run in a print issue of the St. Petersburg (Fla.) Times, and the issue sold out. The readership loved it. When his editors asked Matt to do the same thing again, a couple years later, he asked if he could create a data-driven version instead. It was either that, or shoot himself — there was no way he was ever going to write all those boring little “stories” again!

Clarification (from Matt, via e-mail @ 11:43 a.m.): “In 2004, I shared bylines on many of the stories that ran — more than 30 in two days — but I didn’t have to write all of them (thank God). Reporters from around the company did them with me — I provided data analysis, tips, direction, custom maps, etc. I just wrote the 1A mainbar after doing all the data analysis and helping with graphics and the Web presentation.”

Another point that Matt likes to drive home is that by building this solution in-house (as he did), the news organization spends money once (paying Matt and one graphic designer while they create and test Neighborhood Watch), and everything afterward is pure profit.

In contrast, news organizations that instead choose to use a vendor such as Caspio to host their “data ghetto” will have to pay fees monthly, forever, to keep those data online. Stop paying the vendor, and the whole thing is erased, gone. Moreover, you will pay more if your number of records grows larger. It’s kind of a short-sighted solution: Because you didn’t hire a guy like Matt, you will be paying over and over and over again for the same thing — while Matt will be building new things for his news organization.

Related posts:

13 Comments on “Why you should learn to love data

  1. Pingback: links for 2008-10-23 – Innovation in College Media

  2. Wow, “Pepsi,” “Coke,” “Rails,” “Django” and “Porn” twice, I want to see your search engine traffic stats after this one. 😉

  3. Pingback: Teaching Online Journalism » When to choose data over stories

  4. “Every murder, for example, has a day, a time, a location. Every murder has a victim, and (we hope) a suspect. Both the victm and the suspect have age, race, and gender. Every murder also has a weapon, a motive, and a relationship.”

    Every murder also has a story, about a life ruined (two, really, if you count the killer), about a puzzle to be solved by the police, about a neighborhood, about a city.
    Maybe if we told those instead of covering crime through a computer-generated litany of the day’s shootings (or a string of cop briefs), we’d have more readers.
    Or maybe we do both, no harm in that. But let’s not forget the real stories in all the gee-whiz data applications.

  5. Who is forgetting the “real stories” as opposed to what, fake ones? Who said there’s only one way to handle information? Who said you can’t take the same basic information you’d gather for a “real” story and use it for an app that gives a reader a lot more insight to violence in your community in addition to your so-called real story? Do you see your false choices here? And does your newsroom have all the people it needs to write every one of these “real stories”?

  6. Good posting Mindy. The question this leaves hanging, is WHY use a framework rather than a CMS for data-driven stories, or for various types of evergreen content built on top of databases? (I know the answer, but would love to see the framework gospel spelled out for newsrooms who read your site). Maybe in a future post?

  7. Also, I think the Django/Rails = Pepsi/Coke analogy is limiting – makes it sound like the choice is one of simple preference. But there are important differences between them that should be weighed carefully when choosing.

    Nutshell: Rails has a larger installed base, and more developers. Django is easier to work with and faster to develop in. Most importantly, Django automatically creates a back-end for data entry as soon as the data model has been defined. That’s the key to the Django tagline “The web application framework for journalists with deadlines.”

  8. Scot: I’ve learned some Django and use Rails every day, and it absolutely is a matter of preference. And I’m not sure where you’re getting the idea that one is “faster to develop in” than the other, but that’s just nonsense.

    They both have their strengths and weaknesses, their quirks and qualities. Generally, if you like Python, you’ll love Django. The language and the framework share many of the same traits. If you like Ruby, you’ll like Rails. Same deal there.


  9. Aron – The main quantitative difference is that Rails doesn’t have anything like Django’s auto-generated, production-ready admin interface. Define your data model and journalists can begin entering content immediately, while the dev continues work on the rest of the site. That’s a *huge* time saver, and the main draw over Rails.

    The site development time advantages of Django over Rails has been tested and documented several times – here’s one example report:


    Tough stuff to quantify, but I have yet to see a comparison showing faster dev times for Rails over Django!

  10. Pingback: Teaching Online Journalism » Data visualization resources

  11. Pingback: Teaching Online Journalism » MVPs for October 2008

Leave a Reply