Data journalism: A simple tip to get started

As a young reporter, I went to a National Institute for Computer-Assisted Reporting conference in 1998. I’ll admit that I found the pivot tables in Excel of 1998 a bit daunting. A lot has changed since then, and when I do data journalism training now, one of the things that I stress is how the tools have gotten so much easier, especially in the last few years. Even pivot tables, which can be hard to wrap your head around, are really simple in Google spreadsheets and the current versions of Excel. With Google Drive (Docs), almost any journalist can be trained to produce simple graphs and charts in a day.

Aron Pilhofer, the Interactive News editor at the New York Times, has put it this way:

I teach and have taught for years basic computer-assisted reporting and I do it in this one-day class. Nobody believes me, but it’s totally true: In one day – ONE DAY – we can teach you the skills that if mastered would allow you to do 80 percent of all the computer-assisted reporting that has ever been done. This is importing a spreadsheet, doing some basic math, knowing what a sum is, what a mode, a median, what an average is. I mean, being able to take a dataset, to do some basic count. I mean, this is not rocket science, for the most part.

I did data journalism back in the 1990s when it was called computer-assisted reporting (CAR) in the US, and it was only when Simon Rogers launched The Guardian data blog after one of our internal hack days, that I got a chance to return to it. Thanks Simon.

And I was reminded of how easy it is to get your start again today with an interview by WAN-IFRA with Steve Doig, truly one of the CAR pioneers. What does he use for data journalism?

His toolbox has five items: a web browser, ability to access public records, Excel, in rare cases a heavier programme such as Microsoft Access to bring different tables together and a geo mapping tool.

Those are tools that most journalists use every day, apart from Access, and it really is that easy for you to dip your toe into data journalism. To be honest, I haven’t used Microsoft Access in years, and you can do most things just by using Excel, Google spreadsheets or Google’s Fusion Tables.

As a matter of fact, here’s one tip to get you going. If you want to find the sum, average or count of a column of numbers in Google spreadsheets, simply highlight the column of numbers and look in the lower right hand corner. You’ll see a small box. Click on it, and you’ll see a summary of the biggest number (max), smallest number (min), the average and  the number of numbers (count numbers) or of words (count). It’s that easy. You can start right now.

Gdocssummarise

It’s just a simple little feature, but over time, it can be a huge time saver. Of course, sometimes this quick little summary won’t be the end of the process but just the beginning. However, it’s the first step in interrogating data. Sometimes it will give you the answer you’re looking for, and other times, it will uncover a key question for your story, and that’s when data journalism really gets exciting.

Hacks-Hackers London: Coder-journalists or hybrid teams?

Finally, after months of being busy and missing Hacks/Hackers, I was thrilled to make it to last Wednesday’s instalment, which focused on Big Data in Financial Journalism. Congratulations to Jo Geary of The Guardian for organising another great event and Marianne Bouchart of Bloomberg for being such a great host. 

Emily Cadman, the head of interactive at the Financial Times, had a great presentation, along with her colleague and my friend, Martin Stabe. Emily also had one of the best provocations of the evening. She challenged the idea that journalists should become coders. Instead of journalists learning how to code, she suggested that news organisations should build hybrid teams of crack coders and journalists and editors who can work with and speak to coders. That being said, she said that if you do find a coder who values journalism and can think editorially, then do everything you can to hold onto them. For organisations the size of the Financial Times or even for small and medium-size papers part of a larger group, I couldn’t agree more.

It reminds me of my early days in digital journalism, back in the mid-90s. I was working at a regional news website on special projects. I spent about an hour editing an image. One of my graphic design colleagues said that while she appreciated my initiative that what took me an hour would have taken her minutes. I still have picked up a range of skills, but I have tried to focus on things where I can really add value and not areas where a specialist like a designer or a developer has spent as much time building their expertise in their work as I have in journalism. 

Cadman said that it took time and effort, and I’m sure a fair bit of astute application of political capital, to build her team. These types of hybrid teams don’t get created overnight. I am not familiar with the history of the team at the FT, but I know that Aron Pilhofer at the New York Times has spent years building up his team and figuring how the best composition and organisational positioning of his team. 

Data and visual journalism on a shoestring budget

The FT, the New York Times and the BBC have all developed hybrid teams like this, and I’m sure that for a lot of smaller news organisations having the resources for such a team seems simply unattainable especially for regional publishers in the UK or metro publishers in the US reeling under economic pressures. However, I would say two things, there is a lot that can be done at a group level, creating projects that can easily be replicated across markets and use local data. Good designers can create projects that can easily reflect the style of individual local sites. 

There is another way to develop great interactive data projects and that is to rely on the myriad of web services that exist. At journalism.co.uk’s last news:rewired conference Paul Rowland, deputy head of online content at Media Wales, had a great presentation on how Wales Online does data-driven visual journalism facing the same challenges that almost all regional publisher does in the UK. He outlined the challenges as:

• limited resources.
• a lack of cash.
• no dedicated developers.
• a hefty newspaper legacy.

He gave a rundown of his favourite services that should be in every digital editor’s toolkit no matter how small your organisation. As I always say when I work with news organisations and MDLF’s clients, interactive journalism is a lot like the iPhone. If there is a story-telling technique that you want to try, there’s a web app for that. 

I am more technical than most journalists but I’ve never learned how to code. Instead, I’ve always referred to myself as a “cut-and-paste” coder. I have always tried to keep on top of the kind of services that Rowland highlighted, and cutting-and-pasting an embed code from a third-party service is something that almost anyone who has embedded a YouTube video can do. 

Last Wednesday was really inspiring, and I think that Cadman showed that we’re not just breaking new ground in terms of using data in journalism, but we’re finally starting to get a handle on the best ways to organise the new news room that doesn’t look to everyone to be a jack-of-all trades but realises the role of specialisation and editors who have the digital and traditional experience to work with these kinds of digital teams. 

Digital Directions 11: Josh Hatch of Sunlight Foundation

Josh Hatch, until recently an interactive director at USAToday.com and now with the Sunlight Foundation, talked about how the organisation loves data. The transparency organisation uses data to show context and relationships. He highlighted how much money Google gave to candidates. Sunlight’s Influence Explorer showed that contributions from the organisation’s employees, their family members, and its political action committee went overwhelmingly to Barack Obama.

Sunlight Foundation Influence Explorer Google

The Influence Explorer is also part of another tool that Sunlight has, Poligraft. It is an astoundingly interesting tool in terms of surfacing information about political contributions in the US. You can enter the URL of any political story or just the text, and Poligraft will analyse the story and show you the donors for every member of Congress mentioned in the story. They will highlight details about the donors, donations from organisations and US government agencies. It’s an amazingly powerful application, and I think that it points the way to easily add context to stories. It does rely on the gigabytes of data that the US government publishes, but it’s a great argument for government data publishing and a demonstration for how to use that data. Poligraft is powerful and it scales well.

Josh showed a huge range of data visualisations, and he’ll post the presentation online. I’ll link to it once he’s done.

Opportunities from the data deluge

There are huge opportunities for journalism and data. However, to take advantage of these opportunities, it will take ?not only a major rethinking in the editorial and commercial strategies that underpin current journalism organisations, but it will take a major retooling. Apart from a few business news organisations such as Dow Jones, The Economist and Thomson-Reuters, there really aren’t that many general interest news organisations that have this competency. Most smaller organisations won’t be able to afford it on an individual level, but it leaves room for a number of companies to provide services for this space.

Neil Perkin outlines the challenge and the opportunity in a wonderful column that he’s cross-posted from Marketing Week. (Tip of the blogging hat to Adam Tinworth, who flagged this up on Twitter and on his blog.) In our advanced information economies, we’re generating exabytes of data. While we’re just getting used to terabyte disk drives, this is an exabyte:

1 EB = 1,000,000,000,000,000,000 B = 1018 bytes = 1 billion gigabytes = 1 million terabytes

To put this in perspective, I’ll use an oft-quoted practical example from Caltech researcher Roy Williams. All the words ever spoken by human beings could be stored in about 5 exabytes. Neil quotes Google CEO Eric Schmidt to show the challenge (and opportunity) that the data deluge is creating:

Between the dawn of civilisation and 2003, five exabytes of information were created. In the last two days, five exabytes of information have been created, and that rate is accelerating.

All the words spoken since the dawn of language in 5 exabytes or the amount of information created in the last two days helps illustrate the acceleration of information creation. Those mind-melting numbers wash over most people, especially in our arithmophobic societies. However, there is a huge opportunity here, which Neil states as this:

The upside of the data explosion is that the more of it there is, the better digital based services can get at delivering personal value.

And journalists can and definitely should play a role in helping make sense of this. However, we’re going to have to overcome not only the tyranny of chronology but also the tyranny of narrative, especially narratives that prejudice anecdote over data. Too often to sell stories, we focus on outliers because they shock, not because outliers are in any way representative of reality.

From a process point of view, journalists are going to need to start getting smarter about data. I think data crunching services will be one way that journalism organisations can subsidise the public service mission that they fulfil, but as I have said, it’s a capacity that will need to be built up.

Helping journalists ‘scale up what they do’

It’s not just raw data-crunching that needs to improve, but we’re starting to see a lot of early semantic tools that will help more traditional narrative-driven journalists do their jobs. In talking about how he wanted to help journalists at AOL overcome their technophobia, CEO Tim Armstrong talked about why these tools were necessary. Journalists have not been included in corporate technology upgrades (and often not included in creation of tools for their work). Armstrong said at a conference in June:

Journalists I met were often the only people in the room who never had access to a lot of info, except what they already knew.

It’s not technology for technology’s sake but tools to open up more information and help them make sense of it. Other industries have often implemented data tools to help them do their jobs, but it’s rare in journalism (outside of computer-assisted reporting or database journalism circles). Armstrong said:

You can pretty much go to any professional industry, and there’s some piece of data system that helps people scale what they do.

Journalists are being asked to do more with less as cuts go deep in newsrooms, and we’re going to have to work smarter because I know that there are some journalists now working to the breaking point.

There have been times in the last few years when I testing the limits of my endurance. Last summer, filling in behind my colleague Jemima Kiss, I was working from 7 am until 11 pm five days a week and then usually five or six hours on the weekends. I could do it for a while because it was a limited 10-week assignment. Even for 10 weeks, it was limiting the amount of time I had with my wife and was negatively affecting my health.

I’m doing a lot of thinking about services that can help journalists deal with masses of information and also help audiences more easily put stories into context. We’re going to need new tools and techniques for this new period in the age of information. The opportunities are there. Linked data and tools to analyse, sort and contextualise will lead to a new revolution in news and information services. Several companies are already in this space, but we’re just at the beginning of this revolution. We live in exciting times.

The value of data for readers and the newsroom

When I was at the BBC, a very smart producer, Gill Parker, approached me about pulling together a massive amount of data and information she was collecting with Frank Gardner trying to unravel the events that lead to the 11 September 2001 attacks in the US. Not only had Gill worked on the BBC’s flagship current affairs programme Newsnight and on ABC’s Nightline in the US, she also had worked in the technology industry. They were interviewing law enforcement and security sources all around the world and collecting masses of information which they all had in Microsoft Word files. She knew that they needed something else to help them connect the dots, and speaking with me in Washington where I was working as BBCNews.com’s Washington correspondent at the time, she asked if help her get some database help.

I thought it was a great idea. My view was that by helping her organise all of the information that they were collecting, the News website could use the resulting database to develop info-graphics and other interactives that would help our audience better understand the complex story. We could help show relationships between all of the main actors in al Qaeda as well as walk people through an interactive timeline of events. I had a vision of displaying the information on a globe. People could move through time and see various events with key actors in the story. This was a bit beyond the technology of the time. Google Earth was still a few years away, and it would have required significant development for some of the visualisations. However, on a story like this, I thought we could justify the effort, and frankly, we didn’t need to go that far. Bottom line: Organising the data would have huge benefits for BBC journalists and also for our audiences.

?Unfortunately, it was the beginning of several years of cuts at the BBC, and the News website was coming under pressure. It was beyond the scope of what I had time to do or could do in my position, and we didn’t have database developers at the website who could be spared, I was told.

A few years later as Google Earth developed, Declan Butler at Nature used data of the spread of the H5N1virus globally to achieve something like the vision I had in terms of showing events over time and distance.

It is great to see my friend and former Guardian colleague Simon Rogers move forward with this thinking of data as a resource both internally to help journalists and also externally to help explain a complex story in his work on the Wikileaks War Logs story. Simon wrote about it on the Guardian Datablog:

we needed to make the data easier to use for our team of investigative reporters: David Leigh, Nick Davies, Declan Walsh, Simon Tisdall, Richard Norton-Taylor. We also wanted to make it simpler to access key information for you, out there in the real world – as clear and open as we could make it.

As the digital research editor at The Guardian, data was key to many of my ideas (before I left this March to pursue my own projects). I even thought that data could become a source of revenue for The Guardian. Data and analysis is something that people are willing to pay for. Ben Ayers, the Head of social media and community at ITV.com, (speaking for himself not ITV) said to me on Twitter:

Brilliant. I’d pay for that stuff. Surely the kind of value that could be, er, charged for. Just sayin’ … just an example of where, if people expect great interpretation of data as part of the package, the Guardian could charge subs

As I replied to Ben, I wouldn’t advocate charging for data for the War Logs, but I would suggest that charging for data about media, business and sports. That could become an important source of income to help subsidise the cost of investigations like the War Logs. Data wrangling can be time intensive. I know from my experience in developing the media job cuts series that I wrote at the end of 2009 for The Guardian. However, the data can be a great resource for journalists writing stories as well as developing interactive graphics like the media job cuts map or the IED attack map for the War Logs story. Data drives traffic, as the Texas Tribune in the US has found, and I believe that certain datasets could be developed into new commercial products for news organisations.