Checklist to bulletproof your data work
Updated on 6/23: I’ve included, marked by an asterisk, several fantastic checklist suggestions offered by David Donald at the IRE ‘12 conference in Boston.
Much like surgeons, journalists who work with data can always use help ensuring they’re getting things right. I compiled this checklist of questions I ask myself whenever I work with a data set. It’s evolved into a tool that The Star-Ledger data team uses on every story we write.
- Where did the data come from?
- Who created it? Is this the best source for this data? What was the methodology behind its collection?
- What documentation came with it?
- Did you read the documentation?
- Is there a record layout or data dictionary?
- Did you save an original copy of the database if you need to retrace your steps?
- Are your field headings accurate? Do you have the columns labeled correctly?
- Are you looking at the correct tab? Is there additional data in other tabs in the worksheet?
- How many records are in your table or database? How many should be there? Are there any missing? Too many? Any chance you maxed out the number of fields the software can handle?
- Are there duplicates in the records? *
- Examine each column of data one at a time. Is it formatted in the appropriate manner? Are there any gaps in the column? Are any data missing?
- Have names and entities been cleaned and standardized? *
- Check the aggregation. How was it done? *
- When joining tables, are you certain the join worked? How many records are there now? Too many? Not enough?
- When pasting tables with common columns side-by-side in Excel, do they line up? Any chance two records might be switched?
- What calculation are you trying to do? Are you using the appropriate formula or function for this task?
- Do your cell/column references in functions point to the right places? Are any columns transposed? Can you trace the formula’s path? Have you pasted the correct formula all the way down the column to the bottom of the table?
- When sorting, did you sort the entire table together and not omit any columns?
- When pasting, is it a good idea to Paste Special>Values to strip out any pesky underlying formulas that may become misdirected?
- When moving columns, were there any $ anchors that you pasted that now point to the wrong column or row of data and throw off your calculations?
- Does your calculation make sense?
- Are you looking at outliers or values in the middle of the pack? *
- Is there an expert–either an agency data guru or academic–who has vetted your process or calculations?
- Have you verified your records against the original records or a sample thereof? *
- Have you spot-checked individual records to ensure the numbers match up? Did you ask a colleague to assist by reading off numbers to be CQ’d?
- Have you asked an uninvolved colleague to look at your process, review the numbers and poke holes in the methodology?
- Can you reproduce your calculations from the beginning or explain step-by-step how you arrived at your conclusion?
- What numbers are you using in the story? Have you circled them and reviewed them in the database? Is the description of the numbers accurate? Does it omit any pertinent details that might mislead an unwary reader?
- Are the numbers here based on counting, estimates, projections or guesses? *
- Are there any graphics with the story? Have you checked them? Can you verify every number? Are the introduction and column heads accurate? Are there numbers missing from the graphic that are important to the story?
- Is there a methodology/nerd box to run with the story? *
- How did you get here, and does it make sense?
Print a copy of the checklist.
How do you bulletproof your data work? At the 2012 NICAR conference in St. Louis, Tom Johnson and Cheryl Phillips led a seminar discussing their tips for validating data. I’d really like to hear any advice or ideas you’re willing to share. Leave a comment here with your techniques or tweet at me.
A * indicates questions added after hearing David Donald speak about bulletproofing CAR work at IRE ‘12 in Boston.