Scraping with Python
in the Newsroom

Day One - GIJC 2015, Lillehammer

By Tom Meagher / @ultracasual
and Tommy Kaas / @tbkaas

Why should you code?

Cover more ground, faster

Helps document reporting

Makes analysis replicable


We won't learn everything

The Command Line

Git or Github

pip or virtualenvs


News apps

Journalism > "Development"

This will not be nuanced, idiomatic Python.

Some programmers may be saddened by this code.

But you know what? If it works, and it works on deadline, that's what matters for us today.

The goal

To start thinking about
how to break problems down
into the smallest tasks
that can be programmed.

If you know Excel

You can learn to program.

  • "AZ Arizona" is a string.
  • "A2" is a variable.
  • "=left()" is a function.
  • A2 and 2 are function arguments.

Why Python?

Easy to learn.


Mature and well-documented.

Strong PythonJournos community of support.

Prep your workspace

To set up your machine, you'll need to have Python 2.7, pip, virtualenv and virtualenvwrapper installed.

#create and activate a sandbox to work in
mkvirtualenv gijc15    

#clone the code repo from Github
git clone

#install the dependencies: requests, beautifulsoup4, unicodecsv
pip install -r requirements.txt   

#launch the interactive interpreter


Strings are ordered sequences of characters wrapped in quotes.

var1 = "This class is at GIJC in Lillehammer."
var2 = "&You!_123 Four"

Follow along at home here.
If you want to cheat, the answers are here.

Integers and Floats

Numbers that you can do math on.

Integers are whole numbers.
Floats are decimals.


An ordered collection of objects, wrapped in brackets.

my_list = [1, 2, "Liberty Bell"]


A collection of named keys and their associated values, wrapped in curly braces.

my_dict = {'Fruit': 'Orange', 'Weight': 10}


Logic that can trigger other operations,
similar to Excel's if function.

In Excel:

In Python:

score = 1
if score > 2:
    print "Win"
    print "Lose"


For more practice with the basics,
try this tutorial from PyCAR, or this one.

Hour Two

The Problem

How to go from this... this?

The Reporting Phase

  • Find a website
  • Before you do anything else, ask for it.
  • If that doesn't work...
  • Does it look like a data table?
  • "View Source"
  • Is there a table tag?
  • Does it follow a predictable pattern, like
    body >> div >> table >> tr >> td ?

The Writing Phase

  • Make an http request to the site
  • Collect the text content of the response.
  • Parse the text and step through it, tag by tag
  • Find a tag, assign its content to a variable
  • Store those variables in lists or dicts
  • Make any additional requests to other pages and repeat above.
  • Loop through the collected list or dict and write it to a file

Let's go

Open your text editor of choice and a terminal window.
Write a line or two of code under each comment,
save the text file and then try to run:



Keep learning

Excellent post on ethics of scraping

"Web Scraping With Python"

Python Journos

More resources for learning Python



Github, StackOverflow, Google


Clone the source code
Check out day two's exercises.

Email me or ping me on Twitter