Mining the Social Web, 2nd Edition

Appendix C: Python and IPython Notebook Tips & Tricks

This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from Mining the Social Web (2nd Edition). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.

In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, you can find the full source code repository here.

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the Simplified BSD License that governs its use.

Try tinkering around with Python to test out IPython Notebook...

As you work through these notebooks, it's assumed that you'll be executing each cell in turn, because some cells will define variables that cells below them will use. Here's a very simple example to illustrate how this all works...

In []:
# Execute this cell to define this variable
# Either click Cell => Run from the menu or type
# ctrl-Enter to execute. See the Help menu for lots
# of useful tips. Help => IPython Help and
# Help => Keyboard Shortcuts are especially
# useful.

message = "I want to mine the social web!"
In []:
# The variable 'message is defined here. Execute this cell to see for yourself
print message
In []:
# The variable 'message' is defined here, but we'll delete it
# after displaying it to illustrate an important point...
print message
del message
In []:
# The variable message is no longer defined in this cell or two cells 
# above anymore. Try executing this cell or that cell to see for yourself.
print message
In []:
# Try typing in some code of your own!

Python Idioms

This section of the notebook introduces a few Python idioms that are used widely throughout the book that you might find very helpful to review. This section is not intended to be a Python tutorial. It is intended to highlight some of the fundamental aspects of Python that will help you to follow along with the source code, assuming you have a general programming background. Sections 1 through 8 of the Python Tutorial are what you should spend a couple of hours working through if you are looking for a gentle introduction Python as a programming language.

Python Data Structures are like JSON

If you come from a web development background, a good starting point for understanding Python data structures is to start with JSON as a reference point. If you don't have a web development background, think of JSON as a simple but expressive specification for representing arbitrary data structures using strings, numbers, lists, and dictionaries. The following cell introduces some data structures. Execute the following cell that illustrates these fundamental data types to follow along.

In []:
an_integer = 23
print an_integer, type(an_integer)
print

a_float = 23.0
print a_float, type(a_float)
print

a_string = "string"
print a_string, type(a_string)
print

a_list = [1,2,3]
print a_list, type(a_list)
print a_list[0] # access the first item
print

a_dict = {'a' : 1, 'b' : 2, 'c' : 3}
print a_dict, type(a_dict)
print a_dict['a'] # access the item with key 'a'

Assuming you've followed along with these fundamental data types, consider the possiblities for arbitrarily composing them to represent more complex structures:

In []:
contacts = [
    {
      'name'      : 'Bob',
      'age'       : 23,
      'married'   : False,
      'height'    : 1.8, # meters
      'languages' : ['English', 'Spanish'],
      'address'   : '123 Maple St.',
      'phone'     : '(555) 555-5555'
    },
    
    {'name'      : 'Sally',
     'age'       : 26,
     'married'   : True,
     'height'    : 1.5, # meters
     'languages' : ['English'],
     'address'   : '456 Elm St.',
     'phone'     : '(555) 555-1234'
    }              
]

for contact in contacts:
    print "Name:", contact['name']
    print "Married:", contact['married']
    print

As alluded to previously, the data structures very much lend themselves to constructing JSON in a very natural way. This is often quite convenient for web application development that involves using a Python server process to send data back to a JavaScript client. The following cell illustrates the general idea.

In []:
import json

print contacts
print type(contacts) # list

# json.dumps pronounced (dumps stands for "dump string") takes a Python data structure
# that is serializable to JSON and dumps it as a string
jsonified_contacts = json.dumps(contacts, indent=2) # indent is used for pretty-printing

print type(jsonified_contacts) # str
print jsonified_contacts

A couple of additional types that you'll run across regularly are tuples and the special None type. Think of a tuple as an immutable list and None as a special value that indicates an empty value, which is neither True nor False.

In []:
a_tuple = (1,2,3)

an_int = (1) # You must include a trailing comma when only one item is in the tuple

a_tuple = (1,)

a_tuple = (1,2,3,) # Trailing commans are ok in tuples and lists 

none = None

print none == None   # True
print none == True   # False
print none == False  # False

print

# In general, you'll see the special 'is' operator used when comparing a value to 
# None, but most of the time, it works the same as '=='

print none is None  # True
print none is True  # False
print none is False # False

As indicated in the python.org tutorial, None is often used as a default value in function calls, which are defined by the keyword def

In []:
def square(x):
    return x*x

print square(2) # 4
print

# The default value for L is only created once and shared amongst
# calls

def f1(a, L=[]):
    L.append(a)
    return L

print f1(1) # [1]
print f1(2) # [1, 2]
print f1(3) # [1, 2, 3]
print

# Each call creates a new value for L

def f2(a, L=None):
    if L is None:
        L = []
    L.append(a)
    return L

print f2(1) # [1]
print f2(2) # [2]
print f2(3) # [3]

List and String Slicing

For lists and strings, you'll often want to extract a particular selection using a starting and ending index. In Python, this is called slicing. The syntax involves using square brackets in the same way that you are extracting a single value, but you include an additional parameter to indicate the boundary for the slice.

In []:
a_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

print a_list[0]     # a
print a_list[0:2]   # ['a', 'b']
print a_list[:2]    # Same as above. The starting index is implicitly 0
print a_list[3:]    # ['d', 'e', 'f', 'g'] Ending index is implicitly the length of the list
print a_list[-1]    # g Negative indices start at the end of the list
print a_list[-3:-1] # ['e', 'f'] Start at the end and work backwards. (The index after the colon is still excluded)
print a_list[-3:]   # ['e', 'f', 'g']  The last three items in the list
print a_list[:-4]   # ['a', 'b', 'c'] # Everything up to the last 4 items

a_string = 'abcdefg'

# String slicing works the very same way

print a_string[:-4] # abc

List Comprehensions

Think of Python's list list comprehensions idiom as a concise and efficient way to create lists. You'll often see list comprehensions used as an alternative to for loops for a common set of problems. Although they may take some getting used to, you'll soon find them to be a natural expression. See the section entitled "Loops" from Python Performance Tips for more details on some of the details on why list comprehensions may be more performant than loops or functions like map in various situations.

In []:
# One way to create a list containing 0..9:

a_list = []
for i in range(10):
    a_list.append(i)
print a_list    
    
# How to do it with a list comprehension

print [ i for i in range(10) ]


# But what about a nested loop like this one, which
# even contains a conditional expression in it:

a_list = []
for i in range(10):
    for j in range(10, 20):
        if i % 2 == 0:
            a_list.append(i)

print a_list

# You can achieve a nested list comprehension to 
# achieve the very same result. When reading or writing
# list comprehension. When written with highly readable
# indention like below, note the striking similarity to
# the equivalent code as presented above.

print [ i
        for i in range(10)
            for j in range(10, 20)
                if i % 2 == 0
      ]

Dictionary Comprehensions

In the same way that you can concisely construct lists with list comprehensions, you can concicely construct dictionaries with dictionary comprehensions. The underlying concept involved and the syntax is very similar to list comprehensions. The following example illustrates a few different way to create the same dictionary and introduces dictionary construction syntax.

In []:
# Literal syntax

a_dict = { 'a' : 1, 'b' : 2, 'c' : 3 }
print a_dict
print

# Using the dict constructor

a_dict = dict([('a', 1), ('b', 2), ('c', 3)])
print a_dict
print

# Dictionary comprehension syntax

a_dict = { k : v for (k,v) in [('a', 1), ('b', 2), ('c', 3)] }
print a_dict
print

# A more appropriate circumstance to use dictionary comprehension would 
# involve more complex computation

a_dict = { k : k*k for k in xrange(10) } # {0: 0, 1: 1, 2: 4, 3: 9, ..., 9: 81}
print a_dict

Enumeration

While iterating over a collection such as a list, it's often handy to know the index for the item that you are looping over in addition to its value. While a reasonable approach is to maintain a looping index, the enumerate function spares you the trouble.

In []:
lst = ['a', 'b', 'c']

# You could opt to maintain a looping index...
i = 0
for item in lst:
    print i, item
    i += 1

# ...but the enumerate function spares you the trouble of maintaining a loop index
for i, item in enumerate(lst):
    print i, item

*args and **kwargs

Conceptually, Python functions accept lists of arguments that can be followed by additional keyword arguments. A common idiom that you'll see when calling functions is to dereference a list or dictionary with the asterisk or double-asterisk, respectively, a special trick for satisfying the function's parameterization.

In []:
def f(a, b, c, d=None, e=None):
    print a, b, c, d, e

f(1, 2, 3)              # 1 2 3 None None
f(1, 3, 3, d=4)         # 1 2 3 4 None
f(1, 2, 3, d=4, e=5)    # 1 2 3 4 5

args = [1,2,3]
kwargs = {'d' : 4, 'e' : 5}

f(*args, **kwargs)      # 1 2 3 4 5

String Substitutions

It's often clearer in code to use string substitution than to concatenate strings, although both options can get the job done. The string type's built-in format function is also very handy and adds to the readability of code. The following examples illustrate some of the common string substitutions that you'll regularly encounter in the code.

In []:
name1, name2 = "Bob", "Sally"

print "Hello, " + name1 + ". My name is " + name2

print "Hello, %s. My name is %s" % (name1, name2,)

print "Hello, {0}. My name is {1}".format(name1, name2)
print "Hello, {0}. My name is {1}".format(*[name1, name2])
names = [name1, name2]
print "Hello, {0}. My name is {1}".format(*names)


print "Hello, {you}. My name is {me}".format(you=name1, me=name2)
print "Hello, {you}. My name is {me}".format(**{'you' : name1, 'me' : name2})
names = {'you' : name1, 'me' : name2}
print "Hello, {you}. My name is {me}".format(**names)

Unicode in Python 2.x

XXX - to be written.

In []:

Login to the Virtual Machine with SSH

Although great care has been taken to try and ensure that you won't have to use a secure shell to login to the virtual machine, there is great value in learning how to use an SSH client to login to a machine and be productive in a terminal session, and there will inevitably be times when you may need to troubleshoot on the virtual machine despite our best efforts. If you are already comfortable using a secure shell, you'll simply type vagrant ssh from the vagrant folder where your Vagrantfile is located, and you'll get automatically logged in if you have an SSH client already installed.

If you are a Mac or Linux user, you will already have an SSH client installed and your remote login should "just work", but if you are a Windows user, you will likely not have an SSH client installed unless you're already using a tool like PuTTY or Git for Windows. In any event, if you've used an SSH client before, you'll have no troubles using vagrant ssh to perform a remote login. If you have not used an SSH client before, you won't necessarily need to learn to use one, but it might be in your overall best interest to learn how to use one at your leisure.

Even without an SSH client, however, there are a couple of creative ways that you can use a Python script and IPython Notebook to give you much of the same functionality that you would achieve by performing a remote login to gain terminal access. The following sections introduce some of the possibilities.

Interact with the Virtual Machine without an SSH Client Using Python

Although you're probably better off to configure and use an SSH client to login to the virtual machine most of the time, you could even interact with the virtual machine almost as though you are working in a terminal session using IPython Notebook and the envoy package, which wraps the subprocess package in a highly convenient way that allows you to run arbitrary commands and see the results. The following script shows how to run a few remote commands on the virtual machine (where this IPython Notebook server would be running if you are using the virtual machine). Even in situations where you are running a Python program locally, this package can be of significant convenience.

In []:
import envoy # pip install envoy

# Run a command just as you would in a terminal on the virtual machine
r = envoy.run('ps aux | grep ipython') # show processes containing 'ipython'

# Print its standard output
print r.std_out

# Print its standard error
print r.std_err

# Print the working directory for the IPython Notebook server
print envoy.run('pwd').std_out

# Try some commands of your own...

Bash Cell Magic

An alternative to using the envoy package to interact with the virtual machine through what is called "Bash Cell Magic" in IPython Notebook. The way it works is that if you write %%bash on the first line of a cell, IPython Notebook will automatically take the remainder of the cell and execute it as a Bash script on the machine where the server is running. In case you come from a Windows background or are a Mac user who hasn't yet encountered Bash, it's the name of the default shell on most Linux systems, including the virtual machine that runs the IPython Notebook server.

Assuming that you are using the virtual machine, this means that you can essentially write bash scripts in IPython Notebook cells and execute them on the server. The following script demonstrates some of the possibilities, including how to use a command like wget to download a file.

In []:
%%bash
# Print the working directory
pwd 

# Display the date
date

# View the first 10 lines of a manual page for wget
man wget | head -10

# Download a webpage to /tmp/index.html
wget -O /tmp/foo.html http://ipython.org/notebook.html

# Search for 'ipython' in the webpage
grep ipython /tmp/foo.html

Using Bash Cell Magic to Update Your Source Code

Since Bash cell magic works just as though you were executing commands in a terminal, you can use it easily manage your source code by executing commands like "git status" and "git pull"

In []:
%%bash
ls ../
# Displays the status of the local repository
git status

# Execute "git pull" to perform an update

Serving Static Content

IPython Notebook has some handy features for interacting with the web browser that you should know about. A few of the features that you'll see in the source code are embedding inline frames, and serving static content such as images, text files, JavaScript files, etc. The ability to serve static content is especially handy if you'd like to display an inline visualization for analysis, and you'll see this technique used throughout the notebook.

The following cell illustrates creating and embedding an inline frame and serving the static source file for this notebook, which is serialized as JSON data.

In []:
from IPython.display import IFrame
from IPython.core.display import display

# IPython Notebook can serve files relative to the location of
# the working notebook into inline frames. Prepend the path 
# with the 'files' prefix

static_content = 'files/resources/appc-pythontips/hello.txt'

display(IFrame(static_content, '100%', '600px'))

Shared Folders

The Vagrant virtual machine maps the top level directory of your GitHub checkout (the directory containing README.md) on your host machine to its /vagrant folder and automatically synchronizes files between the guest and host environments as an incredible convenience to you. This mapping and synchronization enables IPython Notebooks you are running on the guest machine to access files that you can conveniently manage on your host machine and vice-versa. For example, many of the scripts in IPython Notebooks may write out data files and you can easily access those data files on your host environment (should you desire to do so) without needing to connect into the virtual machine with an SSH session. On the flip side, you can provide data files to IPython Notebook, which is running on the guest machine by copying them anywhere into your top level GitHub checkout.

In effect, the top level directory of your GitHub checkout is automatically synchronized between the guest and host environments so that you have access to everything that is happening and can manage your source code, modified notebooks, and everything else all from your host machine. See Vagrantfile for more details on how synchronized folders can be configured.

The following code snippet illustrates how to access files. Keep in mind that the code that you execute in this cell writes data to the guest (virtual machine) environment, and it's Vagrant that automatically synchronizes it back to your guest environment. It's a subtle but important detail.

In []:
import os

# The absolute path to the shared folder on the VM
shared_folder="/vagrant"

# List the files in the shared folder
print os.listdir(shared_folder)
print

# How to read and display a snippet of the share/README.md file...
README = os.path.join(shared_folder, "README.md")
txt = open(README).read()
print txt[:200]

# Write out a file to the guest but notice that it is available on the host
# by checking the contents of your GitHub checkout
f = open(os.path.join(shared_folder, "Hello.txt"), "w")
f.write("Hello. This text is written on the guest but synchronized to the host by Vagrant")
f.close()

Monitoring and Debugging Memory Usage with Vagrant and IPython Notebook

IPython Notebook's server kernels are ordinary Python processes that will use as much memory as they need to perform the computation that you give them. Most of the examples in this book do not entail consuming very large amounts of memory (the only notable exception could be Example 7-12 that builds a fairly large in-memory graph), but it's worth noting that most Vagrant boxes, including the default configuration for this book's virtual machine, have fairly low amounts of memory allocated to them by default.

The amount of memory that's available for the "precise64" box that's used for these notebooks, for example, ships with a default of 334MB of memory that's allocated to it. Consequently, if any of your IPython Notebooks ever consume close to 334MB of memory, you'll reach the limit that's allocated to the guest virtual machine and the kernel will kill the offending Python process. Unfortunately, IPython Notebook isn't able to provide a clear indication of what's gone wrong in this case, so your only clue as to what may have happened is that you'll no longer see a "Kernel Busy" message in the upper-right corner of the screen, and your variables in the notebook become undefined because the IPython Notebook kernel that's servicing the notebook has effectively restarted.

From within IPython Notebook, you can check the kernel log to see if there is any evidence that it killed your process with the following Bash command that displays the last 100 lines /var/log/kern.log

In []:
%%bash
tail -n 100 /var/log/kern.log

You can increase the amount of memory that's available to your Vagrant box (and therefore to IPython Notebook) by including the following code snippet in your Vagrantfile as a provider-specific configuration option that specifies that the box should be allocated 768MB of memory:

config.vm.provider :virtualbox do |vb| vb.customize ["modifyvm", :id, "--memory", "768"] end

While a long-running cell is executing, you can also take advantage of IPython Notebook's Bash cell magic to monitor the memory usage of processes by opening up a different notebook and executing the following code in its own cell. In the example below, we are specifying that we'd like to see a list of the top 10 processes and the amount of memory (in KB) that each process is using.

In []:
%%bash
ps -e -orss=,args= | sort -b -k1,1n | pr -TW$COLUMNS | tail -n 10

Again, the guidance offered here is fairly advanced advice that you may never need to use, but it the event that you suspect that you have a memory problem with a notebook, it'll be helpful

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the following Simplified BSD License (also known as "FreeBSD License") that governs its use. Basically, you can do whatever you want with the code so long as you retain the copyright notice.

Copyright (c) 2013, Matthew A. Russell All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.