Homework 4

Due Date: Wednesday April 23 at noon EST


Overview

The goals of this homework are for you to:
  • become comfortable with aquiring data from a website using Python,
  • familiarize yourself with Tableau,
  • and generate a graph visualization of a real-world network.


Part 1: Collecting Data with Python (40 points)

In this part of the assignment you will write a Python program to scrape information from websites and create an output.txt file. This output.txt file will contain a body of text which you will uplaod to ManyEyes and visualize as a Tag Cloud.

This task is meant to be fairly open-ended --- you may use any text you'd like to make your output.txt file. You will turn in the following for this part:

  • an output.txt file,
  • a folder called DataCollection with the code that does the data-mining,
  • in your write-up, a description of how to run your code --- it's fine if this is something simple like "run python myscript.py at the command line"),
  • and also in your write-up, an embedded link to the Many Eyes Tag Cloud that is produced from your output.txt file.

There are a few rules you must follow:

  • No copy-pasting text into your output.txt file. This must be done programmatically.
  • You must gather text from at least 5 different URLs. More is encouraged, but let this be a minimum.
  • You don't need to do this all in one script (although it is encouraged for simplicity). However, you must explain how exactly you generated output.txt in your write-up, especially if there are several steps.

Getting Started

You may use any language you'd like, but the course staff will only support Python. If you want to try something else, feel free to, but we may not be able to help. Feel free to post to the message board, though, as some of your classmates may have answers about other languages.

If you have Unix or Mac OSX, Python should already be installed (try typing "python" into your terminal application). If you have Windows, you may need to get it here (you probably want Python 2.5.2 Windows Installer).

Next, download the following zip file from Thomas's lecture on Python. This has the BeautifulSoup.py library and a handy util.py library. It also has the code written during lecture, which is a good place to get started (to remind yourself of how scraping works).

The regular expression checker used in class is here. It may be useful as you start to write your own regular expressions.

Scraping

Pick several websites from which you want to collect data. Using the the code in the scraper.zip file you downloaded, modify or write your own script to scrape the interesting data from the websites you chose, and output the data to output.txt. Note: from the command line, you can use the ">" command to pipe data printed out to the terminal to a data file.

Here are a few ideas of places you could find your text. Feel free to do something else, though, if you have other interesting ideas!

  • What words does your favorite musical artist use most often? Use azlyrics or ohhla or another lyrics site to find a list of their lyrics and scrape it!
  • What words appear most often in Shakespeare's plays? Use gutenberg to find out.

Next, upload your output.txt file to the Many Eyes site, and create a Tag Cloud visualization of your data. Include the visualization in your write-up (i.e. click on "share this visualization", and copy-and-paste the link into your write-up). Note: There is a 5MB data limit for uploading to ManyEyes. If you go over that limit, you are allowed to manually delete text from your output.txt file so that you can make your Tag Cloud. Here is an example visualization of Beatle's lyrics scraped from here.

A big thanks to Thomas Carriero for putting together this great Python exercise!


Part 2: Exploring Data with Tableau (20 points)

INFO 424 at University of Washington is a visualization course by Maureen Stone and Polle Zellweger. In the fall of 2007, the class filled out a survey about themselves to use a dataset in Tableau. The survey questions and answers (raw and cleaned) can be found here. For this part of the assignment, you will load this survey data in Tableau and explore the data.

You will do the following:

  • Install Tableau. Instructions can be found here.
  • Load the (cleaned) data into Tableau and look at the survey questions. Play with the data in Tableau and look at several visualizations. Based on your understanding of the data, come up with 4 interesting questions and create visualizations to answer them. Try to make them as content-rich and interesting as possible.

For each question, in your write-up include: the question, the answer, and a visualization that supports your answer. Try to find the best (concise, clear) visualization for each question.

Each visualization you will get max 5 points. The most interesting questions and visualizations will be presented in class.


Part 3: Networks, Graphs, and the T (40 points)

For this part of the assignment you will be using data about Boston's subway network system, the T, to create a graph visualization with Processing. The T can be modeled as an undirected graph (Stops are nodes/vertices, and Connections are edges).

Did you know?... the Red Line is so named because its northernmost station used to be at Harvard University, whose school color, as you all know, is crimson.

To learn more about the history of the T, you can go here and here.

3.1) Getting the Tools in Hand

First, download this zip file cantaining the first two example sketches from Chapter 8 Fry's. Place the sketches in your Processing sketchbook directory.

Second, read carefully through the first part of Chapter 8 (pp. 220 - 242, Subsections: Simple Graph Demo, A More Complicated Graph and Approaching Network Problems) and work through the two example sketches. Both sketches are named according to their subsection title in the chapter.

3.2) The T

In this part of the assignment you will apply your Processing skills to visualize a real world network: the T. We'll focus not only on visualizing the network, but also on data acquisition from the web (preparation for the final projects).

3.2.1) Setting Up

Create a new folder named "HW4" in your sketchbook directory and download the following Framework sketch. Copy the unzipped sketch folder "Framework" to your newly created "HW4" directory.

Open the sketch and familiarize yourself with the code. Run the sketch. You can drag Nodes by pressing and holding down the left mouse button. If you press "p" the next call to draw() writes everything into a pdf file "output.pdf" and places it into your sketch folder.

Did you know?... The PDF file format is a vector format. This allows to resize the image file without loss of resolution.

Save a copy of the "Framework" sketch as "Ex_3_2_1" in the "HW4" directory.

You will now add a field col to the Edge class, which allows us to specify its color. Add a fourth argument col of type String to the constructor of the Edge class. Assign an edge color based on the first letter of the argument col. We only allow four different colors:

col first letter (r, g, b) triple
"red" 'r' (230, 19, 16)
"green" 'g' (1, 104, 66)
"blue" 'b' (0, 48, 140)
"orange" 'o' (255, 131, 5)

The addEdge() routine in the main tab should also take a fourth argument col. Change the code accordingly. You also have to make changes to the draw() method of the Edge class.

Add color to the edges of the graph. They are also a little bit too thin -- increase their stroke weight by one.

3.2.2) Acquiring the Data

Next, you'll acquire all the needed data by parsing the following webpage: data. We collected station and connection data from the MBTA website. (Note: Data for the "Minutes" column of the "Connections" table was not available for all lines -- some values may differ from the real amount of time it would take to travel between the two corresponding stations.)

Your task is now to write two Python scripts (or any other language that you used for the first part of this assignment): "stations.py" and "connections.py". The first extracts the list of stations from the first table of data. The second reads the connection data.

The output of "stations.py" should be a list of station names, with one station name per row. Write the list to a file "stations.csv". The file extension ".csv" stands for comma-separated values. It's similar to the ".tsv" file format we were using before. The only difference is that the tabs are replaced by commas. (We introduce it here because it's a widely used format to store data and may be useful for your final projects.)

The information about the connections should be stored into a file named "connections.csv". The row format there is defined as follows: "From", "To", "Color", "Minutes".

Store copies of your Python scripts in a folder called "TheTDataCollection" in your "HW4" directory. Include instructions in your write-up about how to run the scripts.

Now we are ready to define the Node locations on the screen. Download the Using_Your_Own_Data sketch and place the unzipped sketch folder into your "HW4" directory. We used this sketch in HW2 to define locations on a map. We can reuse this code to define the locations of the stations now.

Save a copy of "Using_Your_Own_Data" as "Ex_3_2_2". Add the following MBTA map to the sketch. We'll use it to define the 2D locations of our T stops. Change the code so that it reads and shows station names. Use the Table class to read in the "stations.csv" file. Store the station names together with their x and y coordinates to a file "locations.csv". (Row format: "Station Name", "x", "y".)

Remark: If you change the argument "TAB" of the split command to ',' in the Table class, it is able to read comma-separated files. By adding another line of code, we make sure that the white space at the begining and end of the strings are removed:
Add: for (int j = 0; j < pieces.length; j++) { pieces[j] = trim(pieces[j]); }
For reading and displaying the station names in the above example it doesn't really matter, but it will be helpful/important in the next couple of tasks.

Whew! Take a deep breath as the painful part is done -- the data is acquired.

3.2.3) Visualization of the Network

We'll now visualize the network. Before you start, save "Ex_3_2_1" as "Ex_3_2_3". Add the Table class to the sketch (changes from the above remark).

Load the two files "connections.csv" and "locations.csv" into tables. Load this data by making changes to the loadData() and the addNode() routines. If you now run the sketch you will see the T network.

Next, add code so that the station name is displayed in the upper right corner of the screen, when the mouse rolls over a station.

3.2.4)Shortest Path

Save the previous sketch as "Ex_3_2_4".

What is a shortest path in a network? That's pretty simple. Let us assume the following scenario:
You are late for a meeting at the Government Center. You are still at Harvard. At which stations do you have to change to get there as fast as possible?

You do not have to code a shorest path algorithm yourself --- we've done that for you. Download the ShortestPath file and add it to your sketch. It is (hopefully!) straight-forward to use:

1. Add two global arrays of type "boolean", one named activeNodes and the other named activeEdges. Intialize them by adding initializeActiveDataStructures(); to the end of the setup function. If activeNodes[nodeIndex] evaluates to true, then the node with index NodeIndex is part of the shorest path. Similar for the edges: If activeEdges[edgeIndex] evaluates to true, then the edge with index EdgeIndex is part of the shorest path.
2. Add initializeAdjacencyMatrix(); to the end of the setup function. This initializes a couple of other data structures. You do not have to worry about them. You only have to make sure that they are initialized!
3. Now let us assume that you would like to compute the shortest path between a node A and a node B. The following call shortestPath(A.getIndex(), B.getIndex()) updates the arrays activeNodes and activeEdges and returns the corresponding travel time. That's all you have to do!

For those who are interested in the underlying algorithm, may find the article about Dijkstra's Shorest Path algorithm a good starting point.

Ok. Now it's your turn. Add four global variables: A and B of type "Node", numOfNodes of type "int" and numOfMinutes of type "float". Add code to the mousePressed() method so that the following requirements are met:

1. If you click on a node with the right mouse button it should assign the corresponding node to the global variable A and increment the numOfNodes variable.
2. If you click now on another node with the right mouse button it should assign it to the global variable B and increment the numOfNodes variable. Further, it should compute the shorest path (don't forget numOfMinutes). Make also changes to the draw() routine in the main tab, so that only nodes and edges on the shortest path are drawn.
3. If you then right click somewhere on the screen you should clear A and B, you should set the variables numOfNodes and numOfMinutes to zero again and you should set all the nodes and edges to active again.

Run the sketch and try it... there is still something missing. You do not display the time it takes to travel from A to B! Add code to the draw() routine in the main tab so that it writes the shortest path information in the upper left corner:

Screen Shot

That's it.

3.2.5) Color effect

Save the previous sketch as "Ex_3_2_5".

Wouldn't it be nice to not keep the nonactive nodes and edges in the background? For example by coloring the edges in gray? We could animate this change in color.

Screen Shot

To do so, download the Integrator class and add it do the sketch.

Use the Integrator class to animate the change in color. It's better to do this interpolation in the HSB color space. Leave the hue the same, but change the "target" saturation and "target" brightness for all the nonactive edges to 0 and to 200. Set this new "target" whenever a new shorest path is going to be computed. If the display changes again to display all the nodes, change the target edge color to its original color again:

Screen Shot


Extra Credit: The New Trip Planner (20 points)

Choose your favorite city and visualize its public transportation system. Boston your favorite? Add the bus lines to the above network. You like Chicago more? Map out the L and the associated commuter rails.

You must acquire the data (stations, connections, travel time) from the corresponding public transportation websites (automatically if possible). Add at least one new feature to your visualization, such as a query to an online time-table that checks when the next train is leaving and displaying this information near the station. You could add a ranked list of your favorite coffee shops and a function that determines the highest ranked one you can grab a cup of joe from before the start of class. What else would you love to coordinate?

Feel free to change the design, features, and city. We will award points based on the amount of information you portray, and the helpfulness of your visualization.


Submission Instructions

To submit your homework, create a folder named lastname_firstinitial_hw4, and place your write-up, Python directory, and Processing sketches into the directory. Compress the folder and send it as an email attachment to miriah@seas.harvard.edu.