As I mentioned during my talk at Defrag 2009, the best corpus of sample email data we have is the email data dump that the federal government released after Enron’s collapse. The corpus includes over 400,000 real email messages from Enron employees, and it’s ripe for analysis.
The data is available on the web in a ton of different formats, but none of them are especially conducive to just picking them up and starting analysis, especially in Windows – the original data actually is posted in a format that has filenames ending in periods, so they’re completely invisible to the 93% of us who are in Windows. It took me about 2 days of solid work to get something running before I could actually work on analyzing the data.
I’ve created a VirtualBox image that includes the Enron data, the tools to gather a large sample of Twitter data, and some sample Python scripts that take almost all of the work out of accessing the data for analyzing the email or Twitter data. I think this is by far the easiest way to start analyzing the Enron email corpus, and a pretty darned easy way to get started collecting and analyzing Twitter data.
It should take you less than half an hour of hands-on time (plus a little time for zip files to extract) to go from nothing to running the sample scripts and generating histograms of message length. Good luck, enjoy, and feel free to email me or leave a comment with questions.
- Download and install Sun’s VirtualBox.
- Download my VirtualBox Image file, which runs Xubuntu (a streamlined Ubuntu installation optimized for slower hardware – perfect for a virtual machine!)
- Extract the VirtualBox Image file into a directory you can remember.
- Run VirtualBox and create a new Image. Name it whatever you like; the OS is Linux and the version is Ubuntu
- Set the memory to at least 512 MB (if you have 3+ GB, I recommend 1.5 GB so that you can load the entire enron messages table into memory).
- Leave Boot Hard Disk checked and choose Use an Existing Hard Disk. Click the folder icon next to the dropdown, click Add, and navigate to the EnronTwitter.vdi file extracted from the download link above.
- Highlight the EnronTwitter.vdi file and click Select.
- Click Finish. Select the new VM and click Start. Wait for the image to boot.
Everything you need to get started is in your home directory in the data folder. Double click the Home icon on the desktop, and double click the data folder inside that directory. To edit the files: Right-click and choose Open With Mousepad or use a text editor of your choice.
- To get to a Terminal: Double-Click the Terminal icon on the Desktop or click Applications at the top left, click Accessories, click Terminal. You will need to type cd data to get into the directory with all the sample scripts.
Linux login info: enron/enr0n
MySQL login info: root/enr0n
There are two sample scripts in the directory – enron.py and enronrecpients.py. The enron.py script generates a histogram of the message lengths of all of the emails in the corpus. enronrecpients.py counts how many emails from the corpus are multi-recipient.
One caveat – these scripts load the entire database into memory before they run. For that reason, enron.py is currently set up to run on only the first 200k messages. If you chose to provide more than 1GB of memory, you should be OK to load the full set of messages, so just remove the LIMIT 200000 from the SQL command.
To run the script, open a terminal, cd into the data directory, then type
You can modify these scripts to analyze additional message data. The comments describe what does what as well as providing instructions on how to figure out what else is inside the Enron data using the MySQL client.
The relevant Python data analysis script is twitter.py and the relevant data collection script is datacollector.php. The twitter script uses simplejson to access the fields in the Twitter JSON stream and counts the number of multi-reply (@ to multiple people) as well as multi-retweet (multiple RT in one tweet) messages in the sample data.
To run the Python analysis script, open a terminal, cd into the data directory, and type:
There’s only a tiny amount of Twitter data in the image as-is. You’ll need to run the datacollector.php script to pull data from a Twitter streaming API called the “Gardenhose” –a medium-volume feed that provides a pretty good way to get a bunch of data fast. The script pulls from what is called the “spritzer” stream, which is just a random, undirected sample. I got this script from this streaming api tutorial. You’ll get about 25,000 Tweets per hour.
To run it, you will need to open datacollector.php and replace twitterusername with your Twitter account’s user name and twitterpassword with your Twitter password.
Then open a terminal, cd into the data directory and type:
After you’ve run the script long enough to get all the data you want, I recommend that you cat the files together into a single file so the Python script can digest it in one pass. Do this by typing
cat 20*.txt > tweetcorpus.txt
There’s a lot more information about customizing the stream coming out of the Twitter streaming API, including using search on the front end to restrict the stream at Twitter’s Streaming API Documentation page.
Getting Data out of the Virtual Machine (into Windows)
VirtualBox helpfully provides the ability to share a folder between the guest OS (the Xubuntu image) and the host OS (whatever you’re running, in my case Windows). To do that as of 11/23/09, click the Devices menu entry at the top of the VirtualBox window, and select Shared Folders. Click the Add button on the right, click the dropdown under Folder Path and choose Other. Select the folder you want to share, and give it a name (I shared my Desktop, so i just called it Desktop). Click OK on both dialog boxes.
Now you need to mount the shared folder, so you can access it in the guest OS. Open a Terminal and type the following (replacing Desktop with the name you chose for the folder you shared):
> sudo mount.vboxsf Desktop /media/windows-share
Now, double-click the File System icon on the Desktop in the VirtualBox image, and double-click the media folder. The shared folder you selected will appear there as windows-share, and you can exchange data with your computer’s regular file system using that folder.
I already set up the VirtualBox image with all the scripts and data you should need to get started, but here are some links in case you need to repl
icate some of these steps or if you need to find the original Enron source data.
- My Slides from the Defrag 2009 Conference (Slide 15 links to a bunch of great papers)
- How to load a MySQL dump into a database
- VirtualBox Shared Folder information
- Accessing MySQL from Python (annoying 4-page format, but great tutorial)
- Original Enron Source Data (don’t even bother if you’re running Windows; the folks who put this data up ended all the filenames with a period, so you won’t be able to open any of them)
- Enron Data as a MySQL dump (this is what I loaded into the image)
- Enron Data as an Outlook .PST file (thanks to Pete Warden, founder of Mailana, one of Baydin’s fellow TechStars companies that is also focused on email)
- A bunch of other Enron Data links
November 27, 2009
..]another great source of information on this subjectis ,www.baydin.com,..]
December 30, 2009
Thanks for the post and additional useful tools. I thought I might note that as it happens, the EDRM group (http://www.edrm.net) recently released a comprehensive source of the Enron email data including attachments. Folder structures seem well preserved, although generally rich text was lost from these data a long time ago and that’s also true here. It’s rare to be able to grab the corpus with attachments however. These data are available in the form of a 19GB set of downloadable zipped .PST files that becomes about 43GB of data – perfect for Windows users – right here: http://edrm.net/activities/projects/data-set
Group Product Manager – Enterprise Vault
January 12, 2010
That’s fantastic, Nick! Thanks for the link.