Craigslist Personals Part II: A closer look at my methods

My last post was rather vague regarding what I did to scrape the data. I’ll fix that here. I’ll go through the hardware, software, and R scripts/methods/packages I needed to get where I am now. This will serve as a good way to kill time as I wait for data to come in.

Hardware

One of the most important pieces of the puzzle here–the piece that enabled me to start this project–is my server space on Digital Ocean. I pay $10/month for a “droplet” with 1GB of RAM and a 30GB SSD. This little droplet of mine allows me to constantly log data as it is posted, almost in real time.

I also have my personal computer, a late 2011 macbook pro, which I use to run my analyses.

Software

My droplet is running Ubuntu 14.04, with WordPress, R, RStudio server, Shiny server, dropbox, and the RSS logger, “canto”. I’ve been using Ubuntu for a while now on my macbook so I’m fairly used to various console commands making setup fairly easy. I chose to use Nginx instead of Apache2, because the awesome setup turtorial I was following told me to, but that doesn’t affect any aspect of this project.

Dropbox has native support for Ubuntu, which is really cool. It means I don’t have to use SFTP every time I want to pull files off the server. All my data and code syncs to my personal laptop automatically (and vice versa), so I can start up where I left off, regardless of whether I’m working on my laptop or in RStudio server via a web browser.

Getting canto set up was fairly easy. The online documentation is good enough to get you off and running. It took a while to figure out how to tell canto not to delete your feed ever (You have to add the line “never_discard("unread")” to the top of your conf.py configuration file). Although my revised methods of scraping don’t really need this option.

Scripts/methods

Getting the data – canto

The first part of the task was getting canto to log all the RSS feeds I wanted. It’s pretty easy. Create a file called “conf.py” in the ~/.canto/ directory. That file houses all your configuration options. An excerpt of mine is below.

As I mentioned in my previous post, turn any CL search query into a RSS feed by appending “?&format==rss” to the URL of the search query.

The documentation suggests tha the canto-daemon logs all new posts, and updates the log every minute, but that didn’t seem to be working for me. Every time I ran canto (type “canto” in the console and hit “enter”), it would start the GUI and update the feeds, but I needed it always update–or at least update often, without opening the GUI. The utility canto-fetch is called when you call canto, so to fetch RSS data every 10 minutes, just run canto-fetch every 20 minutes. This is easily accomplished with the magic of crontab. I added the line below to my crontab thing and I’m up and running.

Awesome.

Processing the data – R

I was initially only reading in meta-data from the cryptic text files canto keeps in the ~/.canto/feeds/ directory, but realizing that I could get so much more from the website itself, let me to the download.files function and Duncan Temple-Lang’s wonderfully useful and woefully confusing XML package.

I had the pleasure of taking a class from DTL my last year at UC Davis. Although it was by far the most enjoyable class I’ve take in my life, it had the unfortunate (but not unexpected) side effect of enabling my most debilitating addiction: R.

Data scraping/processing manifested itself into steps (and thus, three functions).

  1. Pull the links from the RSS metadata
  2. Download the linked pages (if they haven’t already been downloaded)
  3. Convert HTML nastiness into data.frame sweetness

I’ll go through each method individually.

2. Download content

This part was also easy. The first three calls to gsub are for pulling out city , post type, and posing ID (see the neat string of 10 numbers at the end of the URL above) from the URL. I construct a standardized filename for each post based on that info, and then download the file, if it doesnt’t already exist in my target folder.

3. Grab that data!

This is the long function of the bunch. It isn’t that long, but it requires some explaining. The comments should help. I hope. The bulk of the function is just using HTML tags and attributes to pull out the data I wanted from the downloaded page. There is also some todo about seletions from the “section” section, but that can mostly be ignored. I was just bored.

There was a helper function in there (crg.time) that I skipped over. It is used to figure out what the timezone is and then informs the date of that rather forcefully.

Putting it all together

Now that we’ve got canto running every 10 minutes to pull our data and we’ve got our functions to scrape data, Let’s make our digital ocean droplet start doing something useful for once. I wrote up a single R script that loads packages needed, pulls in the links, creates some data.frames, combines them into one data.frame, saves that data, and then deletes the used feeds.

Cool. I’ll name that script someting like scrape_script.R, put it in my home directory (or better yet, keep it in my dropbox and make a symlink to my home directory, and R CMD BATCH that shit).

But I want to do it every 10 minutes, so edit crontab to look like this:

Now, canto fetches my feeds, take a 1 second nap, and then scrapes the links in those feeds for fun data. I just get to sit back, relax, and write dumb blog posts while to data rolls in…

Closing rants:

I set canto to grab stuff every tem minutes. This might seem like overkill since the RSS feed hangs around longer than that but I did it for two reasons. The first is, i don’t know how the internet works. I didn’t want to put undue strain on the CL servers by barraging them with requests every 12 hours. Maybe they woudn’t even notice. But if they did, would they wonder why the same IP address is viewing 60,000 CL posts/day? I would certainly hope they would. So I guess I wanted to err on the side of caution. Also, the CL RSS feeds only maintain 150 posts at a time. So every time I pull links from the feeds, I only get 150/city/category. I want to pull a lot and then delete any duplicates rather than pull less often and miss some posts. As I’m writing this, I’m starting to worry 10 minutes isn’t enough for these big cities. Am I missing stuff? Maybe I should drop it down to 5 minutes. If I start grabbing data more often and the amount of data I get increases, that means I’m missing data now and that timeseries will be useless because every day will have the same amount of posts. Shit. I’m going to increase the frequency. Of course now I can’t use the past week as a part of the time series because I might be altering the number of posts coming in and that will mess everything up. I’ll look into this more.