I’m a bit lonely in my new city and I was thinking about the best ways to meet new people. Craigslist did not cross my mind as a good (or safe or socially acceptable) way of meeting people with similar interests as I, but I did start to think about the wealth of data available there and what could be pulled from that data. So instead of stepping out of my comfort zone and trying to meet new people, I decided to hole up and work on making my sparse blog a little less sparse (and a little more interesting).
My first thought for the data was obviously to see if I could find trends regarding when certain types of posts increase in frequency. For example, do the frequency of m4m casual encounter posts increase when the republican national convention is in town? My money says that they do, but we’ll have to wait until 2016 to find that out.
Other ideas started popping into my head fairly quickly and needless to say, I will have a lot to look at. This will probably end up being a series of posts that get more interesting as I log more data. For now we have to start small.
Getting the data
Craigslist doesn’t have an API for their data, Presumably because they don’t want spammers flooding their pages with posts about cheap bottles of Viagra. But it makes it a bit difficult to scrape their data as well. They also don’t have a formal RSS feed it seems. I guess. But I know nothing about RSS. Some googling turned up the nice bit of info that you can append “
?&format=rss” to the end of any search query and you will get a nice log of everything posted in the last couple minutes. So I guess that means they do have an RSS feed, they just don’t publicize it. A useful little command line tool, canto, allows one to save specified RSS feeds to a folder. So I set that up on my server over at digitalocean to grab all the posts I wanted for the cities I specified.
- Little Rock, AR
- Phoenix, AZ
- Scramento, CA
- Philadelphia, PA
- Chicago, IL
- Portland, OR
- Madison, WI
- Washington DC
- Austin, TX
- Dallas, TX
- San Diego, CA
I’m from Sacramento, my girlfriend is from Little Rock, I currently live in Madison, and everything else was chosen somewhat at random. I wanted a mix of large- and medium-sized cities that would have conferences, big name music stars, and festivals coming through regularly. San Diego is in there because Comic-Con will be hosted there July 9th – 12th in 2015.
The posts I’m following are:
- Casual Encounters
- Strictly Platonic
- Sporting Goods for sale
- Computers for sale
- Musical instruments for sale
- Tools for sale
- Misc. services for hire
Casual encounters was chosen as the main subject of my analyses because those posts are absolutely hilarious (no offense to the readers who frequent that section) and a rather interesting foray into a world that most people have never experienced. Strictly platonic was my next feed to follow because I wanted to see the cities that had the highest proportion of lonely people (with the assumption that people who go on craigslist for platonic relationships really have hit rock bottom, once again no offense to readers who frequent that section as well).
The remaining “for sale” categories and miscellaneous services for hire are of absolutely no interest to me, but I need some controls. My hope is that, in aggregate, the for sale sections won’t fluctuate in post frequency per season so it will give a baseline for non-social CL transactions. The misc. services will hopefully also not fluctuate in frequency too much by season and may give a baseline for a quasi-social transactions whose motivations are not sex or friendship.
Formatting the data
Raw text data from the canto RSS feeds is available, but it looks disgusting. this is a small excerpt of what a raw text file looks like (I edited out a few things to keep it G rated).
Of course I can read it in with
readLines, so I read in a list of every raw feed in the
~/.canto/feeds/ directory with
lapplyed that shit until I had a nice list of a bunch of ridiculous non-sense. Nestled in that non-sense are time-stamps of posting, which can be pulled out with a simple
grep. Before you know it, I have a big list of
POSIXct fun. It took a bit of juggling with time-zones to get everything lined up (I had to resort to using the lubridate package, which is actually quite nice. Go figure).
It took a bit more juggling to get it into a usable dataset, but long story short, I pulled out the city and posting type from the file name, made those columns into a
data.frame along with the time-stamp for each feed, and then used the magic of
rbind them all together.
I have another dataset that is wider than that and just lists the frequency of posts for each date/city/post type. Those two datasets will probably be all I really need to jump in for now I think.
Some inital exploring
In the first 24 hours or so of data collection, I had logged 34,022 different posts, from 4 posting types and 10 cities. Since then I’ve added San Diego into the mix, as well as tools, instruments, and computers for sale as further controls.
The majority of posts at this time were from casual encounters (66.77%, n=22715), followed by sporting goods (15.36%, n=5225), misc. services for hire (12.55%, n=4269), and last but not least, strictly platonic (5.33%, n=1813). This huge discrepancy between number of casual encounter posts and everything else is what motivated me to grab more control data. One reason for this large discrepancy, beside the fact that people want anonymous sex, is that I started logging casual encounters a bit before everything else. That being said, if you look at a period where everything is being logged, casual encounters stays ahead by a fair margin.
To start getting a handle on the data, I looked at pure counts of the number of posts logged in the 24 hours of data that I had. Figure 1, below, shows the number of CL posts to either casual encounters or strictly platonic by city.
This plot doesn’t tell us very much because raw counts don’t take into account the the sheer population differences in these cities. So the next course of action was the pull in population sizes and plot the percentage of the population making casual encounter and strictly platonic posts. See Figure 2, below.
It is hard to know what to make of this. Just a day’s worth of data isn’t enough to determine whether Little Rock, AR really is almost as kinky as Philly and Chicago. But I’d really like to think so. I’ll be able to determine that conclusively once more data rolls in.
The next step is to know what time of day people generally post their non-sense. I’m not a huge fan of ggplot2 (I feel that lattice plots look more professional), but I thought a violin plot was more appropriate and concise than a bunch of histograms, so
ggplot2::geom_violin it was. I explored the use of the vioplot package but it didn’t work quite the way I wanted. Anyway, I lined up each city by their local times and tool a look at just posts 8pm May 16th – 8pm May 17th 2015. I’m really excited about what I found. See Figure 3, below.
How cool is that?? A very similar pattern in posting times across cities and post type. Sorry about the color discrepancies between the barplots and the violin plots. I’ll work on it.
I started working on these plots on May 17th around 7pm and finished them up around midnight. I decided to limit the plot just to 24 hours because in another 24 hours, I could plot two days worth of data and by midnight, I had started to suspect that the pattern would repeat. I was fidgeting in my chair all day at work on the 18th. Couldn’t wait to get home and see what happened in my absence! I might be a data nerd. And once I made it home…
Away to my room I flew with a fervor, tore open my laptop case, and logged into my server! A dialogue box appeared, showing me emails unchecked, but I clicked that away, once my ssh connected. When what, to my wondering eyes should materialize, but 18 more hours worth of data for me to analyze. With a few little keystrokes, so lively and quick, I downloaded the data, just a couple of clicks. Slower than molasses, my connection became, so I grumbled and pouted and cursed ISPs by name: No Comcast! No AT&T! No Verizon!, or AOL! Don’t throttle my connection I shouted like hell! I leaned back my head, and was researching Linux when a little notification, told me my download had finished!
I said not a word, but went straight to my script, booted up RStudio, and useless lines of data, I stripped. And laying down two fingers, one on return, one on control, I submitted my code to the console.
And what do you know, Christmas came early! The same pattern I had seen the day before had repeated itself (Figure 4). You’ll also noticed that I added in some new cities. One can never have enough data, you know?
This is exciting! There are some consistent patterns here. It looks like people really don’t like posting at 5am. How bizarre… 1-5pm seems like the most common time to post about how lonely or desperate one might be. Keep that in mind, dear reader.
Note: Data collection in the manner described in this post has ceased. As excited as I am about analyzing meta-data, I realized that at the same time I could be collecting some real data as well. The RSS feeds contain the link to each post, so I’ve switched gears: I’m downloading the full HTML from each post with
download.file and with the wonderfully confusing
XML package I’m grabbing the post title, date, category and subcategory, posting gender, target gender, and the full text of the post. As of May 23rd, 2015, I have 63,616 posts logged from the 11 cities. I’ve added missed connections and miscellaneous romance to my collections and removed tools, instruments, and computers for sale. I can’t wait to start doing some real digging! I can still start off with my meta data analyses, but now I can do some good old fashioned text mining too! (Among many other fun things)
To be continued…