Web scraping with R

Ian Kyle

July 15th, 2015

Outline

www.ikkyle.com/webscraping_with_r.html

What is webscraping?

In real life, we are constantly collecting data

No packages on CRAN that can access our memories (yet)

What is webscraping?

On the internet, everything is recorded and saved

Because it is recorded and saved, with the right tools, we can access it and quantify it

What can be scraped?

“If you can see it, you can scrape it”

Anything in a webpage:

What can’t be scraped?

Server side code and databases

That’s it.

(Unless you have an API)

Format of web data

HTML

HTML displays data

<ul>
    <li>Item 1</li>
    <li>Item 2</li>
</ul>

looks like:

The tag “<ul>” indicates an unordered list and “<li>” indicates a list item. A tag starting with a forward slash terminates the previous tag of that type.

HTML gets more complicated, but understanding tags is enough to get started with scraping.

XML

XML stores data

You can navigate through webpage HTML and XML via nodes and the tags, atributes, and classes associated with each node

<my_tag id="my_tag_id" class = "my_tag_class" this_is_an_attribute="attribute_value">Tag value</my_tag>

HTML/XML navigation

More on this later

JSON

Used to store data more complex than can fit into HTML tables

JSON is fairly easy to understand and navigate with the jsonlite package

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
  },
  "phoneNumber": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ],
  "gender": {
    "type": "male"
  }
}

How do we scrape it?

Types of scraping tools:

Packages to grab

Interfaces to libcurl

RCurl

curl

curl httr

curl

httr

Using httr to get some data

1. Get a response object

library(httr)
r <- GET('http://www.r-project.org/')
r # status_code :200 indicates suscesful connection
## Response [http://www.r-project.org/]
##   Date: 2015-07-09 23:10
##   Status: 200
##   Content-Type: text/html
##   Size: 4.84 kB
## <!DOCTYPE html>
## <html lang="en">
##   <head>
##     <meta charset="utf-8">
##     <meta http-equiv="X-UA-Compatible" content="IE=edge">
##     <meta name="viewport" content="width=device-width, initial-scale=1">
##     <title>R: The R Project for Statistical Computing</title>
## 
##     <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="3...
##     <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="1...
## ...
str(r)
## List of 9
##  $ url        : chr "http://www.r-project.org/"
##  $ status_code: int 200
##  $ headers    :List of 9
##   ..$ date            : chr "Fri, 10 Jul 2015 04:10:03 GMT"
##   ..$ server          : chr "Apache/2.2.22 (Debian)"
##   ..$ last-modified   : chr "Wed, 01 Jul 2015 13:10:02 GMT"
##   ..$ etag            : chr "\"b211c5-12e6-519d00a6a1cbe\""
##   ..$ accept-ranges   : chr "bytes"
##   ..$ vary            : chr "Accept-Encoding"
##   ..$ content-encoding: chr "gzip"
##   ..$ content-length  : chr "1825"
##   ..$ content-type    : chr "text/html"
##   ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ all_headers:List of 1
##   ..$ :List of 3
##   .. ..$ status : int 200
##   .. ..$ version: chr "HTTP/1.1"
##   .. ..$ headers:List of 9
##   .. .. ..$ date            : chr "Fri, 10 Jul 2015 04:10:03 GMT"
##   .. .. ..$ server          : chr "Apache/2.2.22 (Debian)"
##   .. .. ..$ last-modified   : chr "Wed, 01 Jul 2015 13:10:02 GMT"
##   .. .. ..$ etag            : chr "\"b211c5-12e6-519d00a6a1cbe\""
##   .. .. ..$ accept-ranges   : chr "bytes"
##   .. .. ..$ vary            : chr "Accept-Encoding"
##   .. .. ..$ content-encoding: chr "gzip"
##   .. .. ..$ content-length  : chr "1825"
##   .. .. ..$ content-type    : chr "text/html"
##   .. .. ..- attr(*, "class")= chr [1:2] "insensitive" "list"
##  $ cookies    : list()
##  $ content    : raw [1:4838] 3c 21 44 4f ...
##  $ date       : POSIXct[1:1], format: "2015-07-09 23:10:03"
##  $ times      : Named num [1:6] 0 0.509 0.683 0.683 0.831 ...
##   ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
##  $ request    :List of 5
##   ..$ handle:List of 2
##   .. ..$ handle:Formal class 'CURLHandle' [package "RCurl"] with 1 slot
##   .. .. .. ..@ ref:<externalptr> 
##   .. ..$ url   :List of 9
##   .. .. ..$ scheme  : chr "http"
##   .. .. ..$ hostname: chr "www.r-project.org"
##   .. .. ..$ port    : NULL
##   .. .. ..$ path    : chr ""
##   .. .. ..$ query   : NULL
##   .. .. ..$ params  : NULL
##   .. .. ..$ fragment: NULL
##   .. .. ..$ username: NULL
##   .. .. ..$ password: NULL
##   .. .. ..- attr(*, "class")= chr "url"
##   .. ..- attr(*, "class")= chr "handle"
##   ..$ writer:List of 1
##   .. ..$ buffer: NULL
##   .. ..- attr(*, "class")= chr [1:2] "write_memory" "write_function"
##   ..$ method: chr "GET"
##   ..$ opts  :List of 7
##   .. ..$ followlocation: logi TRUE
##   .. ..$ maxredirs     : int 10
##   .. ..$ encoding      : chr "gzip"
##   .. ..$ useragent     : chr "curl/7.35.0 Rcurl/1.95.4.6 httr/0.6.1"
##   .. ..$ httpheader    : Named chr "application/json, text/xml, application/xml, */*"
##   .. .. ..- attr(*, "names")= chr "Accept"
##   .. ..$ customrequest : chr "GET"
##   .. ..$ url           : chr "http://www.r-project.org/"
##   .. ..- attr(*, "class")= chr "config"
##   ..$ body  : NULL
##  - attr(*, "class")= chr "response"

Using httr to get some data

2. Get the webpage content

This gives us the full text of the webpage (“what we see”) in a character vector of length 1

page <- content(r, 'text')
cat(page)
## <!DOCTYPE html>
## <html lang="en">
##   <head>
##     <meta charset="utf-8">
##     <meta http-equiv="X-UA-Compatible" content="IE=edge">
##     <meta name="viewport" content="width=device-width, initial-scale=1">
##     <title>R: The R Project for Statistical Computing</title>
## 
##     <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" />
##     <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16" />
## 
##     <!-- Bootstrap -->
##     <link href="/css/bootstrap.min.css" rel="stylesheet">
##     <link href="/css/R.css" rel="stylesheet">
## 
##     <!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
##     <!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
##     <!--[if lt IE 9]>
##       <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
##       <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
##     <![endif]-->
##   </head>
##   <body>
##     <div class="container page">
##       <div class="row">
##         <div class="col-xs-12 col-sm-offset-1 col-sm-2 sidebar" role="navigation">
## <div class="row">
## <div class="col-xs-6 col-sm-12">
## <p><a href="/"><img src="/Rlogo.png" width="100" height="78" alt = "R" /></a></p>
## <p><small><a href="/">[Home]</a></small></p>
## <h2>Download</h2>
## <p><a href="http://cran.r-project.org/mirrors.html">CRAN</a></p>
## <h2>R Project</h2>
## <ul>
## <li><a href="/about.html">About R</a></li>
## <li><a href="/contributors.html">Contributors</a></li>
## <li><a href="/news.html">What’s New?</a></li>
## <li><a href="/mail.html">Mailing Lists</a></li>
## <li><a href="http://bugs.R-project.org">Bug Tracking</a></li>
## <li><a href="/conferences.html">Conferences</a></li>
## <li><a href="/search.html">Search</a></li>
## </ul>
## </div>
## <div class="col-xs-6 col-sm-12">
## <h2>R Foundation</h2>
## <ul>
## <li><a href="/foundation/">Foundation</a></li>
## <li><a href="/foundation/board.html">Board</a></li>
## <li><a href="/foundation/members.html">Members</a></li>
## <li><a href="/foundation/donors.html">Donors</a></li>
## <li><a href="/foundation/donations.html">Donate</a></li>
## </ul>
## <h2>Documentation</h2>
## <ul>
## <li><a href="http://cran.r-project.org/manuals.html">Manuals</a></li>
## <li><a href="http://cran.r-project.org/faqs.html">FAQs</a></li>
## <li><a href="http://journal.r-project.org">The R Journal</a></li>
## <li><a href="/doc/bib/R-books.html">Books</a></li>
## <li><a href="/certification.html">Certification</a></li>
## <li><a href="/other-docs.html">Other</a></li>
## </ul>
## <h2>Links</h2>
## <ul>
## <li><a href="http://www.bioconductor.org">Bioconductor</a></li>
## <li><a href="/other-projects.html">Related Projects</a></li>
## </ul>
## </div>
## </div>
##         </div>
##         <div class="col-xs-12 col-sm-7">
##         <h1>The R Project for Statistical Computing</h1>
## <h2 id="getting-started">Getting Started</h2>
## <p>R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To <strong><a href="http://cran.r-project.org/mirrors.html">download R</a></strong>, please choose your preferred <a href="http://cran.r-project.org/mirrors.html">CRAN mirror</a>.</p>
## <p>If you have questions about R like how to download and install the software, or what the license terms are, please read our <a href="http://cran.R-project.org/faqs.html">answers to frequently asked questions</a> before you send an email.</p>
## <h2 id="news">News</h2>
## <ul>
## <li><p><a href="http://journal.r-project.org"><strong>The R Journal Volume 7/1</strong></a> is available.</p></li>
## <li><p><a href="http://cran.r-project.org/src/base/R-3"><strong>R version 3.2.1 (World-Famous Astronaut)</strong></a> has been released on 2015-06-18.</p></li>
## <li><p><a href="http://cran.r-project.org/src/base/R-3"><strong>R version 3.1.3 (Smooth Sidewalk)</strong></a> has been released on 2015-03-09.</p></li>
## <li><p><strong><a href="http://www.r-project.org/useR-2015">useR! 2015</a></strong>, will take place at the University of Aalborg, Denmark, June 30 - July 3, 2015.</p></li>
## <li><p><strong><a href="http://www.r-project.org/useR-2014">useR! 2014</a></strong>, took place at the University of California, Los Angeles, USA June 30 - July 3, 2014.</p></li>
## </ul>
## <!--- (Boilerplate for release run-in)
## -   [**R 3.1.3 (Smooth Sidewalk) prerelease versions**](http://cran.r-project.org/src/base-prerelease) will appear starting February 28. Final release is scheduled for 2015-03-09. 
## -->
##         </div>
##       </div>
##       <div class="raw footer">
##         &copy; The R Foundation.
##       </div>
##     </div>
##     <!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
##     <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
##     <!-- Include all compiled plugins (below), or include individual files as needed -->
##     <script src="/js/bootstrap.min.js"></script>
##   </body>
## </html>

Packages to extract

XML

rvest

Packages to extract

XML and rvest

XML and rvest: use case

What is the latest R news?

r_news

XML and rvest: use case

What is the latest R news?

</div>
</div>
<div class="col-xs-12 col-sm-7">
   <h1>The R Project for Statistical Computing</h1>
   <h2 id="getting-started">Getting Started</h2>
   <p>R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To <strong><a href="http://cran.r-project.org/mirrors.html">download R</a></strong>, please choose your preferred <a href="http://cran.r-project.org/mirrors.html">CRAN mirror</a>.</p>
   <p>If you have questions about R like how to download and install the software, or what the license terms are, please read our <a href="http://cran.R-project.org/faqs.html">answers to frequently asked questions</a> before you send an email.</p>
   <h2 id="news">News</h2>
   <ul>
      <li>
         <p><a href="http://journal.r-project.org"><strong>The R Journal Volume 7/1</strong></a> is available.</p>
      </li>
      <li>
         <p><a href="http://cran.r-project.org/src/base/R-3"><strong>R version 3.2.1 (World-Famous Astronaut)</strong></a> has been released on 2015-06-18.</p>
      </li>
      <li>
         <p><a href="http://cran.r-project.org/src/base/R-3"><strong>R version 3.1.3 (Smooth Sidewalk)</strong></a> has been released on 2015-03-09.</p>
      </li>
      <li>
         <p><strong><a href="http://www.r-project.org/useR-2015">useR! 2015</a></strong>, will take place at the University of Aalborg, Denmark, June 30 - July 3, 2015.</p>
      </li>
      <li>
         <p><strong><a href="http://www.r-project.org/useR-2014">useR! 2014</a></strong>, took place at the University of California, Los Angeles, USA June 30 - July 3, 2014.</p>
      </li>
   </ul>
   <!--- (Boilerplate for release run-in)
      -   [**R 3.1.3 (Smooth Sidewalk) prerelease versions**](http://cran.r-project.org/src/base-prerelease) will appear starting February 28. Final release is scheduled for 2015-03-09. 
      -->
</div>
</div>
<div class="raw footer">
   &copy; The R Foundation.
</div>
</div>
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="/js/bootstrap.min.js"></script>
</body>
</html>

XML and rvest: use case

What is the latest R news?

Using rvest

library(rvest)

page_text <- html('http://www.r-project.org/')
r_news <- html_node(page_text, "h2#news ~ ul") 
# "h2#news ~ ul" Select the first unordered list after the heading size 2 element with an ID of "news" 
# This is an example using css selectors

r_news <- html_text(r_news)
cat(r_news)
## The R Journal Volume 7/1 is available.
## R version 3.2.1 (World-Famous Astronaut) has been released on 2015-06-18.
## R version 3.1.3 (Smooth Sidewalk) has been released on 2015-03-09.
## useR! 2015, will take place at the University of Aalborg, Denmark, June 30 - July 3, 2015.
## useR! 2014, took place at the University of California, Los Angeles, USA June 30 - July 3, 2014.

XML and rvest: use case

What is the latest R news?

Using XML

library(XML)

page_text <- htmlTreeParse("http://www.r-project.org/", useInternalNodes=TRUE)
r_news <- xpathSApply(page_text, '//h2[@id="news"]/following-sibling::ul', xmlValue)
# Again, Select the first unordered list after the heading size 2 element with an ID of "news" 
# this is an example using xpath selectors 

cat(r_news)
## The R Journal Volume 7/1 is available.
## R version 3.2.1 (World-Famous Astronaut) has been released on 2015-06-18.
## R version 3.1.3 (Smooth Sidewalk) has been released on 2015-03-09.
## useR! 2015, will take place at the University of Aalborg, Denmark, June 30 - July 3, 2015.
## useR! 2014, took place at the University of California, Los Angeles, USA June 30 - July 3, 2014.

Useful tools - view source

Chrome: right click > view source

view source

Useful tools - chrome developer tools

Developer tools

Useful tools - SelectorGadget

SelectorGadget

Heading 3

Heading 3

Heading 4

  1. An ordered list item
  2. Another ordered list item

a link

Benchmark: rvest vs. XML

Extract text from 1000 Craigslist webpages

read.cl <- function(file){

  xsite <- htmlTreeParse(file, useInternalNodes=TRUE)
  root <- xmlRoot(xsite)  

  msgbody <- xpathApply(root, '//section[@id="postingbody"]', xmlValue)

  msgbody <- sapply(msgbody, function(x)({
    x <- gsub('\\.|\\n|\\t', ' ', x)
    gsub("^( )*|( )*$|[^[:alnum:]///' ]", "", x)
  }
  ))

  posting_date <- xpathApply(root, '//time', xmlAttrs)[[1]]
  
  caption <- xpathApply(root, '//title', xmlValue)[[1]]
  
  sub_location <- strp(xpathApply(root, 
                                  '//span[@class="postingtitletext"]/small',
                                  xmlValue))
  
  post_title <- gsub('-|[mwt]{1,2}4[mwt]{1,2}', '', caption)
  
  location <- strp(xpathApply(root, '//form/nav/div/ul/li[@class="crumb area"]', xmlValue))
  section <- strp(xpathApply(root, '//form/nav/div/ul/li[@class="crumb section"]', xmlValue))
  subsection <- strp(xpathApply(root, '//form/nav/div/ul/li[@class="crumb category"]', xmlValue))

  
  scpts <- xpathApply(root, '//script', xmlValue)
  img_slot <- sapply(scpts, function(x)(grepl('http://images\\.craigslist\\.org',x)))
  img_txt <- scpts[img_slot]

  n_images   <- try(length(gregexpr('\\{', img_txt)[[1]]))
  if(class(n_images) == 'try-error'){
    n_images <- 0
  }

  situation <- gsub('^.*([mwt]{1,2})4([mwt]{1,2}).*$', '\\14\\2', 
                    caption, perl=TRUE)

  posting_gender <- gsub('^.*([mwt]{1,2})4([mwt]{1,2}).*$', '\\1',
                        caption, perl=TRUE)

  target_gender <- gsub('^.*([mwt]{1,2})4([mwt]{1,2}).*$', '\\2', 
                        caption, perl=TRUE)

  dat <- data.frame(id, location, sub_location,posting_date,
    post_title,section,subsection, situation, posting_gender, 
    target_gender,n_images,msgbody)
  
  return(dat) 
  }

Benchmark: rvest vs. XML

Extract text from 1000 Craigslist webpages

XML took 54 seconds

rvest took 49 seconds

Certain parts of the webpage I couldn’t figure out how to extract with rvest

HTML tables