I’m tickled pink to announce the release of rvest 1.0.0. rvest is designed to make it easy to scrape (i.e. harvest) data from HTML web pages.
You can install it from CRAN with:
install.packages("rvest")
This release includes two major improvements that make it easier to extract text and tables. I also took this opportunity to tidy up the interface to be better match the tidyverse standards that have emerged since rvest was created in 2012. This is a major release that marks rvest as stable. That means we promise to avoid breaking changes as much as possible, and where they are needed, we will provided a significant deprecation cycle.
You can see a full list of changes in the release notes.
New features
It’s been a while since I took a good look at rvest, and the GitHub issues suggested that there were two sources of long-standing frustration with rvest:
html_text()
and
html_table()
.
html_text()
was a source of frustration because it extracts raw text from underlying HTML. It ignores HTML’s line breaks (i.e. <br>
) but preserves non-significant whitespace, making it a pain to use:
html <- minimal_html(
"<p>
This is a paragraph.
This another sentence.<br>This should start on a new line
</p>"
)
html %>% html_text() %>% writeLines()
#>
#> This is a paragraph.
#> This another sentence.This should start on a new line
#>
The new
html_text2()
is inspired by Javascript’s innerText()
function and uses a handful of heuristics to generate more useful output:
html %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line
html_table()
was frustrating because it failed on many tables that used row or column spans. I’ve now re-written it from scratch, closely following the algorithm that browsers use. This means that there are far fewer tables for which it fails to produce useful output, and I have deprecated the fill
argument because it’s no longer needed.
Here’s a little example with row span, column span, and a missing cell:
html <- minimal_html("<table>
<tr><th>A</th><th>B</th><th>C</th></tr>
<tr><td colspan='2' rowspan='2'>1</td><td>2</td></tr>
<tr><td rowspan='2'>3</td></tr>
<tr><td>4</td></tr>
</table>")
html %>%
html_element("table") %>%
html_table()
#> # A tibble: 3 x 3
#> A B C
#> <int> <int> <int>
#> 1 1 1 2
#> 2 1 1 3
#> 3 4 NA 3
html_table()
now returns a tibble rather than a data frame (to be more compatible with the rest of the tidyverse), and its performance has been considerably improved (10x for the
motivating example). It also gains new na.strings
and convert
arguments to better control how NA
s and strings are processed. See the docs for more details.
While it’s not a major feature, its worth noting that rvest is now much smaller (~100 Kb vs ~1 Mb) thanks to a rewrite of vignette("rvest")
and making the
SelectorGadget article web-only.
API changes
Since this is the 1.0.0 release, I included a large number of API changes to make rvest more compatible with current tidyverse conventions. Older functions have been deprecated, so existing code will continue to work (albeit with a few new warnings).
-
rvest now imports xml2 rather than depending on it. This is cleaner because it avoids attaching all the xml2 functions that you’re probably not going to use. To reduce the change of breakages, rvest re-exports xml2 functions
read_html()
andurl_absolute()
; if you use other functions, your code will now need an explicitlibrary(xml2)
. -
html_form()
now returns an object with classrvest_form
. Fields within a form now have classrvest_field
, instead of a variety of classes that were lacking thervest_
prefix. All functions for working with forms have a commonhtml_form_
prefix, e.g.set_values()
becamehtml_form_set()
. -
html_node()
andhtml_nodes()
have been superseded in favor ofhtml_element()
andhtml_elements()
since they (almost) always return elements, not nodes. This vocabulary will better match what you’re likely to see when learning about HTML. -
html_session()
is nowsession()
and returns an object of classrvest_session
. All functions that work with session objects now have a commonsession_
prefix. -
Long deprecated
html()
,html_tag()
,xml()
functions have been removed. -
minimal_html()
(which doesn’t appear to be used by any other package) has had its arguments flipped to make it more intuitive. -
guess_encoding()
has been renamed tohtml_encoding_guess()
to avoid a clash withstringr::guess_encoding()
.repair_encoding()
was deprecated because it doesn’t appear to have ever worked. -
pluck()
is no longer exported to avoid a clash withpurrr::pluck()
; if you need it usepurrr::map_chr()
and friends instead. -
xml_tag()
,xml_node()
, andxml_nodes()
have been formally deprecated in favour of theirhtml_
equivalents.
Acknowledgements
A big thanks to all the folks who helped make this release possible through their issues, comments, and pull requests 😄
@13kay, @adam52, @AgnieszkaTomczyk, @ahaseemkunjucl, @akshaynagpal, @AlanMex1990, @alex23lemm, @amjiuzi, @antoine-lizee, @arilamstein, @artemklevtsov, @batpigandme, @bbrewington, @bedantaguru, @bramtayl, @brshallo, @charleswg, @christopherhastings, @chuchu89, @conjugateprior, @cpsievert, @craigcitro, @cranknasty, @cungbac, @curtisalexander, @cwickham, @data-steve, @dbuijs, @Deleetdk, @dholstius, @DiegoKoz, @dmi3kno, @dpprdan, @englianhu, @etabeta78, @ethanbsmith, @flpezet, @garrettgman, @georgevbsantiago, @geotheory, @ghost, @gokceneraslan, @gunawebs, @hadley, @happyshows, @hauj12123, @HBossier, @hemans, @higgi13425, @himanshudhingra, @hsancen, @ignotus0001, @ilarischeinin, @IndrajeetPatil, @iProcrastinate, @jaanos, @JackWilb, @JakeRuss, @jamjaemin, @javrucebo, @jeffisabelle, @jeroen, @jeroenjanssens, @jgilfillan, @jimhester, @jjchern, @jl5000, @jlewis91, @jmgirard, @johncollins, @JohnMount, @jonathan-g, @Jonathanyni, @joranE, @joshualeond, @jpmarindiaz, @jrnold, @jrosen48, @juba, @jubjubbc, @jullybobble, @kendonB, @kevin199011, @kevinrue, @kiernann, @kjschaudt, @ktaylora, @ktmud, @kurtis14, @leoluyi, @LeslieTse, @lifan0127, @litao1105, @magic-lantern, @MarcinKosinski, @markdanese, @MichaelChirico, @mikegros, @mikemc, @MislavSag, @mitchelloharawild, @mobcdi, @Monduiz, @moodymudskipper, @mrchypark, @MrFlick, @msberends, @msgoussi, @myliserta, @mzorgdrager, @nalimilan, @neilfws, @NicolasRuth, @nitishgupta4291, @noamross, @np2201, @npjc, @oguzhanogreden, @OmarGonD, @oNIenSis, @oriolmirosa, @Osc2wall, @petermeissner, @petrbouchal, @PritishDsouza, @PriyaShaji, @pssguy, @qpmnguyen, @r2evans, @rafaminos, @ramnathv, @renkun-ken, @rentrop, @richierocks, @rjpat, @romainfrancois, @rpalsaxena, @salauer, @SamoPP, @san1289, @sco-lo-digital, @seasmith, @sfirke, @sillasgonzaga, @slowkow, @smach, @smbache, @stenevang, @StephaneKazmierczak, @stevecondylios, @swiftsam, @swishderzy, @targeteer, @tbates, @The-Janitor, @thomasd2, @tomasbarcellos, @TyGu1, @wbuchanan, @WHardyPL, @WilDoane, @wldnjs, @yogesh1612, @yrochat, @yutannihilation, and @zheguzai100.