URL encoding and decoding is an essential prerequisite to proper web interaction
and data analysis around things like server-side logs. The
relevant IETF RfC mandates the percentage-encoding
of non-Latin characters, including things like slashes, unless those are reserved.
Base R provides URLdecode
and URLencode
, which handle
URL encoding - in theory. In practise, they have a set of substantial problems
that the urltools implementation solves::
No vectorisation: Both base R functions operate on single URLs, not vectors of URLs.
This means that, when confronted with a vector of URLs that need encoding or
decoding, your only option is to loop from within R. This can be incredibly
computationally costly with large datasets. url_encode and url_decode are
implemented in C++ and entirely vectorised, allowing for a substantial
performance improvement.
No scheme recognition: encoding the slashes in, say, http://, is a good way
of making sure your URL no longer works. Because of this, the only thing
you can encode in URLencode (unless you refuse to encode reserved characters)
is a partial URL, lacking the initial scheme, which requires additional operations
to set up and increases the complexity of encoding or decoding. url_encode
detects the protocol and silently splits it off, leaving it unencoded to ensure
that the resulting URL is valid.
ASCII NULs: Server side data can get very messy and sometimes include out-of-range
characters. Unfortunately, URLdecode's response to these characters is to convert
them to NULs, which R can't handle, at which point your URLdecode call breaks.
url_decode
simply ignores them.