This function can be used to download a file from the Internet.
download.file(url, destfile, method, quiet = FALSE, mode = "w",
cacheOK = TRUE,
extra = getOption("download.file.extra"),
headers = NULL, …)
a character
string (or longer vector e.g.,
for the "libcurl"
method) naming the URL of a resource to be
downloaded.
a character string (or vector, see url
) with
the name where the downloaded file is saved. Tilde-expansion is
performed.
Method to be used for downloading files. Current
download methods are "internal"
, "wininet"
(Windows
only) "libcurl"
, "wget"
and "curl"
, and there
is a value "auto"
: see ‘Details’ and ‘Note’.
The method can also be set through the option
"download.file.method"
: see options()
.
If TRUE
, suppress status messages (if any), and
the progress bar.
character. The mode with which to write the file. Useful
values are "w"
, "wb"
(binary), "a"
(append) and
"ab"
. Not used for methods "wget"
and "curl"
.
See also ‘Details’, notably about using "wb"
for Windows.
logical. Is a server-side cached value acceptable?
character vector of additional command-line arguments for
the "wget"
and "curl"
methods.
named character vector of HTTP headers to use in HTTP
requests. It is ignored for non-HTTP URLs. The User-Agent
header, coming from the HTTPUserAgent
option (see
options
) is used as the first header, automatically.
allow additional arguments to be passed, unused.
An (invisible) integer code, 0
for success and non-zero for
failure. For the "wget"
and "curl"
methods this is the
status code returned by the external program. The "internal"
method can return 1
, but will in most cases throw an error.
What happens to the destination file(s) in the case of error depends
on the method and R version. Currently the "internal"
,
"wininet"
and "libcurl"
methods will remove the file if
there the URL is unavailable except when mode
specifies
appending when the file should be unchanged.
For the Windows-only method "wininet"
, the ‘Internet
Options’ of the system are used to choose proxies and so on; these are
set in the Control Panel and are those used for Internet Explorer.
The next two paragraphs apply to the internal code only.
Proxies can be specified via environment variables.
Setting no_proxy
to *
stops any proxy being tried.
Otherwise the setting of http_proxy
or ftp_proxy
(or failing that, the all upper-case version) is consulted and if
non-empty used as a proxy site. For FTP transfers, the username
and password on the proxy can be specified by ftp_proxy_user
and ftp_proxy_password
. The form of http_proxy
should be http://proxy.dom.com/
or
http://proxy.dom.com:8080/
where the port defaults to
80
and the trailing slash may be omitted. For
ftp_proxy
use the form ftp://proxy.dom.com:3128/
where the default port is 21
. These environment variables
must be set before the download code is first used: they cannot be
altered later by calling Sys.setenv
.
Usernames and passwords can be set for HTTP proxy transfers via
environment variable http_proxy_user
in the form
user:passwd
. Alternatively, http_proxy
can be of the
form http://user:pass@proxy.dom.com:8080/
for compatibility
with wget
. Only the HTTP/1.0 basic authentication scheme is
supported.
Under Windows, if http_proxy_user
is set to ask
then
a dialog box will come up for the user to enter the username and
password. NB: you will be given only one opportunity to enter this,
but if proxy authentication is required and fails there will be one
further prompt per download.
Much the same scheme is supported by method = "libcurl"
, including
no_proxy
, http_proxy
and ftp_proxy
, and for the last
two a contents of [user:password@]machine[:port]
where the
parts in brackets are optional. See
http://curl.haxx.se/libcurl/c/libcurl-tutorial.html for details.
Methods which access https:// and ftps:// URLs should try to verify the site certificates. This is usually done using the CA root certificates installed by the OS (although we have seen instances in which these got removed rather than updated). For further information see http://curl.haxx.se/docs/sslcerts.html.
This is an issue for method = "libcurl"
on Windows, where the
OS does not provide a suitable CA certificate bundle, so by default on
Windows certificates are not verified. To turn verification on, set
environment variable CURL_CA_BUNDLE
to the path to a certificate
bundle file, usually named ca-bundle.crt
or
curl-ca-bundle.crt
. (This is normally done for a binary
installation of R, which installs
R_HOME/etc/curl-ca-bundle.crt
and sets
CURL_CA_BUNDLE
to point to it if that environment variable is not
already set.) For an updated certificate bundle, see
http://curl.haxx.se/docs/sslcerts.html.
Currently one can download a copy from
https://raw.githubusercontent.com/bagder/ca-bundle/master/ca-bundle.crt
and set CURL_CA_BUNDLE
to the full path to the downloaded file.
Note that the root certificates used by R may or may not be the same as used in a browser, and indeed different browsers may use different certificate bundles (there is typically a build option to choose either their own or the system ones).
ftp: URLs are accessed using the FTP protocol which has a
number of variants. One distinction is between ‘active’ and
‘(extended) passive’ modes: which is used is chosen by the
client. The "internal"
and "libcurl"
methods use passive
mode, and that is almost universally used by browsers. Prior to R
3.2.3 the "wininet"
method used active mode: nowadays it first
tries passive and then active.
Setting the method
should be left to the end user. Neither of
the wget
nor curl
commands is widely available:
you can check if one is available via Sys.which
,
and should do so in a package or script.
If you use download.file
in a package or script, you must check
the return value, since it is possible that the download will fail
with a non-zero status but not an R error. (This was more likely
prior to R 3.4.0.)
The supported method
s do change: method libcurl
was
introduced in R 3.2.0 and is still optional on Windows -- use
capabilities("libcurl")
in a program to see if it is
available.
The function download.file
can be used to download a single
file as described by url
from the internet and store it in
destfile
.
The url
must start with a scheme such as
http://, https://, ftp:// or file://.
If method = "auto"
is chosen (the default), the behavior
depends on the platform:
On a Unix-alike method "libcurl"
is used except
"internal"
for file:// URLs, where "libcurl"
uses the library of that name (http://curl.haxx.se/libcurl/).
On Windows the "wininet"
method is used apart from for
ftps:// URLs where "libcurl"
is tried. The
"wininet"
method uses the WinINet functions (part of the OS).
Support for method "libcurl"
is optional on Windows: use
capabilities("libcurl")
to see if it is supported on
your build. It uses an external library of that name
(http://curl.haxx.se/libcurl/) against which R can be compiled.
When method "libcurl"
is used, it provides
(non-blocking) access to https:// and (usually) ftps://
URLs. There is support for simultaneous downloads, so url
and
destfile
can be character vectors of the same length greater
than one (but the method has to be specified explicitly and not
via "auto"
). For a single URL and quiet = FALSE
a progress bar is shown in interactive use.
For methods "wget"
and "curl"
a system call is made to
the tool given by method
, and the respective program must be
installed on your system and be in the search path for executables.
They will block all other activity on the R process until they
complete: this may make a GUI unresponsive.
cacheOK = FALSE
is useful for http:// and
https:// URLs: it will attempt to get a copy directly from the
site rather than from an intermediate cache. It is used by
available.packages
.
The "libcurl"
and "wget"
methods follow http://
and https:// redirections to any scheme they support: the
"internal"
method follows http:// to http://
redirections only. (For method "curl"
use argument
extra = "-L"
. To disable redirection in wget
, use
extra = "--max-redirect=0"
.)
The "wininet"
method supports some
redirections but not all. (For method "libcurl"
, messages will
quote the endpoint of redirections.)
Note that https:// URLs are not supported by the
"internal"
method but are supported by the "libcurl"
method and the "wininet"
method on Windows.
See url
for how file:// URLs are interpreted,
especially on Windows. The "internal"
and "wininet"
methods do not percent-decode file:// URLs, but the
"libcurl"
and "curl"
methods do: method "wget"
does not support them.
Most methods do not percent-encode special characters such as spaces
in URLs (see URLencode
), but it seems the
"wininet"
method does.
The remaining details apply to the "internal"
, "wininet"
and "libcurl"
methods only.
The timeout for many parts of the transfer can be set by the option
timeout
which defaults to 60 seconds.
The level of detail provided during transfer can be set by the
quiet
argument and the internet.info
option: the details
depend on the platform and scheme. For the "internal"
method
setting option internet.info
to 0 gives all available details,
including all server responses. Using 2 (the default) gives only
serious messages, and 3 or more suppresses all messages. For the
"libcurl"
method values of the option less than 2 give verbose
output.
A progress bar tracks the transfer platform specifically:
If the file length is known, the full width of the bar is the known length. Otherwise the initial width represents 100 Kbytes and is doubled whenever the current width is exceeded. (In non-interactive use this uses a text version. If the file length is known, an equals sign represents 2% of the transfer completed: otherwise a dot represents 10Kb.)
If the file length is known, an equals sign represents 2% of the transfer completed: otherwise a dot represents 10Kb.
The choice of binary transfer (mode = "wb"
or "ab"
) is
important on Windows, since unlike Unix-alikes it does distinguish
between text and binary files and for text transfers changes \n
line endings to \r\n
(aka CRLF
).
On Windows, if mode
is not supplied (missing()
)
and url
ends in one of
.gz
, .bz2
, .xz
, .tgz
, .zip
,
.rda
, .rds
or .RData
, mode = "wb"
is set
such that a binary transfer is done to help unwary users.
Code written to download binary files must use mode = "wb"
(or
"ab"
), but the problems incurred by a text transfer will only
be seen on Windows.
options
to set the HTTPUserAgent
, timeout
and internet.info
options used by some of the methods.
url
for a finer-grained way to read data from URLs.
url.show
, available.packages
,
download.packages
for applications.
Contributed package RCurl provides more comprehensive facilities to download from URLs.