Learn R Programming

tm.plugin.webmining (version 1.3)

getLinkContent: Get main content for corpus items, specified by links.

Description

getLinkContent downloads and extracts content from weblinks for Corpus objects. Typically it is integrated and called as a post-processing function (field:$postFUN) for most WebSource objects. getLinkContent implements content download in chunks which has been proven to be a stabler approach for large content requests.

Usage

getLinkContent(corpus, links = sapply(corpus, meta, "origin"), timeout.request = 30, chunksize = 20, verbose = getOption("verbose"), curlOpts = curlOptions(verbose = FALSE, followlocation = TRUE, maxconnects = 5, maxredirs = 20, timeout = timeout.request, connecttimeout = timeout.request, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, useragent = "R", cookiejar = tempfile()), retry.empty = 3, sleep.time = 3, extractor = ArticleExtractor, .encoding = integer(), ...)

Arguments

corpus
object of class Corpus for which link content should be downloaded
links
character vector specifyinig links to be used for download, defaults to sapply(corpus, meta, "Origin")
timeout.request
timeout (in seconds) to be used for connections/requests, defaults to 30
chunksize
Size of download chunks to be used for parallel retrieval, defaults to 20
verbose
Specifies if retrieval info should be printed, defaults to getOption("verbose")
curlOpts
curl options to be passed to getURL
retry.empty
Specifies number of times empty content sites should be retried, defaults to 3
sleep.time
Sleep time to be used between chunked download, defaults to 3 (seconds)
extractor
Extractor to be used for content extraction, defaults to extractContentDOM
.encoding
encoding to be used for getURL, defaults to integer() (=autodetect)
...
additional parameters to getURL

Value

corpus including downloaded link content

See Also

WebSource getURL Extractor