FactivaSource: Factiva Source

Description

Construct a source for an input containing a set of articles exported from Factiva in the XML or HTML formats.

Usage

FactivaSource(x, encoding = "UTF-8",
                format = c("auto", "XML", "HTML"))

Value

An object of class XMLSource which extends the class

Source representing set of articles from Factiva.

Arguments

x: Either a character identifying the file or a connection.
encoding: A character giving the encoding of x, only used for HTML files. It will be ignored unless the HTML input does not include this information, which should normally not happen with files exported from Factiva.
format: The format of the file or connection identified by x (see “Details”).

Author

Milan Bouchet-Valat

Details

This function can be used to import both XML and HTML files. If format is set to “auto” (the default), the file extension is used to guess the format: if the file name ends with “.xml” or “.XML”, XML is assumed; else, the file is assumed to be in the HTML format.

It is advised to export articles from Factiva in the XML format rather than in HTML when possible, since the latter does not provide completely clean information. In particular, dates are not guaranteed to be parsed correctly if the machine from which the HTML file was exported uses a locale different from that of the machine where it is read.

The following screencast illustrates how to export articles in the correct HTML format from the Factiva website: http://rtemis.hypotheses.org/files/2017/02/Factiva-animated-tutorial.gif. Do note that by not following this procedure, you will obtain a HTML file which cannot be imported by this package.

This function imports the body of the articles, but also sets several meta-data variables on individual documents:

datetimestamp: The publication date.
heading: The title of the article.
origin: The newspaper the article comes from.
edition: The (local) variant of the newspaper.
section: The part of the newspaper containing the article.
subject: One or several keywords defining the subject.
company: One or several keywords identifying the covered companies.
industry: One or several keywords identifying the covered industries.
infocode: One or several Information Provider Codes (IPC).
infodesc: One or several Information Provider Descriptions (IPD).
coverage: One or several keywords identifying the covered regions.
page: The number of the page on which the article appears (if applicable).
wordcount: The number of words in the article.
publisher: The publisher of the newspaper.
rights: The copyright information associated with the article.
language: This information is set automatically if readerControl = list(language = NA) is passed (see the example below). Else, the language specified manually is set for all articles. If omitted, the default, "en", is used.

Examples

Run this code

if (FALSE) {
    ## For an XML file
    library(tm)
    file <- system.file("texts", "reut21578-factiva.xml",
                        package = "tm.plugin.factiva")
    source <- FactivaSource(file)
    corpus <- Corpus(source, readerControl = list(language = NA))

    # See the contents of the documents
    inspect(corpus)

    # See meta-data associated with first article
    meta(corpus[[1]])
}

    ## For an HTML file
    library(tm)
    file <- system.file("texts", "factiva_test.html", 
                        package = "tm.plugin.factiva")
    source <- FactivaSource(file)
    corpus <- Corpus(source, readerControl = list(language = NA))

    # See the contents of the documents
    inspect(corpus)

    # See meta-data associated with first article
    meta(corpus[[1]])

    # \dontshow{
    # Check that texts with non-ASCII characters are properly marked as UTF-8,
    # as bugs in XML have created issues in the past
    stopifnot(all(Encoding(content(corpus[[1]])[1]) == "UTF-8"))
    # }

Run the code above in your browser using DataLab