get_nexis_html

This extract headings, body texts and meta data (date, byline, length,
section, edition) from items in HTML files downloaded by the scraper.

internal

Functions for importing and handling text files and formatted text
files with additional meta-data, such including '.csv', '.tab', '.json', '.xml',
'.html', '.pdf', '.doc', '.docx', '.rtf', '.xls', '.xlsx', and others.

Kenneth Benoit

readtext

Import and Handling for Plain and Formatted Text Files

Adam Obeng

Kohei Watanabe

Akitaka Matsuo

Paul Nulty

Stefan Müller

get_nexis_html function

<dl><dt>path</dt>
<dd>either path to a HTML file or a directory that contains HTML
files</dd>
<dt>paragraph_separator</dt>
<dd>a character to separate paragraphs in body texts</dd>
<dt>verbosity</dt>
<dd><ul>
<li>0: output errors only</li>
<li>1: output errors and warnings (default)</li>
<li>2: output a brief summary message</li>
<li>3: output detailed file-related messages</li>
</ul></dd>
<dt>...</dt>
<dd>only to trap extra arguments</dd></dl>

Arguments

extract texts and meta data from Nexis HTML files — get_nexis_html

<dl>

<dt>path</dt>
<dd>either path to a HTML file or a directory that contains HTML
files</dd>


<dt>paragraph_separator</dt>
<dd>a character to separate paragraphs in body texts</dd>


<dt>verbosity</dt>
<dd><ul>
<li>0: output errors only</li>
<li>1: output errors and warnings (default)</li>
<li>2: output a brief summary message</li>
<li>3: output detailed file-related messages</li>
</ul></dd>


<dt>...</dt>
<dd>only to trap extra arguments</dd>

</dl>

get_nexis_html: extract texts and meta data from Nexis HTML files

Description

Usage

Arguments

Examples