Learn R Programming

tm.plugin.webmining (version 1.3)

extractContentDOM: Extract Main HTML Content from DOM

Description

Function extracts main HTML Content using its Document Object Model. Idea comes basically from the fact, that main content of an HTML Document is in a subnode of the HTML DOM Tree with a high text-to-tag ratio. Internally, this function also calls assignValues, calcDensity, getMainText and removeTags.

Usage

extractContentDOM(url, threshold, asText = TRUE, ...)

Arguments

url
character, url or filename
threshold
threshold for extraction, defaults to 0.5
asText
boolean, specifies if url should be interpreted as character
...
Additional Parameters to htmlTreeParse

References

http://www.elias.cn/En/ExtMainText, http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ Gupta et al., DOM-based Content Extraction of HTML Documents,http://www2003.org/cdrom/papers/refereed/p583/p583-gupta.html

See Also

xmlNode