Learn R Programming

htmltidy (version 0.5.0)

tidy_html.response: Tidy or "Pretty Print" HTML/XHTML Documents

Description

Pass in HTML content as either plain or raw text or parsed objects (either with the XML or xml2 packages) or as an httr response object along with an options list that specifies how the content will be tidied and get back tidied content of the same object type as passed in to the function.

Usage

# S3 method for response
tidy_html(content, options = list(TidyXhtmlOut =
  TRUE), verbose = FALSE)

tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for default tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for character tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for raw tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for xml_document tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for HTMLInternalDocument tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

# S3 method for connection tidy_html(content, options = list(TidyXhtmlOut = TRUE), verbose = FALSE)

Arguments

content

accepts a character vector, raw vector or parsed content from the xml2 or XML packages.

options

named list of options

verbose

output document errors? (default: FALSE)

Value

Tidied HTML/XHTML content. The object type will be the same as that of the input type except when it is a connection, then a character vector will be returned.

Details

The default option TixyXhtmlOut will convert the input content to XHTML.

Currently supported options:

  • Ones taking a logical value: TidyAltText, TidyBodyOnly, TidyBreakBeforeBR, TidyCoerceEndTags, TidyDropEmptyElems, TidyDropEmptyParas, TidyFixBackslash, TidyFixComments, TidyGDocClean, TidyHideComments, TidyHtmlOut, TidyIndentContent, TidyJoinClasses, TidyJoinStyles, TidyLogicalEmphasis, TidyMakeBare, TidyMakeClean, TidyMark, TidyOmitOptionalTags, TidyReplaceColor, TidyUpperCaseAttrs, TidyUpperCaseTags, TidyWord2000, TidyXhtmlOut

  • Ones taking a character value: TidyDoctype, TidyInlineTags, TidyBlockTags, TidyEmptyTags, TidyPreTags

  • Ones taking an integer value: TidyIndentSpaces, TidyTabSize, TidyWrapLen

File an issue if there are other libtidy options you'd like supported.

It is likely that the most used options will be:

  • TidyXhtmlOut (logical),

  • TidyHtmlOut (logical) and

  • TidyDocType which should be one of "omit", "html5", "auto", "strict" or "loose".

You can clean up Microsoft Word (2000) and Google Docs HTML via logical settings for TidyWord2000 and TidyGDocClean, respectively.

It may also be advantageous to remove all comments with TidyHideComments.

References

http://api.html-tidy.org/tidy/quickref_5.1.25.html & https://github.com/htacg/tidy-html5/blob/master/include/tidyenum.h for definitions of the options supported above and https://www.w3.org/People/Raggett/tidy/ for an explanation of what "tidy" HTML is and some canonical examples of what it can do.

Examples

Run this code
# NOT RUN {
opts <- list(
  TidyDocType="html5",
  TidyMakeClean=TRUE,
  TidyHideComments=TRUE,
  TidyIndentContent=TRUE,
  TidyWrapLen=200
)

txt <- paste0(
  c("<html><head><style>p { color: red; }</style><body><!-- ===== body ====== -->",
"<p>Test</p></body><!--Default Zone --> <!--Default Zone End--></html>"),
  collapse="")

cat(tidy_html(txt, option=opts))

# }
# NOT RUN {
library(httr)
res <- GET("https://rud.is/test/untidy.html")

# look at the original, un-tidy source
cat(content(res, as="text", encoding="UTF-8"))

# see the tidied version
cat(tidy_html(content(res, as="text", encoding="UTF-8"),
              list(TidyDocType="html5", TidyWrapLen=200)))

# but, you could also just do:
cat(tidy_html(url("https://rud.is/test/untidy.html")))
# }

Run the code above in your browser using DataLab