xpath - R and xpathApply -- removing duplicates from nested html tags -
i have edited question brevity , clarity
my goal find , xpath expression result in "test1"..."test8" listed separately.
i working xpathapply
extract text web pages. due layout of various different pages information pulled from, need extract xml values <font>
, <p>
html tags. problem run when 1 type nested within other, resulting in partial duplicates when use following xpathapply
expression or
condition.
require(xml) html <- '<!doctype html> <html lang="en"> <body> <p>test1</p> <font>test2</font> <p><font>test3</font></p> <font><p>test4</p></font> <p>test5<font>test6</font></p> <font>test7<p>test8</p></font> </body> </html>' work <- htmltreeparse(html, useinternal = true, encoding='utf-8') table <- xpathapply(work, "//p|//font", xmlvalue) table
it should easy see type of issue comes nesting--because <font>
, <p>
tags nested, , aren't, can't ignore them searching both gives me partial dupes. other reasons, prefer text pieces broken rather aggregated (that is, taken lowest level/furthest nested tag).
the reason not doing 2 separate searches , appending them after removing duplicate strings need preserve ordering of text appears in html.
thanks reading!
okay, figured out (entirely due post here:http://www.r-bloggers.com/htmltotext-extracting-text-from-html-via-xpath/)
the answer me take text within html , clean out stuff not needed, this:
table <- xpathapply(work, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlvalue)
Comments
Post a Comment