xpath - R and xpathApply -- removing duplicates from nested html tags -

- February 15, 2015

i have edited question brevity , clarity

my goal find , xpath expression result in "test1"..."test8" listed separately.

i working xpathapply extract text web pages. due layout of various different pages information pulled from, need extract xml values <font> , <p> html tags. problem run when 1 type nested within other, resulting in partial duplicates when use following xpathapply expression or condition.

require(xml)     html <-    '<!doctype html>   <html lang="en">     <body>       <p>test1</p>       <font>test2</font>       <p><font>test3</font></p>       <font><p>test4</p></font>       <p>test5<font>test6</font></p>           <font>test7<p>test8</p></font>     </body>   </html>' work <- htmltreeparse(html, useinternal = true, encoding='utf-8') table <- xpathapply(work, "//p|//font", xmlvalue)  table

it should easy see type of issue comes nesting--because <font> , <p> tags nested, , aren't, can't ignore them searching both gives me partial dupes. other reasons, prefer text pieces broken rather aggregated (that is, taken lowest level/furthest nested tag).

the reason not doing 2 separate searches , appending them after removing duplicate strings need preserve ordering of text appears in html.

thanks reading!

okay, figured out (entirely due post here:http://www.r-bloggers.com/htmltotext-extracting-text-from-html-via-xpath/)

the answer me take text within html , clean out stuff not needed, this:

table <- xpathapply(work, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlvalue)

Search This Blog

WINAPI

xpath - R and xpathApply -- removing duplicates from nested html tags -

Comments

Post a Comment

Popular posts from this blog

Prolog - Listing -

ruby on rails - RuntimeError: Circular dependency detected while autoloading constant - ActiveAdmin.register Role -

c++ - OpenMP unpredictable overhead -