i'm parsing page has structure this:
<pre class="asdf">content a</pre> <pre class="asdf">content b</pre> # returns content content b and i'm using following xpath content: "//pre[@class='asdf']/text()"
it works well, except if there elements nested inside <pre> tag, doesn't concatenate them:
<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre> <pre class="asdf">content b</pre> # returns content content b if use xpath, output follows. "//pre[@class='asdf']//text()"
content content b i don't want either of those. want text inside <pre>, if has children. don't care if tags stripped or not- want concatenated together.
how do this? i'm using lxml.html.xpath in python2, don't think matters. this answer question makes me think maybe child:: has answer.
here's code reproduces it.
from lxml import html tree = html.fromstring(""" <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> <pre class="asdf">content b</pre> """) row in tree.xpath("//*[@class='asdf']/text()"): print("row: ", row)
.text_content() should use:
.text_content():returns text content of element, including text content of children, no markup.
for row in tree.xpath("//*[@class='asdf']"): print("row: ", row.text_content()) demo:
>>> lxml import html >>> >>> tree = html.fromstring(""" ... <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> ... <pre class="asdf">content b</pre> ... """) >>> row in tree.xpath("//*[@class='asdf']"): ... print("row: ", row.text_content()) ... ('row: ', 'content a') ('row: ', 'content b')
Comments
Post a Comment