i'm parsing page has structure this:
<pre class="asdf">content a</pre> <pre class="asdf">content b</pre> # returns content content b
and i'm using following xpath content: "//pre[@class='asdf']/text()"
it works well, except if there elements nested inside <pre>
tag, doesn't concatenate them:
<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre> <pre class="asdf">content b</pre> # returns content content b
if use xpath, output follows. "//pre[@class='asdf']//text()"
content content b
i don't want either of those. want text inside <pre>
, if has children. don't care if tags stripped or not- want concatenated together.
how do this? i'm using lxml.html.xpath
in python2, don't think matters. this answer question makes me think maybe child::
has answer.
here's code reproduces it.
from lxml import html tree = html.fromstring(""" <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> <pre class="asdf">content b</pre> """) row in tree.xpath("//*[@class='asdf']/text()"): print("row: ", row)
.text_content()
should use:
.text_content():
returns text content of element, including text content of children, no markup.
for row in tree.xpath("//*[@class='asdf']"): print("row: ", row.text_content())
demo:
>>> lxml import html >>> >>> tree = html.fromstring(""" ... <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> ... <pre class="asdf">content b</pre> ... """) >>> row in tree.xpath("//*[@class='asdf']"): ... print("row: ", row.text_content()) ... ('row: ', 'content a') ('row: ', 'content b')
Comments
Post a Comment