python - proper xpath to roll up text of children -


i'm parsing page has structure this:

<pre class="asdf">content a</pre> <pre class="asdf">content b</pre>  # returns content content b 

and i'm using following xpath content: "//pre[@class='asdf']/text()"

it works well, except if there elements nested inside <pre> tag, doesn't concatenate them:

<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre> <pre class="asdf">content b</pre>  # returns content content b 

if use xpath, output follows. "//pre[@class='asdf']//text()"

content content b 

i don't want either of those. want text inside <pre>, if has children. don't care if tags stripped or not- want concatenated together.

how do this? i'm using lxml.html.xpath in python2, don't think matters. this answer question makes me think maybe child:: has answer.

here's code reproduces it.

from lxml import html  tree = html.fromstring(""" <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> <pre class="asdf">content b</pre> """) row in tree.xpath("//*[@class='asdf']/text()"):   print("row: ", row) 

.text_content() should use:

.text_content(): returns text content of element, including text content of children, no markup.

for row in tree.xpath("//*[@class='asdf']"):     print("row: ", row.text_content()) 

demo:

>>> lxml import html >>>  >>> tree = html.fromstring(""" ... <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre> ... <pre class="asdf">content b</pre> ... """) >>> row in tree.xpath("//*[@class='asdf']"): ...     print("row: ", row.text_content()) ...  ('row: ', 'content a') ('row: ', 'content b') 

Comments