python - Adding spaces on word boundaries in text extraction with lxml -
an example lxml.html
documentation:
>>> lxml import html >>> root = html.fragment_fromstring('<p>hello<br>world!</p>') >>> html.tostring(root,method='text') 'helloworld!'
my question: there easy (or "right") way producing 'hello world!'
string instead?
you can try approach:
from lxml import html doc = html.document_fromstring('<p>hello<br>world!</p>') br in doc.xpath("*//br"): br.tail = " " + br.tail if br.tail else " " doc.text_content()
this prints:
'hello world!'
Comments
Post a Comment