Python regex find two groups -


>>> text = '<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nanother diversion: softmax output function [7 min]</a>'  >>> import re >>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a) >>> [('47', ''), ('', 'another diversion: softmax output function [7 min]')] 

how extract data out this:

>>> ['47', 'another diversion: softmax output function [7 min]'] 

i think there should smarter regex expressions.

it not recommended parse html reguar expressions. can give try xml.dom.minidom module:

from xml.dom.minidom import parsestring  xml = parsestring('<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nanother diversion: softmax output function [7 min]</a>') anchor = xml.getelementsbytagname("a")[0] print anchor.getattribute("data-lecture-id"), anchor.childnodes[0].data 

Comments