>>> text = '<a data-lecture-id="47"\n data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n data-modal=".course-modal-frame"\n rel="lecture-link"\n class="lecture-link">\nanother diversion: softmax output function [7 min]</a>' >>> import re >>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a) >>> [('47', ''), ('', 'another diversion: softmax output function [7 min]')]
how extract data out this:
>>> ['47', 'another diversion: softmax output function [7 min]']
i think there should smarter regex expressions.
it not recommended parse html reguar expressions. can give try xml.dom.minidom
module:
from xml.dom.minidom import parsestring xml = parsestring('<a data-lecture-id="47"\n data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n data-modal=".course-modal-frame"\n rel="lecture-link"\n class="lecture-link">\nanother diversion: softmax output function [7 min]</a>') anchor = xml.getelementsbytagname("a")[0] print anchor.getattribute("data-lecture-id"), anchor.childnodes[0].data
Comments
Post a Comment