sorry new python , scrapy, trying learn them trial , error.
regarding sgmllinkextractor, see everybody(at least on site) proficient in finding right code represent right path, where/how can learn that? (allow=[r'page/\d+']) or allow=[r'series-\d{1}-episode-\d{2}.'] , etc etc.
i trying scrape off website content in story.html, link format this:
http://www.example.com/folder/category/description/1234567/story.html
*note 1234567 changing 7 digits number
my start url http://www.example.com/folder/
i trying use sgmllinkextractor , define path follows. want include whatever in description portion of url , 7 digits portion. want make sure url ends story.html:
rule(sgmllinkextractor(allow=(r'category1/././story\.html',)), callback='parse_item', follow=true),
but /././ not allow me skip 2 sublevels story.html
what right way write sgmllinkextractor?
try this
rule(sgmllinkextractor(allow=(r'category1/description/\d+/story\.html',)), callback='parse_item', follow=true) but recommend use /description/\d+/story\.html part because unique enough crawl categories
in rules pass regex , needs learn regex , there bunch of online regex tester tools available
Comments
Post a Comment