sorry new python , scrapy, trying learn them trial , error.
regarding sgmllinkextractor, see everybody(at least on site) proficient in finding right code represent right path, where/how can learn that? (allow=[r'page/\d+'])
or allow=[r'series-\d{1}-episode-\d{2}.']
, etc etc.
i trying scrape off website content in story.html, link format this:
http://www.example.com/folder/category/description/1234567/story.html
*note 1234567 changing 7 digits number
my start url http://www.example.com/folder/
i trying use sgmllinkextractor
, define path follows. want include whatever in description portion of url , 7 digits portion. want make sure url ends story.html
:
rule(sgmllinkextractor(allow=(r'category1/././story\.html',)), callback='parse_item', follow=true)
,
but /././
not allow me skip 2 sublevels story.html
what right way write sgmllinkextractor
?
try this
rule(sgmllinkextractor(allow=(r'category1/description/\d+/story\.html',)), callback='parse_item', follow=true)
but recommend use /description/\d+/story\.html
part because unique enough crawl categories
in rules pass regex , needs learn regex , there bunch of online regex tester tools available
Comments
Post a Comment