python - Where to learn about scrapy SgmlLinkExtractor? -


sorry new python , scrapy, trying learn them trial , error.

regarding sgmllinkextractor, see everybody(at least on site) proficient in finding right code represent right path, where/how can learn that? (allow=[r'page/\d+']) or allow=[r'series-\d{1}-episode-\d{2}.'] , etc etc.

i trying scrape off website content in story.html, link format this:

http://www.example.com/folder/category/description/1234567/story.html

*note 1234567 changing 7 digits number

my start url http://www.example.com/folder/

i trying use sgmllinkextractor , define path follows. want include whatever in description portion of url , 7 digits portion. want make sure url ends story.html:

rule(sgmllinkextractor(allow=(r'category1/././story\.html',)), callback='parse_item', follow=true),

but /././ not allow me skip 2 sublevels story.html

what right way write sgmllinkextractor?

try this

rule(sgmllinkextractor(allow=(r'category1/description/\d+/story\.html',)), callback='parse_item', follow=true) 

but recommend use /description/\d+/story\.html part because unique enough crawl categories

in rules pass regex , needs learn regex , there bunch of online regex tester tools available


Comments