i practicing scraping beautifulsoup. below code , screenshot of webspage , it's elements. trying title of each post reddit.com.
code:
import urllib2 bs4 import beautifulsoup url = 'https://www.reddit.com/' page = urllib2.urlopen(url) soup = beautifulsoup(page, 'html.parser') posttitles = soup.find_all("div", {"class", "thing"}) title in posttitles: tclass = title.find("div", {"class", "entry"}) posttitle = tclass.find("a", {"class", "title"}) print posttitle print "\n\n" error:
traceback (most recent call last): file "scrapingtest.py", line 21, in <module> posttitle = tclass.find("a", {"class", "title"}) attributeerror: 'nonetype' object has no attribute 'find'
reason of error
traceback (most recent call last): file "scrapingtest.py", line 21, in <module> posttitle = tclass.find("a", {"class", "title"}) attributeerror: 'nonetype' object has no attribute 'find' you getting because value of tclass none. can not call find on it. that's error message states.
debugging
please print out value of soup check html response get. reddit blocks repeated requests , sends simple message instead of usual listings.
possible workaround
use proper user agents , other stuff simulate proper behaviour of human being, browsing reddit on browser.
you might want try doing using selenium.
alternatives
reddit provides apis collecting data , building bots. have never tried it. not sure allowed , not. might apis see if matches needs.

Comments
Post a Comment