i practicing scraping beautifulsoup. below code , screenshot of webspage , it's elements. trying title
of each post reddit.com.
code:
import urllib2 bs4 import beautifulsoup url = 'https://www.reddit.com/' page = urllib2.urlopen(url) soup = beautifulsoup(page, 'html.parser') posttitles = soup.find_all("div", {"class", "thing"}) title in posttitles: tclass = title.find("div", {"class", "entry"}) posttitle = tclass.find("a", {"class", "title"}) print posttitle print "\n\n"
error:
traceback (most recent call last): file "scrapingtest.py", line 21, in <module> posttitle = tclass.find("a", {"class", "title"}) attributeerror: 'nonetype' object has no attribute 'find'
reason of error
traceback (most recent call last): file "scrapingtest.py", line 21, in <module> posttitle = tclass.find("a", {"class", "title"}) attributeerror: 'nonetype' object has no attribute 'find'
you getting because value of tclass
none
. can not call find
on it. that's error message states.
debugging
please print out value of soup
check html response get. reddit blocks repeated requests , sends simple message instead of usual listings.
possible workaround
use proper user agents , other stuff simulate proper behaviour of human being, browsing reddit on browser.
you might want try doing using selenium.
alternatives
reddit provides apis collecting data , building bots. have never tried it. not sure allowed , not. might apis see if matches needs.
Comments
Post a Comment