i trying obtain data-pid , price craigslist using beautifulsoup. have written separate code gives me file clallsites.txt. in code trying grab each of sites txt file , pids of entries in first 10 pages. code is:
bs4 import beautifulsoup urllib2 import urlopen readfile = open("clallsites.txt") product = "mcy" while 1: u = "" count = 0 line = readfile.readline() commaposition = line.find(',') site = line[0:commaposition] location = line[commaposition+1:] site_filename = location + '.txt' f = open(site_filename, "a") while (count < 10): sitenow = site + "\\" + product + "\\" + str(u) html = urlopen(str(sitenow)) soup = beautifulsoup(html) postings = soup('p',{"class":"row"}) post in postings: y = post['data-pid'] print y count = count +1 index = count*100 u = "index" + str(index) + ".html" if not line: break pass
my clallsites.txt looks this:
craiglist site, location (stackoverflow not allow posting cragslist links cannot show text, try attach text file if helps.)
when run code following error:
traceback (most recent call last):
file "reading.py", line 16, in html = urlopen(str(sitenow))
file "/usr/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout)
file "/usr/lib/python2.7/urllib2.py", line 400, in open response = self._open(req, data)
file "/usr/lib/python2.7/urllib2.py", line 418, in _open '_open', req)
file "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args)
file "/usr/lib/python2.7/urllib2.py", line 1207, in http_open return self.do_open(httplib.httpconnection, req)
file "/usr/lib/python2.7/urllib2.py", line 1177, in do_open raise urlerror(err)
urllib2.urlerror:
any ideas doing wrong?
i don't know content of sitenow
, looks invalid url. note urls use slashes , not backslashes (so statement sould similar sitenow = site + "/" + product + "/" + str(u)
)
Comments
Post a Comment