i coded simple crawler. in settings.py file, referring scrapy documentation, used
dupefilter_class = 'scrapy.dupefilter.rfpdupefilter'
if stop crawler , restart crawler again, scraping duplicate urls again. doing wrong?
i believe looking "persistence support", pause , resume crawls.
to enable can do:
scrapy crawl somespider -s jobdir=crawls/somespider-1
you can read more here.
Comments
Post a Comment