I'm writing a Scrapy spider that scrapes an API for all its items. The API does not provide the total count of results, so I go through all the pages in sequence until a page returns zero results. When it does, my spider currently exits.
Instead, I would like the spider to wait for 30 minutes, then try the same page again. Based on the previous question Scrapy: non-blocking pause, I tried the following code:
def parse(self, response):
items = json.loads(response.text)
for item in items:
yield scrapy.Request(f'{self.settings.get("API_URL")}/{item["id"]}',
callback=self.parse_item,
headers=self.settings.get('API_HEADERS')
)
if len(items) == 0:
self.logger.info('No new items found. Waiting for 30 mins...')
d = defer.Deferred()
reactor.callLater(60.0*30.0, d.callback, self.page_request())
return d
but I get an error twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.
Since I am not familiar with Twisted, and just learning Scrapy, I wonder if anyone has a suggestion how to make progress. Thanks!
Related
I'm actually working on a project of scrapping using scrapy and I ended up with 37 spiders. I want to do a cron-job for these spiders but first I want to regroup all of my 37 spiders in a main spider. And by doing it I'm going to do a single cron-job on the main spider instead of 37 cron-jobs.
Do you have any ideas ?
Why not create a script that runs all these spider and use cron to schedule that?
See the documentation for creating a script.
Here is an example snippet from one of my projects:
def run_spider_crawler(self):
# .. other code here..
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
yield runner.crawl(spider3)
yield runner.crawl(spider4)
yield runner.crawl(spider5)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
I am running a scrapy spider to export some football data and using the scrapy splash plugin.
For development I am running the spider against cached results, so as not to hit the website too much. The strange thing is, that I am consistently missing some items in the export when running the spider with multiple start_urls. The number of missing Items differ slightly each time.
However, when I comment out all but one start_urls and run the spider for each one separately, I get all the results. I am not sure if this is a bug with scrapy or if I am missing something about scrapy here, as this is my first project with the framework.
Here is my caching configuration:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [403]
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
These are my start_urls:
start_urls = [
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2017',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2019',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2020',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2017',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2018',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2019',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2020'
]
I have a standard setup with an export pipeline and my spider yielding splash requests multiple times for each relevant url on the page from a parse method. Each parse method fills the same item passed via cb_kwargs or creates a new one with data from the passed item.
Please let me know if further code from my project, like the spider, pipelines or item loaders might be relevant to the issue and I will edit my question here.
Would there be any code example showing a minimal structure of a Broad Crawls with Scrapy?
Some desirable requirements:
crawl in BFO order; (DEPTH_PRIORITY?)
crawl only from URLs that follow certain patterns; and (LinkExtractor?)
URLs must have a maximum depth. (DEPTH_LIMIT)
I am starting with:
import scrapy
from scrapy import Request
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class WebSpider(scrapy.Spider):
name = "webspider"
def __init__(self):
super().__init__()
self.link_extractor = LxmlLinkExtractor(allow="\.br\/")
self.collecton_file = open("collection.jsonl", 'w')
start_urls = [
"https://www.uol.com.br/"
]
def parse(self, response):
data = {
"url": response.request.url,
"html_content": response.body
}
self.collecton_file.write(f"{data}\n")
for link in self.link_extractor.extract_links(response):
yield Request(link.url, callback=self.parse)
Is that the correct approach to crawl and store the raw HTML on disk?
How to stop the spider when it has already collected n pages?
How to show some stats (pages/min, errors, number of pages collected so far) instead of the standard log?
It should work, but you are writing to a file manually instead of using Scrapy for that.
CLOSESPIDER_PAGECOUNT. But mind that it will start stopping once you reach the specified number of pages, but it will still finish processing some additional pages.
I supect you just want to set LOG_LEVEL to something else.
I have a complex scraping application in Scrapy that run at multiple stages (each stage is a function calling the next stage of scraping and parsing). the spider try to download multiple targets and each target consists of large number of files. what i need to do is after downloading all the files of a target is calling some function that process them and it cannot process them partially it needs the whole set of files for the target at the same time. is there a way to do it ?
If you cannot wait until the whole spider is finished, you will have to write some logic in an item pipeline that keeps track of what you have scraped, and executes a function then.
Below is some logic to get you started: it keeps track of the number items you scraped per target, and when it reaches 100, it will execute the target_complete method. Note that you will have to fill in the field 'target' in the item of course.
from collections import Counter
class TargetCountPipeline(object):
def __init__(self):
self.target_counter = Counter()
self.target_number = 100
def process_item(self, item, spider):
target = item['target']
self.target_counter[target] += 1
if self.target_counter[target] >= self.target_number:
target_complete(target)
return item
def target_complete(self, target):
# execute something here when you reached the target
I've implemented a web application that is triggering scrapy spiders using scrapyd API (web app and scrapyd are running on the same server).
My web application is storing job ids returned from scrapyd in DB.
My spiders are storing items in DB.
Question is: how could I link in DB the job id issued by scrapyd and items issued by the crawl?
I could trigger my spider using an extra parameter - let say an ID generated by my web application - but I'm not sure it is the best solution. At the end, there is no need to create that ID if scrapyd issues it already...
Thanks for your help
The question should be phrased as "How can I get a job id of a scrapyd task in runtime?"
When scrapyd runs a spider it actually gives the spider the job id as an argument.
Should be always as last argument of sys.args.
Also,
os.environ['SCRAPY_JOB'] should do the trick.
In the spider constructor(inside init),
add the line -->
self.jobId = kwargs.get('_job')
then in the parse function pass this in item,
def parse(self, response):
data = {}
......
yield data['_job']
in the pipeline add this -->
def process_item(self, item, spider):
self.jobId = item['jobId']
.......