Use scrapyd job id in scrapy pipelines - scrapy

I've implemented a web application that is triggering scrapy spiders using scrapyd API (web app and scrapyd are running on the same server).
My web application is storing job ids returned from scrapyd in DB.
My spiders are storing items in DB.
Question is: how could I link in DB the job id issued by scrapyd and items issued by the crawl?
I could trigger my spider using an extra parameter - let say an ID generated by my web application - but I'm not sure it is the best solution. At the end, there is no need to create that ID if scrapyd issues it already...
Thanks for your help

The question should be phrased as "How can I get a job id of a scrapyd task in runtime?"
When scrapyd runs a spider it actually gives the spider the job id as an argument.
Should be always as last argument of sys.args.
Also,
os.environ['SCRAPY_JOB'] should do the trick.

In the spider constructor(inside init),
add the line -->
self.jobId = kwargs.get('_job')
then in the parse function pass this in item,
def parse(self, response):
data = {}
......
yield data['_job']
in the pipeline add this -->
def process_item(self, item, spider):
self.jobId = item['jobId']
.......

Related

Regroup multiple spiders in a main spider

I'm actually working on a project of scrapping using scrapy and I ended up with 37 spiders. I want to do a cron-job for these spiders but first I want to regroup all of my 37 spiders in a main spider. And by doing it I'm going to do a single cron-job on the main spider instead of 37 cron-jobs.
Do you have any ideas ?
Why not create a script that runs all these spider and use cron to schedule that?
See the documentation for creating a script.
Here is an example snippet from one of my projects:
def run_spider_crawler(self):
# .. other code here..
runner = CrawlerRunner()
#defer.inlineCallbacks
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
yield runner.crawl(spider3)
yield runner.crawl(spider4)
yield runner.crawl(spider5)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished

Scrapy spider on an API to wait until new items are available

I'm writing a Scrapy spider that scrapes an API for all its items. The API does not provide the total count of results, so I go through all the pages in sequence until a page returns zero results. When it does, my spider currently exits.
Instead, I would like the spider to wait for 30 minutes, then try the same page again. Based on the previous question Scrapy: non-blocking pause, I tried the following code:
def parse(self, response):
items = json.loads(response.text)
for item in items:
yield scrapy.Request(f'{self.settings.get("API_URL")}/{item["id"]}',
callback=self.parse_item,
headers=self.settings.get('API_HEADERS')
)
if len(items) == 0:
self.logger.info('No new items found. Waiting for 30 mins...')
d = defer.Deferred()
reactor.callLater(60.0*30.0, d.callback, self.page_request())
return d
but I get an error twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed.
Since I am not familiar with Twisted, and just learning Scrapy, I wonder if anyone has a suggestion how to make progress. Thanks!

Missing results in export when running scrapy spider with multiple start_urls

I am running a scrapy spider to export some football data and using the scrapy splash plugin.
For development I am running the spider against cached results, so as not to hit the website too much. The strange thing is, that I am consistently missing some items in the export when running the spider with multiple start_urls. The number of missing Items differ slightly each time.
However, when I comment out all but one start_urls and run the spider for each one separately, I get all the results. I am not sure if this is a bug with scrapy or if I am missing something about scrapy here, as this is my first project with the framework.
Here is my caching configuration:
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = [403]
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
These are my start_urls:
start_urls = [
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2017',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2018',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2019',
'https://www.transfermarkt.de/1-bundesliga/startseite/wettbewerb/L1/plus/?saison_id=2020',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2017',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2018',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2019',
'https://www.transfermarkt.de/2-bundesliga/startseite/wettbewerb/L2/plus/?saison_id=2020'
]
I have a standard setup with an export pipeline and my spider yielding splash requests multiple times for each relevant url on the page from a parse method. Each parse method fills the same item passed via cb_kwargs or creates a new one with data from the passed item.
Please let me know if further code from my project, like the spider, pipelines or item loaders might be relevant to the issue and I will edit my question here.

Run multiple processes of a Scrapy Spider

I have a Scrapy project which reads 1 millions product IDs from database and then starts scraping product details based on ID from a website.
My Spider is fully working.
I want to run 10 instances of Spider with each assigned an equal number of product IDs.
I can do it like,
SELECT COUNT(*) FROM product_ids and then divide it by 10 and then do
SELECT * FROM product_ids LIMIT 0, N and so on
I have an idea I can do it in Terminal by passing LIMIT in scrapy command like scrapy crawl my_spider scrape=1000 and so on.
But I want to do it in Spider, so I just run my Spider only once and then it runs 10 another processes of same spider within spider.
One way to do this is using CrawlerProcess helper class or CrawlerRunner class.
import scrapy
from scrapy.crawler import CrawlerProcess
class Spider1(scrapy.Spider):
# Your first spider definition
process = CrawlerProcess()
process.crawl(Spider1)
process.crawl(Spider1)
process.start()
It is running multi spiders in the same process not multiple processes.

scrapyd multiple spiders writing items to same file

I have scrapyd server with several spiders running at same time, I start the spiders one by one using the schedule.json endpoint. All spiders are writing contents on common file using a pipeline
class JsonWriterPipeline(object):
def __init__(self, json_filename):
# self.json_filepath = json_filepath
self.json_filename = json_filename
self.file = open(self.json_filename, 'wb')
#classmethod
def from_crawler(cls, crawler):
save_path='/tmp/'
json_filename=crawler.settings.get('json_filename', 'FM_raw_export.json')
completeName = os.path.join(save_path, json_filename)
return cls(
completeName
)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
After the spiders are running I can see how they are collecting data correctly, items are stored in files XXXX.jl and the spiders works correctly, however the contents crawled are not reflected on common file. Spiders seems to work well however the pipeline is not doing well their job and is not collecting data into common file.
I also noticed that only one spider is writing at same time on file.
I don't see any good reason to do what you do :) You can change the json_filename setting by setting arguments on your scrapyd schedule.json Request. Then you can make each spider to generate slightly different files that you merge with post-processing or at query time. You can also write JSON files similar to what you have by just setting the FEED_URI value (example). If you write to single file simultaneously from multiple processes (especially when you open with 'wb' mode) you're looking for corrupt data.
Edit:
After understanding a bit better what you need - in this case - it's scrapyd starting multiple crawls running different spiders where each one crawls a different website. The consumer process is monitoring a single file continuously.
There are several solutions including:
named pipes
Relatively easy to implement and ok for very small Items only (see here)
RabbitMQ or some other queueing mechanism
Great solution but might be a bit of an overkill
A database e.g. SQLite based solution
Nice and simple but likely requires some coding (custom consumer)
A nice inotifywait-based or other filesystem monitoring solution
Nice and likely easy to implement
The last one seems like the most attractive option to me. When scrapy crawl finishes (spider_closed signal), move, copy or create a soft link for the FEED_URL file to a directory that you monitor with a script like this. mv or ln is an atomic unix operation so you should be fine. Hack the script to append the new file on your tmp file that you feed once to your consumer program.
By using this way, you use the default feed exporters to write your files. The end-solution is so simple that you don't need a pipeline. A simple Extension should fit the bill.
On an extensions.py in the same directory as settings.py:
from scrapy import signals
from scrapy.exceptions import NotConfigured
class MoveFileOnCloseExtension(object):
def __init__(self, feed_uri):
self.feed_uri = feed_uri
#classmethod
def from_crawler(cls, crawler):
# instantiate the extension object
feed_uri = crawler.settings.get('FEED_URI')
ext = cls(feed_uri)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
# return the extension object
return ext
def spider_closed(self, spider):
# Move the file to the proper location
# os.rename(self.feed_uri, ... destination path...)
On your settings.py:
EXTENSIONS = {
'myproject.extensions.MoveFileOnCloseExtension': 500,
}