I have multiple Scrapy spiders that I need to run at the same time every 5 minutes. The issue is that they take almost 30 sec to 1 minute to start.
It's seem that they all start their own twisted engine, and so it take a lot of time.
I've look into different ways to run multiple spiders at the same time (see Running Multiple spiders in scrapy for 1 website in parallel?), but I need to have a log for each spider and a process per spider to integrate well with Airflow.
I've look into scrapyd, but it doesn't seem to share a twisted engine for multiple spiders, is that correct ?
Are they different ways I could achieve my goals ?
Related
I am newish to the world of distributed scrapy crawls, but I found out about scrapy-redis and have been using it. I am using it on a raspberry pi to scrape a large number of URLs that I push to redis. What I have been doing is creating multiple SSH sessions into the Pi, where I then run
scrapy crawl myspider
to have the spider "wait". I then start another SSH and do redis-cli lpush "my links". The crawlers then run, although I'm not sure how concurrent they are actually running.
I'm hoping this is clear, if not please let me know and I can clarify. I'm really just looking for a "next step" after implementing this barebones version of scrapy-redis.
edit: I based my starting point from this answer Extract text from 200k domains with scrapy. The answerer said he spun up 64 spiders using scrapy-redis.
What is the point on creating multiple SSH sessions? Concurrency?
If that is the answer I believe scrapy itself can handle all of the urls at once with the concurrency that you want them to be a give accurate feedback of how the crawl went.
In that case you will only need 1 scrapy spider.
On the other hand, if the idea is to utilise multiple instances anyway, I suggest you to take a look at frontera (https://github.com/scrapinghub/frontera)
I've seen people using Airflow to schedule hundreds of scraping jobs through Scrapyd daemons. However, one thing they miss in Airflow is monitoring long-lasting jobs like scraping: getting number of pages and items scraped so far, number of URL that failed so far or were retried without success.
What are my options to monitor current status of long lasting jobs? Is there something already available or I need to resort to external solutions like Prometheus, Grafana and instrument Scrapy spiders myself?
We've had better luck keeping our airflow jobs short and sweet.
With long-running tasks, you risk running into queue back-ups. And we've found the parallelism limits are not quite intuitive. Check out this post for a good breakdown.
In a case kind of like yours, when there's a lot of work to orchestrate and potentially retry, we reach for Kafka. The airflow dags pull messages off of a Kafka topic and then report success/failure via a Kafka wrapper.
We end up with several overlapping airflow tasks running in "micro-batches" reading a tune-able number of messages off Kafka, with the goal of keeping each airflow task under a certain run time.
By keeping the work small in each airflow task, we can easily scale the number of messages up or down to tune the overall task run time with the overall parallelism of the airflow workers.
Seems like you could explore something similar here?
Hope that helps!
I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish the spider. But in scrapinghub it takes 2:30 minutes. I understand that closing a spider after all requests are finished takes a little bit more time, but this is not a problem. Anyway, my problem is that from the moment I call the API to start the job (I see that it appear in running jobs instantly, but takes too long to make the first request) and the moment the first request is done, I have to wait too much. Any idea how I can make it to last as shortly as locally? Thanks!
I already tried to put AUTOTHROTTLE_ENABLED = false as I saw in some other question on stackoverflow.
According to scrapy cloud docs:
Scrapy Cloud jobs run in containers. These containers can be of different sizes defined by Scrapy Cloud units.
A Scrapy Cloud provides: 1 GB of RAM, 2.5GB of disk space,1x CPU and 1 concurrent crawl slot.
Resources available to the job are proportional to the number of units allocated.
It means that allocating more Scrapy Cloud units can solve your problem.
Is there any way to run scrapy as part of a bash script, and only run it for a certain amount of time?
Perhaps by simulating a Ctrl-C + Ctrl-C after X hours?
You can do this with the GNU timeout command.
For example, to stop the crawler after 1 hour:
timeout 3600 scrapy crawl spider_name
Scrapy provides CLOSESPIDER_TIMEOUT option to stop crawling after a specified time period.
It is not a hard limit though - Scrapy will still process all requests it is already downloading, but it won't fetch new requests from a scheduler; in other words, CLOSESPIDER_TIMEOUT emulates Ctrl-C, not Ctrl-C + Ctrl-C, and tries to stop spider gracefuly. It is usually not a bad idea because killing spider may e.g. leave exported data file broken.
How much extra time will spider be alive depends on a website and on retry & concurrency settings. Default DOWNLOAD_TIMEOUT is 180s; request can be retried upto 2 times, meaning each request may take ~10 min to finish in a worst case. CONCURRENT_REQUESTS is 16 by default, so there is up to 16 requests in the downloader, but they may be downloaded in parallel depending on what you're crawling. Autothrottle or CONCURRENT_REQUESTS_PER_DOMAIN options may limit a number of requests executed in parallel for a single domain.
So in an absolutely worst case (sequential downloading, all requests are unresponsive and retried 2 times) spider may hang for ~3 hours with default settings. But usually in practice this time is much shorter, a few minutes. So you can set CLOSESPIDER_TIMEOUT to a value e.g. 20 minutes less than your X hours, and then use additional supervisor (like GNU timeout suggested by #lufte) to implement a hard timeout and kill a spider if its shutdown time is super-long.
I need to test my website if it can handle 100 or more users concurrently, Is it possible to do it by casperjs? if yes, how will i do it?
No, not really.
CasperJS runs on PhantomJS (or SlimerJS). So you can only have one session per CasperJS script. You could use multiple casper instances if your site doesn't need sessions (i.e. doesn't have a login), but since PhantomJS is single threaded, that won't get you much concurrency.
You would need to start 100 CasperJS processes with the same script to simulate that many users, but then you run into hardware problems. Let's say one CasperJS process takes 50 MB of RAM. Your machine would need at least 5 GB of memory. Then there are the context switches because of so many processes, which wouldn't make it very concurrent.
You would need a cluster of machines with at most 16 CasperJS processes each. You would then need to synchronize them all (i.e. with the webserver module).