Is there any way to run scrapy as part of a bash script, and only run it for a certain amount of time?
Perhaps by simulating a Ctrl-C + Ctrl-C after X hours?
You can do this with the GNU timeout command.
For example, to stop the crawler after 1 hour:
timeout 3600 scrapy crawl spider_name
Scrapy provides CLOSESPIDER_TIMEOUT option to stop crawling after a specified time period.
It is not a hard limit though - Scrapy will still process all requests it is already downloading, but it won't fetch new requests from a scheduler; in other words, CLOSESPIDER_TIMEOUT emulates Ctrl-C, not Ctrl-C + Ctrl-C, and tries to stop spider gracefuly. It is usually not a bad idea because killing spider may e.g. leave exported data file broken.
How much extra time will spider be alive depends on a website and on retry & concurrency settings. Default DOWNLOAD_TIMEOUT is 180s; request can be retried upto 2 times, meaning each request may take ~10 min to finish in a worst case. CONCURRENT_REQUESTS is 16 by default, so there is up to 16 requests in the downloader, but they may be downloaded in parallel depending on what you're crawling. Autothrottle or CONCURRENT_REQUESTS_PER_DOMAIN options may limit a number of requests executed in parallel for a single domain.
So in an absolutely worst case (sequential downloading, all requests are unresponsive and retried 2 times) spider may hang for ~3 hours with default settings. But usually in practice this time is much shorter, a few minutes. So you can set CLOSESPIDER_TIMEOUT to a value e.g. 20 minutes less than your X hours, and then use additional supervisor (like GNU timeout suggested by #lufte) to implement a hard timeout and kill a spider if its shutdown time is super-long.
Related
I have to test load testing for 32000 users, duration 15 minutes. And I have run it on command line mode. Threads--300, ramp up--100, loop 1. But after showing some data, it is freeze. So I can't get the full report/html. Even i can't run for 50 users. How can I get rid of this. Please let me know.
From 0 to 2147483647 threads depending on various factors including but not limited to:
Hardware specifications of the machine where you run JMeter
Operating system limitations (if any) of the machine where you run JMeter
JMeter Configuration
The nature of your test (protocol(s) in use, the size of request/response, presence of pre/post processors, assertions and listeners)
Application response time
Phase of the moon
etc.
There is no answer like "on my macbook I can have about 3000 threads" as it varies from test to test, for GET requests returning small amount of data the number will be more, for POST requests uploading huge files and getting huge responses the number will be less.
The approach is the following:
Make sure to follow JMeter Best Practices
Set up monitoring of the machine where you run JMeter (CPU, RAM, Swap usage, etc.), if you don't have a better idea you can go for JMeter PerfMon Plugin
Start your test with 1 user and gradually increase the load at the same time looking into resources consumption
When any of monitored resources consumption starts exceeding reasonable threshold, i.e. 80% of maximum available capacity stop your test and see how many users were online at this stage. This is how many users you can simulate from particular this machine for particular this test.
Another machine or test - repeat from the beginning.
Most probably for 32000 users you will have to go for distributed testing
If your test "hangs" even for smaller amount of users (300 could be simulated even with default JMeter settings and maybe even in GUI mode):
take a look at jmeter.log file
take a thread dump and see what threads are doing
I am trying to perform a load test, and according to our stats (that I can't disclose) we expect peaks of 300 users per minute, uploading files of different sizes to our system.
Now, I created a jmeter test, which works fine, but what I don't know how to fine tune is - aim for certain throughput.
I create a test with 150 users 100 loops, expecting it to simulate 150 users coming and going, and in total upload 15000 files, but that never happened because at certain point tests started failing.
Looking at our new relic monitoring, it seems that somehow I reached 1600 requests in a single minute. I am testing a microservice, running 12 instances, so that might play the role here for a higher number of requests, but even with it I expected tests to pass. My uploaded file was 600kb. In the end, I had 98% failure.
I reduced the file size to 13kb, at that point, I got 17% failiure.
So, there's obviously something with the time needed to upload the bigger file, but I don't understand what causes 150 thread/users in X loops to become 1600 at the same time. I'd expect Jmeter to never start a new loop with the same thread, unless the original user is finished. That being said - I'd expect tops 150 users in a given minute.
Any clarification on how to get exact number of users/threads running at the same time is well appreciated.
I tried to play with KeepAlive checkbox, I tried adding lifetime of request to 10 seconds (all them uploads get response earlier) - but then JMeter finished the Thread, and I had only 150 runs, no loops.
Thanks!
By default JMeter executes Samplers as fast as it can so there are 2 main factors which define the actual throughput (number of requests per unit of time):
JMeter configuration
Application under test response time
So if you're following JMeter Best Practices and JMeter has enough headroom to operate in terms of CPU, RAM, etc. - you are only limited by your application response time as JMeter waits for previous request to finish before starting a new one.
If you need to "slow down" your test execution consider adding i.e. Constant Throughput Timer to your Test Plan where you will be able to define the desired number of requests per minute
I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish the spider. But in scrapinghub it takes 2:30 minutes. I understand that closing a spider after all requests are finished takes a little bit more time, but this is not a problem. Anyway, my problem is that from the moment I call the API to start the job (I see that it appear in running jobs instantly, but takes too long to make the first request) and the moment the first request is done, I have to wait too much. Any idea how I can make it to last as shortly as locally? Thanks!
I already tried to put AUTOTHROTTLE_ENABLED = false as I saw in some other question on stackoverflow.
According to scrapy cloud docs:
Scrapy Cloud jobs run in containers. These containers can be of different sizes defined by Scrapy Cloud units.
A Scrapy Cloud provides: 1 GB of RAM, 2.5GB of disk space,1x CPU and 1 concurrent crawl slot.
Resources available to the job are proportional to the number of units allocated.
It means that allocating more Scrapy Cloud units can solve your problem.
I have multiple Scrapy spiders that I need to run at the same time every 5 minutes. The issue is that they take almost 30 sec to 1 minute to start.
It's seem that they all start their own twisted engine, and so it take a lot of time.
I've look into different ways to run multiple spiders at the same time (see Running Multiple spiders in scrapy for 1 website in parallel?), but I need to have a log for each spider and a process per spider to integrate well with Airflow.
I've look into scrapyd, but it doesn't seem to share a twisted engine for multiple spiders, is that correct ?
Are they different ways I could achieve my goals ?
In jMeter
I have a test plan with 100 virtual users. If i set ramp up time to 100 then the whole test takes 100 sec to complete the whole set. That means each thread takes 1 sec to perform for each virtual user. Meaning that each thread is carried out step by step. However, each thread is carried out after completion of previous one.
Problem: I need 100 users accessing the website at a same time , concurently and simultaneously. I read about CSV but still it does act step wise dosent it. OR if I am not clear about it. Please enlighten me.
You're running into "classic" situation described in Max Users is Lower than Expected article.
JMeter acts as follows:
Threads are being started according to the ramp-up time. If you put 1 there - threads will be started immediately. If you put 100 threads and 100 seconds ramp-up time initially 1 thread will start and each 1 second next thread will be kicked off.
Threads start executing samplers upside down (or according to logic controllers)
When thread doesn't have more samplers to execute and more loops to iterate - it's being shut down.
So I would suggest adding more loops on Thread Group level so threads kicked off earlier kept looping while others are starting so finally you could have 100 threads working at the same time. You can configure test execution time either in Thread Group "Scheduler" section or via Runtime Controller.
Another good option is using Ultimate Thread Group available via JMeter Plugins which provides easy way of configuring your load scenario.