scrapinghub starting job too slow - scrapy

I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish the spider. But in scrapinghub it takes 2:30 minutes. I understand that closing a spider after all requests are finished takes a little bit more time, but this is not a problem. Anyway, my problem is that from the moment I call the API to start the job (I see that it appear in running jobs instantly, but takes too long to make the first request) and the moment the first request is done, I have to wait too much. Any idea how I can make it to last as shortly as locally? Thanks!
I already tried to put AUTOTHROTTLE_ENABLED = false as I saw in some other question on stackoverflow.

According to scrapy cloud docs:
Scrapy Cloud jobs run in containers. These containers can be of different sizes defined by Scrapy Cloud units.
A Scrapy Cloud provides: 1 GB of RAM, 2.5GB of disk space,1x CPU and 1 concurrent crawl slot.
Resources available to the job are proportional to the number of units allocated.
It means that allocating more Scrapy Cloud units can solve your problem.

Related

How to setup Jmeter test to have a certain throughput?

I am trying to perform a load test, and according to our stats (that I can't disclose) we expect peaks of 300 users per minute, uploading files of different sizes to our system.
Now, I created a jmeter test, which works fine, but what I don't know how to fine tune is - aim for certain throughput.
I create a test with 150 users 100 loops, expecting it to simulate 150 users coming and going, and in total upload 15000 files, but that never happened because at certain point tests started failing.
Looking at our new relic monitoring, it seems that somehow I reached 1600 requests in a single minute. I am testing a microservice, running 12 instances, so that might play the role here for a higher number of requests, but even with it I expected tests to pass. My uploaded file was 600kb. In the end, I had 98% failure.
I reduced the file size to 13kb, at that point, I got 17% failiure.
So, there's obviously something with the time needed to upload the bigger file, but I don't understand what causes 150 thread/users in X loops to become 1600 at the same time. I'd expect Jmeter to never start a new loop with the same thread, unless the original user is finished. That being said - I'd expect tops 150 users in a given minute.
Any clarification on how to get exact number of users/threads running at the same time is well appreciated.
I tried to play with KeepAlive checkbox, I tried adding lifetime of request to 10 seconds (all them uploads get response earlier) - but then JMeter finished the Thread, and I had only 150 runs, no loops.
Thanks!
By default JMeter executes Samplers as fast as it can so there are 2 main factors which define the actual throughput (number of requests per unit of time):
JMeter configuration
Application under test response time
So if you're following JMeter Best Practices and JMeter has enough headroom to operate in terms of CPU, RAM, etc. - you are only limited by your application response time as JMeter waits for previous request to finish before starting a new one.
If you need to "slow down" your test execution consider adding i.e. Constant Throughput Timer to your Test Plan where you will be able to define the desired number of requests per minute

aws ECS- Farget task submition pending time

we have an ECS fargate cluster,that we have just created, and when testing, we noticed, that the submission of a new task takes about 2-3 minutes (PENDING to RUNNING).
since we run there a new task every minute, it's not good enough for us.
is there any way to optimize the PENDING to RUNNING time?
This is largely dependent on the size of your container. For example I use go from scratch containers heavily so they are only about 15MB, and I get launch times from nothing -> running in roughly 15-20 seconds.
The biggest thing you can do right now to increase launch times is to reduce the size of your container.

Speed up scrapy spiders initialisation time

I have multiple Scrapy spiders that I need to run at the same time every 5 minutes. The issue is that they take almost 30 sec to 1 minute to start.
It's seem that they all start their own twisted engine, and so it take a lot of time.
I've look into different ways to run multiple spiders at the same time (see Running Multiple spiders in scrapy for 1 website in parallel?), but I need to have a log for each spider and a process per spider to integrate well with Airflow.
I've look into scrapyd, but it doesn't seem to share a twisted engine for multiple spiders, is that correct ?
Are they different ways I could achieve my goals ?

running scrapy for X hours in script?

Is there any way to run scrapy as part of a bash script, and only run it for a certain amount of time?
Perhaps by simulating a Ctrl-C + Ctrl-C after X hours?
You can do this with the GNU timeout command.
For example, to stop the crawler after 1 hour:
timeout 3600 scrapy crawl spider_name
Scrapy provides CLOSESPIDER_TIMEOUT option to stop crawling after a specified time period.
It is not a hard limit though - Scrapy will still process all requests it is already downloading, but it won't fetch new requests from a scheduler; in other words, CLOSESPIDER_TIMEOUT emulates Ctrl-C, not Ctrl-C + Ctrl-C, and tries to stop spider gracefuly. It is usually not a bad idea because killing spider may e.g. leave exported data file broken.
How much extra time will spider be alive depends on a website and on retry & concurrency settings. Default DOWNLOAD_TIMEOUT is 180s; request can be retried upto 2 times, meaning each request may take ~10 min to finish in a worst case. CONCURRENT_REQUESTS is 16 by default, so there is up to 16 requests in the downloader, but they may be downloaded in parallel depending on what you're crawling. Autothrottle or CONCURRENT_REQUESTS_PER_DOMAIN options may limit a number of requests executed in parallel for a single domain.
So in an absolutely worst case (sequential downloading, all requests are unresponsive and retried 2 times) spider may hang for ~3 hours with default settings. But usually in practice this time is much shorter, a few minutes. So you can set CLOSESPIDER_TIMEOUT to a value e.g. 20 minutes less than your X hours, and then use additional supervisor (like GNU timeout suggested by #lufte) to implement a hard timeout and kill a spider if its shutdown time is super-long.

How to load test with simultaneous users?

In jMeter
I have a test plan with 100 virtual users. If i set ramp up time to 100 then the whole test takes 100 sec to complete the whole set. That means each thread takes 1 sec to perform for each virtual user. Meaning that each thread is carried out step by step. However, each thread is carried out after completion of previous one.
Problem: I need 100 users accessing the website at a same time , concurently and simultaneously. I read about CSV but still it does act step wise dosent it. OR if I am not clear about it. Please enlighten me.
You're running into "classic" situation described in Max Users is Lower than Expected article.
JMeter acts as follows:
Threads are being started according to the ramp-up time. If you put 1 there - threads will be started immediately. If you put 100 threads and 100 seconds ramp-up time initially 1 thread will start and each 1 second next thread will be kicked off.
Threads start executing samplers upside down (or according to logic controllers)
When thread doesn't have more samplers to execute and more loops to iterate - it's being shut down.
So I would suggest adding more loops on Thread Group level so threads kicked off earlier kept looping while others are starting so finally you could have 100 threads working at the same time. You can configure test execution time either in Thread Group "Scheduler" section or via Runtime Controller.
Another good option is using Ultimate Thread Group available via JMeter Plugins which provides easy way of configuring your load scenario.