Is it possible to set dynamic download delay in scrapy? - scrapy

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY

If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.

You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True

Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.

Related

Catchpoint pause vs. waitForNoRequest - What's the difference?

I have a test that was alerting because it was taking extra time for an asset to load. We changed from waitForNoRequest to a pause (at Catchpoint's suggestion). That did not seem to have the expected effect of waiting for things to load. We increased the pause from 3000 to 12000 and that helped to allow the page to load and stop the alert. We noticed some more alerts, so I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
So the main question here is - what functionality does both of these different features provide? What do I gain by pausing instead of waiting, if anything?
Here's the test, data changed to protect company specific info. Step 3 is where we had some failures and we switched between pause and wait.
// Step - 1
open("https://website.com/")
waitForNoRequest("2000")
click("//*[#id=\"userid\"]")
type("//*[#id=\"userid\"]", "${username}")
setStepName("Step1-Login-")
// Step - 2
clickMouseAndWait("//*[#id=\"continue\"]")
waitForVisible("//*[#id=\"challenge-password\"]")
click("//*[#id=\"challenge-password\"]")
type("//*[#id=\"challenge-password\"]", "${password}")
setStepName("Step2-Login-creds")
// Step - 3
clickMouseAndWait("//*[#id=\"signIn\"]")
setStepName("Step3-dashboard")
waitForTitle("Dashboard")
waitForNoRequest("3000")
click("//*[#id=\"account-header-wrapper\"]")
waitForVisible("//*[#id=\"logout-link\"]")
click("//*[#id=\"logout-link\"]")
// Step - 4
clickAndWait("//*[text()=\"Sign Out\"]")
waitForTitle("Login - ")
verifyTextPresent("You have been logged out.")
setStepName("Step5-Logout")
Rachana here, I’m a member of the Technical Service Team here at Catchpoint, I’ll be happy to answer your questions.
Please find the differences below between waitForNoRequest and Pause commands:
Pause
Purpose: This command pauses the script execution for a specified amount of time, whether there are HTTP/s requests downloading or not. Time value is provided in milliseconds, it can range between 100 to 30,000 ms.
Explanation: This command is used when the agent needs to wait for a set amount of time and this is not impacted by the way the requests are loaded before proceeding to the next step or command. Only a parameter is required for this action.
WaitForNoRequest
Purpose: This commands waits for a specified amount of time, when there was no HTTP/s requests downloading. The wait time parameter can range between 1,000 to 5,000 ms.
Explanation: The only parameter for this action is a wait time. The agent will wait for that specified amount of time before moving onto the next step/command. Which will, in return, allow necessary requests more time to load after document complete.
For instance when you add waitforNoRequest(5000), initially agent waits 5000 ms after doc complete for any network activity. During that period if there is any network activity, then the agent waits another 5000 ms for the next network activity to end and the process goes on until no other request loads within the specified timeframe(5000 ms).
A pause command with 12000 ms, gives exactly 12 seconds to load the page. After 12 seconds the script execution will continue to next command no matter the page is loaded or not.
Since waitForNoRequest has a max time value of 5000 ms, you can tell the agent to wait for a gap of 5 seconds when there is no network activity. In this case, the page did not have any network activity for 3 seconds and hence proceeded to the next action. The page was not loaded completely and the script failed.
I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
We allow a maximum of 30 seconds pause time hence 45 seconds will not work.
Please reach out to our support team and we’ll be glad to connect you with our scripting SMEs and help you with any scripting needs you might have.

scrapy timeout not controlling twisted timeout

I keep getting this when I run my scrapy spider raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.exampletest.com/test took longer than 190 seconds..
I have set the following settings but didn't help
'AUTOTHROTTLE_ENABLED':False,
'DOWNLOAD_TIMEOUT':20,
'RETRY_ENABLED': False,
How can I control if the website doesn't respond in under 30 sec to just pass or ignore it.
190 is a weird default, so I’ll go ahead and assume that you are using scrapy-crawlera.
If that is the case, know that scrapy-crawlera ignores DOWNLOAD_DELAY because Crawlera requires higher timeout values, as requests through Crawlera can take much longer.
If you want to decrease the timeout value nonetheless, change CRAWLERA_DOWNLOAD_TIMEOUT instead.

Podio Create Item rate limit after 25 calls

I have to create items in podio using the api. When i let my program go full speed i noticed that after 5 - 6 items I get an error response from podio saying:
{
"error_propagate":false,
"error":"rate_limit",
"error_description":"You have hit the rate limit. Please wait 300 seconds before trying again",
"request":{
"url":"http://api.podio.com/oauth/token",
"query_string":"",
"method":"POST"
}
}
I tought the rate limit was 5000 calls/H and I get this error after 25 calls...
I added a thread.sleep in my code, and now it seems to be better, but even when I let the thread sleep for 10s I still get this error, I have now set the thread.sleep to 20 sec and it seems to work.
Is there a hidden rate limit to the number off calls per second ?
I think you are using Username password authentication here. The token request endpoint have lower limit from my experience. So the best way to solve this is to store and reuse the access tokens, instead of re-authenticating every time your program runs.
Podio API client libraries provide convenience methods to do this. See this links:
http://podio.github.io/podio-dotnet/sessions/
http://podio.github.io/podio-php/sessions
The rate limit is 1000 calls/H. so you can put sleep accordingly.

JMeter HTTP Request Post Body from File

I am trying to send an HTTP request via JMeter. I have created a thread group with a loop count of 25. I have a ramp up period of 120 and number of threads set to 30. Within the thread group, I have 20 HTTP Requests. I am a little confused as to how JMeter runs these requests. Do each of the 20 requests within a thread group run in a single thread, and each loop over a thread group runs concurrently on a different thread? Or do each of the 20 requests run in different threads as and when they are available.
My other question is, Over each loop, I want to vary the body of the post data that is being sent via the HTTP request. Is it possible to pass the post data body via a file instead of inserting the data into the JMeter Body Data Tab as show below:
However, instead of doing that, I want to define some kind of variable that picks a file based on iteration of the threadgroup that is running, for example, if it is looping over the thread group the second time, i want to call test2.txt, if the third time test3.txt etc and these text files will contain different post data. Could anyone tell me if this is possible with JMeter please and if so, how would I go about doing this.
Point 1 - JMeter concurrency
JMeter starts with 1 thread and spawns more threads as per ramp-up set. In your case (30 threads and 120 seconds ramp-up) another thread is being added each 4 seconds. Each thread executes 20 requests and if there is another loop - starts over, if there is no loop - the threads shuts down. To control load and concurrency JMeter provides 2 options:
Synchronizing Timer - pause all threads till specified threshold is reached and then release all of them at the same time
Constant Throughput Timer - to specify the load in requests per minute.
Point 2 - Send file instead of text
You can replace your request body with __fileToString function. If you want to parametrize it you can use nested function to provide current iteration - see below.
Point 3 - adding iteration as a parameter
JMeter provides 2 options on how you can increment a counter each loop
Counter config element - starts from specified value and gets incremented by specified value each time it's called.
__counter function - start from 1 and gets incremented by 1 each time it's being called. Can be "per-user" or "global"
See How to Use JMeter Functions post series for comprehensive information on above and more JMeter functions.

Force a thread to use same input line when using CSV Data Set Config

I am trying to build a Jmeter test plan that can make http calls to a server. Each thread in the thread group will read 2 parameters from a CSV file and make the http call with the params, and continue to make the same call with same parameters for lets say 1000 times with a delay of 10s between each thread execution.
The http call looks like
/service/method?param1=${param1}&param2=${param2}
The CSV is like this:
1,2
3,4
5,6
7,8
I have the test plan set up that works for the most part except the single issue. I want each thread to use the same parameters (same line of input) whenever the thread executes. Currently the only way to do it is to set Recycle on EOF = true, but the threads randomly pick the values. Param1 and Param2 can be randomly generated values as long as they stick with the same thread throughout the execution.
Is there anyway I can achieve this?
Thanks!
I'm not really sure I understand your issue right (you can possibly describe it more explicitly or using an example) but the schema below should implement your test-plan description:
Test Plan
Thread Group
Number of Threads: N
. . .
While Controller
Condition: ${__javaScript("${param2"!="<EOF>",)} - read csv-file until the EOF
CSV Data Set Config
Filename: [path to your file with test-data]
Variable Names: param1,param2
Recycle on EOF? False
Stop thread on EOF? True
Sharing mode: Current thread group
Loop Controller
Loop Count = 1000 - number of loops for each thread, with the same params
HTTP Request - your http call
Test Action
Target = Current Thread
Action = Pause
Duration (ms) = 10000 - pause between calls
. . .
In case if you need that each of N threads reads and uses single and unique line from csv-file you have to set Sharing mode: Current thread group for CSV Data Set Config (number of csv-entries should be in this case the sane as threads number, or Recycle on EOF? False should be set otherwise).
In case if you need that each of N threads reads and uses all lines from csv-file you have to set Sharing mode: Current thread for CSV Data Set Config.
If that's not what you want please describe your issue a bit more clear.
I was able to find sort of a hack. Basically I just put a constant timer for each thread and used the thread number ${__threadNum} as the parameter to fit my constraint of having the same parameter to be used by the same thread.
I would still prefer a way to read the params from a csv file.