scrapy timeout not controlling twisted timeout - scrapy

I keep getting this when I run my scrapy spider raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://www.exampletest.com/test took longer than 190 seconds..
I have set the following settings but didn't help
'AUTOTHROTTLE_ENABLED':False,
'DOWNLOAD_TIMEOUT':20,
'RETRY_ENABLED': False,
How can I control if the website doesn't respond in under 30 sec to just pass or ignore it.

190 is a weird default, so I’ll go ahead and assume that you are using scrapy-crawlera.
If that is the case, know that scrapy-crawlera ignores DOWNLOAD_DELAY because Crawlera requires higher timeout values, as requests through Crawlera can take much longer.
If you want to decrease the timeout value nonetheless, change CRAWLERA_DOWNLOAD_TIMEOUT instead.

Related

Catchpoint pause vs. waitForNoRequest - What's the difference?

I have a test that was alerting because it was taking extra time for an asset to load. We changed from waitForNoRequest to a pause (at Catchpoint's suggestion). That did not seem to have the expected effect of waiting for things to load. We increased the pause from 3000 to 12000 and that helped to allow the page to load and stop the alert. We noticed some more alerts, so I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
So the main question here is - what functionality does both of these different features provide? What do I gain by pausing instead of waiting, if anything?
Here's the test, data changed to protect company specific info. Step 3 is where we had some failures and we switched between pause and wait.
// Step - 1
open("https://website.com/")
waitForNoRequest("2000")
click("//*[#id=\"userid\"]")
type("//*[#id=\"userid\"]", "${username}")
setStepName("Step1-Login-")
// Step - 2
clickMouseAndWait("//*[#id=\"continue\"]")
waitForVisible("//*[#id=\"challenge-password\"]")
click("//*[#id=\"challenge-password\"]")
type("//*[#id=\"challenge-password\"]", "${password}")
setStepName("Step2-Login-creds")
// Step - 3
clickMouseAndWait("//*[#id=\"signIn\"]")
setStepName("Step3-dashboard")
waitForTitle("Dashboard")
waitForNoRequest("3000")
click("//*[#id=\"account-header-wrapper\"]")
waitForVisible("//*[#id=\"logout-link\"]")
click("//*[#id=\"logout-link\"]")
// Step - 4
clickAndWait("//*[text()=\"Sign Out\"]")
waitForTitle("Login - ")
verifyTextPresent("You have been logged out.")
setStepName("Step5-Logout")
Rachana here, I’m a member of the Technical Service Team here at Catchpoint, I’ll be happy to answer your questions.
Please find the differences below between waitForNoRequest and Pause commands:
Pause
Purpose: This command pauses the script execution for a specified amount of time, whether there are HTTP/s requests downloading or not. Time value is provided in milliseconds, it can range between 100 to 30,000 ms.
Explanation: This command is used when the agent needs to wait for a set amount of time and this is not impacted by the way the requests are loaded before proceeding to the next step or command. Only a parameter is required for this action.
WaitForNoRequest
Purpose: This commands waits for a specified amount of time, when there was no HTTP/s requests downloading. The wait time parameter can range between 1,000 to 5,000 ms.
Explanation: The only parameter for this action is a wait time. The agent will wait for that specified amount of time before moving onto the next step/command. Which will, in return, allow necessary requests more time to load after document complete.
For instance when you add waitforNoRequest(5000), initially agent waits 5000 ms after doc complete for any network activity. During that period if there is any network activity, then the agent waits another 5000 ms for the next network activity to end and the process goes on until no other request loads within the specified timeframe(5000 ms).
A pause command with 12000 ms, gives exactly 12 seconds to load the page. After 12 seconds the script execution will continue to next command no matter the page is loaded or not.
Since waitForNoRequest has a max time value of 5000 ms, you can tell the agent to wait for a gap of 5 seconds when there is no network activity. In this case, the page did not have any network activity for 3 seconds and hence proceeded to the next action. The page was not loaded completely and the script failed.
I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
We allow a maximum of 30 seconds pause time hence 45 seconds will not work.
Please reach out to our support team and we’ll be glad to connect you with our scripting SMEs and help you with any scripting needs you might have.

Podio Create Item rate limit after 25 calls

I have to create items in podio using the api. When i let my program go full speed i noticed that after 5 - 6 items I get an error response from podio saying:
{
"error_propagate":false,
"error":"rate_limit",
"error_description":"You have hit the rate limit. Please wait 300 seconds before trying again",
"request":{
"url":"http://api.podio.com/oauth/token",
"query_string":"",
"method":"POST"
}
}
I tought the rate limit was 5000 calls/H and I get this error after 25 calls...
I added a thread.sleep in my code, and now it seems to be better, but even when I let the thread sleep for 10s I still get this error, I have now set the thread.sleep to 20 sec and it seems to work.
Is there a hidden rate limit to the number off calls per second ?
I think you are using Username password authentication here. The token request endpoint have lower limit from my experience. So the best way to solve this is to store and reuse the access tokens, instead of re-authenticating every time your program runs.
Podio API client libraries provide convenience methods to do this. See this links:
http://podio.github.io/podio-dotnet/sessions/
http://podio.github.io/podio-php/sessions
The rate limit is 1000 calls/H. so you can put sleep accordingly.

Setting a timeout on webservice consumer built with org.apache.axis.client.Call and running on Domino

I'm maintaining an antedeluvian Notes application which connects to a SAP back-end via a manually done 'Webservice'
The server is running Domino Release 7.0.4FP2 HF97.
The Webservice is not the more recently Webservice Consumer, but a large Java agent which is using Apache soap.jar (org.apache.soap). Below an example of the calling code.
private Call setupSOAPCall() {
Call call = new Call();
SOAPHTTPConnection conn = new SOAPHTTPConnection();
call.setSOAPTransport(conn);
call.setEncodingStyleURI(Constants.NS_URI_SOAP_ENC);
There has been a change in the SAP system which is now taking 8 minutes to complete (verified by SAP Team).
I'm getting an error message as follows:
[SOAPException: faultCode=SOAP-ENV:Client; msg=For input string: "906 "; targetException=java.lang.NumberFormatException: For input string: "906 "]
I found a blog article describing the error message quite closely:
https://thejavablog.wordpress.com/category/jmeter/
and I've come to the hypothesis that it is a timeout message that is returning to my Call object and that this timeout message is being incorrectly parsed, hence the NumberFormat Exception.
Looking at my logs I can see that there is a time difference of 62 seconds between my call and the response.
I recommended that the server setting in the server document, tab Internet Protocols/HTTP/Timeouts/Request timeouts be changed from 60 seconds to 600 seconds, and the http task restarted with
tell http restart
I've re-run the tests and I am getting the same error, and the time difference is still slightly more than 60 seconds, which is not what I was expecting.
I read Michael Rulnau's blog entry
http://www.mruhnau.net/2014/06/how-to-overcome-domino-webservice.html
which points to this APR
http://www-01.ibm.com/support/docview.wss?uid=swg1LO48272
but I'm not convinced that this would apply in this case, since there is no way that IBM would know that my Java agent is in fact making a Soap call.
My current hypothesis is that I have to use either the setTimeout() method on
org.apache.axis.client.Call
https://axis.apache.org/axis/java/apiDocs/org/apache/axis/client/Call.html
or on the org.apache.soap.transport.http.SOAPHTTPConnection
https://docs.oracle.com/cd/B13789_01/appdev.101/b12024/org/apache/soap/transport/http/SOAPHTTPConnection.html
and that the timeout value is an apache default, not something that is controlled by the Domino server.
I'd be grateful for any help.
I understand your approach, and I hope this is the correct one to solve your problem.
Add a debug (console write would be fine) that display the default Timeout then try to increase it to 10 min.
SOAPHTTPConnection conn = new SOAPHTTPConnection();
System.out.println("time out is :" + conn.getTimeout());
conn.setTimeout(600000);//10 min in ms
System.out.println("after setting it, time out is :" + conn.getTimeout());
call.setSOAPTransport(conn);
Now keep in mind that Dommino has also a Max LotusScript/Java execution time, check this value and (at least for a try) change it: http://www.ibm.com/support/knowledgecenter/SSKTMJ_9.0.1/admin/othr_servertasksagentmanagertab_r.html (it's version 9 help but this part should be identical)
I've since discovered that it wasn't my code generating the error; the default timeout for the apache axis SOAPHTTPConnetion is 0, i.e. no timeout.

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums
Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.