Scrapy response.status not 100% accurate?

Scrapy response.status not 100% accurate? - scrapy

I'm using l.add_value('http_status', response.status) to capture the response status of each domain and store it in a SQL database but some domains do not have a response status (null). At first, I thought that they might be blocking scrapy but when I run scrapy again on those domains I get back a status of 200. Before I implement a second check using urllib, I thought I'd ask here to see if anyone has experienced this before or has any advice.

there are a few reasons for why the request didn't receive a response,
1)DNS didn't find it in time i.e wasn't resolved (increase DNS_TIMEOUT)
2)server took a while to respond (increase DOWNLOAD_TIMEOUT)
3)if you are checking for large files >1Gb (increase DOWNLOAD_MAXSIZE)
4)Internet connection problem on your side (DNS resolved but you lost connection after)
5)the web server is temporarily down

Related

What would make expressjs 503?

I use expressjs for my server. A user hits a proxy service that hits mine. The proxy service is telling me that mine has given a 503. In this case, I can't really be logging that it happened since it seems express just error'ed out. All I know is that this other service receives a 503. Nowhere in my code do I set this up. I couldn't find anything when searching around, and if it weren't the proxy I wouldn't know the 503s are happening. In hundreds of thousands of requests, this happened just under 100 times.
I'm not expecting anyone to have the direct answer, however any clues would be much appreciated and any clue that gets to the answer will be marked as the answer with my comment below. Hoping some comments will help me give this question better information!

Well, one thing that comes to mind is server overload. If the route (s) you're checking are performing synchronous tasks, and you're sending alot of requests at once, you might end up blocking your server. Express recommend that every route handler you have should be async, so the server won't get blocked even if a lot of requests are happening.
I don't know if that's your case, but it's worth a check.

The solution relates to Envoy & Node. A coworker suggested server timeouts and I found https://github.com/envoyproxy/envoy/issues/1979.
In my case I added this: server.keepAliveTimeout = 0;
You can see docs over at https://nodejs.org/docs/latest-v10.x/api/http.html#http_server_keepalivetimeout.
I don't work on the infrastructure side of things so it took a bit for me to figure it was in that layer with folks on the teams who implemented the usage of Envoy.
When Envoy gets ECONNRESET it sends back a 503

How to handle the application if connection breaks in between a web service call

In several interviews I have been asked about handling of connection, web service calls, server responses and all. Even now I am not clear about many things.Could you please help me to get a better idea about the following scenarios?
What is the advantage of using NSURLSessionDataTask instead of NSURLConnection-I have an idea like data loss will not happen even if the connection breaks for NSURLSessionDataTask but not for the latter.But how it works?
If the connection breaks after sending the request to a server or while connecting to server , How can we handle the code at our end in case of NSURLConnection and NSURLSessionDataTask?-My idea is to use Reachability classes and check when it becomes online.
The data we are sending got updated at the server side. But we don't get the response from server. What can we do at our side to handle this situation?- Incrementing timeOutInterval is the only thing that we can do?
Please help me with these scenarios. Thank you very much in advance!!

That's multiple questions, really, but I'll try to answer them all briefly.
Most failure handling is the same between NSURLConnection and NSURLSession. The main advantages of the latter are support for background downloads and cancelling groups of related requests.
That said, if you're doing a large download that you think might fail, NSURLSession does provide download tasks that let you resume the download if your network connection fails, similar to what NSURLDownload used to do on OS X (never available on iOS). This only helps for downloading large files, though, not for large uploads (which require significant server-side support to resume) or other requests.
Your intuition is correct. When a connection fails, create a reachability object monitoring that particular hostname to see when it would be a good time to try the request again. Then, try the request again.
You might also display some sort of advisory UI to say that you have no Internet connection. (By advisory, I mean something that the user doesn't have to click on and that does not impact offline use of the app any more than necessary; look at the Facebook app for a great example.)
Provide a unique identifier when you make the request, and store that on the server along with the server's response until the client acknowledges receipt of the response (or purge it anyway after some reasonable number of days). When the upload finishes, the server gives you back its response if it can.
If something goes wrong, the client asks the server to resend the response associated with that unique identifier. Once your client has the data, it acknowledges receipt and the server deletes the response. If you ask the server for the response and it doesn't have one, then the upload didn't really complete.
With some additional work, this approach can make it possible to support long-running uploads more reliably. If an upload fails, ask the server how much data it got for that identifier, then tell the server that you're going to upload new data starting at the next byte. On the server side, overwrite the old data starting at that byte (just in case some data was still being written when you asked for the length).
Hope that helps.

About losing HTTP Requests

I have a server to which my client sends a HTTP GET request with some values. The server on its end simply stores these values to a database.
Now, I am observing that sometimes I do not observe these values in the database. One of the following could have happened:
The client never sent it
The server never received it
The server failed in writing to the database
My strongest doubt is that the reason is 2 - but I am unable to explain it completely. Since this is an HTTP request (which means there is TCP underneath) reliable delivery of the GET request should be guaranteed, right? Is it possible that even though I send a GET request to the server - it was never received by the server? If yes, what is TCP doing there?
Or, can I confidently assert that if the server is up and running and everything sent to the server is written to the database, then the absence of the details of the GET request in the database means the client never sent it?
Not sure if the details will help - but I am running a tomcat server and I am just sending a name-value pair through the get request.

There are a few things you seem to be missing. First of all, yes, if TCP finishes successfully, you pretty much have a guarantee that your message (i.e. the TCP payload) has reached the other side: TCP assures that it will take care of lost packages and the order in which packages arrive. However, this is not universially failproof, as there are still things beyond the powers of TCP (think of a physical disconnect by cutting through an ethernet cable). There is also no assertion regarding the syntactical correctness of the protocol "above." Any checks beyond delivering a bit-perfect copy is simply not TCP's concern.
So, there is a chance that the requests issued by your client are faulty or that they are indeed correct but not parsed correctly by your server. Former is striking me as more likely as latter one as Tomcat is a very mature piece of software. I think it would help tremendously if you would record and analyse some of your generated traffic through e.g. Wireshark.
You do not really mention what database you have in use. But there are some sacrificing acid-compliance in favour of increased write speeds. The nature of these databases brings it that you can never be really sure wether something actually got written to disk or is still residing in some buffer in memory. Should you happen to use such a db, this were another line of investigation.
Programmatically, I advise you take the following steps when dealing with HTTP traffic:
Has writing to the socket finishes without error?
Could a response be read from the socket?
Does the response carry a code in the 2xx range (indicating a successful operation)?
If any of these fail, you should really log something.
On a realated note, what you are doing there does not call for the GET method but for POST as you are changing application state. Consider it as a nice-to-have ;)

Without knowing the specifics, you can break it down into two parts. The HTTP request and the DB write. The client will receive a 200 OK response from the server when its GET request has been acknowledged. I've written code under Tomcat to connect to a MySQL DB using DAO. In the case of a failure an exception would be thrown and logged. Which ever method you're using, you'll want to figure out how failures are logged.

What Is Meant By Server Response Time

I'm doing website optimisations using Google's Pagespeed Insights to test improvements. Among the high-priority fix suggestions, is this:
Reduce server response time
In our test, your server responded in 2.1 seconds.
I read the 'helpful' doc linked in this section, and now I'm really confused.
Is the server response time the DNS response, the time to first-byte, or a combination? Is it purely a server-side thing, or could this be affected by, for example, a slow JavaScript resource or ready events in the DOM?

My first guess would have been that it's the time taken from the moment the request was issued, to the 1st byte received from the server, however Google's definition is not quite that:
(from this page https://developers.google.com/speed/docs/insights/Server)
Server response time measures how long it takes to load the necessary
HTML to begin rendering the page from your server, subtracting out the
network latency between Google and your server. There may be variance
from one run to the next, but the differences should not be too large.
In fact, highly variable server response time may indicate an
underlying performance issue.
To take 2.1 seconds would suggest to me that your application/webserver is buffering it's output, so all your server side processing is happening before it sends the content. If you don't buffer then the html can begin being sent to the browser more quickly which may help, however you lose the ability to do things like change response headers late in your logic.

Rails 3: Return large amount of data to user via API

My app has an API that users can request data. Sometimes that data takes time to process and is breaking my code.
I need a solution for this and I was thinking in using delayed_job but I'm not sure how this works. If the user makes a request, I need to give him an answer. Even if I process the data in background, the call still needs to wait until the job returns.
What is the solution for this? I am not sure how to do it.
Thanks

Heroku has a 30 second timeout, which is why your requests are failing (Probably H12 or H13 in your heroku logs).
There are three methods to work around this.
Keep the connection open by sending blank data.
You'll need to respond within the first 30 seconds and every 55 seconds after that. Use the time in between to process the data. Sending spaces should not affect the ability of the browser to read the response.
Callback
Have the user provide a callback URL in the initial request. When you finish processing the data, hit the callback url with your response.
Polling
As suggested by Codeglot, you can provide the user with a key. To check on their request, they can ping your server with that key.

Tell the user that their data is being processed and will be available shortly. Youtube, Vimeo, Facebook, Twitter, they all do this.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas