Scrapy - get depth level of failed requests with no response - scrapy

When parsing scraped pages I also save the depth the request was scraped from using response.meta['depth'].
I recently started using errback to log all failed requests into a separate file and having depth there would help me a lot. (I believe) I could use failure.value.response.meta['depth'] for those pages which actually got a response but failed due to ie a http status error like 403 etc., however when an error like TCPTimeout is encountered there is no response.
Is it possible to get the depth level of a failed request with no response?
EDIT1: Tried failure.request.meta['depth'] but that gives an error. Meta seems that can be found but it has no depth key.
EDIT2: The issue seems to be that failure.request.meta['depth'] is created only when the first response is received. So the way I understand is that if the first request, a start_url doesn't receive a response, the depth key is not yet created and hence throws an exception.
I'm going to experiment with this as per the depth middleware:
if 'depth' not in response.meta:
response.meta['depth'] = 0

Yep, the issue turns out to be exactly how I described it in EDIT2. This is how I fixed it:
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, errback=self.my_errback)
def my_errback(self, failure):
if 'depth' not in failure.request.meta:
failure.request.meta['depth'] = 0
depth = failure.request.meta['depth']
# do something with depth...
Big thanks to mr #Galecio who pointed me in the right direction!

Related

Sometimes the Google Geocoding API returns a 500 server error [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Sometimes the Google Maps API returns a 500 server error response according to German postal codes and i cannot understand why.
I hope it is specific enough.
Any ideas?
https://maps.googleapis.com/maps/api/geocode/json?key={api_key}&address={postal_code}&language=de&region=de&components=country:DE&sensor=false
Since you specify that the problem is not a given address but a seemingly "random" behavior, this may fall under a documented behavior of other "famous" API.
As for other cases, the recommended strategy is Exponential backoff for the Geocoding API, which basically means that you have to retry after a certain delay.
In case the above link goes down or changes, I'm quoting the article:
Exponential Backoff
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
For example, consider an application that wishes to make this request to the Google Maps Time Zone API:
https://maps.googleapis.com/maps/api/timezone/json?location=39.6034810,-119.6822510&timestamp=1331161200&key=YOUR_API_KEY
The following Python example shows how to make the request with exponential backoff:
import json
import time
import urllib
import urllib2
def timezone(lat, lng, timestamp):
# The maps_key defined below isn't a valid Google Maps API key.
# You need to get your own API key.
# See https://developers.google.com/maps/documentation/timezone/get-api-key
maps_key = 'YOUR_KEY_HERE'
timezone_base_url = 'https://maps.googleapis.com/maps/api/timezone/json'
# This joins the parts of the URL together into one string.
url = timezone_base_url + '?' + urllib.urlencode({
'location': "%s,%s" % (lat, lng),
'timestamp': timestamp,
'key': maps_key,
})
current_delay = 0.1 # Set the initial retry delay to 100ms.
max_delay = 3600 # Set the maximum retry delay to 1 hour.
while True:
try:
# Get the API response.
response = str(urllib2.urlopen(url).read())
except IOError:
pass # Fall through to the retry loop.
else:
# If we didn't get an IOError then parse the result.
result = json.loads(response.replace('\\n', ''))
if result['status'] == 'OK':
return result['timeZoneId']
elif result['status'] != 'UNKNOWN_ERROR':
# Many API errors cannot be fixed by a retry, e.g. INVALID_REQUEST or
# ZERO_RESULTS. There is no point retrying these requests.
raise Exception(result['error_message'])
if current_delay > max_delay:
raise Exception('Too many retry attempts.')
print 'Waiting', current_delay, 'seconds before retrying.'
time.sleep(current_delay)
current_delay *= 2 # Increase the delay each time we retry.
tz = timezone(39.6034810, -119.6822510, 1331161200)
print 'Timezone:', tz
Of course this will not resolve the "false responses" you mention; I suspect that depends on data quality and does not happen randomly.

WSO2 API Manager returning RunTime Error

I have an API in WSO2. When I try to test it in the store with valid parameters via GET, it returns the following error message:
<am:fault xmlns:am="http://wso2.org/apimanager">
<am:code>101504</am:code>
<am:type>Status report</am:type>
<am:message>Runtime Error</am:message>
<am:description>Send timeout</am:description>
</am:fault>
I have already searched and tried a lot, but with no success, always returning the same error. Don't know if helps, but the api that I try to access is a PHP file.
Any ideas?
EDIT:
I have identical apis to this one, changing only the response, that are working properly. Even if I erase the php file that the API is pointing, the error keep coming.
EDIT:
I changed the code in Management Console, in Metadata>List>APIs:
{"production_endpoints":
{"url":"http://site/myapi.php","config":
{"format":"leave-as-is","optimize":"leave-as-
is","actionSelect":"fault","actionDuration":30000}},
"sandbox_endpoints":
{"url":"http://site/myapi.php","config":
{"format":"leave-as-is","optimize":"leave-as-
is","actionSelect":"fault","actionDuration":30000}},
"implementation_status":"managed","endpoint_type":"http"}
to this:
{"production_endpoints":
{"url":"http://10.20.40.189/ConsultaAutorizacaoCadPos.php","config":null},
"sandbox_endpoints":{"url":"http://10.20.40.189/ConsultaAutorizacaoCadPos.php","config":null},
"implementation_status":"managed","endpoint_type":"http"}
And this error message vanishes. But this new one appears:
<am:fault xmlns:am="http://wso2.org/apimanager">
<am:code>303001</am:code>
<am:type>Status report</am:type>
<am:message>Runtime Error</am:message>
<am:description>Currently , Address endpoint : [ Name : admin--myapi_APIproductionEndpoint_0 ] [ State : SUSPENDED ]</am:description>
EDIT: this error message is appearing just sometimes. Most of the time the request takes long time to run and then returns nothing( no content )
Any ideas how to solve it?

Splunk rex query does not return desired result

I am looking to search for error type in my spunk. A typical error log looks like this:
ERROR 2016/03/16 22:13:55 Program exited with error Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Note that common part is "Program exited with error". I am looking to capture the part that follows this common part of the error message. I tried with a couple of rex expressions. Both returned different results. Importantly, neither captured the error type I have shown above. I am giving the one that worked better here.
* | rex "Program exited with error\s+(?<reason>.+)" | top reason
An example of the log it matched-
Unable to get program status, Get http://192.168.0.2:2774/program/v1/status: net/http: timeout awaiting response headers
However, it did not match log of the form-
initial ZK connection failed, stat /var/program/f47aae5c-ea42-11e5-8975-fc15b40f4cc4/srcheck/started: no such file or directory
Calling service: Post http://hostname/v1.21/resource/create?name=/60b80cf9-ebc4-11e5-a9cb-3c4a92db9491-2: read unix #->/var/run/program.sock: use of closed network connection (Client.Timeout exceeded while awaiting headers)
Could someone help me understand what's wrong with my rex expression and what the right one would be so I get all possible error types?
This recipe:
"ERROR.*Program exited with error.*:.*:.*:\s+(?<reason>.+)"
will yield:
use of closed network connection (Client.Timeout exceeded while awaiting headers)
I don't have enough sample data to know if this will hold up or not. For example, I'm counting on exactly 3 colons, to get me to the interesting part. Also I don't know if you care about other things like the hostname, the fact that it's a Post, etc.. But based on your sample of 1, this answer should do the trick.

HTTPSConnectionPool Max retries exceeded

I've got a django app in production running on nginx/uwsgi. We recently started using SSL for all our connections. Since moving to SSL, I often get the following message:
HTTPSConnectionPool(host='foobar.com', port=443):
Max retries exceeded with url: /foo/bar
Essentially what happens is I've got the browser communicating with django server code, which then uses the requests library to call an api. Its the connection to the api that generates the error. Also, I've moved all our requests into one session (a requests session, that is), but this hasn't helped.
I've bumped up the number of uwsgi listeners since I thought that could be the problem, but our load isn't that high. Also, we never had this problem before SSL. Does anyone have some advice as to how to solve this problem?
Edit
Code snippet of how I call the API. I've posted it (mostly) verbatim. Note its not the code that actually fails, but the requests library that throws an exception when calling self.session.post
def save_answer(self):
logger.info("Saving answer to question")
url = "%s1.0/exam/learneranswer/" % self.api_url
response = {'success': False}
data = {'questionorder': self.request.POST.get('questionorder'),
'paper': self.request.POST.get('paper')}
data['answer'] = ",".join(self.request.POST.getlist('answer'))
r = self.session.post(url, data=simplejson.dumps(data))
if r.status_code == 201:
logger.info("Answer saved successfully")
response['success'] = True
elif r.status_code == 400:
if r.text == "Paper expired":
logger.warning("Timer has expired")
response['message'] = 'Your time has run out'
if r.text == "Question locked":
response['message'] = \
'This question is locked and cannot be answered anymore'
else:
logger.error("Unknown error")
self.log_error(r, "Unknown Error while saving answer")
else:
logger.error("Internal error")
self.log_error(r, "Internal error in api while saving answer")
return simplejson.dumps(response)
I've found that this error happens when some item in one of my views throws an exception. For example, when using the django 'requests' framework to post data to another URL:
r = requests.post(url, data=json.dumps(payload), headers=headers, timeout=5)
The downrange server was having connection issues, which threw an exception and that bubbled up and gave me the error you had above. I replaced with this:
try:
r = requests.post(url, data=json.dumps(payload), headers=headers, timeout=5)
except requests.exceptions.ConnectionError as e:
r = "No response"
And that fixed it (of course, I'd suggest adding in more error handling, but the above is the relevant subset).
You must disable validation like this
requests.get('https://google.com', verify=False)
You should specify your CA.
This Error occurs as a result of python script trying to connect to IBM service even before your wifi or ethernet connection is established. Have a try/catch to rectify or if trying to run as service then run service after network is established.

Rails 3.2.2 log files unordered, requests intertwined

I recollect getting log files that were nicely ordered, so that you could follow one request, then the next, and so on.
Now, the log files are, as my 4 year old says "all scroggled up", meaning that they are no longer separate, distinct chunks of text. Loggings from two requests get intertwined/mixed up.
For instance:
Started GET /foobar
...
Completed 200 OK in 2ms (Views: 0.4ms | ActiveRecord: 0.8ms)
Patient Load (wait, that's from another request that has nothing to do with foobar!)
[ blank space ]
Something else
This is maddening, because I can't tell what's happening within one single request.
This is running on Passenger.
I tried to search for the same answer but couldn't find any good info. I'm not sure if you should fix server or rails code.
If you want more info about the issue here is the commit that removed old way of logging https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
If you value production log readability over everything else you can use the
PassengerMaxInstancesPerApp 1
configuration. It might cause some scaling issues. Alternatively you could stuff something like this in application.rb:
process_log_filename = Rails.root + "log/#{Rails.env}-#{Process.pid}.log"
log_file = File.open(process_log_filename, 'a')
Rails.logger = ActiveSupport::BufferedLogger.new(log_file)
Yep!, they have made some changes in the ActiveSupport::BufferedLogger so it is not any more waiting until the request has ended to flush the logs:
http://news.ycombinator.com/item?id=4483390
https://github.com/rails/rails/commit/04ef93dae6d9cec616973c1110a33894ad4ba6ed
But they have added the ActiveSupport::TaggedLogging which is very funny and you can stamp every log with any kind of mark you want.
In your case could be good to stamp the logs with the request UUID like this:
# config/application.rb
config.log_tags = [:uuid]
Then even if the logs are messed up you still can follow which of them correspond to the request you are following up.
You can make more funny things with this feature to help you in your logs study:
How to log user_name in Rails?
http://zogovic.com/post/21138929607/running-time-in-rails-logs
Well, for me the TaggedLogging solution is a no go, I can live with some logs getting lost if the server crashes badly, but I want my logs to be perfectly ordered. So, following advice from the issue comments I'm applying this to my app:
# lib/sequential_logs.rb
module ActiveSupport
class BufferedLogger
def flush
#log_dest.flush
end
def respond_to?(method, include_private = false)
super
end
end
end
# config/initializers/sequential_logs.rb
require 'sequential_logs.rb'
Rails.logger.instance_variable_get(:#logger).instance_variable_get(:#log_dest).sync = false
As far as I can say this hasn't affected my app, it is still running and now my logs make sense again.
They should add some quasi-random reqid and write it in every line regarding one single request. This way you won't get confused.
I haven't used it, but I believe Lumberjack's unit_of_work method may be what you're looking for. You call:
Lumberjack.unit_of_work do
yield
end
And all logging done either in that block or in the yielded block are tagged with a unique ID.