Sometimes the Google Geocoding API returns a 500 server error [closed] - api

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Sometimes the Google Maps API returns a 500 server error response according to German postal codes and i cannot understand why.
I hope it is specific enough.
Any ideas?
https://maps.googleapis.com/maps/api/geocode/json?key={api_key}&address={postal_code}&language=de&region=de&components=country:DE&sensor=false

Since you specify that the problem is not a given address but a seemingly "random" behavior, this may fall under a documented behavior of other "famous" API.
As for other cases, the recommended strategy is Exponential backoff for the Geocoding API, which basically means that you have to retry after a certain delay.
In case the above link goes down or changes, I'm quoting the article:
Exponential Backoff
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
For example, consider an application that wishes to make this request to the Google Maps Time Zone API:
https://maps.googleapis.com/maps/api/timezone/json?location=39.6034810,-119.6822510&timestamp=1331161200&key=YOUR_API_KEY
The following Python example shows how to make the request with exponential backoff:
import json
import time
import urllib
import urllib2
def timezone(lat, lng, timestamp):
# The maps_key defined below isn't a valid Google Maps API key.
# You need to get your own API key.
# See https://developers.google.com/maps/documentation/timezone/get-api-key
maps_key = 'YOUR_KEY_HERE'
timezone_base_url = 'https://maps.googleapis.com/maps/api/timezone/json'
# This joins the parts of the URL together into one string.
url = timezone_base_url + '?' + urllib.urlencode({
'location': "%s,%s" % (lat, lng),
'timestamp': timestamp,
'key': maps_key,
})
current_delay = 0.1 # Set the initial retry delay to 100ms.
max_delay = 3600 # Set the maximum retry delay to 1 hour.
while True:
try:
# Get the API response.
response = str(urllib2.urlopen(url).read())
except IOError:
pass # Fall through to the retry loop.
else:
# If we didn't get an IOError then parse the result.
result = json.loads(response.replace('\\n', ''))
if result['status'] == 'OK':
return result['timeZoneId']
elif result['status'] != 'UNKNOWN_ERROR':
# Many API errors cannot be fixed by a retry, e.g. INVALID_REQUEST or
# ZERO_RESULTS. There is no point retrying these requests.
raise Exception(result['error_message'])
if current_delay > max_delay:
raise Exception('Too many retry attempts.')
print 'Waiting', current_delay, 'seconds before retrying.'
time.sleep(current_delay)
current_delay *= 2 # Increase the delay each time we retry.
tz = timezone(39.6034810, -119.6822510, 1331161200)
print 'Timezone:', tz
Of course this will not resolve the "false responses" you mention; I suspect that depends on data quality and does not happen randomly.

Related

Scrapy - get depth level of failed requests with no response

When parsing scraped pages I also save the depth the request was scraped from using response.meta['depth'].
I recently started using errback to log all failed requests into a separate file and having depth there would help me a lot. (I believe) I could use failure.value.response.meta['depth'] for those pages which actually got a response but failed due to ie a http status error like 403 etc., however when an error like TCPTimeout is encountered there is no response.
Is it possible to get the depth level of a failed request with no response?
EDIT1: Tried failure.request.meta['depth'] but that gives an error. Meta seems that can be found but it has no depth key.
EDIT2: The issue seems to be that failure.request.meta['depth'] is created only when the first response is received. So the way I understand is that if the first request, a start_url doesn't receive a response, the depth key is not yet created and hence throws an exception.
I'm going to experiment with this as per the depth middleware:
if 'depth' not in response.meta:
response.meta['depth'] = 0
Yep, the issue turns out to be exactly how I described it in EDIT2. This is how I fixed it:
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, errback=self.my_errback)
def my_errback(self, failure):
if 'depth' not in failure.request.meta:
failure.request.meta['depth'] = 0
depth = failure.request.meta['depth']
# do something with depth...
Big thanks to mr #Galecio who pointed me in the right direction!

Is it possible to set dynamic download delay in scrapy?

I know that a constant delay can be set in
settings.py
DOWNLOAD_DELAY = 2
however, if I set the delay to 2s it is not efficient enough. If I set the DOWNLOAD_DELAY = 0.
The crawler is able to crawl about 10 pages. after that, the target page will return something like " you are requesting too frequently ".
What I want to do is the keep the download_delay to 0. once the "requesting too frequently" msg is found in the html. it change the delay to 2s. After a while it switch back to zero.
is there any module can do this? or any other better idea to handle such case?
Update:
I found that is a extension call AutoThrottle
but is it able to customize some logic like this??
if (requesting too frequently) is found
increase the DOWNLOAD_DELAY
If right after you get anti-spider page, then in 2 seconds you can get data page, then what you are asking probably requires writing a downloader middleware
that checks for anti-spider page, reset all scheduled requests to a renew-queue, start a looping call when spider is idle to get request from the renew-queue, (the looping interval is your hack for a new download delay), and try to decide when the download delay is not necessary again (requires some tests), then stop the looping and reschedule all the requests in renew-queue to scrapy scheduler. You will need to use redis queue in case of distributed crawl.
With download delay set to 0, in my experience throughput can go easily above 1000 items/min. If anti-spider page pops up after 10 responses, then it is not worth the effort.
Instead maybe you can try to find out how fast does your target server allow, may be 1.5s, 1s, 0.7s, 0.5s etc. Then maybe redesign your product takes into consideration the throughput your crawler can achieve.
You can use Auto Throttle extension now. It is turned off by default. You can add these parameters in your project's settings.py file to enable it.
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 300
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
Yes, You can use the time module to set the dynamic delay.
import time
for i in range(10):
*** Operations 1****
time.sleep( i )
*** Operations 2****
Now you can see the delay between Operations 1 and Operations 2.
Note:
the variable 'i' is in the form of seconds.

How to make pooling HTTP connection with twisted?

I wirite a very simple spider program to fetch webpages from single site.
Here is a minimized version.
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler)
return response
def onBody(body, i):
print('Received %s, Length %s' % (i, len(body)))
def errorHandler(err):
print('%s : %s' % (reactor.seconds() - startTimeStamp, err))
def requestFactory():
for i in range (start, end):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler)
print('Generated %s' % i)
reactor.iterate(1)
print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()
For a few requests, like 100, it works fine. But for massive requests, it will failed.
I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool, set maxPersistentPerHost, create an Agent instance with it and incrementally create the connections.
But it doesn't, the connections are not keep-alive nor pooled.
In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.
So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.
Could you please give me some help on this?
I have tried my best on this, here is my finds.
Error occured exactly 30s after reactor start runing, won't be influenced by other things.
Let the spider fetch my server, I find something interesting.
The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
If I didn't add any explicit header(just like the minimized version), the request header didn't contain Connection: Keep-Alive, either do response header.
If I add explicit header to ensure it is a keep-alive connection, the request header did contain Connection: Keep-Alive, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receive Connection: Keep-Alive header.)
I check /proc/net/sockstat during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets)
I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
If I have add explicit header on it, Connection: Keep-Alive never appear in response header.
Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.
I solved the problem myself, with help from IRC and another question in stackoverflow, Queue remote calls to a Python Twisted perspective broker?
In summary, the agent's behavior is very different from that in Nodejs(I have some experience in Nodejs). As it described on Nodejs doc
agent.requests
An object which contains queues of requests that have not yet been assigned to sockets.
agent.maxSockets
By default set to 5. Determines how many concurrent sockets the agent can have open per origin. Origin is either a 'host:port' or 'host:port:localAddress' combination.
So, here is the difference.
Twisted:
There is no doubt that Agent could queue requests if construct with a HTTPConnectionPool instance.
But if a new request is issued after connections in pool has run out, the agent will still create a new connection and perform the request, rather than put it in a queue.
Actually, it will lead to drop a connection in the pool, and push the newly generated connection into the pool, keep the connections count still equal to maxPersistentPerHost
Nodejs:
By default, agent will queue the requests with a implicit connection pool, which have a size of 5 connections.
If a new request is issued after connections in pool has run out, the agent will queue the requests into agent.requests variable, waiting for available connection.
The agent's behavior in twisted lead to a result that the agent is able to queue the requests, but actually it doesn't.
Follow our intuition, once assign a connection pool to an agent, it is in line with the intuition that agent will only use the connections in the pool, and wait for available connection if the pool has run out. That is exactly match with the agent in Nodejs.
Personally, I think it is a buggy behavior in twisted, or at least, could make an improvement to provide an option to set agent's behavior.
According to this, I have to use DeferredSemaphore to manually schedule the requests.
I raise a issue to treq project on github, and get similar solution. https://github.com/dreid/treq/issues/71
Here is my solution.
#!/usr/bin/env python
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
from twisted.internet.defer import DeferredSemaphore
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
count = end - start
concurrency = 10
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = concurrency
agent = Agent(reactor, pool=pool)
sem = DeferredSemaphore(concurrency)
done = 0
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler, i)
return deferred
def onBody(body, i):
sem.release()
global done, count
done += 1
print('Received %s, Length %s, Done %s' % (i, len(body), done))
if(done == count):
print('All items fetched')
reactor.stop()
def errorHandler(err, i):
print('[%s] id %s: %s' % (reactor.seconds() - startTimeStamp, i, err))
def requestFactory(token, i):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler, i)
print('Request send %s' % i)
#this function it self is a callback emit by reactor, so needn't iterate manually
#reactor.iterate(1)
return deferred
def assign():
for i in range (start, end):
sem.acquire().addCallback(requestFactory, i)
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(assign)
reactor.run()
Is it right? Beg for pointing out my error and improvements.
For a few requests, like 100, it works fine. But for massive requests,
it will failed.
This is either a protection against web crawlers or a server protection against DoS/DDoS, because you are sending too much requests from the same IP in a short time, so the Firewall or the WSA will block your future request. Just modify your script to make request in batch spaced by some time. you can use callLater() with some time after each X request.

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110

GWT-RPC, Apache, Tomcat server data size checking

Following up on this GWT-RPC question (and answer #1) re. field size checking, I would like to know the right way to check pre-deserialization for max data size sent to server, something like if request data size > X then abort the request. Valuing simplicity and based on answer on aforementioned question/answer, I am inclined to believe checking for max overall request size would suffice, finer grained checks (i.e., field level checks) could be deferred to post-deserialization, but I am open to any best-practice suggestion.
Tech stack of interest: GWT-RPC client-server communication with Apache-Tomcat front-end web-server.
I suppose a first step would be to globally limit the size of any request (LimitRequestBody in httpd.conf or/and others?).
Are there finer-grained checks like something that can be set per RPC request? If so where, how? How much security value do finer grain checks bring over one global setting?
To frame the question more specifically with an example, let's suppose we have the two following RPC request signatures on the same servlet:
public void rpc1(A a, B b) throws MyException;
public void rpc2(C c, D d) throws MyException;
Suppose I approximately know the following max sizes:
a: 10 kB
b: 40 kB
c: 1 M B
d: 1 kB
Then I expect the following max sizes:
rpc1: 50 kB
rpc2: 1 MB
In the context of this example, my questions are:
Where/how to configure the max size of any request -- i.e., 1 MB in my above example? I believe it is LimitRequestBody in httpd.conf but not 100% sure whether it is the only parameter for this purpose.
If possible, where/how to configure max size per servlet -- i.e., max size of any rpc in my servlet is 1 MB?
If possible, where/how to configure/check max size per rpc request -- i.e., max rpc1 size is 50 kB and max rpc2 size is 1 MB?
If possible, where/how to configure/check max size per rpc request argument -- i.e., a is 10 kB, b is 40 kB, c is 1 MB, and d is 1 kB. I suspect it makes practical sense to do post-deserialization, doesn't it?
For practical purposes based of cost/benefit, what level of pre-deserialization checking is generally recommended -- 1. global, 2. servlet, 3. rpc, 4. object-argument? Stated differently, what is roughly the cost-complexity on one hand and the added value on the other hand of each of the above pre-deserialization level checks?
Thanks much in advance.
Based on what I have learned since I asked the question, my own answer and strategy until someone can show me better is:
First line of defense and check is Apache's LimitRequestBody set in httpd.conf. It is the overall max for all rpc calls across all servlets.
Second line of defense is servlet pre-deserialization by overriding GWT AbstractRemoteServiceServlet.readContent. For instance, one could do it as shown further below I suppose. This was the heart of what I was fishing for in this question.
Then one can further check each rpc call argument post-deserialization. One could conveniently use the JSR 303 validation both on the server and client side -- see references StackOverflow and gwt r.e. client side.
Example on how to override AbstractRemoteServiceServlet.readContent:
#Override
protected String readContent(HttpServletRequest request) throws ServletException, IOException
{
final int contentLength = request.getContentLength();
// _maxRequestSize should be large enough to be applicable to all rpc calls within this servlet.
if (contentLength > _maxRequestSize)
throw new IOException("Request too large");
final String requestPayload = super.readContent(request);
return requestPayload;
}
See this question in case the max request size if > 2GB.
From a security perspective, this strategy seems quite reasonable to me to control the size of data users send to server.