I have a piece of code like this
hanzi = "一丁七万丈三上下不与"
try:
all_participants = await client.get_participants(my_channel, None,search=hanzi, aggressive=True)
except errors.FloodWaitError as e:
print('Have to sleep', e.seconds, 'seconds')
print("")
exit
# except errors.common.MultiError:
# print("")
# exit
When searching for "hanzi" with too much text, he will prompt a flood error. What I can think of is to split with an array and loop the request, but that doesn't solve the actual problem. Is there a way to make each request delay a few seconds before proceeding.
Related
I have this piece of code that only executes the first yield's callback and not the next one. I have tried reordering them and it gives the same result:
Only the first yield callback gets executed.
for j in range(totalOrderPages): # the code gets in the loop
productURI = feedUrl % (productId, j + 1)
print "Got in the loop" # this gets printed
yield response.follow(productURI, self.parse_orders, meta={'pid': productId, 'categories': categories})
yield response.follow(first_page, self.parse_product, meta={'pid': productId, 'categories': categories})
Is there anything in Python or scrapy that prevents 2 consecutive yields?
Second question:
I'm trying to debug this using pdb.set_trace() but when I try to execute yield from the debugging console, it give the yield outside function error.
Does anyone know how can we debug yields?
Thank you.
Without knowing more details, like the redirection behaviour of the specific site or the contents of the variables (feedUrl, productURI, first_page, etc), I would say that some requests are being dropped by the Dupefilter (https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class).
I'd recommend you to enable the DEBUG logging level and setting DUPEFILTER_DEBUG=True, and check the logs to see if that's the case.
You can force requests to bypass the Dupefilter by adding dont_filter=True when calling response.follow.
If this doesn't solve your issue, please share your crawl logs so we can have more information to debug the issue. Happy scraping!
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Sometimes the Google Maps API returns a 500 server error response according to German postal codes and i cannot understand why.
I hope it is specific enough.
Any ideas?
https://maps.googleapis.com/maps/api/geocode/json?key={api_key}&address={postal_code}&language=de®ion=de&components=country:DE&sensor=false
Since you specify that the problem is not a given address but a seemingly "random" behavior, this may fall under a documented behavior of other "famous" API.
As for other cases, the recommended strategy is Exponential backoff for the Geocoding API, which basically means that you have to retry after a certain delay.
In case the above link goes down or changes, I'm quoting the article:
Exponential Backoff
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
For example, consider an application that wishes to make this request to the Google Maps Time Zone API:
https://maps.googleapis.com/maps/api/timezone/json?location=39.6034810,-119.6822510×tamp=1331161200&key=YOUR_API_KEY
The following Python example shows how to make the request with exponential backoff:
import json
import time
import urllib
import urllib2
def timezone(lat, lng, timestamp):
# The maps_key defined below isn't a valid Google Maps API key.
# You need to get your own API key.
# See https://developers.google.com/maps/documentation/timezone/get-api-key
maps_key = 'YOUR_KEY_HERE'
timezone_base_url = 'https://maps.googleapis.com/maps/api/timezone/json'
# This joins the parts of the URL together into one string.
url = timezone_base_url + '?' + urllib.urlencode({
'location': "%s,%s" % (lat, lng),
'timestamp': timestamp,
'key': maps_key,
})
current_delay = 0.1 # Set the initial retry delay to 100ms.
max_delay = 3600 # Set the maximum retry delay to 1 hour.
while True:
try:
# Get the API response.
response = str(urllib2.urlopen(url).read())
except IOError:
pass # Fall through to the retry loop.
else:
# If we didn't get an IOError then parse the result.
result = json.loads(response.replace('\\n', ''))
if result['status'] == 'OK':
return result['timeZoneId']
elif result['status'] != 'UNKNOWN_ERROR':
# Many API errors cannot be fixed by a retry, e.g. INVALID_REQUEST or
# ZERO_RESULTS. There is no point retrying these requests.
raise Exception(result['error_message'])
if current_delay > max_delay:
raise Exception('Too many retry attempts.')
print 'Waiting', current_delay, 'seconds before retrying.'
time.sleep(current_delay)
current_delay *= 2 # Increase the delay each time we retry.
tz = timezone(39.6034810, -119.6822510, 1331161200)
print 'Timezone:', tz
Of course this will not resolve the "false responses" you mention; I suspect that depends on data quality and does not happen randomly.
I have the following code
# -*- coding: utf-8 -*-
# 好
##########################################
import time
from twisted.internet import reactor, threads
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
##########################################
class Website(Resource):
def getChild(self, name, request):
return self
def render(self, request):
if request.path == "/sleep":
duration = 3
if 'duration' in request.args:
duration = int(request.args['duration'][0])
message = 'no message'
if 'message' in request.args:
message = request.args['message'][0]
#-------------------------------------
def deferred_activity():
print 'starting to wait', message
time.sleep(duration)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
request.write(message)
print 'finished', message
request.finish()
#-------------------------------------
def responseFailed(err, deferred):
pass; print err.getErrorMessage()
deferred.cancel()
#-------------------------------------
def deferredFailed(err, deferred):
pass; # print err.getErrorMessage()
#-------------------------------------
deferred = threads.deferToThread(deferred_activity)
deferred.addErrback(deferredFailed, deferred) # will get called indirectly by responseFailed
request.notifyFinish().addErrback(responseFailed, deferred) # to handle client disconnects
#-------------------------------------
return NOT_DONE_YET
else:
return 'nothing at', request.path
##########################################
reactor.listenTCP(321, Site(Website()))
print 'starting to serve'
reactor.run()
##########################################
# http://localhost:321/sleep?duration=3&message=test1
# http://localhost:321/sleep?duration=3&message=test2
##########################################
My issue is the following:
When I open two tabs in the browser, point one at http://localhost:321/sleep?duration=3&message=test1 and the other at http://localhost:321/sleep?duration=3&message=test2 (the messages differ) and reload the first tab and then ASAP the second one, then the finish almost at the same time. The first tab about 3 seconds after hitting F5, the second tab finishes about half a second after the first tab.
This is expected, as each request got deferred into a thread, and they are sleeping in parallel.
But when I now change the URL of the second tab to be the same as the one of the first tab, that is to http://localhost:321/sleep?duration=3&message=test1, then all this becomes blocking. If I press F5 on the first tab and as quickly as possible F5 on the second one, the second tab finishes about 3 seconds after the first one. They don't get executed in parallel.
As long as the entire URI is the same in both tabs, this server starts to block. This is the same in Firefox as well as in Chrome. But when I start one in Chrome and another one in Firefox at the same time, then it is non-blocking again.
So it may not neccessarily be related to Twisted, but maybe because of some connection reusage or something like that.
Anyone knows what is happening here and how I can solve this issue?
Coincidentally, someone asked a related question over at the Tornado section. As you suspected, this is not an "issue" in Twisted but rather a "feature" of web browsers :). Tornado's FAQ page has a small section dedicated to this issue. The proposed solution is appending an arbitrary query string.
Quote of the day:
One dev's bug is another dev's undocumented feature!
I'm using this logic in order to catch BigQuery jobs
is success or not but sometimes i'm successful job id even due the job run
query but did't insert rows.
it's happen mainly with queries to table.
i used the code that i saw in google document and little bit add some logs for me.
it will be great if someone can tell me what i'm doing wrong.
def _wait_for_response(self, bq_api, insert_response, max_wait_time=3600):
"""get bigQuery job status. wait for DONE and check for errors.
if errors exist - raise an exception"""
start_time = time.time()
logstr.info(current_module='bq_session',
current_func='_wait_for_response')
# sleep interval between retries
# first, try 8 times every 1 second, then double sleep time until
# 30 seconds (and stay on 30 until max_wait_time is reached)
sleep = itertools.chain(itertools.repeat(1, 8), xrange(2, 30, 3),
itertools.repeat(30))
while time.time() - start_time < max_wait_time:
try:
job = bq_api.jobs().get(
projectId=insert_response['jobReference']['projectId'],
jobId=insert_response['jobReference']['jobId']).execute()
# on job end
if job['status']['state'] == 'DONE':
# if job failed raise error(s)
if 'errors' in job['status'].keys() and\
job['status']['errors']:
raise Exception(','.join(
[err['message']
for err in job['status']['errors']]))
else:
return job
except apiclient.errors.HttpError, error:
status = int(error.resp.get('status', 0))
if status >= 500:
pass
# raise Exception(
# global_messages.BQ_SERVER_ERROR.format(err=error))
elif status == 404:
raise Exception(
global_messages.BQ_JOB_NOT_FOUND.format(
jobid=insert_response['jobReference']['jobId']))
else:
raise Exception(
global_messages.BQ_ERROR_GETTING_JOB_STATUS.format(
err=error))
time.sleep(sleep.next())
raise Exception(global_messages.BQ_TIMEOUT.format(
time=max_wait_time,
jobid=insert_response['jobReference']['jobId']))
These lines of your script cause control to fall through if the status is greater than 500.
if status >= 500:
pass
# raise Exception(
# global_messages.BQ_SERVER_ERROR.format(err=error))
That could be preventing you from seeing the expected exception.
I am trying to implement an IRC Bot on a local server. The bot that I am using is identical to the one found at Eric Florenzano's Blog. This is the simplified code (which should run)
import sys
import re
from twisted.internet import reactor
from twisted.words.protocols import irc
from twisted.internet import protocol
class MomBot(irc.IRCClient):
def _get_nickname(self):
return self.factory.nickname
nickname = property(_get_nickname)
def signedOn(self):
print "attempting to sign on"
self.join(self.factory.channel)
print "Signed on as %s." % (self.nickname,)
def joined(self, channel):
print "attempting to join"
print "Joined %s." % (channel,)
def privmsg(self, user, channel, msg):
if not user:
return
if self.nickname in msg:
msg = re.compile(self.nickname + "[:,]* ?", re.I).sub('', msg)
prefix = "%s: " % (user.split('!', 1)[0], )
else:
prefix = ''
self.msg(self.factory.channel, prefix + "hello there")
class MomBotFactory(protocol.ClientFactory):
protocol = MomBot
def __init__(self, channel, nickname='YourMomDotCom', chain_length=3,
chattiness=1.0, max_words=10000):
self.channel = channel
self.nickname = nickname
self.chain_length = chain_length
self.chattiness = chattiness
self.max_words = max_words
def startedConnecting(self, connector):
print "started connecting on {0}:{1}"
.format(str(connector.host),str(connector.port))
def clientConnectionLost(self, connector, reason):
print "Lost connection (%s), reconnecting." % (reason,)
connector.connect()
def clientConnectionFailed(self, connector, reason):
print "Could not connect: %s" % (reason,)
if __name__ == "__main__":
chan = sys.argv[1]
reactor.connectTCP("localhost", 6667, MomBotFactory('#' + chan,
'YourMomDotCom', 2, chattiness=0.05))
reactor.run()
I added the startedConnection method in the client factory, which it is reaching and printing out the proper address:host. It then disconnects and enters the clientConnectionLost and prints the error:
Lost connection ([Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]), reconnecting.
If working properly it should log into the appropriate channel, specified as the first arg in the command (e.g. python module2.py botwar. would be channel #botwar.). It should respond with "hello there" if any one in the channel sends anything.
I have NGIRC running on the server, and it works if I connect from mIRC or any other IRC client.
I am unable to find a resolution as to why it is continually disconnecting. Any help on why would be greatly appreciated. Thank you!
One thing you may want to do is make sure you will see any error output produced by the server when your bot connects to it. My hunch is that the problem has something to do with authentication, or perhaps an unexpected difference in how ngirc handles one of the login/authentication commands used by IRCClient.
One approach that almost always applies is to capture a traffic log. Use a tool like tcpdump or wireshark.
Another approach you can try is to enable logging inside the Twisted application itself. Use twisted.protocols.policies.TrafficLoggingFactory for this:
from twisted.protocols.policies import TrafficLoggingFactory
appFactory = MomBotFactory(...)
logFactory = TrafficLoggingFactory(appFactory, "irc-")
reactor.connectTCP(..., logFactory)
This will log output to files starting with "irc-" (a different file for each connection).
You can also hook directly into your protocol implementation, at any one of several levels. For example, to display any bytes received at all:
class MomBot(irc.IRCClient):
def dataReceived(self, bytes):
print "Got", repr(bytes)
# Make sure to up-call - otherwise all of the IRC logic is disabled!
return irc.IRCClient.dataReceived(self, bytes)
With one of those approaches in place, hopefully you'll see something like:
:irc.example.net 451 * :Connection not registered
which I think means... you need to authenticate? Even if you see something else, hopefully this will help you narrow in more closely on the precise cause of the connection being closed.
Also, you can use tcpdump or wireshark to capture the traffic log between ngirc and one of the working IRC clients (eg mIRC) and then compare the two logs. Whatever different commands mIRC is sending should make it clear what changes you need to make to your bot.