Twisted deferreds block when URI is the same (multiple calls from the same browser) - twisted

I have the following code
# -*- coding: utf-8 -*-
# 好
##########################################
import time
from twisted.internet import reactor, threads
from twisted.web.server import Site, NOT_DONE_YET
from twisted.web.resource import Resource
##########################################
class Website(Resource):
def getChild(self, name, request):
return self
def render(self, request):
if request.path == "/sleep":
duration = 3
if 'duration' in request.args:
duration = int(request.args['duration'][0])
message = 'no message'
if 'message' in request.args:
message = request.args['message'][0]
#-------------------------------------
def deferred_activity():
print 'starting to wait', message
time.sleep(duration)
request.setHeader('Content-Type', 'text/plain; charset=UTF-8')
request.write(message)
print 'finished', message
request.finish()
#-------------------------------------
def responseFailed(err, deferred):
pass; print err.getErrorMessage()
deferred.cancel()
#-------------------------------------
def deferredFailed(err, deferred):
pass; # print err.getErrorMessage()
#-------------------------------------
deferred = threads.deferToThread(deferred_activity)
deferred.addErrback(deferredFailed, deferred) # will get called indirectly by responseFailed
request.notifyFinish().addErrback(responseFailed, deferred) # to handle client disconnects
#-------------------------------------
return NOT_DONE_YET
else:
return 'nothing at', request.path
##########################################
reactor.listenTCP(321, Site(Website()))
print 'starting to serve'
reactor.run()
##########################################
# http://localhost:321/sleep?duration=3&message=test1
# http://localhost:321/sleep?duration=3&message=test2
##########################################
My issue is the following:
When I open two tabs in the browser, point one at http://localhost:321/sleep?duration=3&message=test1 and the other at http://localhost:321/sleep?duration=3&message=test2 (the messages differ) and reload the first tab and then ASAP the second one, then the finish almost at the same time. The first tab about 3 seconds after hitting F5, the second tab finishes about half a second after the first tab.
This is expected, as each request got deferred into a thread, and they are sleeping in parallel.
But when I now change the URL of the second tab to be the same as the one of the first tab, that is to http://localhost:321/sleep?duration=3&message=test1, then all this becomes blocking. If I press F5 on the first tab and as quickly as possible F5 on the second one, the second tab finishes about 3 seconds after the first one. They don't get executed in parallel.
As long as the entire URI is the same in both tabs, this server starts to block. This is the same in Firefox as well as in Chrome. But when I start one in Chrome and another one in Firefox at the same time, then it is non-blocking again.
So it may not neccessarily be related to Twisted, but maybe because of some connection reusage or something like that.
Anyone knows what is happening here and how I can solve this issue?

Coincidentally, someone asked a related question over at the Tornado section. As you suspected, this is not an "issue" in Twisted but rather a "feature" of web browsers :). Tornado's FAQ page has a small section dedicated to this issue. The proposed solution is appending an arbitrary query string.
Quote of the day:
One dev's bug is another dev's undocumented feature!

Related

GtkTreeView stops updating unless I change the focus of the window

I have a GtkTreeView object that uses a GtkListStore model that is constantly being updated as follows:
Get new transaction
Feed data into numpy array
Convert numbers to formatted strings, store in pandas dataframe
Add updated token info to GtkListStore via GtkListStore.set(titer, liststore_cols, liststore_data), where liststore_data is the updated info, liststore_cols is the name of the columns (both are lists).
Here's the function that updates the ListStore:
# update ListStore
titer = ls_full.get_iter(row)
liststore_data = []
[liststore_data.append(df.at[row, col])
for col in my_vars['ls_full'][3:]]
# check for NaN value, add a (space) placeholder is necessary
for i in range(3, len(liststore_data)):
if liststore_data[i] != liststore_data[i]:
liststore_data[i] = " "
liststore_cols = []
[liststore_cols.append(my_vars['ls_full'].index(col) + 1)
for col in my_vars['ls_full'][3:]]
ls_full.set(titer, liststore_cols, liststore_data)
Class that gets the messages from the websocket:
class MyWebsocketClient(cbpro.WebsocketClient):
# class exceptions to WebsocketClient
def on_open(self):
# sets up ticker Symbol, subscriptions for socket feed
self.url = "wss://ws-feed.pro.coinbase.com/"
self.channels = ['ticker']
self.products = list(cbp_symbols.keys())
def on_message(self, msg):
# gets latest message from socket, sends off to be processed
if "best_ask" and "time" in msg:
# checks to see if token price has changed before updating
update_needed = parse_data(msg)
if update_needed:
update_ListStore(msg)
else:
print(f'Bad message: {msg}')
When the program first starts, the updates are consistent. Each time a new transaction comes in, the screen reflects it, updating the proper token. However, after a random amount of time - seen it anywhere from 5 minutes to over an hour - the screen will stop updating, unless I change the focus of the window (either activate or inactive). This does not last long, though (only enough to update the screen once). No other errors are being reported, memory usage is not spiking (constant at 140 MB).
How can I troubleshoot this? I'm not even sure where to begin. The data back-ends seem to be OK (data is never corrupted nor lags behind).
As you've said in the comments that it is running in a separate thread then i'd suggest wrapping your "update liststore" function with GLib.idle_add.
from gi.repository import GLib
GLib.idle_add(update_liststore)
I've had similar issues in the past and this fixed things. Sometimes updating liststore is fine, sometimes it will randomly spew errors.
Basically only one thread should update the GUI at a time. So by wrapping in GLib.idle_add() you make sure your background thread does not intefer with the main thread updating the GUI.

Two consecutive yields, only the first work

I have this piece of code that only executes the first yield's callback and not the next one. I have tried reordering them and it gives the same result:
Only the first yield callback gets executed.
for j in range(totalOrderPages): # the code gets in the loop
productURI = feedUrl % (productId, j + 1)
print "Got in the loop" # this gets printed
yield response.follow(productURI, self.parse_orders, meta={'pid': productId, 'categories': categories})
yield response.follow(first_page, self.parse_product, meta={'pid': productId, 'categories': categories})
Is there anything in Python or scrapy that prevents 2 consecutive yields?
Second question:
I'm trying to debug this using pdb.set_trace() but when I try to execute yield from the debugging console, it give the yield outside function error.
Does anyone know how can we debug yields?
Thank you.
Without knowing more details, like the redirection behaviour of the specific site or the contents of the variables (feedUrl, productURI, first_page, etc), I would say that some requests are being dropped by the Dupefilter (https://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class).
I'd recommend you to enable the DEBUG logging level and setting DUPEFILTER_DEBUG=True, and check the logs to see if that's the case.
You can force requests to bypass the Dupefilter by adding dont_filter=True when calling response.follow.
If this doesn't solve your issue, please share your crawl logs so we can have more information to debug the issue. Happy scraping!

How to make pooling HTTP connection with twisted?

I wirite a very simple spider program to fetch webpages from single site.
Here is a minimized version.
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler)
return response
def onBody(body, i):
print('Received %s, Length %s' % (i, len(body)))
def errorHandler(err):
print('%s : %s' % (reactor.seconds() - startTimeStamp, err))
def requestFactory():
for i in range (start, end):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler)
print('Generated %s' % i)
reactor.iterate(1)
print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()
For a few requests, like 100, it works fine. But for massive requests, it will failed.
I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool, set maxPersistentPerHost, create an Agent instance with it and incrementally create the connections.
But it doesn't, the connections are not keep-alive nor pooled.
In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.
So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.
Could you please give me some help on this?
I have tried my best on this, here is my finds.
Error occured exactly 30s after reactor start runing, won't be influenced by other things.
Let the spider fetch my server, I find something interesting.
The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
If I didn't add any explicit header(just like the minimized version), the request header didn't contain Connection: Keep-Alive, either do response header.
If I add explicit header to ensure it is a keep-alive connection, the request header did contain Connection: Keep-Alive, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receive Connection: Keep-Alive header.)
I check /proc/net/sockstat during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets)
I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
If I have add explicit header on it, Connection: Keep-Alive never appear in response header.
Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.
I solved the problem myself, with help from IRC and another question in stackoverflow, Queue remote calls to a Python Twisted perspective broker?
In summary, the agent's behavior is very different from that in Nodejs(I have some experience in Nodejs). As it described on Nodejs doc
agent.requests
An object which contains queues of requests that have not yet been assigned to sockets.
agent.maxSockets
By default set to 5. Determines how many concurrent sockets the agent can have open per origin. Origin is either a 'host:port' or 'host:port:localAddress' combination.
So, here is the difference.
Twisted:
There is no doubt that Agent could queue requests if construct with a HTTPConnectionPool instance.
But if a new request is issued after connections in pool has run out, the agent will still create a new connection and perform the request, rather than put it in a queue.
Actually, it will lead to drop a connection in the pool, and push the newly generated connection into the pool, keep the connections count still equal to maxPersistentPerHost
Nodejs:
By default, agent will queue the requests with a implicit connection pool, which have a size of 5 connections.
If a new request is issued after connections in pool has run out, the agent will queue the requests into agent.requests variable, waiting for available connection.
The agent's behavior in twisted lead to a result that the agent is able to queue the requests, but actually it doesn't.
Follow our intuition, once assign a connection pool to an agent, it is in line with the intuition that agent will only use the connections in the pool, and wait for available connection if the pool has run out. That is exactly match with the agent in Nodejs.
Personally, I think it is a buggy behavior in twisted, or at least, could make an improvement to provide an option to set agent's behavior.
According to this, I have to use DeferredSemaphore to manually schedule the requests.
I raise a issue to treq project on github, and get similar solution. https://github.com/dreid/treq/issues/71
Here is my solution.
#!/usr/bin/env python
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
from twisted.internet.defer import DeferredSemaphore
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
count = end - start
concurrency = 10
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = concurrency
agent = Agent(reactor, pool=pool)
sem = DeferredSemaphore(concurrency)
done = 0
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler, i)
return deferred
def onBody(body, i):
sem.release()
global done, count
done += 1
print('Received %s, Length %s, Done %s' % (i, len(body), done))
if(done == count):
print('All items fetched')
reactor.stop()
def errorHandler(err, i):
print('[%s] id %s: %s' % (reactor.seconds() - startTimeStamp, i, err))
def requestFactory(token, i):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler, i)
print('Request send %s' % i)
#this function it self is a callback emit by reactor, so needn't iterate manually
#reactor.iterate(1)
return deferred
def assign():
for i in range (start, end):
sem.acquire().addCallback(requestFactory, i)
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(assign)
reactor.run()
Is it right? Beg for pointing out my error and improvements.
For a few requests, like 100, it works fine. But for massive requests,
it will failed.
This is either a protection against web crawlers or a server protection against DoS/DDoS, because you are sending too much requests from the same IP in a short time, so the Firewall or the WSA will block your future request. Just modify your script to make request in batch spaced by some time. you can use callLater() with some time after each X request.

Twisted IRC Bot connection lost repeatedly to localhost

I am trying to implement an IRC Bot on a local server. The bot that I am using is identical to the one found at Eric Florenzano's Blog. This is the simplified code (which should run)
import sys
import re
from twisted.internet import reactor
from twisted.words.protocols import irc
from twisted.internet import protocol
class MomBot(irc.IRCClient):
def _get_nickname(self):
return self.factory.nickname
nickname = property(_get_nickname)
def signedOn(self):
print "attempting to sign on"
self.join(self.factory.channel)
print "Signed on as %s." % (self.nickname,)
def joined(self, channel):
print "attempting to join"
print "Joined %s." % (channel,)
def privmsg(self, user, channel, msg):
if not user:
return
if self.nickname in msg:
msg = re.compile(self.nickname + "[:,]* ?", re.I).sub('', msg)
prefix = "%s: " % (user.split('!', 1)[0], )
else:
prefix = ''
self.msg(self.factory.channel, prefix + "hello there")
class MomBotFactory(protocol.ClientFactory):
protocol = MomBot
def __init__(self, channel, nickname='YourMomDotCom', chain_length=3,
chattiness=1.0, max_words=10000):
self.channel = channel
self.nickname = nickname
self.chain_length = chain_length
self.chattiness = chattiness
self.max_words = max_words
def startedConnecting(self, connector):
print "started connecting on {0}:{1}"
.format(str(connector.host),str(connector.port))
def clientConnectionLost(self, connector, reason):
print "Lost connection (%s), reconnecting." % (reason,)
connector.connect()
def clientConnectionFailed(self, connector, reason):
print "Could not connect: %s" % (reason,)
if __name__ == "__main__":
chan = sys.argv[1]
reactor.connectTCP("localhost", 6667, MomBotFactory('#' + chan,
'YourMomDotCom', 2, chattiness=0.05))
reactor.run()
I added the startedConnection method in the client factory, which it is reaching and printing out the proper address:host. It then disconnects and enters the clientConnectionLost and prints the error:
Lost connection ([Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]), reconnecting.
If working properly it should log into the appropriate channel, specified as the first arg in the command (e.g. python module2.py botwar. would be channel #botwar.). It should respond with "hello there" if any one in the channel sends anything.
I have NGIRC running on the server, and it works if I connect from mIRC or any other IRC client.
I am unable to find a resolution as to why it is continually disconnecting. Any help on why would be greatly appreciated. Thank you!
One thing you may want to do is make sure you will see any error output produced by the server when your bot connects to it. My hunch is that the problem has something to do with authentication, or perhaps an unexpected difference in how ngirc handles one of the login/authentication commands used by IRCClient.
One approach that almost always applies is to capture a traffic log. Use a tool like tcpdump or wireshark.
Another approach you can try is to enable logging inside the Twisted application itself. Use twisted.protocols.policies.TrafficLoggingFactory for this:
from twisted.protocols.policies import TrafficLoggingFactory
appFactory = MomBotFactory(...)
logFactory = TrafficLoggingFactory(appFactory, "irc-")
reactor.connectTCP(..., logFactory)
This will log output to files starting with "irc-" (a different file for each connection).
You can also hook directly into your protocol implementation, at any one of several levels. For example, to display any bytes received at all:
class MomBot(irc.IRCClient):
def dataReceived(self, bytes):
print "Got", repr(bytes)
# Make sure to up-call - otherwise all of the IRC logic is disabled!
return irc.IRCClient.dataReceived(self, bytes)
With one of those approaches in place, hopefully you'll see something like:
:irc.example.net 451 * :Connection not registered
which I think means... you need to authenticate? Even if you see something else, hopefully this will help you narrow in more closely on the precise cause of the connection being closed.
Also, you can use tcpdump or wireshark to capture the traffic log between ngirc and one of the working IRC clients (eg mIRC) and then compare the two logs. Whatever different commands mIRC is sending should make it clear what changes you need to make to your bot.

apache on mod_wsgi Daemon Mode not yielding small string

While trying to test a few things, Using Django + apache2 + mod_wsgi3.3. I find two different results by running periodic yielding of results. Between embeded and daemon mode.
When tried with embedded mode, i.e having no WSGIDaemonProcess, WSGIProcessGroup directive used. below mentioned function yields results one after the other, with every digit getting printed on browser view after 2 seconds of sleep.
def yielder(request):
gen = testYielding()
return HttpResponse(gen)
def testYielding():
yield "3"
time.sleep(2)
yield "4"
time.sleep(2)
yield "5"
time.sleep(2)
yield "6"
time.sleep(2)
yield "7"
Though with DaemonMode on, this view responds data after collating the complete response post 8 seconds with all the digits printed together and not yielding the same, one after the other.
Is this behavior correct? and is there a way to make sure on Daemon mode responses are yielded like embedded mode?
The flush which occurs in the daemon process isn't transfered across to the Apache child worker process that is doing the proxying. Whether the output therefore is passed back through to the client immediately, will depend in part on what Apache output filters you have registered. If you have output filters which want to try and buffer up response data before flushing, you will see this issue.
You should therefore look closely at what Apache output filters are in place. If you can change these, then you would have no choice but to use embedded mode.