How to make pooling HTTP connection with twisted? - twisted

I wirite a very simple spider program to fetch webpages from single site.
Here is a minimized version.
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 10
agent = Agent(reactor, pool=pool)
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler)
return response
def onBody(body, i):
print('Received %s, Length %s' % (i, len(body)))
def errorHandler(err):
print('%s : %s' % (reactor.seconds() - startTimeStamp, err))
def requestFactory():
for i in range (start, end):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler)
print('Generated %s' % i)
reactor.iterate(1)
print('All requests has generated, elpased %s' % (reactor.seconds() - startTimeStamp))
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(requestFactory)
reactor.run()
For a few requests, like 100, it works fine. But for massive requests, it will failed.
I expect all of the requests(around 3000) should be automatically pooled, scheduled and pipelined, since I use HTTPConnectionPool, set maxPersistentPerHost, create an Agent instance with it and incrementally create the connections.
But it doesn't, the connections are not keep-alive nor pooled.
In this programm, it did establish the connections incrementally, but the connections didn't pooled, each connecction will close after body received, and later requests never wait in the pool for an available connecction.
So it will take thousands of sockets, and finally failed due to timeout, because the remote server has a connection timeout set to 30s. Thousands of requests can't be done within 30s.
Could you please give me some help on this?
I have tried my best on this, here is my finds.
Error occured exactly 30s after reactor start runing, won't be influenced by other things.
Let the spider fetch my server, I find something interesting.
The HTTP protocol version is 1.1 (I check the twisted document, the default HTTPClient is 1.0 rather than 1.1)
If I didn't add any explicit header(just like the minimized version), the request header didn't contain Connection: Keep-Alive, either do response header.
If I add explicit header to ensure it is a keep-alive connection, the request header did contain Connection: Keep-Alive, but the response header still not. (I am sure my server behave correctly, other stuff like Chrome, wget did receive Connection: Keep-Alive header.)
I check /proc/net/sockstat during running, it increase rapidly at first, and decrease rapidly later. (I have increase the ulimit to support plenty of sockets)
I write a similar program with treq, a twisted based request library). The code is almost the same, so not paste here.
Link: https://gist.github.com/Preffer/dad9b1228fcd75cebd75
It's behavior is almost the same. Not pooling. It is expected to be pooling as described in treq's feature list.
If I have add explicit header on it, Connection: Keep-Alive never appear in response header.
Based on all of the above, I am quite suspicious about the quirk Connection: Keep-Alive header ruin the program. But this header is part of HTTP 1.1 standard, and it did report as HTTP 1.1. I am completely puzzled on this.

I solved the problem myself, with help from IRC and another question in stackoverflow, Queue remote calls to a Python Twisted perspective broker?
In summary, the agent's behavior is very different from that in Nodejs(I have some experience in Nodejs). As it described on Nodejs doc
agent.requests
An object which contains queues of requests that have not yet been assigned to sockets.
agent.maxSockets
By default set to 5. Determines how many concurrent sockets the agent can have open per origin. Origin is either a 'host:port' or 'host:port:localAddress' combination.
So, here is the difference.
Twisted:
There is no doubt that Agent could queue requests if construct with a HTTPConnectionPool instance.
But if a new request is issued after connections in pool has run out, the agent will still create a new connection and perform the request, rather than put it in a queue.
Actually, it will lead to drop a connection in the pool, and push the newly generated connection into the pool, keep the connections count still equal to maxPersistentPerHost
Nodejs:
By default, agent will queue the requests with a implicit connection pool, which have a size of 5 connections.
If a new request is issued after connections in pool has run out, the agent will queue the requests into agent.requests variable, waiting for available connection.
The agent's behavior in twisted lead to a result that the agent is able to queue the requests, but actually it doesn't.
Follow our intuition, once assign a connection pool to an agent, it is in line with the intuition that agent will only use the connections in the pool, and wait for available connection if the pool has run out. That is exactly match with the agent in Nodejs.
Personally, I think it is a buggy behavior in twisted, or at least, could make an improvement to provide an option to set agent's behavior.
According to this, I have to use DeferredSemaphore to manually schedule the requests.
I raise a issue to treq project on github, and get similar solution. https://github.com/dreid/treq/issues/71
Here is my solution.
#!/usr/bin/env python
from twisted.internet import epollreactor
epollreactor.install()
from twisted.internet import reactor
from twisted.web.client import Agent, HTTPConnectionPool, readBody
from twisted.internet.defer import DeferredSemaphore
baseUrl = 'http://acm.zju.edu.cn/onlinejudge/showProblem.do?problemCode='
start = 1001
end = 3500
count = end - start
concurrency = 10
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = concurrency
agent = Agent(reactor, pool=pool)
sem = DeferredSemaphore(concurrency)
done = 0
def onHeader(response, i):
deferred = readBody(response)
deferred.addCallback(onBody, i)
deferred.addErrback(errorHandler, i)
return deferred
def onBody(body, i):
sem.release()
global done, count
done += 1
print('Received %s, Length %s, Done %s' % (i, len(body), done))
if(done == count):
print('All items fetched')
reactor.stop()
def errorHandler(err, i):
print('[%s] id %s: %s' % (reactor.seconds() - startTimeStamp, i, err))
def requestFactory(token, i):
deferred = agent.request('GET', baseUrl + str(i))
deferred.addCallback(onHeader, i)
deferred.addErrback(errorHandler, i)
print('Request send %s' % i)
#this function it self is a callback emit by reactor, so needn't iterate manually
#reactor.iterate(1)
return deferred
def assign():
for i in range (start, end):
sem.acquire().addCallback(requestFactory, i)
startTimeStamp = reactor.seconds()
reactor.callWhenRunning(assign)
reactor.run()
Is it right? Beg for pointing out my error and improvements.

For a few requests, like 100, it works fine. But for massive requests,
it will failed.
This is either a protection against web crawlers or a server protection against DoS/DDoS, because you are sending too much requests from the same IP in a short time, so the Firewall or the WSA will block your future request. Just modify your script to make request in batch spaced by some time. you can use callLater() with some time after each X request.

Related

is there a way to setup timeout in grpc server side?

Unable to timeout a grpc connection from server side. It is possible that client establishes a connection but kept on hold/sleep which is resulting in grpc server connection to hang. Is there a way at server side to disconnect the connection after a certain time or set the timeout?
We tried disconnecting the connection from client side but unable to do so from server side. In this link Problem with gRPC setup. Getting an intermittent RPC unavailable error, Angad says that it is possible but unable to define those parameters in python.
My code snippet:
def serve():
server = grpc.server(thread_pool=futures.ThreadPoolExecutor(max_workers=2), maximum_concurrent_rpcs=None, options=(('grpc.so_reuseport', 1),('grpc.GRPC_ARG_KEEPALIVE_TIME_MS', 1000)))
stt_pb2_grpc.add_ListenerServicer_to_server(Listener(), server)
server.add_insecure_port("localhost:50051")
print("Server starting in port "+str(50051))
server.start()
try:
while True:
time.sleep(60 * 60 * 24)
except KeyboardInterrupt:
server.stop(0)
if __name__ == '__main__':
serve()
I expect the connection should be timed out from grpc server side too in python.
In short, you may find context.abort(...) useful, see API reference. Timeout a server handler is not supported by the underlying C-Core API of gRPC Python. So, you have to implement your own timeout mechanism in Python.
You can try out some solution from other StackOverflow questions.
Or use a simple-but-big-overhead extra threads to abort the connection after certain length of time. It might look like this:
_DEFAULT_TIME_LIMIT_S = 5
class FooServer(FooServicer):
def RPCWithTimeLimit(self, request, context):
rpc_ended = threading.Condition()
work_finished = threading.Event()
def wrapper(...):
YOUR_ACTUAL_WORK(...)
work_finished.set()
rpc_ended.notify_all()
def timer():
time.sleep(_DEFAULT_TIME_LIMIT_S)
rpc_ended.notify_all()
work_thread = threading.Thread(target=wrapper, ...)
work_thread.daemon = True
work_thread.start()
timer_thread = threading.Thread(target=timer)
timer_thread.daemon = True
timer_thread.start()
rpc_ended.wait()
if work_finished.is_set():
return NORMAL_RESPONSE
else:
context.abort(grpc.StatusCode.DEADLINE_EXCEEDED, 'RPC Time Out!')

Twisted IRC Bot connection lost repeatedly to localhost

I am trying to implement an IRC Bot on a local server. The bot that I am using is identical to the one found at Eric Florenzano's Blog. This is the simplified code (which should run)
import sys
import re
from twisted.internet import reactor
from twisted.words.protocols import irc
from twisted.internet import protocol
class MomBot(irc.IRCClient):
def _get_nickname(self):
return self.factory.nickname
nickname = property(_get_nickname)
def signedOn(self):
print "attempting to sign on"
self.join(self.factory.channel)
print "Signed on as %s." % (self.nickname,)
def joined(self, channel):
print "attempting to join"
print "Joined %s." % (channel,)
def privmsg(self, user, channel, msg):
if not user:
return
if self.nickname in msg:
msg = re.compile(self.nickname + "[:,]* ?", re.I).sub('', msg)
prefix = "%s: " % (user.split('!', 1)[0], )
else:
prefix = ''
self.msg(self.factory.channel, prefix + "hello there")
class MomBotFactory(protocol.ClientFactory):
protocol = MomBot
def __init__(self, channel, nickname='YourMomDotCom', chain_length=3,
chattiness=1.0, max_words=10000):
self.channel = channel
self.nickname = nickname
self.chain_length = chain_length
self.chattiness = chattiness
self.max_words = max_words
def startedConnecting(self, connector):
print "started connecting on {0}:{1}"
.format(str(connector.host),str(connector.port))
def clientConnectionLost(self, connector, reason):
print "Lost connection (%s), reconnecting." % (reason,)
connector.connect()
def clientConnectionFailed(self, connector, reason):
print "Could not connect: %s" % (reason,)
if __name__ == "__main__":
chan = sys.argv[1]
reactor.connectTCP("localhost", 6667, MomBotFactory('#' + chan,
'YourMomDotCom', 2, chattiness=0.05))
reactor.run()
I added the startedConnection method in the client factory, which it is reaching and printing out the proper address:host. It then disconnects and enters the clientConnectionLost and prints the error:
Lost connection ([Failure instance: Traceback (failure with no frames):
<class 'twisted.internet.error.ConnectionDone'>: Connection was closed cleanly.
]), reconnecting.
If working properly it should log into the appropriate channel, specified as the first arg in the command (e.g. python module2.py botwar. would be channel #botwar.). It should respond with "hello there" if any one in the channel sends anything.
I have NGIRC running on the server, and it works if I connect from mIRC or any other IRC client.
I am unable to find a resolution as to why it is continually disconnecting. Any help on why would be greatly appreciated. Thank you!
One thing you may want to do is make sure you will see any error output produced by the server when your bot connects to it. My hunch is that the problem has something to do with authentication, or perhaps an unexpected difference in how ngirc handles one of the login/authentication commands used by IRCClient.
One approach that almost always applies is to capture a traffic log. Use a tool like tcpdump or wireshark.
Another approach you can try is to enable logging inside the Twisted application itself. Use twisted.protocols.policies.TrafficLoggingFactory for this:
from twisted.protocols.policies import TrafficLoggingFactory
appFactory = MomBotFactory(...)
logFactory = TrafficLoggingFactory(appFactory, "irc-")
reactor.connectTCP(..., logFactory)
This will log output to files starting with "irc-" (a different file for each connection).
You can also hook directly into your protocol implementation, at any one of several levels. For example, to display any bytes received at all:
class MomBot(irc.IRCClient):
def dataReceived(self, bytes):
print "Got", repr(bytes)
# Make sure to up-call - otherwise all of the IRC logic is disabled!
return irc.IRCClient.dataReceived(self, bytes)
With one of those approaches in place, hopefully you'll see something like:
:irc.example.net 451 * :Connection not registered
which I think means... you need to authenticate? Even if you see something else, hopefully this will help you narrow in more closely on the precise cause of the connection being closed.
Also, you can use tcpdump or wireshark to capture the traffic log between ngirc and one of the working IRC clients (eg mIRC) and then compare the two logs. Whatever different commands mIRC is sending should make it clear what changes you need to make to your bot.

Twisted big files transfer

I write client-server application like this:
client(c#) <-> server (twisted; ftp proxy and additional functional) <-> ftp server
Server has two classes: my own class-protocol inherited from LineReceiever protocol and FTPClient from twisted.protocols.ftp.
But when client sends or gets big files (10 Gb - 20 Gb) server catches MemoryError. I don't use any buffers in my code. It happens when after call transport.write(data) data appends to inner buffer of reactor's writers (correct me if I wrong).
What should I use to avoid this problem? Or should I change approach to the problem?
I found out that for big streams, I should use IConsumer and IProducer interfaces. But finally it will invoke transfer.write method and effect will be the same. Or am I wrong?
UPD:
Here is logic of file download/upload (from ftp through Twisted server to client on Windows):
Client sends some headers to Twisted server and after that begins send of file. Twisted server receive headers and after that (if it needs) invoke setRawMode(), open ftp connection and recieves/sends bytes from/to client and after all close connections. Here is a part of code that uploads files:
FTPManager class
def _ftpCWDSuccees(self, protocol, fileName):
self._ftpClientAsync.retrieveFile(fileName, FileReceiver(protocol))
class FileReceiver(Protocol):
def __init__(self, proto):
self.__proto = proto
def dataReceived(self, data):
self.__proto.transport.write(data)
def connectionLost(self, why = connectionDone):
self.__proto.connectionLost(why)
main proxy-server class:
class SSDMProtocol(LineReceiver)
...
After SSDMProtocol object (call obSSDMProtocol) parse headers it invoke method that open ftp connection (FTPClient from twisted.protocols.ftp) and set object of FTPManager field _ftpClientAsync and call _ftpCWDSuccees(self, protocol, fileName) with protocol = obSSDMProtocol and when file's bytes recieved invokes dataReceived(self, data) of FileReceiver object.
And when self.__proto.transport.write(data) invoked, data appends to inner buffer faster than sending back to client, therefore memory runs out. May be I can stop reading when the buffer reaches a certain size and resume reading after buffer will be all send to client? or something like that?
If you're passing a 20 gigabyte (gigabit?) string to transport.write, you're going to need at least 20 gigabytes (gigabits?) of memory - probably more like 40 or 60 due to the extra copying necessary when dealing with strings in Python.
Even if you never pass a single string to transport.write that is 20 gigabytes (gigabits?), if you repeatedly call transport.write with short strings at a rate faster than your network can handle, the send buffer will eventually grow too large to fit in memory and you'll encounter a MemoryError.
The solution to both of these problems is the producer/consumer system. The advantage that using IProducer and IConsumer gives you is that you'll never have a 20 gigabyte (gigabit?) string and you'll never fill up a send buffer with too many shorter strings. The network will be throttled so that bytes are not read faster than your application can deal with them and forget about them. Your strings will end up on the order of 16kB - 64kB, which should easily fit in memory.
You just need to adjust your use of FileReceiver to include registration of the incoming connection as a producer for the outgoing connection:
class FileReceiver(Protocol):
def __init__(self, outgoing):
self._outgoing = outgoing
def connectionMade(self):
self._outgoing.transport.registerProducer(self.transport, streaming=True)
def dataReceived(self, data):
self._outgoing.transport.write(data)
Now whenever self._outgoing.transport's send buffer fills up, it will tell self.transport to pause. Once the send buffer empties out, it will tell self.transport to resume. self.transport nows how to undertake these actions at the TCP level so that data coming into your server will also be slowed down.

Extend existing Twisted Service with another Socket/TCP/RPC Service to get Service informations

I'm implementing a Twisted-based Heartbeat Client/Server combo, based on this example. It is my first Twisted project.
Basically it consists of a UDP Listener (Receiver), who calls a listener method (DetectorService.update) on receiving packages. The DetectorService always holds a list of currently active/inactive clients (I extended the example a lot, but the core is still the same), making it possible to react on clients which seem disconnected for a specified timeout.
This is the source taken from the site:
UDP_PORT = 43278; CHECK_PERIOD = 20; CHECK_TIMEOUT = 15
import time
from twisted.application import internet, service
from twisted.internet import protocol
from twisted.python import log
class Receiver(protocol.DatagramProtocol):
"""Receive UDP packets and log them in the clients dictionary"""
def datagramReceived(self, data, (ip, port)):
if data == 'PyHB':
self.callback(ip)
class DetectorService(internet.TimerService):
"""Detect clients not sending heartbeats for too long"""
def __init__(self):
internet.TimerService.__init__(self, CHECK_PERIOD, self.detect)
self.beats = {}
def update(self, ip):
self.beats[ip] = time.time()
def detect(self):
"""Log a list of clients with heartbeat older than CHECK_TIMEOUT"""
limit = time.time() - CHECK_TIMEOUT
silent = [ip for (ip, ipTime) in self.beats.items() if ipTime < limit]
log.msg('Silent clients: %s' % silent)
application = service.Application('Heartbeat')
# define and link the silent clients' detector service
detectorSvc = DetectorService()
detectorSvc.setServiceParent(application)
# create an instance of the Receiver protocol, and give it the callback
receiver = Receiver()
receiver.callback = detectorSvc.update
# define and link the UDP server service, passing the receiver in
udpServer = internet.UDPServer(UDP_PORT, receiver)
udpServer.setServiceParent(application)
# each service is started automatically by Twisted at launch time
log.msg('Asynchronous heartbeat server listening on port %d\n'
'press Ctrl-C to stop\n' % UDP_PORT)
This heartbeat server runs as a daemon in background.
Now my Problem:
I need to be able to run a script "externally" to print the number of offline/online clients on the console, which the Receiver gathers during his lifetime (self.beats). Like this:
$ pyhb showactiveclients
3 clients online
$ pyhb showofflineclients
1 client offline
So I need to add some kind of additional server (Socket, Tcp, RPC - it doesn't matter. the main point is that i'm able to build a client-script with the above behavior) to my DetectorService, which allows to connect to it from outside. It should just give a response to a request.
This server needs to have access to the internal variables of the running detectorservice instance, so my guess is that I have to extend the DetectorService with some kind of additionalservice.
After some hours of trying to combine the detectorservice with several other services, I still don't have an idea what's the best way to realize that behavior. So I hope that somebody can give me at least the essential hint how to start to solve this problem.
Thanks in advance!!!
I think you already have the general idea of the solution here, since you already applied it to an interaction between Receiver and DetectorService. The idea is for your objects to have references to other objects which let them do what they need to do.
So, consider a web service that responds to requests with a result based on the beats data:
from twisted.web.resource import Resource
class BeatsResource(Resource):
# It has no children, let it respond to the / URL for brevity.
isLeaf = True
def __init__(self, detector):
Resource.__init__(self)
# This is the idea - BeatsResource has a reference to the detector,
# which has the data needed to compute responses.
self._detector = detector
def render_GET(self, request):
limit = time.time() - CHECK_TIMEOUT
# Here, use that data.
beats = self._detector.beats
silent = [ip for (ip, ipTime) in beats.items() if ipTime < limit]
request.setHeader('content-type', 'text/plain')
return "%d silent clients" % (len(silent),)
# Integrate this into the existing application
application = service.Application('Heartbeat')
detectorSvc = DetectorService()
detectorSvc.setServiceParent(application)
.
.
.
from twisted.web.server import Site
from twisted.application.internet import TCPServer
# The other half of the idea - make sure to give the resource that reference
# it needs.
root = BeatsResource(detectorSvc)
TCPServer(8080, Site(root)).setServiceParent(application)

How to diagnose "the operation has timed out" HttpException

I am calling 5 external servers to retrieve XML-based data for each request for a particular webpage on my IIS 6 server. Present volume is between 3-5 incoming requests per second, meaning 15-20 outgoing requests per second.
99% of the outgoing requests from my server (the client) to the external servers (the server) work OK but about 100-200 per day end up with a "The operation has timed out" exception.
This suggests I have a resource problem on my server - some shortage of sockets, ports etc or a thread lock but the problem with this theory is that the failures are entirely random - there are not a number of requests in a row that all fail - and two of the external servers account for the majority of the failures.
My question is how can I further diagnose these exceptions to determine if the problem is on my end (the client) or on the other end (the servers)?
The volume of requests precludes putting an analyzer on the wire - it would be very difficult to capture these few exceptions. I have reset CONNECTIONS and THREADS in my machine.config and the basic code looks like:
Dim hRequest As HttpWebRequest
Dim responseTime As String
Dim objWatch As New Stopwatch
Try
' calculate time it takes to process transaction
objWatch.Start()
hRequest = System.Net.WebRequest.Create(url)
' set some defaults
hRequest.Timeout = 5000
hRequest.ReadWriteTimeout = 10000
hRequest.KeepAlive = False ' to prevent open HTTP connection leak
hRequest.SendChunked = False
hRequest.AllowAutoRedirect = True
hRequest.MaximumAutomaticRedirections = 3
hRequest.Accept = "text/xml"
hRequest.Proxy = Nothing 'do not waste time searching for a proxy
hRequest.ServicePoint.Expect100Continue = False
Dim feed As New XDocument()
' use *Using* to auto close connections
Using hResponse As HttpWebResponse = DirectCast(hRequest.GetResponse(), HttpWebResponse)
Using reader As XmlReader = XmlReader.Create(hResponse.GetResponseStream())
feed = XDocument.Load(reader)
reader.Close()
End Using
hResponse.Close()
End Using
objWatch.Stop()
' Work here with returned contents in "feed" document
Return XXX' some results here
Catch ex As Exception
objWatch.Stop()
hRequest.Abort()
Return Nothing
End Try
Any suggestions?
By default, HttpWebRequest limits you to 2 connections per HTTP/1.1 server. So, if your requests take time to complete, and you have incoming requests queuing up on the server, you will run out of connection and thus get timeouts.
You should change the max outgoing connections on ServicePointManager.
ServicePointManager.DefaultConnectionLimit = 20 // or some big value.
You said that you are doing 5 outgoing request for each incoming request to the ASP page. Is that 5 different servers, or the same server?
DO you wait for the previous request to complete, before issuing the next one? Is the timeout happening while it is waiting for a connection, or during the request/response?
If the timeout is happening during the request/response then it means that the target server is under stress. The only way to find out if this is the case, is to run wireshark/netmon on one of the machines, and look at the network trace to see if the request from the app is even making it through to the server, and if it is, whether the target server is responding within the given timeout.
If this is a thread starvation issue, then one of the ways to diagnose it is to attach windbg.exe debugger to w3wp.exe process, when you start getting timeout. Then load the sos.dll debugging extension. And run the !threads command, followed by !threadpool command. It will show you how many Worker threads and completion port threads are utilized/remaining. If the #completionport threads or worker threads are low, then that will contribute to the timeout.
Alternatively, you can monitor ASP.NET and System.net perf counters. See if the ASP.NET request queue is increasing monotonically - this might indicate that your outgoing requests are not completing fast enough.
Sorry, there are no easy answers here. THere is a lot of avenues you will need to explore. If I were you, I would start off by attaching windbg.exe to w3wp when you start getting timeouts and do what I described earlier.