Asio streaming - slow speed with SSL/TLS encryption

Asio streaming - slow speed with SSL/TLS encryption - ssl

I created a simple server client using Asio (non-Boost), and am using it for a simple test of data transfer speed.
Server:
Create buffer (4 MB)
When a client connects (callback from async_accept)
Start timer
Send buffer to client (async_write)
Wait for response from client (callback from async_read)
Repeat step 2-3 100 times
Stop timer
Calculate transfer speed (100 * buffer size * 8 / time)
Client
Connect to server
Wait for data from server (callback from async_read)
Send a single byte back to server (async_write)
Repeat
I have implemented this both with and without SSL/TLS encryption. The non-secure version achieves speeds at around 15.0 Gbps through localhost, but the encrypted version slows way down to about 0.3 Gbps.
Is this expected? If not, any ideas what could be causing this?

The task has become CPU bound. You can easily verify that using a task manager.
Also compare with netcat vs openssl s_server/s_client to see the same effects. E.g. for data.bin being 32MiB of random data, I get:
$ for a in {1..100}; do cat data.bin; done | pv | openssl enc -e -kfile server.pem -pass test -out data.bin.crypt
3,12GiB 0:00:08 [ 392MiB/s]
That's just time required for the server side to encrypt the data.

Is this expected?
No. TLS is not less than 1/3 the speed of plaintext over a sufficiently long transfer. I tested this extensively ten or more years ago, and computers have got a lot faster since then.
If not, any ideas what could be causing this?
You are probably using inadequate buffering between your application and the TLS layer. For example, if you send one byte at a time to TLS here can be a data explosion of 45 times.

Related

How to apply TLS session tickets from a previous TLS session to a new connecton in OpenSSL? [duplicate]

I've got a short-lived client process that talks to a server over SSL. The process is invoked frequently and only runs for a short time (typically for less than 1 second). This process is intended to be used as part of a shell script used to perform larger tasks and may be invoked pretty frequently.
The SSL handshaking it performs each time it starts up is showing up as a significant performance bottleneck in my tests and I'd like to reduce this if possible.
One thing that comes to mind is taking the session id and storing it somewhere (kind of like a cookie), and then re-using this on the next invocation, however this is making me feel uneasy as I think there would be some security concerns around doing this.
So, I've got a couple of questions,
Is this a bad idea?
Is this even possible using OpenSSL?
Are there any better ways to speed up the SSL handshaking process?

After the handshake, you can get the SSL session information from your connection with SSL_get_session(). You can then use i2d_SSL_SESSION() to serialise it into a form that can be written to disk.
When you next want to connect to the same server, you can load the session information from disk, then unserialise it with d2i_SSL_SESSION() and use SSL_set_session() to set it (prior to SSL_connect()).
The on-disk SSL session should be readable only by the user that the tool runs as, and stale sessions should be overwritten and removed frequently.

You should be able to use a session cache securely (which OpenSSL supports), see the documentation on SSL_CTX_set_session_cache_mode, SSL_set_session and SSL_session_reused for more information on how this is achieved.

Could you perhaps use a persistent connection, so the setup is a one-time cost?
You could abstract away the connection logic so your client code still thinks its doing a connect/process/disconnect cycle.

Interestingly enough I encountered an issue with OpenSSL handshakes just today. The implementation of RAND_poll, on Windows, uses the Windows heap APIs as a source of random entropy.
Unfortunately, due to a "bug fix" in Windows 7 (and Server 2008) the heap enumeration APIs (which are debugging APIs afterall) now can take over a second per call once the heap is full of allocations. Which means that both SSL connects and accepts can take anywhere from 1 seconds to more than a few minutes.
The Ticket contains some good suggestions on how to patch openssl to achieve far FAR faster handshakes.

In a Cloudflare worker why the faster stream waits the slower one when using the tee() operator to fetch to R2?

I want to fetch an asset into R2 and at the same time return the response to the client.
So simultaneously streaming into R2 and to the client too.
Related code fragment:
const originResponse = await fetch(request);
const originResponseBody = originResponse.body!!.tee()
ctx.waitUntil(
env.BUCKET.put(objectName, originResponseBody[0], {
httpMetadata: originResponse.headers
})
)
return new Response(originResponseBody[1], originResponse);
I tested the download of an 1GB large asset with a slower, and a faster internet connection.
In theory the outcome (success or not) of putting to R2 should be the same in both cases. Because its independent of the client's internet connection speed.
However, when I tested both scenarios, the R2 write was successful with the fast connection, and failed with the slower connection. That means that the ctx.waitUntil 30 second timeout was exceeded in case of the slower connection. It was always an R2 put "failure" when the client download took more than 30 sec.
It seems like the R2 put (the reading of that stream) is backpressured to the speed of the slower consumer, namely the client download.
Is this because otherwise the worker would have to enqueue the already read parts from the faster consumer?
Am I missing something? Could someone confirm this or clarify this? Also, could you recommend a working solution for this use-case of downloading larger files?
EDIT:
The Cloudflare worker implementation of the tee operation is clarified here: https://community.cloudflare.com/t/why-the-faster-stream-waits-the-slower-one-when-using-the-tee-operator-to-fetch-to-r2/467416
It explains the experiences.
However, a stable solution for the problem is still missing.

Cloudflare Workers limits the flow of a tee to the slower stream because otherwise it would have to buffer data in memory.
For example, say you have a 1GB file, the client connection can accept 1MB/s while R2 can accept 100MB/s. After 10 seconds, the client will have only received 10MB. If we allowed the faster stream to go as fast as it could, then it would have accepted all 1GB. However, that leaves 990MB of data which has already been received from the origin and needs to be sent to the client. That data would have to be stored in memory. But, a Worker has a memory limit of 128MB. So, your Worker would be terminated for exceeding its memory limit. That wouldn't be great either!
With that said, you are running into a bug in the Workers Runtime, which we noticed recently: waitUntil()'s 30-second timeout is intended to start after the response has finished. However, in your case, the 30-second timeout is inadvertently starting when the response starts, i.e. right after headers are sent. This is an unintended side effect of an optimization I made: when Workers detects that you are simply passing through a response body unmodified, it delegates pumping the stream to a different system so that the Worker itself doesn't need to remain in memory. However, this inadvertently means that the waitUntil() timeout kicks in earlier than expected.
This is something we intend to fix. As a temporary work-around, you could write your worker to use streaming APIs such that it reads each chunk from the tee branch and then writes it to the client connection in JavaScript. This will trick the runtime into thinking that you are not simply passing the bytes through, but trying to perform some modifications on them in JavaScript. This forces it to consider your worker "in-use" until the entire stream completes, and the 30-second waitUntil() timeout will only begin at that point. (Unfortunately this work-around is somewhat inefficient in terms of CPU usage since JavaScript is constantly being invoked.)

Erlang getting the exact size in memory of a SSL connection

Is there a way in erlang to get exactly how much memory a SSL connection takes ?
Right now I'm kinda guessing by dividing the whole beam.smp size (minus the init size) in memory by the number of connected clients...
I'm using R15B01
The SSL connection is handled by a gen_server, doing
process_info(spawn(Fun), memory).
give me after gc calling:
{memory,2108}
This clearly does not contain the SSL socket connection size.

The thing is that even to handle a single SSL connection Erlang starts several separate processes (certificate db, ssl manager, ssl session, etc) and each of those processes might have a separate storage for its data. Thus it is hard to give a definitive answer how much memory each connection takes as there is quite a few places which keep book keeping information about the connection.
If you need an estimate, I would do the following:
Started a SSL server and a SSL client as described at http://pdincau.wordpress.com/2011/06/22/a-brief-introduction-to-ssl-with-erlang/
Saved TotalMemory1 = proplists:get_value(total, memory()). in the server session.
Tried to open 99 more client connections from a separate client session.
Calculated TotalMemory2 = proplists:get_value(total, memory()).
Found out amortized amount of memory a single connection takes by dividing (TotalMemory2 - TotalMemory1)/99.

WCF disabling UseNagleAlgorithm while connecting to SQL Azure

Have a bunch of WCF REST services hosted on Azure that access a SQL Azure database. I see that ServicePointManager.UseNagleAlgorithm is set to true. I understand that setting this to false would speed up calls (inserts of records < 1460 bytes) to table storage - the following link talks about it.
My Question - Would disabling the Nagle Algorithm also speed up my calls to SQL Azure?

Nagle's algorithm is all about buffering tcp-level data into a smaller # of packets, and is not tied to record size. You could be writing rows to Table Storage of, say, 1300 bytes of data, but once you include tcp header info, content serialization, etc., the data transmitted could be larger than the threshold of 1460 bytes.
In any case: the net result is that you could be seeing write delays up to 500ms when the algorithm is enabled, as data is buffered, resulting in less tcp packets over the wire.
It's possible that disabling Nagle's algorithm would help with your access to SQL Azure, but you'd probably need to do some benchmarking to see if your throughput is being affected based on the type of reads/writes you're doing. It's possible that the calls to SQL Azure, with the requisite SQL command text, result in large-enough packets that disabling nagle wouldn't make a difference.

Do ping requests put a load on a server?

I have a lot of clients (around 4000).
Each client pings my server every 2 seconds.
Can these ping requests put a load on the server and slow it down?
How can I monitor this load?
Now the server response slowly but the processor is almost idle and the free memory is ok.
I'm running Apache on Ubuntu.

Assuming you mean a UDP/ICMP ping just to see if the host is alive, 4000 hosts probably isn't much load and is fairly easy to calculate. CPU and memory wise, ping is handled by you're kernel, and should be optimized to not take much resources. So, you need to look at network resources. The most critical point will be if you have a half-duplex link, because all of you're hosts are chatty, you'll cause alot of collisions and retransmissions (and dropped pings). If the links are all full duplex, let's calculate the actual amount of bandwidth required at the server.
4000 client #2 seconds
Each ping is 72 bytes on the wire (32 bytes data + 8 bytes ICMP header + 20 bytes IP header + 14 bytes Ethernet). * You might have some additional overhead if you use vlan tagging, or UDP based pings
If we can assume the pings are randomly distributed, we would have 2000 pings per second # 72 bytes = 144000 bytes
Multiple by 8 to get Bps = 1,152,000 bps or about 1.1Mbps.
On a 100Mbps Lan, this would be about 1.1% utilization just for the pings.
If this is a lan environment, I'd say this is basically no load at all, if it's going across a T1 then it's an immense amount of load. So you should basically run the same calculation on which network links may also be a bottle neck.
Lastly, if you're not using ICMP pings to check the host, but have an application level ping, you will have all the overhead of what protocol you are using, and the ping will need to go all the way up the protocol stack, and you're application needs to respond. Again, this could be a very minimal load, or it could be immense, depending on the implementation details and the network speed. If the host is idle, I doubt this is a problem for you.

Yes, they can. A ping request does not put much CPU load on, but it certainly takes up bandwidth and a nominal amount of CPU.
If you want to monitor this, you might use either tcpdump or wireshark, or perhaps set up a firewall rule and monitor the number of packets it matches.

The other problem apart from bandwidth is the CPU. If a ping is directed up to the CPU for processing, thousands of these can cause a load on any CPU. It's worth monitoring - but as you said yours is almost idle so it's probably going to be able to cope. Worth keeping in mind though.
Depending on the clients, ping packets can be different sizes - their payload could be just "aaaaaaaaa" but some may be "thequickbrownfoxjumpedoverthelazydog" - which is obviously further bandwidth requirements again.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas