In a Cloudflare worker why the faster stream waits the slower one when using the tee() operator to fetch to R2?

In a Cloudflare worker why the faster stream waits the slower one when using the tee() operator to fetch to R2? - cloudflare

I want to fetch an asset into R2 and at the same time return the response to the client.
So simultaneously streaming into R2 and to the client too.
Related code fragment:
const originResponse = await fetch(request);
const originResponseBody = originResponse.body!!.tee()
ctx.waitUntil(
env.BUCKET.put(objectName, originResponseBody[0], {
httpMetadata: originResponse.headers
})
)
return new Response(originResponseBody[1], originResponse);
I tested the download of an 1GB large asset with a slower, and a faster internet connection.
In theory the outcome (success or not) of putting to R2 should be the same in both cases. Because its independent of the client's internet connection speed.
However, when I tested both scenarios, the R2 write was successful with the fast connection, and failed with the slower connection. That means that the ctx.waitUntil 30 second timeout was exceeded in case of the slower connection. It was always an R2 put "failure" when the client download took more than 30 sec.
It seems like the R2 put (the reading of that stream) is backpressured to the speed of the slower consumer, namely the client download.
Is this because otherwise the worker would have to enqueue the already read parts from the faster consumer?
Am I missing something? Could someone confirm this or clarify this? Also, could you recommend a working solution for this use-case of downloading larger files?
EDIT:
The Cloudflare worker implementation of the tee operation is clarified here: https://community.cloudflare.com/t/why-the-faster-stream-waits-the-slower-one-when-using-the-tee-operator-to-fetch-to-r2/467416
It explains the experiences.
However, a stable solution for the problem is still missing.

Cloudflare Workers limits the flow of a tee to the slower stream because otherwise it would have to buffer data in memory.
For example, say you have a 1GB file, the client connection can accept 1MB/s while R2 can accept 100MB/s. After 10 seconds, the client will have only received 10MB. If we allowed the faster stream to go as fast as it could, then it would have accepted all 1GB. However, that leaves 990MB of data which has already been received from the origin and needs to be sent to the client. That data would have to be stored in memory. But, a Worker has a memory limit of 128MB. So, your Worker would be terminated for exceeding its memory limit. That wouldn't be great either!
With that said, you are running into a bug in the Workers Runtime, which we noticed recently: waitUntil()'s 30-second timeout is intended to start after the response has finished. However, in your case, the 30-second timeout is inadvertently starting when the response starts, i.e. right after headers are sent. This is an unintended side effect of an optimization I made: when Workers detects that you are simply passing through a response body unmodified, it delegates pumping the stream to a different system so that the Worker itself doesn't need to remain in memory. However, this inadvertently means that the waitUntil() timeout kicks in earlier than expected.
This is something we intend to fix. As a temporary work-around, you could write your worker to use streaming APIs such that it reads each chunk from the tee branch and then writes it to the client connection in JavaScript. This will trick the runtime into thinking that you are not simply passing the bytes through, but trying to perform some modifications on them in JavaScript. This forces it to consider your worker "in-use" until the entire stream completes, and the 30-second waitUntil() timeout will only begin at that point. (Unfortunately this work-around is somewhat inefficient in terms of CPU usage since JavaScript is constantly being invoked.)

Related

Web application hangs after multiple requests

The application is using Apache Server as a web server and Tomcat as an application server.
Operations/requests can be triggered from the UI, which can take time to return from the server as it does some processing like fetching data from the database and performing calculations on that data. This time depends on the amount of data in the database and the duration of data it is processing. It could be as long as 30min to an hour or 2 min's based on the parameters.
Apart from this, there are some other calls which fetche small amount of data from the database and return immediately.
Now when I have multiple, say 4 or 5 of these long heavy calls to the server, and they are currently running, when I make a call that is supposed to be smaller and return immediately, this call also hangs as it never reaches my controller.
I am unable to find a way to debug this issue or find a resolution. Please let me know if in case you happen to know how to proceed with this issue.
I am using Spring, with c3p0 connection pooling with Hibernate.

So I figured out what was wrong with the application, and thought about sharing it in case someone somewhere faces the same issue. It turns out nothing was wrong with the application server or the web server, when technically speaking it was the browsers fault.
I found out that the browser can only have a limited number of open concurrent calls to a domain. In the case of the latest version of chrome at the time of writing is 6. This is something all the browsers do to prevent DDOS attacks.
As in my application, the HTTP calls take a lot of time to return until the calculations are completed several HTTP calls accumulate concurrently and as a result, the browser stops sending any further calls after the 6th concurrent call and it feels like the application is unresponsive. You can read about the maximum no of concurrent calls by a browser in SO.
A possible solution I have thought is either polling or even better Long Polling. I would have used WebSockets but then we would need to make a lot of changes.

Possible redis data corruption bug

I have seen a data problem happening in redis and I am wondering if my diagnosis is correct. Essentially when I'm doing a lot of writing to a server and reading using a Jedis client, I am seeing timeouts followed by incorrect data being returned by get() operations - the data makes sense but it's for a different key.
Here is what I think is happening:
Master is put under a lot of write load
Slave does a periodic bgsave
Slave tries to catch up to the master but it's gotten too far behind so it does a full re-sync
To serve the full re-sync, master does a bgsave of a 10GB+ data set while handling lots of reads and writes
Jedis client get() call times out before the data comes back from the server
The next get() call done on the same client reads the data that has been written in response to the one that timed out (since it actually arrives in the socket buffer after the timeout but before the next call)
From now on, every get() call returns the data intended for the previous one
My solution, which seems to work, is to close and reopen the connection every time a timeout exception is thrown.
Does this seem like a plausible explanation for what I am seeing?

What you are describing would not be a Redis bug but a Jedis one as the offset reads would be happening in the client.
In this case a workaround to reconnect on timeout would be reasonable and should work. I'd also recommend submitting it as a bug to Jedis.

Real Time screen grabbing and streaming with libav-tools

For my school project I have to stream screen grabbing from 1 station (i.e. server) to another (i.e. client) in Real Time, both running linux (ubuntu).
I'm using libav-tools (avconv as the encoder on the server side and avplay as the player on the client side)
avconv uses x11grab format to grab from the screen.
My problem is: avconv needs a few seconds to output the encoded video. this wait is too long for RT.
I've tried streaming to localhost to avoid network influence on speed, it still seems that avconv is responsible for the long wait.
Also, streaming a video file seems to be much faster, almost immediately.
The project is implemented in C++ and executes avconv in a fork.
Any suggestions as to shortening the procedure?

This is most likely due to internal buffering. There is often a buffer which is way too big on default. That is because having no delay is not the primary concern of most software, they are more concerned with bad connections and that sort of problems, which is what buffers are for.
See https://libav.org/avconv.html, search for "nobuffer" or "-analyzeduration" or "-rtbufsize" or "-max_delay" or "-fpsprobesize" or "rtmp_buffer" (if you use rtmp) or others and try your luck.
There will always be a noticable delay, especially if you use an encodings like h264 for transfer. But a few seconds it does not need to be in a controlled environment. You should be able to bring it down to fractions of a second.

how to deal with read() timeout in Redis client?

Assume that my client send a 'INCR' command to redis server, but the response packet is lost, so my client's read() will times out, but client is not able to tell if INCR operation has been performed by server.
what to do next? resending INCR or continuing next command? If client resends INCR, but in case redis had carried out INCR in server side before, this key will be increased two times, which is not what we want.

This is not a problem specific to Redis: it also applies to any other data stores (including transactional ones). There is no solution to this problem: you can only hope to minimize the issue.
For instance, some people tend to put very aggressive values for their timeout thinking that Redis is supposed to be a soft real-time data store. Redis is fast, but you also need to consider the network, and the system itself. Network related problems may generate high latencies. If the system starts swapping, it will very seriously impact Redis response times.
I tend to think that putting a timeout under 2 secs is a nonsense on any Unix/Linux system, and if a network is involved, I am much more comfortable with 10 secs. People put very low values because they want to avoid their application to block: it is a mistake. Rather than setting very low timeouts and keep the application synchronous, they should design the application to be asynchronous and set sensible timeouts.
After a timeout, a client should never "continue" with the next command. It should close the connection, and try to open a new one. If a reply (or a query) has been lost, it is unlikely that the client and the server can resynchronize. It is safer to close the connection.
Should you try to issue the INCR again after the reconnection? It is really up to you. But if a read timeout has just been triggered, there is a good chance the reconnection will time out as well. Redis being single-threaded, when it is slow for one connection, it is slow for all connections simultaneously.

How to control number of running worker processes for MongoDB?

Well, as the question simply explains itself, let me clear it up little more.
I am running MongoDB primarily for read-only purposes at back-end. My crons do the writes and they don't really push it hard when they are triggered. Some updates, some new documents etc.
The thing is requests usually do not even hit the application level because of entire page caching handled within MemCached by Nginx. So the application doesn't query database for another hour per page.
But so far as I can see in my process list, there are 21 MongoDB worker processes that are using none of the CPU but reasonably large amount of memory because of the previous queries.
I checked the configuration settings and googled around but couldn't find any answer.. So, is there any way to limit those processes or at least to tell MongoDB reduce/empty its memory usage after a while?

Workers are using for talking to config server and other replica as well apart from just serving user request. This is documented in here.
we can limit net.maxIncomingConnections as par recommendation on this page to limit the number of workers processing user request. But this should be used with precaution as setting this number too low and then sending more concurrent calls will result in some calls being queued.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas