Background: I'm trying to measure common metrics (e.g. firstContentfulPaint) using selenium with chromium over satellite links under different loss and delay parameters.
Problem: When running with high delay / loss rates, chromium stops the handshake relatively early with this error message: No recent network activity after 4001103us. (The exact number of microseconds changes but is always equal to between 4 and 5 seconds).
Found out so far:
This error message seems to be generated here: https://source.chromium.org/chromium/chromium/src/+/main:net/third_party/quiche/src/quic/core/quic_connection.cc;l=6323?q=%22No%20recent%20network%20activity%20after%22&ss=chromium
I believe the relevant variable is: quic_max_idle_time_before_crypto_handshake_seconds
and is set here: https://source.chromium.org/chromium/chromium/src/+/main:net/third_party/quiche/src/quic/core/quic_constants.h;l=139;drc=1b6e5b6710b7c002de308d7195326fc84d6e9b33
Question / TL;DR:
How do I change the timeout used by chromium during the handshake stage in HTTP/3 / QUIC? (without building chromium from scratch)
Thanks a lot!
Related
Lately numerous network requests with Alamofire made from our iOS device fail with the following error:
Error Domain=NSPOSIXErrorDomain Code=28 "No space left on device"
UserInfo={_NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask
.<3>,
_kCFStreamErrorDomainKey=1, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalDataTask .<3>" ),
_kCFStreamErrorCodeKey=28}
Our app has a mechanism to send a network request if the user has moved +- 10 meters. This is checked every 5 seconds, so in theory every five seconds a call can be made. The network request fails occasionally with this message, returning no status code and the above error.
The message implies the error has to do with available disk/memory space on the device. However, after checking both there is no link to be found since there is plenty of space available. Also, the error occurs on multiple devices, all running iOS 14.4 or higher.
Is there information available regarding error code 28 and what could be the culprit on iOS devices? Even better; how can this error be prevented?
To answer the occurrence of the error itself:
NSPOSIXErrorDomain Code=28 "No space left on device"
With logs in the Xcode terminal:
2021-05-07 15:56:50.873428+0200 MYAPP[21757:7406020] [] nw_path_evaluator_create_flow_inner NECP_CLIENT_ACTION_ADD_FLOW 05CD829A-810D-412F-B86E-7524369359E8 [28: No space left on device]
2021-05-07 15:56:50.877243+0200 MYAPP[21757:7400322] Task <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1> finished with error [28] Error Domain=NSPOSIXErrorDomain Code=28 "No space left on device" UserInfo={_NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1>, _kCFStreamErrorDomainKey=1, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalUploadTask <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1>"
), _kCFStreamErrorCodeKey=28}
It appears to get called when there are too many NSURLSessions created, reaching a limit of (in our tests) 600-700 sessions, which are not maintained or closed properly. The error started to get thrown since iOS 14, so it is interesting to see if there was a limit introduced.
Linked is a github issue raised stating the same issues on the ktor microservices framework by JetBrains, pointing in the same direction, mentioning the invalidation of sessions to prevent this issue:
https://github.com/ktorio/ktor/issues/1341
In our own project the origin of the problem turned out to be our implementation of the StarScream websocket library. This might not be relevant for the issues others are having, but explained anyways to create a complete picture of the problem. It is the cause and fix of our specific situation.
At first we assumed it had something to do with the URLSession created by Alamofire (networking library used) since POST requests started to get cancelled, and a kill of the app seemed the only solution to do requests again.
However, we also make use of websocket connections using the StarScream library, which attempts to connect to an socket, and if failed retry to connect every two seconds for a max time of two hours. This would mean for two hours, every two seconds, we connect to the socket -> receive a failure to connect -> disconnect the socket -> connect again. Using a singleton of the socket it was thought there was no possibility of creating multiple URLSessions, since the socket was only initiated once. However calling the connect to the socket again would create a new nw_connection object every single time, since the library did not handle the disconnect properly.
image of NWConcrete_nw_connection objects generated in socket connection
The way this was validated was using the instruments app to check for the creation of new nw_connection objects. Logged as a "memory leak" there, the creation of the nw_connection objects was getting logged and the solution was to make sure we disconnect the socket (invalidate the session) properly before connecting again.
I hope to answer a big part of the issue here, and I will mark my own question answered since this was the solution to the problem at hand. I think Apple should consider giving accurate reports on the number of objects created being limited, instead of giving an error "No space left on device".
Just wanted to chime in with more info, since we're experiencing the same issue.
Based on our analytics, this issue only started happening since iOS 14. We've verified it happening on 14.2, 14.4 and 14.5. Naturally the most straightforward cause for this error would be low memory or disk storage. We've excluded this option with additional logging, as you seem to have done as well.
A possibly related SO post has attributed the issue to a network inspecting framework that was enabled in their release build. It's worth checking if you use a similar tool.
Another report of this issue, this time on the Github of AFNetworking (predecessor to the Alamofire library you use), says they were able to fix it by limiting the creation of URLSession objects.
For us personally, neither of these did the trick. We created a support ticket with Apple, but this hasn't lead to a solution. They requested a small sample project that reproduces the issue, but the error only manifested after 7 days of continuous use in our app. If you have a faster way to reproduce this, it may be worth it to submit your own support ticket.
Hopefully this helps you find a solution, if you do please add this to your post to help others!
Using this API: https://developer.mozilla.org/en-US/docs/Web/API/Network_Information_API
You can run $ navigator.connection in a browser console to receive your different values regarding your network connection.
However the downlink attribute is a max of 10 (aka 10Mbps). Why is it capped here? Doesn't really help me since I need more info since I am deciding whether a client can handle HD video that may very well require over 10Mbps, thanks.
I found the answer in the comments to this answer: https://stackoverflow.com/a/47511842/3973137
Turns out Chrome caps it at 10 Mbps to prevent fingerprinting
I'm working on a backoff strategy for a robot that connects through the Twitter Streaming API. The API documentation states:
Back off linearly for TCP/IP level network errors. These problems are generally temporary and tend to clear quickly. Increase the delay in reconnects by 250ms each attempt, up to 16 seconds.
I understand this errors to be when – for whatever reason – the client cannot communicate with the server (ie: no Internet service). However, I'm not sure if HTTP status codes equal or greater than 500 should be treated as TCP/IP level network errors too (ie: 503 service unavailable), because, in order to receive this error codes, a successful connection between client and server should have happened already.
Could someone please help me understand this?
Thanks.
Since this post earned me a "Tumbleweed" badge :-), I decided to post the reply I received at Twitter Developers:
#kurrik
I think your intuition is correct in that a HTTP status code is not a TCP/IP error and you should use the exponential backoff. The slower backoff for these kinds of errors is so that your connection does not get rate limited. A 5XX error is a bit unusual, as it indicates an error which may have happened before or after the connection attempt was logged by the rate limiter. To be safe, I'd say use exponential backoff for this case (although most 503 issues should be cleared after the first reconnect attempt).
I ended up using this strategy:
Disconnections and 500 errors: linear backoff starting at 0.25s and adding 0.25s each time, up to 16s.
All errors greater than 500: exponential backoff starting at 5s and doubling the amount each time, up to 5 minutes.
Rate limited (420 / 429): exponential backoff starting at 1m and doubling the amount each time.
No reconnection: all 4XX errors different from the rate limit errors.
I am new in JMeter tool. Can anyone help me for the best way to analyse JMeter reports?
Simply list of related links you can possibly find useful:
Native graphs:
JMeter Report Dashboard
Real-time plotting with 3rd party real-time series database like influxdb
Free Open source solutions for automated graphs:
JMeter Plugins - look onto custom graphs in this package; some of them provide better results reporting out-of-box than jmeter's original ones;
JMeter Result Analysis Plugin
JWeter tool for logs analyzing & visualization
Recipes with custom development:
JMeter Wiki: Suggestions and Recipes for Log Analysis
Better JMeter Graphs
Plotting your load test with JMeter
3rd party solutions:
Blazemeter Sense
Tricentis flood.io
RedLine13
JAnalyser: browser based results analysis tool
UPD.
Please find, use and feel free to extend this Awesome JMeter collection continued as github repo.
There are 3 test that are must when doing performance testing, there should always be a baseline, a peak test and a stress test. These test relate to each other because of the little's law. The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the time a customer spends in the system, W; or expressed algebraically: L = λW..
Jmeter already provides means to check this values, the standar plugin provides plots for reponse times, hits as well as throughput. There is no way to directly tell how many users were active on the system, it is not the same concurrent users than active users. The plugins are enough to produce the reports, but they do not allow to control much of the presentation, i will use some plots produced using python(they add labels, and have 2 y axis).
Baseline Test:
This case is an special case of the law, in this case the active users is constant and it is one, then:
L = λW
1 = λW
1/W = λ
If the application run the same piece of code, the response time will stabilize over time, then the arrival rate will be constant over time too.
There is a service that does nothing else than wait some time to go by:
2 Seconds service: The arrival rate was 1/2TPS.
3 Seconds service: The arrival rate was 1/3TPS.
Peak Test:
This is nother special case, in this case load incrase until it surpass the system thoughput, because the load is greater than throughput the response times do increase. During the test the threads number should increase fast enough to recover from long response times.
This time instead of running the peak, i will stress the system with more load than it is able to handle during the whole test. To control the service throughput:
The active transactions are those that had leave the injector but haven't get a reponse, those are transactions that are queue in some place whitin the system.
λ(t ) = c, T(t) = k; both the load as well as the thoughput are constant over time.
L = Σλ - ΣT = ct - kt; The active transactions is the difference in between the cumulative load and the cumulative thoughput.
L = (c - k)t
λW= (c - k)t
cW(t) = (c - k)t
W(t) = t(c - k)/c
Because response times do grow as active users do, we will need the injector to create new threads as fast as new conections are requiered, most of the pool threads are going to be busy.
2TPS arrival rate, 1 TPS throughput:
The response times function is 1/2t
The injector stress the system during 300 seconds.
The test last 600 seconds.
4TPS arrival rate, 1 TPS throughput:
The response times function is 3/4t
The injector stress the system during 300 seconds.
The test last 1200 seconds.
6TPS arrival rate, 5 TPS throughput:
The response times function is 1/6t
The injector stress the system during 300 seconds.
The test last 360 seconds.
In simple word if you want to analyze your JMeter report...
Start with server CPU and RAM utilization. When you run a performance test on your server, see how much CPU and RAM is utilized by the current test.
Issue the following command on hosted site server; it will create a log file of CPU usage.
while true; do
( echo "%CPU %MEM ARGS $(date)" &&
ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 |
tail ) >> ps.log
sleep 1
done
See overall response time, it should not exceed your expected response time criteria.
See below image. My expectation is response time should not go above 525 microseconds, but some requests are crossing it. Find these kind of requests which are taking time.
Overall Response Times:
See Transaction per second, how many transaction are made per second and is there any drop in the test time frame?
Inspect the summary report, Average time, and max time to see which requests are taking the most time.
Currently many listeners are available in JMeter as add-ons or built in, but these are the major things to look at in order to be able to guess properly what's going on. And you can use other reports like that.
Follow my blog for more details https://softwaretesterfriend.blogspot.in/
Starting with version 3.0, JMeter includes a dynamic HTML report that can be generated either at the end of a load test or from a result file.
See generating-dashboard
In order to analyze your JMeter results, you can use
Listeners in JMeter
Blazemeter Sense
Reports Dashboard
In addition to all the other answers: there is a nice site of BlazeMeter where you can upload your test result file (.jtl) and it will generate all kinds of (interactive) reports for it. It even analyzes it for you and points out when the first error occurs, what the saturation point is, etc. https://sense.blazemeter.com/gui/
If you have a graphite/grafana infrastructure I can recommend to add the Backend Listener to the project. It will send real-time metrics to the graphite server and you can monitor the test in graphite (or grafana).
If you are new JMeter understanding JMeter listeners and other components will help you . check the tutorial
- https://www.youtube.com/watch?v=FfDVIklNjgw
I have a request that takes more than 30 seconds and it breaks.
What is the solution for this? I am not sure if I add more dynos this will work.
Thanks
You should probably see the Heroku devcenter article regarding this, as the information will be more helpful, here's a small summary:
To answer the timeout question:
Cedar supports long-polling and streaming responses. Your app has an initial 30 second window to respond with a single byte back to the client. After each byte sent (either recieved from the client or sent by your application) you reset a rolling 55 second window. If no data is sent during the 55 second window your connection will be terminated.
(That is, if you had Cedar instead of Aspen or Bamboo you could send a byte every thirty seconds or so just to trick the system. It might work.)
To answer your dynos question:
Additional concurrency is of no help whatsoever if you are encountering request timeouts. You can crank your dynos to the maximum and you'll still get a request timeout, since it is a single request that is failing to serve in the correct amount of time. Extra dynos increase your concurrency, not the speed of your requests.
(That is, don't bother adding more dynos.)
On request timeouts:
Check your code for infinite loops, if you're doing something big:
If so, you should move this heavy lifting into a background job which can run asynchronously from your web request. See Queueing for details.