We have TIMEOUT 3600 used across many execute_process calls.
Some of our slow servers often exceed the timeout and builds fail as expected.
However, this is often causing issues and we keep increasing the timeout intervals.
Now I'm thinking to remove timeout completely.
Is there a reason for using the TIMEOUT option other than failing?
Why should we have TIMEOUT at all (e.g., none of add_custom_target commands has this feature)? Why use it only in execute_process (request is not meant for CTest)?
Turning my comment into an answer
Generally speaking I would remove the individual timeouts.
I have a timeout for the complete build process running on a Jenkins server. There it calculates an average over the last builds and you can give a percentage value of how much your build may differ (I use a "Timeout as a percentage of recent non-failing builds" of 400% so only the really hanging tasks will be terminated).
Related
I have some tests that give me on my local workstation the following Bazel WARNING:
WARNING: //okapi/rendering/sensor:film_test: Test execution time
(25.1s excluding execution overhead) outside of range for MODERATE
tests. Consider setting timeout="short" or size="small".
When I change the size of the test to small, e.g. buildozer 'set size small' //okapi/rendering/sensor:film_test
My CI job fails with a timeout:
//okapi/rendering/sensor:film_test
TIMEOUT in 60.1s
/home/runner/.cache/bazel/_bazel_runner/7ace7ed78227d4f92a8a4efef1d8aa9b/execroot/de_vertexwahn/bazel-out/k8-fastbuild/testlogs/okapi/rendering/sensor/film_test/test.log
My CI Job is running on GitHub via GitHub-hosted runners - those runners are slower than my local dev box.
What is the best practice here? Choose test size always according to CI and ignore Bazel warnings on local machine? Get a better CI?
Get a better CI?
One of the main purposes of software testing is to simulate the software behavior in an environment that reliably represents the production environment. The better the representation, the easier it is to spot possible issues and fix them before the software is deployed. My opinion is that you are qualified more than any of us to say is the CI you are currently using, a reliable representation of the production environment of your software.
//okapi/rendering/sensor:film_test: Test execution time (25.1s excluding execution overhead)
You can always recheck if your test target is packed correctly, i.e. ask yourself do all of these tests really belong to a single test target. What will be gained/lost if those tests are divided into several test targets?
My CI Job is running on GitHub via GitHub-hosted runners - those runners are slower than my local dev box.
Have you tried employing the test_sharding?
Size vs timeout
When it comes to test targets and their execution, for Bazel it is the question of the underlying resources (CPU, RAM,..) and their utilization.
For that purpose, Bazel exposes two main test attributes, size and timeout
The size attribute is mainly used to define how many resources are needed for the test to be executed, but Bazel uses size attribute to determine a default timeout of a test. The timeout value can be overridden by specifying the timeout attribute.
When using the timeout attribute you are specifying both the minimal and maximal execution time of a test. In Bazel 6.0.0 those values in seconds are:
0 <= short <= 60
30 <= moderate <= 300
300 <= long <= 900
900 <= eternal <= 3600
Since at the time of writing this answer the BUILD file is not shown, I'm guessing that your test target has at least one of these (not that size = medium is the default setting if the attribute is not specified):
size = "medium" or timeout = "moderate"
"All combinations of size and timeout labels are legal, so an "enormous" test may be declared to have a timeout of "short". Presumably it would do some really horrible things very quickly."
There is another option that I don't see being used quite often but might be helpful in your case and that is to specify --test_timeout value as written here
"The test timeout can be overridden with the --test_timeout bazel flag when manually running under conditions that are known to be slow. The --test_timeout values are in seconds. For example, --test_timeout=120 sets the test timeout to two minutes."
One last employable option is, just like you wrote, ignoring Bazel test timeout warnings with --test_verbose_timeout_warnings
I'm trying to write a policy extension in lua for Citrix Netscaler that calculates the base64 of a string and add it to a header. Most of the time the function works just fine, but more than a few times I see in the ns.log that its execution was terminated with the following message -
terminating execution, function exceeded time limit.
Nowhere in the docs could I find what exactly is this time limit (from what I saw it's about 1ms, which makes no sense to me) or how to configure it.
So my question is: is this property configurable and if so how?
Why do you need to go 'lua' ? with policy expressions you can do text.b64encode or text.b64decode . I am not answering your Q but you might not be aware of the builtin encoders/decoders in Netscaler.
Although I don't have any official document I believe that the max execution time is 10ms for any policy. This is also somehwat confirmed by the counters. Execute the following command on the shell:
nsconmsg -K /var/nslog/newnslog -d stats -g 10ms -g timeout
You will see all counters with these names. While your script is executing you can run
nsconmsg -d current-g 10ms -g timeout
This will let you see the counters in real time. When it fails you will see the value incrementing.
I would say that you can print the time on the ns.log while your script runs to confirm this. I don't know exactly what you are doing but keep in mind that netscaler policies are supposed to do very brief executions, after all you are working on a packet and the scale for packets is in the order of nanoseconds.
I have query take seconds to get a key from redis sometimes.
Redis info shows used_memory is 2 times lager than used_memory_rss and OS starts to use swap.
After cleaning the useless data, used_memory is lower than used_memory_rss and everything goes fine.
what confuse me is: if any query cost like 10 second and block other query to redis would lead serious problem to other part of the app, but it seems fine to the app.
And I can not find any of this long time query in slow log, so I check redis SLOWLOG command and it says
The execution time does not include I/O operations like talking with the client, sending the reply and so forth, but just the time needed to actually execute the command (this is the only stage of command execution where the thread is blocked and can not serve other requests in the meantime)
so if this means the execution of the query is normal and not blocking any other queries? What happen to the query when memory is not enough and lead this long time query? Which part of these query takes so long since "actually execute the command" time cost not long enough to get into slowlog?
Thanks!
When memory is not enough Redis will definitely slow down as it will start swapping .You can use INFO to report the amount of memory Redis is using ,even you can set a max limit to memory usage, using the maxmemory option in the config file to put a limit to the memory Redis can use. If this limit is reached Redis will start to reply with an error to write commands (but will continue to accept read-only commands),
I've recently been reading up on messaging systems and have specifically looked at both RabbitMQ and NServiceBus. As I have understood it, if a message fails for some reason it is tried again immidiately a number of times. Both systems then offers the possibility to try again later, for example in 5 seconds. When the five seconds have passed the message is sent again a number of times.
I quote Vaughn Vernon in Implementing Domain-Driven Design (p.502):
The other way to handle this is to simply retry the send until it succeeds, perhaps using a Capped Exponential Back-off. In the case of RabbitMQ, retries could fail for quite a while. Thus, using a combination of message NAKs and retries could be the best approach. Still, if our process retries three times every five minutes, it could be all we need.
For NServiceBus, this is called second level retries, and when the retry happens, it happens multiple times.
Why does it need to happen multiple times? Why does it not retry once every five minutes? What is the chance that the first retry after five minutes fails and the second retry, probably just milliseconds later, should succeed?
And in case it does not need to due to some configuration (does it?), why do all the examples I have found have multiple retries?
My background is NServiceBus so my answer may be couched in those terms.
First level retries are great for very transient errors. Deadlocks are a perfect example of this. You try to change the database, and your transaction is chosen as the deadlock victim. In these cases, a first level retry is perfect. Most of the time, one first level retry is all you need. If there is a lot of contention in the database, maybe 2 or 3 retries will be good enough.
Second level retries are for your less transient errors. Think about things like a web service being down for 10 seconds, or a SQL Server database in a failover cluster switching over, which can take 30-60 seconds. If you retry a few milliseconds later, it's not going to do you any good, but 10, 20, 30 seconds later you might have a good shot.
However, the crux of the question is after 5 first level retries and then a delay, why try again 5 times before an additional delay?
First, on your first second-level retry, it's still possible that you could get a deadlock or other very transient error. After all, the goal is usually not to make as slow a system as possible so it would be preferable to not have to wait an additional delay before retrying if the problem is truly transient. Of course there's no way for the infrastructure to know just how transient the problem is.
The second reason is that it's just easier to configure if they're all the same. X levels of retry and Y tries per level = X*Y total tries and only 2 numbers in the configuration file. In NServiceBus, it's these 2 values plus the back-off time span, so the config looks like this:
<SecondLevelRetriesConfigEnabled="true" TimeIncrease ="00:00:10" NumberOfRetries="3" />
<TransportConfig MaxRetries="3" />
That's fairly simple. Try 3 times. Wait 10 seconds. Try 3 times. Wait 20 seconds. Try 3 times. Wait 30 seconds. Try 3 times. Then you're done and you move on to an error queue.
Configuring different values for each level would require a much more complex config story.
First Level Retries exist to compensate for quick issues like networking and database locks. This is configurable in NSB, so if you don't want them, you can turn them off. Second Level Retries are to compensate for longer outages. For example we use SLRs to compensate for a database that recycles every night at the same time.
The OOTB functionality increases the duration between SLRs because it assumes that if it didn't work the previous time, you will need more time to fix it. There exists a Retry Policy that is overridable, so you can change how the SLRs work.
In NSB, the FLRs always come first and SLRs don't come into play unless the transaction is still failing after FLRs. In addition, you can disable SLRs altogether and build your own custom Fault Manager which have additionally functionality. We have a process where we have a Fault Manager that sends issues to a staffed help desk, as that is the only way to solve a particular subset of issues.
On a client is being raised the error "Timeout" to trigger some commands against the database.
My first test option for correction is to increase the CommandTimeout to 99999 ... but I am afraid that this treatment generates further problems.
Have experienced it ...?
I wonder if my question is relevant, and/or if there is another option more robust and elegant correction.
You are correct to assume that upping the timeout is not the correct approach. Typically, I look for log running queries that are running around the timeouts. They will typically stand out in the areas of duration and reads.
Then I'll work to reduce the query run time using this method:
https://www.simple-talk.com/sql/performance/simple-query-tuning-with-statistics-io-and-execution-plans/
If it's a report causing issues and you can't get it running faster, you may need to start thinking about setting up a reporting database.
CommandTimeout is a time, that the client is waiting for a response from server. If the query is run in the main VCL thread then the whole application is "frozen" and might be marked "not responding" by Windows. So, would you expect your users to wait at frozen app for 99999 sec?
Generally, leave the Timeout values at default and rather concentrate on tunning the queries as Sam suggests. If you happen to have long running queries (ie. some background data movement, calculations etc in Stored Procedures) set the CommandTimeout to 0 (=INFINITE) but run them in a separate thread.