redis save causing cluster failover

redis save causing cluster failover - redis

Our DBA team (me) recently took over our infrastructure of Redis cluster with 8 primary/8 secondary servers along with 20-30 other standalone and Sentinel instances.
As a paranoid DBA, one thing I wanted to do was set up a scheduled time for backups using a simple script I wrote that does a save then moves the resulting rdb file to a new location. Since scheduling this script to run each night at 8pm, we have seen multiple nodes in the cluster fail over to the secondary, which I then have to move back manually. Now, there are also saves going on via the auto save settings so I may just abandon doing explicit backups altogether but I'm curious what's going on. Would doing bgsave instead solve the problem ?
October 13th 2021, 20:00:07.176 * DB saved on disk
October 13th 2021, 20:00:11.179 # Client id=374070 addr=10.100.0.151:48416 fd=25 name= age=14 idle=0 flags=P db=0 sub=15 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=6 omem=101109904 events=rw cmd=subscribe scheduled to be closed ASAP for overcoming of output buffer limits.
October 13th 2021, 20:00:16.181 * FAIL message received from 6c4a3e7c46d31d123dd5b55b662d323d51467109 about 866744a3afd0c83cbb91234313ff6ae3c4a9dfec
October 13th 2021, 20:00:18.182 * Clear FAIL state for node 866744a3afd0c83cbb91234313ff6ae3c4a9dfec: replica is reachable again.
October 13th 2021, 20:00:18.182 * FAIL message received from c4f0c574aaaa1309b6ec514a9645a5f6e3238e34 about 06de5562d113d2c36977cbf581b013487d19ee4e
October 13th 2021, 20:00:20.183 * Clear FAIL state for node 06de5562d113d2c36977cbf581b013487d19ee4e: replica is reachable again.
October 13th 2021, 20:00:20.183 * FAIL message received from b9f16fd70418a297796cb0a5ad6cad312aa6d784 about 8dd6dc1ff3105fbcd49e6c541c605f2b9d364952
October 13th 2021, 20:00:20.183 # Cluster state changed: fail
October 13th 2021, 20:00:21.183 # Cluster state changed: ok

It seems like this is likely the issue. I will try updating the client buffer settings and see what happens.
Redis replication and client-output-buffer-limit

Related

How to change the Splunk Forwarder Formatting of my Logs?

I am using our Enterprise's Splunk forawarder which seems to be logging events in splunk like this which makes reading splunk logs a bit difficult.
{"log":"[https-jsse-nio-8443-exec-5] 19 Jan 2021 15:30:57,237+0000 UTC INFO rdt.damien.services.CaseServiceImpl CaseServiceImpl :: showCase :: Case Created \n","stream":"stdout","time":"2021-01-19T15:30:57.24005568Z"}
However, there are different Orgs in our Sibling Enterprise who log splunks thus which is far more readable. (No relation between us and them in tech so not able to leverage their tech support to triage this)
[http-nio-8443-exec-7] 15 Jan 2021 21:08:49,511+0000 INFO DaoOImpl [{applicationSystemCode=dao-app, userId=ANONYMOUS, webAnalyticsCorrelationId=|}]: This is a sample log
Please note the difference in logs (mine vs other):
{"log":"[https-jsse-nio-8443-exec-5]..
vs
[http-nio-8443-exec-7]...
Our Enterprise team is struggling to determine what causes this. I checked my app.log which looks ok (logged using Log4J) and doesn't have the aforementioned {"log" :...} entry.
[https-jsse-nio-8443-exec-5] 19 Jan 2021 15:30:57,237+0000 UTC INFO
rdt.damien.services.CaseServiceImpl CaseServiceImpl:: showCase :: Case
Created
Could someone guide me as to where could the problem/configuration lie that is causing the Splunk Forwarder to send the logs with the {"log":... format to splunk? I thought it was something to do with JSON type vs RAW which I too dont understand if its the cause and if it is - what configs are driving that?

Over the course of investigation - I found that is not SPLUNK thats doing this but rather the docker container. The docker container defaults to json-file that writes the outputs to the /var/lib/docker/containers folder with the **-json postfix which contains the logs in the `{"log" : <EVENT NAME} format.
I need to figure out how to fix the docker logging (aka the docker logging driver) to write in a non-json format.

geth states eth_submitHashrate while mining with Claymore on Windows 10 with 2 GPU's

I am aiming on GPU-mining Ethereum on a Windows 10 PC with 2 Radeon RX590.
geth version is
1.9.9-stable-01744997
cmd call to start geth:
geth --rpc --syncmode "fast" --cache 4096 --etherbase [ADR] --datadir "[MyDataDir]" --mine --minerthreads 0
Blockchain is up to date and everything seems fine on the geth side.
Used Miner is
Claymore's Dual GPU Miner - v15.0
cmd to start miner:
EthDcrMiner64.exe -epool http://127.0.0.1:8545 -mode 1 -tt 75
Now the miner starts and seems to start mining. GPU's show they are doing massive work.
Once the miner is initiated it only permanently outputs something like this (plus every once in a while some GPU info):
ETH: 12/21/19-15:46:33 - New job from 127.0.0.1:8545
ETH - Total Speed: 21.345 Mh/s, Total Shares: 0, Rejected: 0, Time: 45:52
ETH: GPU0 10.665 Mh/s, GPU1 10.680 Mh/s
So this looks good.
In the geth console meanwhile I get this output:
INFO [12-21|15:46:35.446] Imported new chain segment blocks=1 txs=74 mgas=9.921 elapsed=159.999ms mgasps=62.007 number=9141165 hash=05972d…032349 dirty=1019.58MiB
INFO [12-21|15:46:35.459] Commit new mining work number=9141166 sealhash=35129c…59de27 uncles=0 txs=0 gas=0 fees=0 elapsed=999.3µs
INFO [12-21|15:46:35.720] Commit new mining work number=9141166 sealhash=3788e2…df83fc uncles=0 txs=39 gas=9922304 fees=0.0347883012 elapsed=261.998ms
WARN [12-21|15:46:36.032] Served eth_submitHashrate conn=127.0.0.1:54083 reqid=6 t=0s err="the method eth_submitHashrate does not exist/is not available"
INFO [12-21|15:46:38.548] Commit new mining work number=9141166 sealhash=7451f4…69a431 uncles=0 txs=72 gas=9911680 fees=0.04369322037 elapsed=89.942ms
WARN [12-21|15:46:41.120] Served eth_submitHashrate conn=127.0.0.1:54083 reqid=6 t=0s err="the method eth_submitHashrate does not exist/is not available"
There is this warning/error message:
err="the method eth_submitHashrate does not exist/is not available"
But also it states "Commit new mining work".
I am quite unsure now.
Do I mine or do I only waste electric power as the work is never commited?

You have not connected to any mining pools, only to your own geth node, meaning that you are mining alone and competing against the whole world. When mining solo, you have no shares. You either mine a whole block or get nothing. It is extremely hard to mine all alone, thus it is advised to join a mining pool. Claymore-Dual-Miner (CDM) has a list of available mining pool alternatives.
Also, when mining solo with CDM, you can miss mined messages because in this mode, it uses HTTP protocol instead of the Stratum pool protocol. You can manually check your balance on etherscan at any time, though.

Use PhoenixMiner 5e with -rate 2 command. It will stop showing this error.

Tracking down S3 Costs (S3FS)

During a transition, our S3 costs jumped a lot due to ListBucket and HeadObject calls. We are trying to figure out how to debug a sudden increase in our S3 costs. We made some changes that should NOT have affected it but the major change seems to be
10-20X increase in HeadObject calls
Sudden appearance of ListBucket calls
I have attached a chart showing the jump between the April 10, 2018 and April 14, 2018. The dates in between, we made the following changes
Changed from (debian 8) S3FS v1.61 (super old from 2012, not even in Github) to v1.84 (latest)
https://github.com/s3fs-fuse/s3fs-fuse
Moved from N. Virginia to N. California AZ (10% higher cost)
The giant yellow bars are showing the moving of the files using Amazon CLI (April 11 to 13)
In order to try to calm this down, we added to the mount command in /etc/fstab the following:
noatime,stat_cache_expire=3600,enable_noobj_cache
The bars that look uneven starting Apr 14 are now stable around $25/day
Options that are already there were there since the start (no change)
_netdev,allow_other,use_cache=/tmp,umask=0000,use_path_request_style,ensure_diskfree=10240
We have done the following to try to debug this
Enabled S3 Logging
Dumped the logs into Athena and then CSV export into MySQL
These logs are just 1 days worth
Screenshot "query 1" shows that there is 4.8m hits into a path ... basically, we think it is traversing the entire directory tree (with most like about 100k files) looking for a file if it exists
Screenshot "query 2" shows the same thing (kind of) where it is also doing down a path
Not really sure what else to do but our normal bill of about $5/day (including other services) is now about $25/day (5x increase) .. with the /etc/fstab changes, it is down to $13/day but still trying to get it to $5/day if we can get back to the zero ListBucket calls and 20% of the HeadObject calls.
Any ideas on what to try greatly appreciated.

ListBucket and HeadObject API calls were being made updatedb (and located).
Solution: Add your mount point (in my case /mnt/s3fs) to PRUNEPATHS in /etc/updatedb.conf so updatedb does not include this when it scans
https://linux.die.net/man/5/updatedb.conf
https://github.com/s3fs-fuse/s3fs-fuse/issues/193#issuecomment-109617253

ServerXmlHttpRequest hanging sometimes when doing a POST

I have a job that periodically does some work involving ServerXmlHttpRquest to perform an HTTP POST. The job runs every 60 seconds.
And normally it runs without issue. But there's about a 1 in 50,000 chance (every two or three months) that it will hang:
IXMLHttpRequest http = new ServerXmlHttpRequest();
http.open("POST", deleteUrl, false, "", "");
http.send(stuffToDelete); <---hang
When it hangs, not even the Task Scheduler (with the option enabled to kill the job if it takes longer than 3 minutes to run) can end the task. I have to connect to the remote customer's network, get on the server, and use Task Manager to kill the process.
And then its good for another month or three.
Eventually i started using Task Manager to create a process dump,
so i could analyze where the hang is. After five crash dumps (over the last 11 months or so) i get a consistent picture:
ntdll.dll!_NtWaitForMultipleObjects#20()
KERNELBASE.dll!_WaitForMultipleObjectsEx#20()
user32.dll!MsgWaitForMultipleObjectsEx()
user32.dll!_MsgWaitForMultipleObjects#20()
urlmon.dll!CTransaction::CompleteOperation(int fNested) Line 2496
urlmon.dll!CTransaction::StartEx(IUri * pIUri, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4453 C++
urlmon.dll!CTransaction::Start(const wchar_t * pwzURL, IInternetProtocolSink * pOInetProtSink, IInternetBindInfo * pOInetBindInfo, unsigned long grfOptions, unsigned long dwReserved) Line 4515 C++
msxml3.dll!URLMONRequest::send()
msxml3.dll!XMLHttp::send()
Contoso.exe!FrobImporter.TFrobImporter.DeleteFrobs Line 971
Contoso.exe!FrobImporter.TFrobImporter.ImportCore Line 1583
Contoso.exe!FrobImporter.TFrobImporter.RunImport Line 1070
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.HandleFrobImport Line 433
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.CoreExecute Line 71
Contoso.exe!CommandLineProcessor.TCommandLineProcessor.Execute Line 84
Contoso.exe!Contoso.Contoso Line 167
kernel32.dll!#BaseThreadInitThunk#12()
ntdll.dll!__RtlUserThreadStart()
ntdll.dll!__RtlUserThreadStart#8()
So i do a ServerXmlHttpRequest.send, and it never returns. It will sit there for days (causing the system to miss financial transactions, until come Sunday night i get a call that it's broken).
It is of no help unless someone knows how to debug code, but the registers in the stalled thread at the time of the dump are:
EAX 00000030
EBX 00000000
ECX 00000000
EDX 00000000
ESI 002CAC08
EDI 00000001
EIP 732A08A7
ESP 0018F684
EBP 0018F6C8
EFL 00000000
Windows Server 2012 R2
Microsoft IIS/8.5
Default timeouts of ServerXmlHttpRequest
You can use serverXmlHttpRequest.setTimeouts(...) to configure the four classes of timeouts:
resolveTimeout: The value is applied to mapping host names (such as "www.microsoft.com") to IP addresses; the default value is infinite, meaning no timeout.
connectTimeout: A long integer. The value is applied to establishing a communication socket with the target server, with a default timeout value of 60 seconds.
sendTimeout: The value applies to sending an individual packet of request data (if any) on the communication socket to the target server. A large request sent to a server will normally be broken up into multiple packets; the send timeout applies to sending each packet individually. The default value is 30 seconds.
receiveTimeout: The value applies to receiving a packet of response data from the target server. Large responses will be broken up into multiple packets; the receive timeout applies to fetching each packet of data off the socket. The default value is 30 seconds.
The KB305053 (a server that decides to keep the connection open will cause serverXmlHttpRequest to wait for the connection to close) seems like it plausibly could be the issue. But the 30 second default timeout would have taken care of that.
Possible workaround - Add myself to a Job
The Windows Task Scheduler is unable to terminate the task; even though the option is enabled to do do.
I will look into using the Windows Job API to add my self process to a job, and use SetInformationJobObject to set a time limit on my process:
CreateJobObject
AssignProcessToJobObject
SetInformationJobObject
to limit my process to three minutes of execution time:
PerProcessUserTimeLimit
If LimitFlags specifies
JOB_OBJECT_LIMIT_PROCESS_TIME, this member is the per-process
user-mode execution time limit, in 100-nanosecond ticks. Otherwise,
this member is ignored.
The system periodically checks to determine
whether each process associated with the job has accumulated more
user-mode time than the set limit. If it has, the process is
terminated.
If the job is nested, the effective limit is the most
restrictive limit in the job chain.
Although since Task Scheduler uses Job objects to also limit a task's time, i'm not hopeful that the Job Object can limit a job either.
Edit: Job objects cannot limit a process by process time - only user time. And with a process idle waiting for an object, it will not accumulate any user time - certainly not three minutes worth.
Bonus Reading
How can a ServerXMLHTTP GET request hang? (GET, not POST)
KB305053: ServerXMLHTTP Stops Responding When You Send a POST Request (which says the timeout should expire; where mine does not)
MS Forums: oHttp.Send - Hangs (HEAD, not POST)
MS Forums: ASP to test SOAP WebService using MSXML2.ServerXMLHTTP Send hangs
CC to MS Support Forums

Consider switching to a newer, supported API.
msxml6.dll using MSXML2.ServerXMLHTTP.6.0
winhttpcom.dll using WinHttp.WinHttpRequest.5.1.
The msxml3.dll library is no longer supported and is only kept around for compatibility reasons. Plus, there were a number of security and stability improvements included with msxml4.dll (and newer) that you are missing out on.

upstart script to start a task after hardware clock has been started

I have a process that tries to make an SSL connection after start up, but that fails if the clock has not yet been set (the dates don't match the effective dates on the certificates). Is it possible to configure upstart to only start the process after the internal clock is set?
The default setting for the clock is 2010-01-01, so perhaps something like date >= 2014 is sufficient (obviously not legit upstart syntax, but the concept holds).
The best I could figure out was to start up after NTP has started, but that doesn't necessarily mean the clock has been set as the network connection establishment may be delayed or not available for a while.

The simple solution is probably to just poll the date and wait 500ms or whatever before trying again if the date isn't sane yet.

Here's what I ended up doing:
start on started connman
stop on runlevel [016]
script
YEAR=$(date +'%Y')
until [ $YEAR -ge "2014" ]; do
sleep 5
YEAR=$(date +'%Y')
done
python access_point.py
end script
I wait until the connection manager is running and then I check the year every 5 seconds until the year is 2014 or greater.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas