Why can't the security context be found? - ignite

There are 6 server nodes and 4 client nodes in the cluster. When the cluster is first started, servers 5 and 6 cannot find the security context of client 4. After the cluster restarts, server 6 cannot find the client 2 security context.
There is only this kind of exception in the log, there are no other exceptions. Why can't the security context be found?
All nodes are restarted sequentially. This problem occurs in the production environment and is not reproduced in the test environment.
2022 Aug 09 20:53:51:378 GMT +08 cep-data-010.ds-cache6 ERROR [sys-stripe-41-#42%cep-data-010.ds-cache6%] - [org.apache.ignite] Failed to obtain a security context.
java.lang.IllegalStateException: Failed to find security context for subject with given ID : be1fded5-1450-4fc6-b16f-1c580899db2f
at org.apache.ignite.internal.processors.security.IgniteSecurityProcessor.withContext(IgniteSecurityProcessor.java:167)
at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1908)
at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1530)
at org.apache.ignite.internal.managers.communication.GridIoManager.access$5300(GridIoManager.java:243)
at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1423)
at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:637)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
at java.base/java.lang.Thread.run(Thread.java:834)
2022 Aug 09 20:53:51:383 GMT +08 cep-data-010.ds-cache6 INFO [disco-event-worker-#351%cep-data-010.ds-cache6%] - [org.apache.ignite] Added new node to topology: TcpDiscoveryNode [id=be1fded5-1450-4fc6-b16f-1c580899db2f, consistentId=cep-master-017.ds-realtrail2, addrs=ArrayList [192.168.229.9], sockAddrs=HashSet [cep-master-017/192.168.229.9:0], discPort=0, order=16, intOrder=16, lastExchangeTime=1660049631321, loc=false, ver=2.13.0#20220420-sha1:551f6ece, isClient=true]
2022 Aug 09 20:53:51:389 GMT +08 cep-data-010.ds-cache6 ERROR [sys-stripe-41-#42%cep-data-010.ds-cache6%] - [org.apache.ignite] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Failed to find security context for subject with given ID : be1fded5-1450-4fc6-b16f-1c580899db2f]]

Related

Restoring rabbitmq-server database to a different host

Im running rabbitmq-server-3.6.6-1.el7.noarch on a CentOS 7.4.1708 server.
The /var/lib/ directory, on an ext4 lvm partition, had insufficient storage issues which required us to extend the lvm online, in order to make more space.
This seemed to fix the problems at the time, but a rabbitmq-server service restart was needed, which, when attempted, hung.
The service never started up anymore.
In order to get rabbitmq working again, the old mnesia directory was backed up, and a new one was created.
In order to recover the messages in the broken service, I have moved the old mnesia to a new server, added NODENAME=rabbit#oldserver to /etc/rabbitmq/rabbitmq-env.conf on the new server and tried to start it, but it keeps failing to start.
Please how can I start the old rabbitmq database on the new host?
[root#newserver]# cat /etc/rabbitmq/rabbitmq-env.conf
NODENAME=rabbit#oldserver
When I try to start the service on the new server:
[root#newserver rabbitmq]# systemctl status rabbitmq-server.service -l
? rabbitmq-server.service - RabbitMQ broker
Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2018-05-07 16:25:23 WAT; 1h 32min ago
Process: 6484 ExecStop=/usr/sbin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
Process: 5997 ExecStart=/usr/sbin/rabbitmq-server (code=exited, status=1/FAILURE)
Main PID: 5997 (code=exited, status=1/FAILURE)
Status: "Exited."
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: * epmd reports: node 'rabbit' not running at all
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: other nodes on oldserver: ['rabbitmq-cli-20']
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: * suggestion: start the node
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: current node details:
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - node name: 'rabbitmq-cli-20#newserver'
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - home dir: .
May 07 16:25:23 newserver.dom.local rabbitmqctl[6484]: - cookie hash: edMXQlaNlKXH72ZvAXFhbw==
May 07 16:25:23 newserver.dom.local systemd[1]: Failed to start RabbitMQ broker.
May 07 16:25:23 newserver.dom.local systemd[1]: Unit rabbitmq-server.service entered failed state.
May 07 16:25:23 newserver.dom.local systemd[1]: rabbitmq-server.service failed.
=ERROR REPORT==== 7-May-2018::16:16:43 ===
** Generic server <0.145.0> terminating
** Last message in was {'$gen_cast',
{submit_async,
#Fun<rabbit_queue_index.32.103862237>}}
** When Server state == undefined
** Reason for termination ==
** {function_clause,
[{rabbit_queue_index,parse_segment_entries,
[<<1,23,0,255,54,241,95,251,201,20,69,202,0,0,0,0,0,0,0,0,0,0,0,
240,0,0,1,176,131,104,6,100,0,13,98,97,115,105,99,95,109,101,
115,115,97,103,101,104,4,100,0,8,114,101,115,111,117,114,99,101,
109,0,0,0,8,47,98,105,108,108,105,110,103,100,0,8,101,120,99,
104,97,110,103,101,109,0,0,0,7,98,105,108,108,105,110,103,108,0,
0,0,1,109,0,0,0,13,98,105,108,108,105,110,103,95,113,117,101,
117,101,106,104,6,100,0,7,99,111,110,116,101,110,116,97,60,100,
0,4,110,111,110,101,109,0,0,0,7,48,0,0,0,0,0,2,100,0,25,114,97,
98,98,105,116,95,102,114,97,109,105,110,103,95,97,109,113,112,
95,48,95,57,95,49,108,0,0,0,1,109,0,0,0,240,123,34,115,104,111,
114,116,67,111,100,101,34,58,34,52,50,54,95,109,101,110,117,115,
34,44,34,116,105,109,101,115,116,97,109,112,34,58,34,50,48,49,
56,45,48,52,45,49,56,84,49,48,58,53,57,58,50,55,46,57,51,53,90,
34,44,34,109,115,105,115,100,110,34,58,34,50,51,52,57,48,57,51,
52,52,57,50,57,51,34,44,34,105,100,34,58,34,50,55,50,49,56,95,
50,51,52,57,48,57,51,52,52,57,50,57,51,95,49,53,50,52,48,52,57,
48,52>>,
------snip-------goes-on-forever----
100,0,4,116,114,117,101>>},
no_del,no_ack},
undefined,undefined,undefined,undefined,undefined,
undefined,undefined,undefined},
10,10,10,10},
100,100,100,100,100},
1000,1000,1000,1000,1000,1000,1000},
10000,10000,10000,10000,10000,10000,10000,10000,10000}},
8988}],
[{file,"src/rabbit_queue_index.erl"},{line,1067}]},
{rabbit_queue_index,'-recover_journal/1-fun-0-',1,
[{file,"src/rabbit_queue_index.erl"},{line,863}]},
{lists,map,2,[{file,"lists.erl"},{line,1224}]},
{rabbit_queue_index,segment_map,2,
[{file,"src/rabbit_queue_index.erl"},{line,989}]},
{rabbit_queue_index,recover_journal,1,
[{file,"src/rabbit_queue_index.erl"},{line,856}]},
{rabbit_queue_index,scan_segments,3,
[{file,"src/rabbit_queue_index.erl"},{line,676}]},
{rabbit_queue_index,queue_index_walker_reader,2,
[{file,"src/rabbit_queue_index.erl"},{line,664}]},
{rabbit_queue_index,'-queue_index_walker/1-fun-0-',2,
[{file,"src/rabbit_queue_index.erl"},{line,645}]}]}

Weblogic doesn't cache LDAP

I have a web application set up using JSF 2.1 and JEE 6 running on a WebLogic 12.1.2 server with an openLDAP for authentication. I've been noticing that loading any page in the app causes multiple BIND requests to LDAP – every single time!
I've read through much of the material and have configured the LDAP provider in Weblogic such that just about any cache I could find is activated. In particular I've set
[x] Cache Enabled
Cache Size: 10240
Cache TTL: 300
GUID Attribute: entryUUID
I've also double-checked that the entryUUID attribute exists. I'm not too knowledgable on either WebLogic or LDAP, but I've read just about any page on configuring the cache, but there's still just as many requests to the LDAP (yes, I've restarted the servers after changes.)
I'd appreciate any help, insights or wild guesses as to what may be the cause or how I can debug this issue further. I'm not too sure which config files to attach, but if there's anything needed I'm happy to provide it.
The LDAP requests all look like this:
# journalctl -u slapd
# … many of these …
Sep 16 23:06:03 server.org slapd[15038]: daemon: read active on 13
Sep 16 23:06:03 server.org slapd[15038]: daemon: epoll: listen=7 active_threads=0 tvp=zero
Sep 16 23:06:03 server.org slapd[15038]: daemon: epoll: listen=8 active_threads=0 tvp=zero
Sep 16 23:06:03 server.org slapd[15038]: conn=1109 op=32 BIND anonymous mech=implicit ssf=0
Sep 16 23:06:03 server.org slapd[15038]: conn=1109 op=32 BIND dn="tpid=NQ00000013,ou=people,dc=de,dc=foobiz,dc=com" method=128
Sep 16 23:06:03 server.org slapd[15038]: conn=1109 op=32 BIND dn="tpid=NQ00000013,ou=people,dc=de,dc=foobiz,dc=com" mech=SIMPLE ssf=0
Sep 16 23:06:03 server.org slapd[15038]: conn=1109 op=32 RESULT tag=97 err=0 text=
Sep 16 23:06:03 server.org slapd[15038]: daemon: activity on 1 descriptor
Sep 16 23:06:03 server.org slapd[15038]: daemon: activity on:
I have figured out the issue and WebLogic isn't at fault whatsoever. Our application seems to be using a rather broken concept of calling remote EJBs where it creates its own proxy, stores the JNDI information and executes a JNDI lookup on every method invocation.
Therefore, even caching the bean wouldn't help. Of course this bypasses any caching mechanisms and thus results in multiple LDAP binds with every request.

NetworkManager kills apache when going offline

When i turn off my WiFi, NetworkManager kills apache2. This can be seen in '/var/log/apache/error_log':
[Sun Mar 01 13:25:55 2015] [notice] caught SIGTERM, shutting down
However, this does not happen if i turn off the WiFi manually by doing
sudo ifconfig wlan0 down
It seems NetworkManager goes to status 'inactive' when i disconnect it from the WiFi.
These are some of the contents of '/var/log/messages' around the time i turn off the WiFi:
Mar 1 13:25:52 raven NetworkManager[22393]: <info> (wlan0): device state change: activated -> disconnected (reason 'user-requested') [100 30 39]
Mar 1 13:25:52 raven NetworkManager[22393]: <info> (wlan0): deactivating device (reason 'user-requested') [39]
Mar 1 13:25:52 raven dhcpcd[350]: received SIGTERM, stopping
Mar 1 13:25:52 raven dhcpcd[350]: wlan0: removing interface
Mar 1 13:25:53 raven NetworkManager[22393]: <info> (wlan0): canceled DHCP transaction, DHCP client pid 350
Mar 1 13:25:53 raven NetworkManager[22393]: <info> NetworkManager state is now DISCONNECTED
Mar 1 13:25:53 raven dbus[16077]: [system] Activating service name='org.freedesktop.nm_dispatcher' (using servicehelper)
Mar 1 13:25:53 raven NetworkManager[22393]: <warn> (pid 350) unhandled DHCP event for interface wlan0
Mar 1 13:25:53 raven NetworkManager[22393]: <warn> Connection disconnected (reason -3)
Mar 1 13:25:53 raven NetworkManager[22393]: <info> (wlan0): supplicant interface state: completed -> disconnected
Mar 1 13:25:53 raven NetworkManager[22393]: <warn> Connection disconnected (reason -3)
Is there a way to "uncouple" apache2 from NetworkManager so it is not killed when going offline
I have gentoo 3.10.7-gentoo-r1, i am using OpenRC (not systemd),
NetworkManager 0.9.8.8, and apache 2.2.25
Same problem here (Gentoo user for years).
Very quick solution:
As root, just type apache1 this will start apache with the same configs as the init script /etc/init.d/apache2 does. The only difference is that it will not check for a started network.
The reason why it is stopped then NetworkManager is stopped is this part of the init script:
depend() {
need net
use mysql dns logger netmount postgresql
after sshd
}
The need net part tells the script that this service requires the network to be up. On machine where NetworkManager is used and there is no network connection, through WiFi or wired networks, this condition is not given and the service is stopped automatically.
So as a second (still hacky) solution, you could just comment out this line.

How to solve race condition in etcd leader election?

While testing a Core Os cluster with three nodes, after successfully adding and removing few additional nodes, I encountered the following problem, supposedly due to a race condition during the election process for etcd.
Checking the new leader gives:
$ curl -L http://127.0.0.1:4001/v2/stats/leader
{"errorCode":300,"message":"Raft Internal Error","index":629006}
Journalctl for each machine in the cluster gives:
$ journalctl -r -u etcd
-- Logs begin at Wed 2014-11-12 15:09:01 UTC, end at Mon 2014-11-24 10:47:34 UTC. --
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.307 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5221 started.
Nov 24 10:47:34 node-1 etcd[56576]: [etcd] Nov 24 10:47:34.306 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:33 node-1 etcd[56576]: [etcd] Nov 24 10:47:33.098 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: term #5219 started.
Nov 24 10:47:32 node-1 etcd[56576]: [etcd] Nov 24 10:47:32.081 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'candidate' to 'follower'.
Nov 24 10:47:31 node-1 etcd[56576]: [etcd] Nov 24 10:47:31.962 INFO | 965d12d38a4a4b2c807bd232fb7b0db7: state changed from 'follower' to 'candidate'.
And listing the machines with fleet fails:
$ fleetctl list-machines
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2014/11/24 10:56:19 ERROR client.go:200: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 200ms
2014/11/24 10:56:19 INFO client.go:278: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
Listing the machines in the cluster gives:
$ curl -L http://127.0.0.1:7001/v2/admin/machines
[{"name":"","state":"follower","clientURL":"http://100.72.62.35:4001","peerURL":"http://100.72.62.35:7001"},
{"name":"555cca74216644fea48990673b3d539c","state":"follower","clientURL":"http://100.72.62.59:4001","peerURL":"http://100.72.62.59:7001"},
{"name":"965d12d38a4a4b2c807bd232fb7b0db7","state":"follower","clientURL":"http://100.72.20.153:4001","peerURL":"http://100.72.20.153:7001"},
{"name":"a1b566dedb194c259f7eb2ffde5595b1","state":"follower","clientURL":"http://100.72.62.2:4001","peerURL":"http://100.72.62.2:7001"},
{"name":"a45efba827754b5f93c38b751a0ae273","state":"follower","clientURL":"http://100.72.62.31:4001","peerURL":"http://100.72.62.31:7001"},
{"name":"d041738235a9483cb814d37ca7fa4b6d","state":"follower","clientURL":"http://100.72.20.18:4001","peerURL":"http://100.72.20.18:7001"}]
but only three machines are currently running. I tried to add additional machines to reach the quorum with no avail.
I'm running the following version:
$ etcdctl -v
etcdctl version 0.4.6
for which, as mentioned here https://coreos.com/docs/distributed-configuration/etcd-api/#cluster-config, the leader module to force a leader has been removed. The ugly part is that since there is no quorum I'm not able to remove from the list of machines the ones that are not currently running using for example:
$ curl -L -XDELETE http://127.0.0.1:7001/v2/admin/machines/2abbf47a9e644bc69652a986d796d7a6
which has no effect. Is there any way to save the cluster?
In my understanding, you can save the cluster, but it isn't worth it.
The cluster is not accepting new machines because it needs a quorum to add new machines and there is not a quorum of existing machines. The same goes for removing machines and deleting keys.
If you can bring up enough machines listed as cluster members and have them successfully work as cluster members, you will have a quorum and save the cluster.
From what I can see, you have six machines listed as cluster members. You need to have at least four running for the existing cluster to operate.

ERROR: Initialization failure: Cannot create configuration

i'm trying to get CouldBees working. Heres the error i get when im running:
C:\cloudbees-sdk-1.5.0>bees init --proxyHost=localhost --proxyPort=8008 (or 8080)
You have not created a CloudBees configuration profile, let's create
one now... Enter your default CloudBees API end point [us | eu]: eu
Enter your CloudBees account email address: abs#abs.com
Enter your CloudBees account password: Jul 18, 2013 1:32:09 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused: connect Jul 18, 2013 1:32:09 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: Retrying request Jul 18, 2013 1:32:10 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused: connect Jul 18, 2013 1:32:10 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: Retrying request Jul 18, 2013 1:32:11 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection refused: connect Jul 18, 2013 1:32:11 PM
org.apache.commons.httpclient.HttpMethodDirector execute WithRetry
INFO: Retrying request
ERROR: Initialization failure: Cannot create configuration
Can anyone read out what's causing this error?
It looks like the SDK can't establish Internet connections to the CloudBees website. If you are running behind a proxy, you will need to use proxy flags to connect.
bees init --proxyHost=YOUR_PROXY_HOST --proxyPort=YOUR_PROXY_PORT
This is covered in the CloudBees SDK docs: Running behind a proxy
It helps, to set the system time exactly to six hours ago (US-time).
Also cloudbees documentation sais, that You should create on your file system (under Windows7) c:\Users\Your_User.bees\bees.config file, which contains following line (if you want to call cloudbbees eu server):
bees.api.url=https\://api-eu.cloudbees.com/api>
but actually it didn't help in my case (maybe outdated version)