Authentication failures in cassandra when 1 of 16 nodes is down - authentication

I have a Cassandra cluster running :
Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0
The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:
replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '4',
'DC2': '4',
'DC3': '4'}
and my authenticator/authorizer are set to:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:
"User my_application_user has no MODIFY permission on or any of its parents"
Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?

Re-posting this as the answer, as BrianC suggested above.
So this is resolved... Here's the sequence of events that seems to have fixed it:
Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.
Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.

Related

Janusgraph not using index in production

Problem
When performing queries in my production environment, the index is not being used and a full scan is performed, but my development environment works fine and uses the index.
After looking deeper at the problem in production, it also seems that the index information is being saved to the storage backend, but the data is not, and is being stored locally. I have no idea why this is...
I will explain the architecture now:
Environments
The following describe my two environments. Important to note, the index in question is a composite index, as such uses the storage backend, but I still included the index-backend in the architecture environment (aka Elasticsearch).
Both local and production environment versions are the same, i.e:
Janusgraph: 0.5.2
ScyllaDB: 0.5.2
Elasticsearch: 7.13.1
Local Environment
Services are running in docker-compose, consisting of a single Janusgraph instance, a single ScyllaDB instance, and a single Elasticsearch Instance.
Production Environment
Running on AWS, kubernetes cluster managed with EKS, I have multiple janusgraph deployments, which connect to a ScyllaDB cluster (in the same k8s cluster), which is done via Scylla For Kubernetes (https://operator.docs.scylladb.com/stable/), and an Elasticsearch cluster.
Setup
The following will give the simplest example I can that contains the problems I describe.
I pre-create the index's with the Janusgraph management system, such as:
# management.groovy
import org.janusgraph.graphdb.database.management.ManagementSystem
cluster = Cluster.open("/opt/janusgraph/my_scripts/gremlin.yaml")
client = cluster.connect()
graph = JanusGraphFactory.open("/opt/janusgraph/my_scripts/env.properties")
g = graph.traversal().withRemote(DriverRemoteConnection.using(client, "g"))
m = graph.openManagement()
uid_property = m.makePropertyKey("uid").dataType(String).make()
user_label = m.makeVertexLabel("User").make()
m.buildIndex("index::User::uid", Vertex.class).addKey(uid_property).indexOnly(user_label).buildCompositeIndex()
m.commit()
Upon inspection with m.printSchema() I can see that the index's are ENABLED, in both my local environment and production environment.
I proceed to import all the data that needs to exist on the graph, both local env and production env are OK.
Performing Queries
The following outline what happens when I run a query
Local Environment
What we see here is a simple lookup just to check that the query is using the index:
gremlin> g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 1.837 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=true
\_query=multiKSQ[1]#4000
\_index=index::User::uid
\_orders=[]
\_isOrdered=true
optimization 0.038
optimization 0.497
backend-query 1 0.901
\_query=index::User::uid:multiKSQ[1]#4000
\_limit=4000
>TOTAL - - 1.837 -
Production Environment
Again, we run the query to see if it using the index (which it is not)
g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 11296.568 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=false
\_query=[]
\_orders=[]
\_isOrdered=true
optimization 0.025
optimization 0.102
scan 0.000
\_condition=VERTEX
\_query=[]
\_fullscan=true
>TOTAL - - 11296.568 -
What Happened? So far my best guess:
The storage backend is NOT being used for storing data, but is being used for storing information about the indexes
Update: Aug 16 2021, after digging around some more I found out something interesting
It is now clear that the data is actually not being saved to the storage backend at all.
In my local environment I set the storage.directory environment variable to /var/lib/janusgraph/data, which mounts onto an empty directory, this directory remains empty. Any vertex/edge updates get's saved to the scyllaDB storage backend, and the data persists between janusgraph instance restarts.
In my production environment, this directory (/var/lib/janusgraph/data) is populated with files:
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.lck
-rw-r--r-- 1 janusgraph janusgraph 9650 Aug 16 05:46 je.config.csv
-rw-r--r-- 1 janusgraph janusgraph 450 Aug 16 05:46 je.info.0
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.info.0.lck
drwxr-xr-x 2 janusgraph janusgraph 118 Aug 16 05:46 .
-rw-r--r-- 1 janusgraph janusgraph 7533 Aug 16 05:46 00000000.jdb
drwx------ 1 janusgraph janusgraph 75 Aug 16 05:53 ..
-rw-r--r-- 1 janusgraph janusgraph 19951 Aug 16 06:09 je.stat.csv
and any subsequent updates on the graph seem to be reflected here, the update do not get put onto the storage backend, and other janusgraph instances on kubernetes cannot see any changes other instances make, leading me to come to the conclusion, the storage backend is not being used for storing data
The domain name used for the storage.hostname and index.hostname both resolve to IP address's, confirmed with using nslookup.
The endpoints must also work, as the keyspace janusgraph is created, and also has a different replication factor that I defined, and also retains the index information regardless of restarting the janusgraph instances.
Idea 1 (Index is not enabled)
This was disproved via running m.printSchema() showing that all the index's were ENABLED
Idea 2 (Storage backends have different data)
I looked at the data stored in scylladb, and got a summary with nodetool cfstats, this does show something different:
# Local
Keyspace : janusgraph
Read Count: 1688328
Read Latency: 2.5682805710738673E-5 ms
Write Count: 1055210
Write Latency: 1.702409946835227E-5 ms
...
Memtable cell count: 126411
Memtable data size: 345700491
Memtable off heap memory used: 480247808
# Production
Keyspace : janusgraph
Read Count: 6367
Read Latency: 2.1203078372860058E-5 ms
Write Count: 21
Write Latency: 0.0 ms
...
Memtable cell count: 4
Memtable data size: 10092
Memtable off heap memory used: 131072
Although I don't know how to explain the difference, it is clear that both backends contain all the data, verified with various count() queries over labels, such as g.V().hasLabel("User").count(), which both environments report the same result
Idea 3 (Elasticsearch Warnings)
When launching a gremlin console session, there is a difference in that the production environment shows:
07:27:09 WARN org.elasticsearch.client.RestClient - request [PUT http://*******<i_removed_the_domain>******:9200/_cluster/settings] returned 3 warnings: [299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.master] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]
but as my problem is using composite index's, I believe we can disregard elasticsearch warnings.
Idea 4 (ScyllaDB cluster node resources)
Another idea I had was increasing the node resources, even with 7gb RAM, the problem still persists.
Finally...
I don't know what to try next in order to solve this problem, this is my first time pushing Janusgraph into production and perhaps I have missed something important. I have been stuck on this problem for quite a while, hence now asking the community here for help.
Thank you very much for reading this for, and hopefully helping me to solve this problem
I solved the problem myself, I realised that my K8s Deployment .yaml file I use for deploying needed all environment variables to have the prefix janusgraph., as such the janusgraph server was starting with all default variables rather than my selected ones.
Every-time I was creating a gremlin shell session (which connected to it's localhost server), although I was specifying all the correct endpoints and configuration, it was still saving the data according to default janusgraph variables. Although, even in this case, I don't know why the index's were successfully created on my specified backend.
But none the less, the solution was to make sure environment variables have the prefix janusgraph.

RabbitMQ delete a corrupted queue after node crash

RabbitMQ Version 3.7.21
Erlang Version Erlang 21.3.8.10
My team had 2 nodes hit the memory watermark last night and so I rebuilt the bad nodes but it left some queues in a bad state. I want to clear them out so that we can recreate them.
The stats show NaN for Ready, Unacked, and Total and the stats in queue look like:
It looks like the queue's node is one that no longer exists so unfortunately I can't access it. It's completely gone.
I have tried the following commands:
rabbitmqctl eval 'Q = rabbit_misc:r(<<"/">>, queue, <<"QUEUE">>), rabbit_amqqueue:internal_delete(Q).'
rabbitmqctl eval 'Q = {resource, <<"/">>, queue, <<"QUEUE">>}, rabbit_amqqueue:internal_delete(Q).'
but get this error:
{:undef, [{:rabbit_amqqueue, :internal_delete, [{:resource, "/", :queue, "QUEUE"}], []}, {:erl_eval, :do_apply, 6, [file: 'erl_eval.erl', line: 680]}, {:rpc, :"-handle_call_call/6-fun-0-", 5, [file: 'rpc.erl', line: 197]}]}
Which I assume means it's trying to make an RPC call to a node that no longer exists and it fails. This seems crazy to me because not just is the node gone but it has been forgotten from the cluster but still a couple queues remain.
Looks like there are 3 options:
Comb through the Mnesia tables and delete the corrupted ones
Fully rebuild the cluster and migrate to a new cluster
Rename your queues and ignore corrupted ones
We're going to go with Option 3 for now but I'm sure eventually there will be a breaking change in RabbitMQ that will make Option 2 more appealing but for now the quick fix is best for me.
According to https://groups.google.com/g/rabbitmq-users/c/VSjzvOUfS3s/m/q8OmFTqACAAJ, the internal_delete function in 3.7.x takes two arguments:
In 3.7.x rabbit_amqqueue:internal_delete takes two arguments (acting user name is the second one).
Therefore, the next time you need to delete a queue in a bad state, try
rabbitmqctl eval 'Q = {resource, <<"/">>, queue, <<"QUEUE">>}, rabbit_amqqueue:internal_delete(Q, <<"CLI">>).'

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

julia on PBS cluster: what to give to addprocs()?

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!

Voldemort replication and failover details

I'm evaluating Voldemort and encountered some confusing stuff related to replication and failover. I tried to make a simple 2 nodes cluster configuration where each node is a backup for another one. So data written to node 1 should be replicated to node 2 and vice versa. In case of node 1 failover the second node should serve client requests. After node 1 recovery data should be transfered back to node 1. I think it's very common and clear case. So I made the following configuration.
<cluster>
<name>perf_cluster</name>
<server>
<id>0</id>
<host>10.50.3.156</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>0, 1, 2, 3</partitions>
<zone-id>0</zone-id>
</server>
<server>
<id>1</id>
<host>10.50.3.157</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>4, 5, 6, 7</partitions>
<zone-id>0</zone-id>
</server>
</cluster>
<stores>
<store>
<name>perftest</name>
<persistence>memory</persistence>
<description>Performance Test store</description>
<owners>owner</owners>
<routing>client</routing>
<replication-factor>2</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>
<key-serializer>
<type>string</type>
</key-serializer>
<value-serializer>
<type>java-serialization</type>
</value-serializer>
</store>
</stores>
I perform the following test:
Start both nodes;
Connect cluster via shell using 'bin/voldemort-shell.sh perftest tcp://10.50.3.156:6666';
Put the key-value "1" "a";
Perform 'preflist "1"' which returns me 'Node 1' 'Node 0' so I assume that 'get' request will be sent to Node 1 first;
Crash Node 1;
Get key "1". I see some errors related to loss of connectivity but finally it returns me correct value;
Start Node 1;
Get key "1". It says that Node 1 is available but returns me 'null' instead of the value. So I assume the Node 1 didn't get the data from Node 0 and since my required-reads = 1 it doesn't ask for Node 0 and returns me null.
Crash Node 0;
Key "1" is lost forever because it wasn't replicated to Node 1.
I'm more than sure that I misunderstand something in configuration or cluster replication details. Could you clarify why the data doesn't replicate back from Node 0 to Node 1 after recovery? And am I right that replication is a client responsibility, not server? If so how should the data be replicated after Node recovery?
Thanks in advance.
I don't known if you've already solved the problem, but take a look to: http://code.google.com/p/project-voldemort/issues/detail?id=246
Remember that the memory store is only for testing (junit) purposes, you should use the readonly or bdb stores.