Cassandra: MigrationStage cannot keep up - schema

We have 4 nodes and running tpstats shows big backlog for MigrationStage at all nodes and it's not able to reduce the queue over time. For example:
Pool Name Active Pending Completed Blocked All time blocked
MigrationStage 1 3946 17766 0 0
I don't see this going down ever and the other 3 servers have about 300 pending requests.
Is there a way to speed this up? Or is it possible to stop schema migration since most likely it's trying to migrate old keyspaces?
PS I Tried to drop keyspaces to reduce this (there are about 200 keyspace). However I always query timeout for that statement (select works). I assume this backlog is also blocking some schema DLL statements.

Related

lock redis key when two processes are accessing the same key

Application 1 set a value in Redis.
And we have two instance of application 2 which are running and we would like only one instance should read this value from Redis (please note application 2 takes around 30 sec to 1 min process data )
Can Instance-1 application 2 acquire lock redis key which is created by application 1 , so that instance-2 of application 2 will not read and do the same operation ?
No, there's no concept of record lock in Redis. If you need to achieve some sort of locking you have to use another set of data structures to mimic that behavior. For example
List: You can use a list and then POP the item from the list or...
Redis Stream: Using Redis Stream with ConsumerGroup so that each consumer in your Group only sees a portion of the whole data the needs to be processed and it guarantees you that, when an item is delivered to a consumer, it is not going to be delivered to another one.

Storing time intervals efficiently in redis

I am trying to track server uptimes using redis.
So the approach I have chosen is as follows:
server xyz will keep on sending my service ping indicating that it was alive and working in the last 30 seconds.
My service will store a list of all time intervals during which the server was active. This will be done by storing a list of {startTime, endTime} in redis, with key as name of the server (xyz)
Depending on a user query, I will use this list to generate server uptime metrics. Like % downtime in between times (T1, T2)
Example:
assume that the time is T currently.
at T+30, server sends a ping.
xyz:["{start:T end:T+30}"]
at T+60, server sends another ping
xyz:["{start:T end:T+30}", "{start:T+30 end:T+60}"]
and so on for all pings.
This works fine , but an issue is that over a large time period this list will get a lot of elements. To avoid this currently, on a ping, I pop the last element of the list, check if it can be merged with the latest time interval. If it can be merged, I coalesce and push a single time interval into the list. if not then 2 time intervals are pushed.
So with this my list becomes like this after step 2 : xyz:["{start:T end:T+60}"]
Some problems I see with this approach is:
the merging is being done in my service, and not redis.
incase my service is distributed, The list ordering might get corrupted due to multiple readers and writers.
Is there a more efficient/elegant way to handle this , like maybe handling merging of time intervals in redis itself ?

Work queue providing retries with increasing delays and the maximum number of attempts. Is a pure RabbitMQ solution possible?

I have repetitive tasks that I want to process with a number of workers (i.e., competing consumers pattern). The probability of failure during the task is fairly low so in case of such rare events, I would like to try again after a short period of time, say 1 second.
A sequence of consecutive failures is even less probable but still possible, so for a few initial retries, I would like to stick to a 1-second delay.
However, if the sequence of failures reaches some point, then the most likely there is some external reason that may cause these failures. So from that point, I would like to start extending the delay.
Let's say that the desired distribution of delays looks like this:
first appearance in the queue - no delay
retry 1 - 1 second
retry 2 - 1 second
retry 3 - 1 second
retry 4 - 5 second
retry 5 - 10 seconds
retry 6 - 20 seconds
retry 7 - 40 seconds
retry 8 - 80 seconds
retry 9 - 160 seconds
retry 10 - 320 seconds
another retry - drop the message
I have found a lot of information about DLXes (Dead Letter Exchanges) that can partially solve the problem. It appears to be easy to achieve an infinite number of retries with the same delay. At the same time, I haven't found a way to increase the delay or to stop after certain number of retries.
I'm looking for the purest RabbitMQ solution possible. However, I'm interested in anything that works.
There is a plugin available for this. I think you can use it to achieve what you need.
I've used it for something in a similar fashion for handling custom retries with dynamic delays.
RabbitMQ Delayed Message Plugin
Using a combination of DLXes and expire/TTL times, you can accomplish this except for the case when you want to change the redelivery time, for instance, implementing an exponential backoff.
The only way I could make it work using a pure RabbitMQ approach is to set the expire time to the smallest time needed and then use the x-death array to figure out how many times the message has been killed and then reject (ie. DLX it again) or ack the message accordingly.
Let's say you set expire time to 1 minute and you need to backoff 1 minute first time, then 5 minutes and then 30 minutes. This translates to x-death.count = 1, followed by 5 and then 30. Any other time you just reject the message.
Note that this can create lots of churn if you have many retry-messages. But if retries are rare, go for it.

Dataflow Apache beam Python job stuck at Group by step

I am running a dataflow job, which readed from BigQuery and scans around 8 GB of data and result in more than 50,000,000 records. Now at group by step I want to group based on a key and one column need to be concatenated . But After concatenated size of concatenated column becomes more than 100 MB that why I have to do that group by in dataflow job because that group by can not be done in Bigquery level due to row size limit of 100 MB.
Now the dataflow job scales well when reading from BigQuery but stuck at Group by Step , I have 2 version of dataflow code, but both are stucking at group by step. When I checked the stack driver logs, it says, processing stuck at lull for more than 1010 sec time(similar kind of message) and Refusing to split GroupedShuffleReader <dataflow_worker.shuffle.GroupedShuffleReader object at 0x7f618b406358> kind of message
I expect the group by state to be completed within 20 mins but is stuck for more than 1 hours and never gets finished
I figured out the thing myself.
Below are the 2 changes that I did in my pipeline:
1. I added a Combine function just after the Group by Key, see screenshot
since the Group by key when running on multiple worker, does a lot of network traffic exchange, and by default the network we use, does not allow the inter network communication, so I have to create a firewall rule to allow traffic from one worker to another worker i.e. ip range to network traffic.

Azure SQL high wait time on "VDI_CLIENT_OTHER"

We're benchmarking our app with different scales of an Azure SQL database, and we're having a hard time saturating the db. Among other things, we've executed this query:
SELECT *
FROM sys.dm_os_wait_stats
ORDER BY wait_time_ms DESC
The top row of the result was something like
wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms
VDI_CLIENT_OTHER 19560 409007428 60016 37281
What is this wait time? What exactly have we been waiting for during those 409000 seconds (almost 5 days)? Google doesn't seem to know what VDI_CLIENT_OTHER is.
VDI_CLIENT_OTHER is used in case of new replica seeding or any other user initiated workflow that triggers copies like update service tier and setting up geo relationship link. High wait time It likely just means we did seeding and the task remained running waiting for additional work items which aren’t arriving.