Gerrit Replication is always Rescheduling

Gerrit Replication is always Rescheduling - replication

There are thousands of this log:
[2021-08-18 20:21:39,663] Rescheduling replication to git://xxx to avoid collision with the in-flight push [6e810ffc]. [CONTEXT PLUGIN="replication" RECEIVE_ID="xxxx" project="xxx" pushOneId="2d04c11f" ]
[2021-08-18 20:21:42,663] Rescheduling replication to git://xxx to avoid collision with the in-flight push [6e810ffc]. [CONTEXT PLUGIN="replication" RECEIVE_ID="xxxx" project="xxx" pushOneId="2d04c11f" ]
And the replication work seems never finish ...
My gerrit version is 3.2.3.
How can I fix this problem? Many thanks.

My fix action is one of two
1.wait the conflict task to finish
2.cd into the repo folder,such as project.git and run 'git push --mirror ssh://user#RemoteHost:project.git'.The Gerrit Rescheduled task will disappear soon.

The reason for rescheduling is to avoid a collision with an in-flight push to the same URI as you can see from the logs. The replication is re-scheduled according to your configuration trying to avoid it.
You should understand why you have so many concurrent replications to the same URI creating the collisions.

Related

Why is "await Publish<T>" hanging / not completing / not finishing

The following piece of code has been working for some time and it has suddenly stopped returning:
await availableChangedPublishEndpoint
.Publish<IAvailableStockChanged>(
AvailableStockCounter.ConvertSkuQtyToAvailableStockChangedEvent(
newAvailable,
absMessage.Warehouse)
);
There is nothing clever in ConvertSkuQtyToAvailableStockChangedEvent - it just maps one simple class to another.
We added logs before and after this code and it's definitely just stopping at this point. Other systems are publishing fine, other messages are being sent from this application (for e.g. logs are actually sent via RabbitMQ). We have redeployed and we have upgraded to latest MassTransit version. We are seeing that the messages are being published - possibly multiple times, but this Publish method never returns.

We had a broken RabbitMQ node and a clean service restart on one node fixed it. I appreciate there might be other reasons for this behaviour, but this was our problem.
systemctl restart rabbitmq-server
Looking further into RabbitMQ we saw that some of the empty queues that were connected to this exchange were not synchronized (see below) and when we tried to synchronize them that wouldn't work.
We also couldn't delete some of these unsynchronized queues.
We believe an unexpected shutdown of one of the nodes had caused this problem - but it left most queues / exchanges completely OK.

redis-cli FLUSHALL and FLUSHDB return ok but do nothing after Hubot restores redis

On ubuntu 16.04. Interacting with a local redis instance via redis-cli. Working with a node hubot script which uses redis as its primary data store.
when I type keys * I get a single key hubot:storage
so I FLUSHALL and get an ok response. But if the Hubot is running or else as soon as it turns on, it restores the value of that key immediately so I can never delete it.
I'v used the info command to try to see if it is persisting on some other redis instance and I've cleared all backup files from /var/redis. Basically I can't figure out where this data is being stored to keep getting restored from.
Any advice regarding how I could clear this out or where Hubot may be caching this?
It seems to be related to this code: https://github.com/hubotio/hubot-redis-brain/blob/master/src/redis-brain.js specifically the chunk at line 49 is what gets called before each restore.

Steps
Stop hubot
Flush redis (important that this is done while hubot is NOT running)
Start hubot
The reasoning is that hubot has an in-memory representation of the brain and will write it out to redis at intervals. Perhaps a nicer solution to this which would help during script development would be a command that can empty the brain and save that, but I can't see an obvious API for that in either robot.brain or hubot-redis-brain

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?

We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.

We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

How Can I Totally Reset Resque?

My Rails app runs on Heroku using Resque backed by RedisCloud.
Somehow Resque has gotten totally hosed. A few days ago it stopped showing what was currently working (Overview and Working tabs on resque web). I tried to solve it by "cleaning up" what appeared to me to be stale keys in Redis.
Bad move! Now, failed jobs stopped showing up on the Failed tab (and no failed key in Redis).
At this point I would like to do a clean reset, but how? I basically wiped all keys, following the process of this gem: resque-reset. What else can be done? Where is state kept besides Redis? Of course I also did heroku restart.
But it's still hosed. That is, resque web never shows what's working or what's failed.

post commit hook fail

I have Master/Slave setup using Win2k8R with SVN 1.6.9 and using TortoiseSVN 1.6.7.
The access is through Apache and using http.
Everything works but when I commit I get the following message:
Error: post-commit hook failed (exit code 1) with output:
Error: The process cannot access the file because it is being used by another process.
This happen when using multiple TortoiseSVN dialog for committing the files in rapid succession. If I use one TortoiseSVN dialog and wait till the commit reply is back then I won't see the problem. In other words, committing one at the time cause no issue.
The post-commit script output is logged.
Even though I get the above error but when I check the Master and Slave repository the files have been replicated okay with no issue.
I am wondering how this issue can be solved.

You should vote this as an issue in the Subversion Issue Tracker.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas