How to properly restart a kafka s3 sink connect? - amazon-s3

I started a kafka s3 sink connector (bundle connector from confluent package) since 1 May. It works fine until 8 May. Checking the status, it tells that some aws exception crashes this connector. This should not be a big problem, so I want to restore it.
I tried the following steps:
I POST /connectors/s3sink/restart . Then I saw the connector is in RUNNING mode, but the task is still FAIL.
Then I PUT /connectors/s3sink/task/0/restart. Ok, now the task is in RUNNING mode.
But then I tail the log, I found it starts to rewrite the old data, such as 3 May data. And it messed the old data!
So, does connect restart REST API reset the offset? I thought it will save the offset and just start from the offset it fails.
And how to restart a failed connector task correctly? By deleting those PODs? (using kubernetes), or by REST /task/0/restart? When should I use /connectors/s3sink/restart?

/connector/:name/restart is a rolling restart operation on the worker leader that needs to propagate to all worker server tasks in async fashion. So, you need to ensure network connection between the leader worker and all others.
/connector/:name/task/:num/restart will send request straight to that worker, restarting the thread.
Restart should not reset the offset since they are stored in the consumer offsets topic for that connect cluster. If anything, the tasks were not able to commit offsets back to the __consumer_offsets topic, but you should see logs for that.

Related

GridGain Server partition loss

We have 3 node Gridgain server and there are 3 client nodes deployed in GCP Kubernetes engine. Cluster is native persistence enabled. Also <property name="shutdownPolicy" value="GRACEFUL"/> as shutdown policy. There is one backup for each cache. After automatic cluster restart getting partition loss. Need to reset these partitions by executing control commands.
Can you provide proper solution for this. We have around 60GB persistent data.
<property name="shutdownPolicy" value="GRACEFUL"/> is supposed to protect from partition loss if certain conditions are met:
The caches must be either PARTITIONED with backups > 0 or REPLICATED. Check your configs. Default cache config in Ignite is PARTITIONED with backups = 0 (for historical reasons), so the defaults won't work.
There must be more than one baseline node (only baseline nodes store data!). Here is the doc.
You must stop the nodes in a graceful way. This is a bit tricky since you don't always control this.
If you stop with a kill to the process, make sure it uses SIGTERM and not SIGKILL because the later always kills the process immediately
If you stop with Ignite.close() this should just work
If you stop with Java System.exit() it'll work, but if you use System.halt() - it won't (because halt() is not graceful)
If you use orchestrators such as Kubernetes, you need to make sure they'll stop the nodes gracefully. For example, in Kubernetes you normally have to set terminationGracePeriodSeconds to a high value so that Kubernetes waits for the nodes to finish graceful shutdown instead of killing them.
If you use custom startup scripts, you need to make sure they forward signals to the Ignite process.
To debug this, check the points above. I would normally start by looking at the server logs (with IGNITE_QUIET=false!) to see if "Invoking shutdown hook" message is there. If it isn't there then your shutdown hook isn't getting called, and the problem is one of the points under 3. Otherwise, there should be other log messages explaining the situation.

Manage In-memory cache in multiple servers in aws

Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.
To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.

Hangfire jobs stuck in processing state

We are using Hangfire to download data from Azure. We are using Hangfire 1.7.6. However, after running for some time, Hangfire is having a deadlock and seems stuck in processing the job. We had to restart the service to keep it working.
There is a recurring job which is adding jobs to the other background server. Mostly the jobs are stuck when it is downloading a big file.
Has anyone faced this type of problem of hangfire jobs stuck in processing?
Please let me know if any further information is required. Any help/guidance is appreciated.
Is this not caused by the length of time it takes to complete the download from azure?
You could try testing this with large files, and see how it handles it.
Also, like #jbl asked, how is your Hangfire Server hosted? If it is hosted in IIS then remember that the Hangfire Server may lose its heartbeat if IIS shuts down the application process due to it being idle for a given period of time.
I came across this issue in the past and ended up running the application as a process on the server.
IIS is optimised to save resources, so it will shutdown processes that aren't being used. When a request is made to your application, then it fires the process back up. This will also cause any scheduled background jobs not to fire.

How to use rotate_logs on a log file that is 80+gb's for RabbitMQ on windows server

I need to run rabbitmqctl rotate_logs on a rabbitmq log file that is over 80gb's in size. When I tried to run this the first time it froze rabbit and no messages could be received. The freeze lasted 20 mins before I had to kill the command and restart the rabbit server.
This is a production server and completing this in a small amount of time without losing messages or killing the broker would be optimal.
Would it be possible to shut down the service and move the current log file to another location and restart the service and then run the rotate_logs command?
I'm fairly new to rabbitmq and I am not sure what the best way to handle this would be.
This is installed on a windows 2008 server as a service for a heavy traffic production site (However the message queue has a small load and only affects the administrative side of things).
Any help or insight would be appreciated.
I ran into a similar situation, but with only about 4GB of log file instead of 80.
the workaround I used was pretty much what you suggested... stop the service, move the log file and restart the service as quickly as possible.
for me, specifically, instead of moving the file while the service was stopped i just renamed it. i also wrote a commandline script to do the work for me.
this allowed me to stop the service, rename the file and restart the service in a matter of seconds.
once the service was back up and running, i was free to move / rename / whatever the large log file as needed.

Jenkins removing queued and running build on restart

I have a Jenkins instance to which I am sending build request programmatically through API. My server gets restarted once in a day.
I have observed that when Jenkins server gets restarted, Jenkins is not keeping any track of queued Jobs and running jobs. We looses those jobs to be triggered.
Also I wanted to monitor programmatically if the queued build was actually executed or not. But when we restart the Jenkins, queue ids gets started from one.
Is there any way [any plugin] available that persistent the queued build and continue executing after restart in the same order as they were queued ?
Also want to continue queue numbed at which it was there before restart.
According to this and this, /safeRestart should be enough for what you need.
Or you can use Naginator plugin to restart failed (due to Jenkins going down) builds