I'm using Apache Flink + RabbitMQ stack. I know about opportunity to manually trigger savepoints and restore jobs from them, but the problem is that Flink acknowledges messages after successful checkpoint, and if you want to make savepoint and restore state you're losing all data between last successful savepoint and last successful checkpoint. Is there a way to restore job from checkpoint? That would solve the problem of losing data in case of non-replayable data sources (like rabbitmq). Btw, if we have checkpoints with all their overheads, why don't let users to use them?
Conceptually, a savepoint is nothing else than a checkpoint plus a bit of metadata. In both cases (Savepoint and Checkpoint), Flink creates a consistent checkpoint of the state of all operators, source, and sinks.
Checkpoints are considered to be an internal mechanism for failure recovery. However, checkpoints can be configured to be externalized checkpoints. Externalized checkpoints are not automatically cleaned up when a job terminates and can be used to manually restart a program.
Your problem with the RabbitMQ source is that it kind of violates Flink's checkpointing semantics, because it pushes some state to an external system by acking on checkpoint which cannot be reset.
Would a mechanism to trigger a savepoint and immediately shutdown the job afterwards solve your problem? This would prevent that a checkpoint is triggered after a savepoint was taken.
Related
I'm hoping to get some opinions on what could be the cause of strange checkpoint behaviour in SQL Server.
I have a database which is in the SIMPLE recovery model and starts at 10 GB in size. The database is on a SQL Server 2017 instance and is configured for Indirect Checkpoints with target_recovery_time_in_seconds set to 60.
We have alerts that trigger on transaction log percent usage (70%) which is typically when an internal CHECKPOINT would occur. We then continued to receive alerts as the transaction log continued to grow and eventually registered 99% full but no further growth occurred.
The log_reuse_wait_desc column in sys.databases showed ACTIVE TRANSACTION as the reason why the last attempted log truncation failed. I confirmed that there were no active transactions running using close to all relevant DMVs.
Issuing a CHECKPOINT manually cleared the wait_desc and truncated the log.
My theory is that the database had an active transaction at the time when log truncation was last attempted either when 70% log usage was breached or after that point when the target dirty buffers to be flushed to disk was reached. In either case there was an active transaction at that point which prevented log truncation. Since that last checkpoint there was minimal activity resulting in no further checkpoint attempt due to not reaching the dirty buffers threshold therefore even though there is now no active transaction log truncation would can't take place until a CHECKPOINT was issued.
I intend to place Trace Flag 3502 on to see the checkpoint activity when this transaction is supposedly running.
Has anyone ever encountered this behaviour, or knows if SQL Server has a back off configured for running checkpoints when above 70% transaction log usage even as the log continues to fill?
Many thanks!
As pointed out by #sepupic, the 70% log space usage issued checkpoint is a characteristic of automatic checkpoints and not internal checkpoints (see comments on question).
The simple reason for this noticed behaviour is that the indirect checkpoints would've responded to dirty page threshold breaches while the active transaction continued to execute. The active transaction prevented log truncation from occurring with the checkpoints and so the transaction log continued to grow.
Between the time that the last indirect checkpoint and the previously active transaction (that prevented log truncation) completed there were insufficient dirty pages to trigger an indirect checkpoint to occur.
Hence why the last log_reuse_wait_desc remained ACTIVE TRANSACTION even when no active transaction was found upon investigation and that the log file usage was immediately cleared by a manual CHECKPOINT command being issued.
As per apache ignite spring data documentation, there are two method to save the data in ignite cache:
1. org.apache.ignite.springdata.repository.IgniteRepository.save(key, vlaue)
and
2. org.apache.ignite.springdata.repository.IgniteRepository.save(Map<ID, S> entities)
So, I just want to understand the 2nd method transaction behavior. Suppose we are going to save the 100 records by using the save(Map<Id,S>) method and due to some reason after 70 records there are some nodes go down. In this case, will it roll back all the 70 records?
Note: As per 1st method behavior, If we use #Transaction at method level then it will roll back the particular entity.
First of all, you should read about the transaction mechanism used in Apache Ignite. It is very good described in articles presented here:
https://apacheignite.readme.io/v1.0/docs/transactions#section-two-phase-commit-2pc
The most interesting part for you is "Backup Node Failures" and "Primary Node Failures":
Backup Node Failures
If a backup node fails during either "Prepare" phase or "Commit" phase, then no special handling is needed. The data will still be committed on the nodes that are alive. GridGain will then, in the background, designate a new backup node and the data will be copied there outside of the transaction scope.
Primary Node Failures
If a primary node fails before or during the "Prepare" phase, then the coordinator will designate one of the backup nodes to become primary and retry the "Prepare" phase. If the failure happens before or during the "Commit" phase, then the backup nodes will detect the crash and send a message to the Coordinator node to find out whether to commit or rollback. The transaction still completes and the data within distributed cache remains consistent.
In your case, all updates for all values in the map should be done in one transaction or rollbacked. I guess that these articles answered your question.
What is Checkpoint in SQL Server Transaction, what are the different types of Checkpoint
A checkpoint writes the current in-memory modified pages (known as dirty pages) and transaction log information from memory to disk and, also, records information about the transaction log
Automatic
Issued automatically in the background to meet the upper time limit suggested by the recovery interval server configuration option. Automatic checkpoints run to completion. Automatic checkpoints are throttled based on the number of outstanding writes and whether the Database Engine detects an increase in write latency above 20 milliseconds.
Indirect
Issued in the background to meet a user-specified target recovery time for a given database. The default is 0, which indicates that the database will use automatic checkpoints, whose frequency depends on the recovery interval setting of the server instance.
Manual
Issued when you execute a Transact-SQL CHECKPOINT command. The manual checkpoint occurs in the current database for your connection. By default, manual checkpoints run to completion. Throttling works the same way as for automatic checkpoints. Optionally, the checkpoint_duration parameter specifies a requested amount of time, in seconds, for the checkpoint to complete.
Internal
Issued by various server operations such as backup and database-snapshot creation to guarantee that disk images match the current state of the log.
A checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash.
While doing batch delete operation forcing 'Checkpoint' helped to deletion faster..
I have a bunch of summary nodes (scalars, histograms, etc) that are constantly writing to the log. Checkpointing is not as frequent, and so I often have situations in which I'm recovering from a checkpoint that is earlier than the events that have been written to the log. When I resume from the checkpoint and start writing to the log again, what exactly happens? Do the old events get overwritten? The documentation is not very clear on this. Looking in TensorBoard, it appears as if the "future" events are still there. Ideally I'd like to flush everything ahead of the current global_step and just start over.
TensorBoard does have logic to handle this case - it looks for restart events, and tries to purge everything with a global_step greater than the restart step. See this code. If you are still seeing the orphaned events, that means something isn't working - maybe the SessionLog.START event isn't being written when your job restarts from checkpoint?
Can you create a simple repro of this and file an issue on GitHub?
We are using hsqldb (2.2.8) as in-memory database for our web application which is running on tomcat web server (6.19). We have set the max size of the log file to hsqldb.log_size=200 (which is 200 MB),
but for some instances of out production environment the log file (~/tomcat/work/hypersonic/localDB.log) is growing way beyond that range (40GB).
Further looking into the logs we found that the DB stop performing the CHECKPOINT operation. What is the default behaviors of HSQL DB in performing periodic CHECKPOINT operation ? Is there anyway we can stop growing this LOG file.
After the size of the .log reaches its limit, the CHECKPOINT operation is performed when all connections to the database have committed. You may have a connection that has not been committed.
You can check the INFORMATION_SCHEMA.SYSTEM_SESSIONS table and see if there is a session in transaction. You can reset such sessions with the ALTER SESSION statement.
http://www.hsqldb.org/doc/2.0/guide/sessions-chapt.html