currently the logs in the folder “/engine-rocksdb/journals” are running full (WAL logs).
When does ArangoDB do a cleaning run of these logs and delete them automatically and how to trigger this cleaning run earlier? My ArangoDB 3.10 runs in single mode and in a virtual environment (cloud with a network storage).
The logfile are increasing very fast for me because there are many writes to the DB. What is the best way, any idea?
What I have done so far:
If I set the value “rocksdb.wal-archive-size-limit” it does delete the logs when the set limit is reached, but it shows errors in the logfile:
2022-09-27T17:53:04Z [898948] WARNING [d9793] {engines} forcing removal of RocksDB WAL file '/archive/813371.log' with start sequence 5387062892 because of overflowing archive. configured maximum archive size is 1073741824, actual archive size is: 75401520
However, I still don't understand the meaning of the logfile output: "configured maximum archive size is 1073741824, actual archive size is: 75401520`". The "actual archive size" is smaller?
But what are the consequences of lowering the "wal-archive-size-limit" value? Is it possible to switch off the wal-archive completely. What exactly is it for? As I understand it, ArangoDb need it for transaction security (i.e. in case of power loss), right?
In general, yes, this is a good thing, but how can I get ArangoDb to a) limit this WAL-archive (without error massages) and b) do a cleaning run faster?
thx :-)
When does ArangoDB do a cleaning run of these logs and delete them automatically and how to trigger this cleaning run earlier?
ArangoDB uses RocksDB underneath, and RocksDB will move WAL file (.log files) into its archive as soon as possible. In order to do so all data from the WAL file needs to be safely stored in the column families' .sst files and have been flushed to disk.
ArangoDB will delete files from the WAL archive (and only from there) once it can assure that an archived WAL file is not used anymore. It will not remove files for the archive that are or may be in current use.
There are a few reasons why ArangoDB may keep archived WAL files for some time:
when server-to-server replication is used: while a follower replicates data, it may read from the leader's WAL. Deleting the WAL file on the leader may make the replication fail
when arangodump is used to create a database dump, it will create a snapshot of data on the server, and the WAL files for that snapshot will be kept around until the snapshot isn't needed anymore (i.e. arangodump finishes).
the first 180 seconds after server start, all WAL files are intentionally kept, for forensic reasons, and to allow followers to replay events from a leader's WAL when it is restarted. The value of 180 seconds can be changed by adjusting the startup option --rocksdb.wal-file-timeout-initial.
there can be some background processing of changes that may refer to data from WAL files. For example, each insert into a collection will need to increase the collection's count() value by 1. To save an extra write into RocksDB on each insert, the count() value is only written to the storage engine by a background thread, ideally only once every X insert operations. However, this may lead to WAL files being around for a bit longer, especially if the background thread cannot keep up with the insert workload.
There is the startup option --rocksdb.wal-archive-size-limit to put a hard limit on the cumulated size of the WAL files in the archive. From your question, it appears that you are currently using ArangoDB version 3.10.
From the warning message you posted, it seems that the WAL archive cleanup somehow applies the wrong limit values.
It turns out that there has been a recent bugfix, released in ArangoDB version 3.10.1, 3.9.4, and 3.8.8, that should rectify this behavior. So upgrading to one of these or later versions may actually help when using the WAL archive size limit.
Shared your question in the Speedb hive, on Discord, and here is what we got for you:
"By default, ArangoDB set the max_wal_size to 1G the value of rocksdb.wal-archive-size-limit must be set to at least twice this number (otherwise you may end up with a single WAL file and the delete will fail)."
Hope this help, if it doesn't or you have follow up questions, please join the Speedb Discord and we will be happy to help.
Docs of redis say:
Redis is very data backup friendly since you can copy RDB files while the database is running: the RDB is never modified once produced, and while it gets produced it uses a temporary name and is renamed into its final destination atomically using rename(2) only when the new snapshot is complete.
But what happens when I'm copying the snapshot to another location for backup purposes while Redis is performing the rename command to replacing the snapshot? Is my backup broken? Or there is something I'm missing? Do I need a "safe" timeframe to copy the RDB snapshot, without Redis writes a new snapshot or is this something I don't have to care about?
I know that hive map side join uses memory.
Can I use an SSD instead of a memory?
I want to do a mapside join by putting the dimension table on the SSD.
Is it possible?
I will try to answer your question by explaining you Hadoop distributed cache:
DistributedCache is a facility provided by the Map-Reduce framework to cache files (in your case it is hive table which you want to join) needed by applications.
The DistributedCache assumes that the files specified via urls are already present on the FileSystem (this is your SSD or HDD) at the path specified by the url and are accessible by every machine in the cluster.
So ironically it is the hadoop frame work who decides whether to put the map
file in memory(RAM / YARN) or in SSD/HDD depending on the map file
size.
Although By default, the maximum size of a table to be used in a map join (as the small table) is 1,000,000,000 bytes (about 1 GB), you can increase this manually also by hive set properties example:
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=2000000000;
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
you can find more about distributed cache on this links:
https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/filecache/DistributedCache.html
https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html
What will happen to my redis data if no expiry is set and no DEL command is used.
Will it be removed after some default time ?
One more,
How redis stores data, is it in any file format ? because I can access data even after restarting the computer. So which files are created by redis and where ?
Thanks.
Redis is a in-memory data store meaning all your data is kept in RAM (ie. volatile). So theoritically your data will live as long as you don't turn the power off.
However, it also provides persistence in two modes:
RDB mode which takes snapshots of your dataset and saves them to the disk in a file called dump.drb. This is the default mode.
AOF mode which records every write operation executed by the server in an Append-Only file and then replays it thus reconstructing the original data.
Redis persistence is very good explained here and here by the creator of Redis himself.
I am setting Postgres 9.0 binary replication & Archive backup.
For Archiving on Master, I have these options:
a) Copy archive to it's local backup location
b) Copy to a share network location also accessible by Standby
c) rsync archive files to Standby Server
Due to hard disk space problem on Master, I give up (a). Since I doesn't want to re-setup the base backup too often, the growth in size for archive is huge over time.
For (b) & (c), with PostgreSQL 9.0 binary replication, I understand that it can give this benefit: If Standby can't keep up with Master, Standby can always recover itself from archive (which is created by Master) by having restore_cmd in recovery.conf. But I don't prefer (b) or (c) due to the network complexity problem. Also, to avoid the problem of Standby goes out of sync with Master, I set *wal_keep_segments* to huge value.
But in order to have a complete backup system, I still need archiving. I would prefer to enable Archive on the Standby server. Is it possible to do this since Standby is in Recovery mode (there is always a recovery.conf file available) ?