Clean up and prevent excessive data accumulation in an MobileFirst Analytics 8.0 environment - ibm-mobilefirst

Our analytics data is taking up almost 100% disk space on the file system. How do we remove the old er data, and prevent such situation from occurring again?

You can follow the url, https://mobilefirstplatform.ibmcloud.com/tutorials/en/foundation/8.0/installation-configuration/production/server-configuration/#setting-up-jndi-properties-for-mobilefirst-server-web-applications to setup JNDI properties in Mobilefirst. You need to
set the TTL values base on you business requirements, and keep the values as short as possible, so that huge data accumulation does not occur again. To clean up the existing data, you can perform the following
Setup the Analytics server with JNDI properties set for TTL and other configuration
Stop the Analytics Server
Delete the /analyticsData directory contents to discard any initial data (this will not affect as there is no data accumulated yet. So that there is no directories within the analyticsData directory) Note:
/analyticsData is the default location, please refer
http://mobilefirstplatform.ibmcloud.com/tutorials/en/foundation/8.0/installation-configuration/production/analytics/configuration/ to verify the actual value in your environment.
Restart the Analytics server. (Now the index will be created brand new with TTL in effect causing the proper data purging in place)

Related

WAL log files fill up quickly - how to prevent this?

currently the logs in the folder “/engine-rocksdb/journals” are running full (WAL logs).
When does ArangoDB do a cleaning run of these logs and delete them automatically and how to trigger this cleaning run earlier? My ArangoDB 3.10 runs in single mode and in a virtual environment (cloud with a network storage).
The logfile are increasing very fast for me because there are many writes to the DB. What is the best way, any idea?
What I have done so far:
If I set the value “rocksdb.wal-archive-size-limit” it does delete the logs when the set limit is reached, but it shows errors in the logfile:
2022-09-27T17:53:04Z [898948] WARNING [d9793] {engines} forcing removal of RocksDB WAL file '/archive/813371.log' with start sequence 5387062892 because of overflowing archive. configured maximum archive size is 1073741824, actual archive size is: 75401520
However, I still don't understand the meaning of the logfile output: "configured maximum archive size is 1073741824, actual archive size is: 75401520`". The "actual archive size" is smaller?
But what are the consequences of lowering the "wal-archive-size-limit" value? Is it possible to switch off the wal-archive completely. What exactly is it for? As I understand it, ArangoDb need it for transaction security (i.e. in case of power loss), right?
In general, yes, this is a good thing, but how can I get ArangoDb to a) limit this WAL-archive (without error massages) and b) do a cleaning run faster?
thx :-)
When does ArangoDB do a cleaning run of these logs and delete them automatically and how to trigger this cleaning run earlier?
ArangoDB uses RocksDB underneath, and RocksDB will move WAL file (.log files) into its archive as soon as possible. In order to do so all data from the WAL file needs to be safely stored in the column families' .sst files and have been flushed to disk.
ArangoDB will delete files from the WAL archive (and only from there) once it can assure that an archived WAL file is not used anymore. It will not remove files for the archive that are or may be in current use.
There are a few reasons why ArangoDB may keep archived WAL files for some time:
when server-to-server replication is used: while a follower replicates data, it may read from the leader's WAL. Deleting the WAL file on the leader may make the replication fail
when arangodump is used to create a database dump, it will create a snapshot of data on the server, and the WAL files for that snapshot will be kept around until the snapshot isn't needed anymore (i.e. arangodump finishes).
the first 180 seconds after server start, all WAL files are intentionally kept, for forensic reasons, and to allow followers to replay events from a leader's WAL when it is restarted. The value of 180 seconds can be changed by adjusting the startup option --rocksdb.wal-file-timeout-initial.
there can be some background processing of changes that may refer to data from WAL files. For example, each insert into a collection will need to increase the collection's count() value by 1. To save an extra write into RocksDB on each insert, the count() value is only written to the storage engine by a background thread, ideally only once every X insert operations. However, this may lead to WAL files being around for a bit longer, especially if the background thread cannot keep up with the insert workload.
There is the startup option --rocksdb.wal-archive-size-limit to put a hard limit on the cumulated size of the WAL files in the archive. From your question, it appears that you are currently using ArangoDB version 3.10.
From the warning message you posted, it seems that the WAL archive cleanup somehow applies the wrong limit values.
It turns out that there has been a recent bugfix, released in ArangoDB version 3.10.1, 3.9.4, and 3.8.8, that should rectify this behavior. So upgrading to one of these or later versions may actually help when using the WAL archive size limit.
Shared your question in the Speedb hive, on Discord, and here is what we got for you:
"By default, ArangoDB set the max_wal_size to 1G the value of rocksdb.wal-archive-size-limit must be set to at least twice this number (otherwise you may end up with a single WAL file and the delete will fail)."
Hope this help, if it doesn't or you have follow up questions, please join the Speedb Discord and we will be happy to help.

Foundry Data Storage Optimization

Hi I have a general question about pipelines optimization in order to lower storage space.
Does deleting trashed datasets help alleviate disk storage? Ex. Remove obsolete datasets: a.) based on business knowledge and utilization and b.) datasets in the trash.
Also, We'd like to manage the copies of datasets that are stored when a schedule runs. We believe that if we ever had to fall back to a previous version, we only need to reference the latest one, as opposed to keeping multiple copies.
Does this affect storage? And is there a way to manage configuration on this?
Deleting trashed datasets (in typical setups) will result in their underlying files being deleted, but typically a larger driver of storage consumption is the set of previous dataset views kept by default.
You can control the length of time these files and views are kept using the Foundry Retention service. I'd recommend you consult with platform documentation and your support team for configuration of this service.
Retention will compute and mark files matching your configuration for deletion and periodically delete them, thus reducing your storage consumption.

Ceph Object Gateway: what is the best backup strategy?

I have a Ceph cluster managed by Rook with a single RGW store over it. We are trying to figure out the best backup strategy for this store. We are considering the following options: using rclone to backup object via an S3 interface, using s3fs-fuse (haven’t tested it yet but s3fs-fuse is known to be not reliable enough), and using NFS-Ganesha to reexport the RGW store as an NFS share.
We are going to have quite a lot of RGW users and quite a lot of buckets, so all three solutions do not scale well for us. Another possibility is to perform snapshots of RADOS pools backing the RGW store and to backup these snapshots, but the RTO will be much higher in that case. Another problem with snapshots is that it does not seem possible to perform them consistently across all RGW-backing pools. We never delete objects from the RGW store, so this problem does not seem to be that big if we start snapshotting from the metadata pool - all the data it refers to will remain in place even if we create a snapshot on the data pool a bit later. It won’t be super consistent but it should not be broken either. It’s not entirely clear how to restore single objects in a timely manner using this snapshotting scheme (to be honest, it’s not entirely clear how to restore using this scheme at all), but it seems to be worth trying.
What other options do we have? Am I missing something?
We're planning to implement Ceph in 2021.
We don't expect a large number of users and buckets, initially.
While waiting for https://tracker.ceph.com/projects/ceph/wiki/Rgw_-_Snapshots, I successfully tested this solution to address the Object Store protection by taking advantage of multisite configuration + sync policy (https://docs.ceph.com/en/latest/radosgw/multisite-sync-policy/) in the "Octopus" version.
Assuming you have all zones in the Prod site Zone Sync'd to the DRS,
create a Zone in the DRS, e.g. "backupZone", not Zone Sync'd from
or to any of the other Prod or DRS zones;
the endpoints for this backupZone are in 2 or more DRS cluster
nodes;
using (https://rclone.org/s3/) write a bash script: for each the
"bucket"s in the DRS zones, create a version enabled "bucket"-p in the backupZone
and schedule sync, e.g. twice a day, from "bucket" to "bucket"-p;
protect the access to the backupZone endpoints so that no ordinary
user (or integration) can access them, only accessible from the other nodes in the
cluster (obviously) and the server running the rclone-based script;
when there is a failure, just recover all the objects from the *-p
buckets, once again using rclone, to the original buckets or to
filesystem.
This protects from the following failures:
Infra:
Bucket or pool failure;
Object pervasive corruption;
Loss of a site
Human error:
Deletion of versions or objects;
Removal of buckets
Elimination of entire Pools
Notes:
Only the latest version of each object is sync'd to the protected
(*-p) bucket, but if the script runs several times you have the
latest states of the objects through time;
when an object is deleted in the prod bucket, rnode just flags the
object with the DeleteMarker upon sync
this does not scale!! As the number of buckets increases, the time to
sync becomes untenable

Updating OpenFlow group table bucket list in OpenDaylight

I have a mininet (v2.2.2) network with openvswitch (v2.5.2), controlled by OpenDaylight Carbon. My application is an OpenDaylight karaf feature.
The application creates a flow (for multicasts) to a group table (type=all) and adds/removes buckets as needed.
To add/remove buckets, I first check if there is an existing group table:
InstanceIdentifier<Group> groupIid = InstanceIdentifier.builder(Nodes.class)
.child(Node.class, new NodeKey(NodId))
.augmentation(FlowCapableNode.class)
.child(Group.class, grpKey)
.build();
ReadOnlyTransaction roTx = dataBroker.newReadOnlyTransaction();
Future<Optional<Group>> futOptGrp = rwTx.read(LogicalDatastoreType.OPERATIONAL, groupIid);
If it doesn't find the group table, it is created (SalGroupService.addGroup()). If it does find the group table, it is updated (SalGroupService.updateGroup()).
The problem is that it takes some time after the RPC call add/updateGroup() to see the changes in the data model. Waiting for the Future<RPCResult<?>> doesn't guarantee that the data model has the same state as the device.
So, how do I read the group table and bucket list from the data model and make sure that I am indeed reading the same state as the current state of the device?
I know that
Add/UpdateGroupInputBuilder has setTransactionUri()
DataBroker gives transaction to read/write
you should use transaction chaining
But I cannot figure out how to combine these.
Thank you
EDIT: Or do I have to use write transactions in stead of RPC calls?
I dropped using RPC calls for writing flows and switched to using writes to the config datastore. It will still take some time to see the changes appear in the actual device and in the operational datastore but that is ok as long as I use the config datastore for both reads and writes.
However, I have to keep in mind that it is not guaranteed that changes to the config datastore will always make it to the actual device. My flows are not that complicated in the sense that conflicts are unlikely to happen. Still, I will probably check consistency between operational and configuration datastore.

gun db storage model for large centrally stored data, tiny collaborative clients

Use Case:
Say I wanted to create a realtime-collaborative document editing system.
In this scenario many users could create and collaborate on many documents.
Due to client-device constraints, it's not possible for any client to keep a replica of all documents, only just a handful.
There needs to be central storage server where all documents always live, and this server is always backed up.
Each client can 'subscribe' to any document, and all clients subscribed can see realtime changes of all other clients subscribed/editing the same document.
Questions:
Since each client can't store all documents, there needs to be a way to remove the replicas of 'old' documents from the client, without deleting the document from the central store, ideally based on an automatic least-recently-used approach. How is this handled in gun?
In gun, how can a document be deleted from the central store, so it's then effectively permanently removed from, and no longer accessible to, all clients?
When a document is deleted from the central store, how is the physical storage space ever actually reclaimed for later use?
Great questions, #user2672083 . Here is the current lay out:
Collaborative realtime document editing is possible with gun. Here is a quick prototype I recorded a long time ago, however there are no full pre-built examples/implementations yet.
Not all data is stored on every client by default. The browser only keeps the data it requests/gets/subscribes to.
The default server already acts as a backup. I recommend using the S3 storage adapter, because then you do not have to worry about running out of disk space.
Removing old replicas. Currently, if I want the server to act as a central "master", I just put a localStorage.clear() at the top of my browser code. This will force the browser to have to always load the latest from the server. This is not ideal though, an LRU specific feature is coming soon according to the roadmap.
Permanently removing data and reclaiming space. While this should be easy for a central setup, because gun is P2P by default, it uses a technique called tombstoning to delete data. Given a lot of requests (like yours) for LRU/TTL/GC/deleting, there will be better support for this in the future. Currently, you have to use a mix of rm data.json, localStorage.clear() and 30 day lifecycles on S3 to get this to work. This will be more integrated/easier in the future.
Now a question for you: What are you working on, and how can I help? Many of the things you asked about are possible (with some work) now, but slated to be the focus of the next version of gun - I'd love to get your feedback as we build this out.
All peers reply to requests for data (#2), meaning that localStorage and the server will both reply. Because localStorage is physically closer to a user, it will reply first/fastest and then replies from the server will be merged. GUN does not try each peer "in sequence" doing try/catch cascades, GUN replies from all peers in parallel.
GUN has swappable storage and transport interfaces, so yes, it is easy to build other layers on top or into it.