how to add a new repository outside graphdb.home - graphdb

I am cloning a large public triplestore for local development of a client app.
The data is too large to fit on the ssd partition where <graphdb.home>/data lives. How can I create a new repository at a different location to host this data?

GraphDB on startup will read the value of graphdb.home.data parameter. By default it will point to ${graphdb.home}/data. You have two options:
Move all repositories to the big non-SSD partition
You need to start graphdb with ./graphdb -Dgraphdb.home.data=/mnt/big-drive/ or edit the value of graphdb.home.data parameter in ${graphdb.home/conf/graphdb.properties.
Move a single repository to a different location
GraphDB does not allow creating a repository if the directory already exists. The easiest way to work around this is to create a new empty repository bigRepo, initialize the repository by making at least a request to it, and then shutdown GraphDB. Then move the directory $gdb.home/data/repositories/bigRepo/storage/ to your new big drive and create a symbolic link on the file system ln -s /mnt/big-drive/ data/repositories/bigRepo/storage/
You can apply the same technique also for moving only individual files.
Please make sure that all permissions are correctly set by using the same user to start GraphDB.

Related

Pentaho - Migrating from Database repository to File Repository

I am in process of migrating Pentaho from Database repository to file repository.
I have exported the database repository into xml file and then created a file repository and imported the repository...
The first issue that I saw after importing is all my database connections are being stored in .ktr and .kjb files, This is going to be a big issue If I update a connection string like updating a password, I have more than a hundreds of sub transformations and jobs, do I have to update this in all those files?
Is there any way to ignore the password and other connection settings that is stored in the .ktr and .kjb files and instead use the repository connection or specify it in the .kettle property?
The other issue that I face is When I try to run the master job via kitchen in cmd it does not recognize the sub transformation and jobs. However when I change the Transformation root to ${Internal.Entry.Current.Directory} - the sub transformation is being recognized and processed- As I mentioned I have more than 100 sub transformation and jobs - is there any way to update this root for all jobs and transformation at once.
Kitchen.bat /file:"C:\pentaho-8-1\Dev_Repo\home\jobs\MainProcess\MasterJob.kjb" /level:Basic /logfile:"C:\pentaho-8-1\logs\my-job.txt"
This fails with error (.ktr is not a file or the repository is not defined)
However when I change the root directory to ${Internal.Entry.Current.Directory} it works!
For the database connections, you can make .kdbs in the repository and enter variables for all the properties (Host, Port, Schema, User, etc) and define them in kettle.properties or another properties file.
This works like a more convenient version of JNDI files, with one properties file per environment. You can easily inspect current values by opening the kettle properties from within the Spoon client (don't edit them or it will mess up the layout!) and you can also put kettle "encrypted" passwords in the properties file.
PDI will still save copies of the connections into all the .kjb and ktr files (and should in theory update them from .kdb or shared.xml when opening them) but since the contents are just generic variable names (${STAGING_DB_HOST} etc) you will almost never run into problems with this.
For the transformation filenames, a good text search and replace tool should fix most of your transformations in one go. Include some of the XML tag to prevent replacing too much.

How to migrate MlFlow experiments from one Databricks workspace to another with registered models?

so unfortunatly we have to redeploy our Databricks Workspace in which we use the MlFlow functonality with the Experiments and the registering of Models.
However if you export the user folder where the eyperiment is saved with a DBC and import it into the new workspace, the Experiments are not migrated and are just missing.
So the easiest solution did not work. The next thing I tried was to create a new experiment in the new workspace. Copy all the experiment data from the dbfs of the old workspace (with dbfs cp -r dbfs:/databricks/mlflow source, and then the same again to upload it to the new workspace) to the new one. And then just reference the location of the data to the experiment like in the following picture:
This is also not working, no run is visible, although the path is already existing.
The next idea was that the registred models are the most important one so at least those should be there and accessible. For that I used the documentation here: https://www.mlflow.org/docs/latest/model-registry.html.
With the following code you get a list of the registred models on the old workspace with the reference on the run_id and location.
from mlflow.tracking import MlflowClient
client = MlflowClient()
for rm in client.list_registered_models():
pprint(dict(rm), indent=4)
And with this code you can add models to a model registry with a reference to the location of the artifact data (on the new workspace):
# first the general model must be defined
client.create_registered_model(name='MyModel')
# and then the run of the model you want to registre will be added to the model as version one
client.create_model_version( name='MyModel', run_id='9fde022012046af935fe52435840cf1', source='dbfs:/databricks/mlflow/experiment_id/run_id/artifacts/model')
But that did also not worked out. if you go into the Model Registry you get a message like this: .
And I really checked, at the given path (the source) there the data is really uploaded and also a model is existing.
Do you have any new ideas to migrate those models in Databricks?
There is no official way to migrate experiments from one workspace to another. However, leveraging the MLflow API, there is an "unofficial" tool that can migrate experiments minus the notebook revision associated with a run.
mlflow-tools
As an addition to #Andre's anwser
you can also check mlflow-export-import from the same developer
mlflow-export-import

When AEM is configured to use a S3 data store will it make blue-green deployments faster?

Background
We know it's possible to setup a devops pipeline that deploys updates to AEM via a blue/green approach by using crx2oak to migrate the content from old to new environment. Why is out of scope of this question.
The problem with this approach is the content copy operation can take a significant time, as the amount of content in the JCR grows. Other ideas to mittigate this are appreciated.
We also know that AEM can have a S3 datastore that off-loads the binary content into a S3 bucket which would not be re-built during blue/green deployment as per:
https://helpx.adobe.com/experience-manager/6-3/sites/deploying/using/storage-elements-in-aem-6.html#OverviewofStorageinAEM6
What is unclear from Adobe's documentation is whether the same S3 bucket can be shared across AEM instances (i.e. blue/green instances). Maybe it's just my google fu that has failed...
Question(s)
When a new AEM instance is configured to use a S3 datastore that already has content in it from the old instance, when crx2oak is used to migrate content, will the new instance be able to access the existing content?
Are there any articles/blogs that describe what the potential time savings of this approach would be?
Yes I could do an experiment, and may do so in the future to answer my own question. I'm looking for information from anyone who has already done this? I'm an engineer so will not re-invent the wheel if someone else has done so.
You can certainly share the same S3 bucket between instances - in fact, this is commonly used along with binary-less replication from author->publisher(s) and is a tried and true configuration.
It's even possible to share the same bucket between completely different environments (e.g. DEV/STAGE, or BLUE/GREEN in your case). The main "gotcha" to be aware of is with regard to DataStore Garbage Collection (DSGC) because it's very possible that there will be blobs which are referenced by only some of the instances sharing the bucket and so when purging unused blobs this needs to be taken into account.
This is all part of the design though, and there is a flag designed specifically for this purpose which tells DSGC to only execute the first phase (the "mark" phase) of GC, and skip the 2nd "sweep" phase, until all instances have marked which blobs they wish to keep/discard. Once all instances have done so the sweep phase can be run to purge blobs not needed by any instances using the bucket.
For a more detailed explanation see the Oak docs:
https://jackrabbit.apache.org/oak/docs/plugins/blobstore.html#Shared_DataStore_Blob_Garbage_Collection_Since_1.2.0
I find it helps to understand that pretty much all of the datastore implementations are done such that blobs are stored according to their checksum, so the same file added uploaded twice will only have one copy stored in the datastore, and there will be two segment store records referencing that same blob. In the same way, multiple AEM instances sharing the same bucket will be able to find a given blob regardless of which instance put it there in the first place.
You can observe see this in action easily with FileDataStore by finding a blob and sha256'ing it - e.g. (this example is on OS X, the checksum command on Linux/Windows will be slightly different):
$ shasum -a256 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002 crx-quickstart/repository/datastore/0c/9e/40/0c9e405fc8d0f0405930cd0044611cfbf014938a1837ae0cfaa266d7732d1002
There you can see that a) the filename is the checksum, and b) it's nested using the first 3 pairs of characters from that checksum, so you can locate the file by just knowing the hash and if you store the same binary, even if the name or JCR metadata is different, the blob referenced will be the same literal file on disk.
From memory S3 datastore uses prefixes rather than directory nesting because this performance better, but the principle is the same.
Finally, a couple of things to consider are:
1) S3 storage is relatively cheap (and practically unlimited) so there is an argument to be made that it's not as necessary to perform regular DSGC unless you're really trying to pinch pennies.
2) If you do run DSGC you need to think about how this will work with whatever backup strategy you're using for the AEM instances. For instance, if you roll back a segment store shortly after running DSGC you'll likely have to recover some of those purged blobs. You can use versioning and/or lifecycle rules to help with this, but it can add significant additional complexity and time to your restore process.
If you opt to simply skip DSGC and leave the blobs there indefinitely it's a good idea to make sure the access key or IAM roles AEM is using doesn't have the DeleteObject permission for the bucket, just to be sure a rogue GC process can't delete anything.
Hope this helps.
Edit
In all that I forgot to actually answer your question - yes it will save some time in cloning in most cases. You'll still need to sync the segment store (obviously) and there are various approaches for this. crx2oak is certainly one - you'll see in the documentation there are specific options for using it w/ S3 where you supply a configuration file (basically a serialised .config file like you'd use with Felix/OSGi).
You can also use something like rsync to simply copy the TAR files over (while at least the target AEM is stopped. Oak is generally atomic so a hot copy from the source can work in theory, but YMMV).
Finally you could obviously use Mongo and cluster the segment store that way, but all the usual cost/complexity/performance issues with doing so apply).
Another interesting development on the horizon for blue/green type is the CompositeNodeStore - there is a good talk from the 2017 adaptTo() conference that talks about this:
https://adapt.to/2017/en/schedule/zero-downtime-deployments-for-the-sling-based-apps-using-docker.html
An external datastore will help a lot, as usually the most space is used by binary assets. The pure content typed in by real people is much less.
On my current project (quite small, but relations should be normal):
Repository 4,8 GB total (4.1 GB Segment Store, 780 MB Index)
File DataStore 222 GB total
If you wanna do it, I have the following remarks:
There are different datastores available. For testing I would start with the File DataStore.
The S3 DataStore makes only sense in my point of view, if you are hosting at Amazons AWS anyway. Adobe Managed Services is doing this, and so S3 makes sense for them. But also there only if you have more than 500 GB assets.
If you use the green/blue approach, then be careful the DataStore garbage collection (just do it manually). The shared Datastore is meant for several publishers, that have the same content. As example you could have the following situation: Your editors delete some assets, you run the DataStore GC and finally your rollback your environment. That means the assets are still in the content repository, but the binaries are cleaned out of the DataStore.
In order to to use a shared file datastore, you need to do the following:
Unpack Quickstart java -jar AEM_6.3_Quickstart.jar -unpack
Create an directory for the file datastore (anywhere outside of the crx-quickstart folder)
Create a directory install inside the extracted crx-quickstart folder
Create a file called org.apache.jackrabbit.oak.plugins.blob.datastore.FileDataStore.cfg inside this install folder
This file contains just 1 line path=<path to file datastore> (see https://jackrabbit.apache.org/oak/docs/osgi_config.html)
Place a reference.key file inside the datastore directory. First time it will be created automatically. But if you use always the same key, the same hash-values are used all datastores across all your environments. This is also a prerequisite for a feature called "binary-less replication" (so binary would only be replicated the first time between author and publisher)
kind regards,
Alex

Is there a way to clone data from CrateDB into Crate running on a new container?

I currently have one container which runs Crate, and stores all its data in the /data/ directory. I am trying to create a clone of this container for debugging purposes -- ideally, the clone would be running Crate (which I can query) using the exact same data. I've tried mounting the same data directory into the /data/ directory of the cloned container and starting Crate, but when I run any queries, I notice that Crate shows 0 tables (that is, it doesn't recognize the data in the folder as database tables). How do I get around this? I know I can export and import data using COPY TO and COPY FROM, but I have so many tables that that would be quite cumbersome to write.
I’m a little bit wondering why you want to use the same data directory for debugging purposes, since you then modify data, which you probably don’t want to change. Also, the two instances would overwrite each others data, when using the same data directory at the same time. That’s the reason why this is not working.
What you still can do, is simply copying the folder in your file system and mount the second debugging node to the cloned folder.
Another solution would be to create a cluster containing both nodes as documented here: https://crate.io/docs/crate/guide/best_practices/docker.html.
Hope that helps.

Adding auxiliary DB data during deployment

My app consists of two containers: the app itself and a database. I'm planning to wrap the app into a chart, thus paving a way for easy reproducible deployment.
Apart from setting/reading environment envs (which helm+kubernetes seems to handle really well), part of app's configuration is:
making sure the database is pre-filled with special auxiliary data (e.g. admin user exists, some user role names required to create new users are there, etc.).
I like the idea of having readable yaml files hold the entire configuration in a human readable format. However at a glance it doesn't seem that helm in any way would help with this (DB records) kind of configuration.
That being said, what is the best place to put code/configuration ensuring that DB contains certain auxiliary records? A config yaml file? An container init script, written in bash?
You are right, Kubernetes or Helm cannot help with preparing your pre-filled database records/schema.
You should probably have your application initialize those pre-filled data. If you don't want to put this logic into your application, you can ship an initialization script and configure an init container with Kubernetes.
Kubernetes makes sure every time your application container is restarted, the init container runs first. In the init container, you can execute a bash/python/... script that makes sure the records you want are there.