How to migrate MlFlow experiments from one Databricks workspace to another with registered models? - migration

so unfortunatly we have to redeploy our Databricks Workspace in which we use the MlFlow functonality with the Experiments and the registering of Models.
However if you export the user folder where the eyperiment is saved with a DBC and import it into the new workspace, the Experiments are not migrated and are just missing.
So the easiest solution did not work. The next thing I tried was to create a new experiment in the new workspace. Copy all the experiment data from the dbfs of the old workspace (with dbfs cp -r dbfs:/databricks/mlflow source, and then the same again to upload it to the new workspace) to the new one. And then just reference the location of the data to the experiment like in the following picture:
This is also not working, no run is visible, although the path is already existing.
The next idea was that the registred models are the most important one so at least those should be there and accessible. For that I used the documentation here: https://www.mlflow.org/docs/latest/model-registry.html.
With the following code you get a list of the registred models on the old workspace with the reference on the run_id and location.
from mlflow.tracking import MlflowClient
client = MlflowClient()
for rm in client.list_registered_models():
pprint(dict(rm), indent=4)
And with this code you can add models to a model registry with a reference to the location of the artifact data (on the new workspace):
# first the general model must be defined
client.create_registered_model(name='MyModel')
# and then the run of the model you want to registre will be added to the model as version one
client.create_model_version( name='MyModel', run_id='9fde022012046af935fe52435840cf1', source='dbfs:/databricks/mlflow/experiment_id/run_id/artifacts/model')
But that did also not worked out. if you go into the Model Registry you get a message like this: .
And I really checked, at the given path (the source) there the data is really uploaded and also a model is existing.
Do you have any new ideas to migrate those models in Databricks?

There is no official way to migrate experiments from one workspace to another. However, leveraging the MLflow API, there is an "unofficial" tool that can migrate experiments minus the notebook revision associated with a run.
mlflow-tools

As an addition to #Andre's anwser
you can also check mlflow-export-import from the same developer
mlflow-export-import

Related

Specify the GCP project in tf.io.GFile()

Is there a way how to specify the GCP project for downloading some objects using the tf.io.gfile.GFile? I know it can be used like this:
import tensorflow as tf
with tf.io.gfile.GFile("gs://<bucket>/<path>") as f:
f.read()
but this does not have any parameter for project. I know you can select active project using the CLI tools, but I want to download data from different projects. Is it possible, or do I need to use some other GCS client? If so, which is the most compatible with TF and can be most easily used in tf.function?
Buckets are unique across projects, so although you see only buckets created as a part of a project on the https://console.cloud.google.com/storage/browser?project=<project>&prefix=&forceOnObjectsSortingFiltering=false page, you can query it regardless of the project, so as long as you have access there, you just can access the data without specifying project.

dbt : Database Error Insufficient Permission

In my current dbt project, I run everything on the same Google cloud project (let's say project : dataA). Since datasets becomes a lot, I decide to split the project into 2: The current project for import of raw data and a new project (for example : dataB) for production environment where I stock all data marts.
I use a service account to manage the lecture or editing data sources for both two projects. And I am sure that there is no issues on rights. The profile setting is quite similar to my current settings which work fine.
But I am experiencing some Database Error issues from dbt say that I don't have Insufficient Permission.
Does anyone have an idea about the reason of the issue? And how to fix it?
Many thanks!

how to add a new repository outside graphdb.home

I am cloning a large public triplestore for local development of a client app.
The data is too large to fit on the ssd partition where <graphdb.home>/data lives. How can I create a new repository at a different location to host this data?
GraphDB on startup will read the value of graphdb.home.data parameter. By default it will point to ${graphdb.home}/data. You have two options:
Move all repositories to the big non-SSD partition
You need to start graphdb with ./graphdb -Dgraphdb.home.data=/mnt/big-drive/ or edit the value of graphdb.home.data parameter in ${graphdb.home/conf/graphdb.properties.
Move a single repository to a different location
GraphDB does not allow creating a repository if the directory already exists. The easiest way to work around this is to create a new empty repository bigRepo, initialize the repository by making at least a request to it, and then shutdown GraphDB. Then move the directory $gdb.home/data/repositories/bigRepo/storage/ to your new big drive and create a symbolic link on the file system ln -s /mnt/big-drive/ data/repositories/bigRepo/storage/
You can apply the same technique also for moving only individual files.
Please make sure that all permissions are correctly set by using the same user to start GraphDB.

Can openrdf-sesame recover the unseen data in openrdf-workbench?

Somehow I found some of the repositories(physics_base for example) in workbench are missing, but when I searched into the data path /.aduna/openrdf-sesame/repositories/physics_base, I found the data still exist.
So does this mean that there is a way to recover my data in workbench?
Thank you for any suggestion.
Yes that's possible. Just create a new repository of the same type in the workbench, open it once, shut down sesame, then copy over the contents of the directory to the data directory of your new repository.
It might even be possible to just create a repository of the same name and have it automatically pick up the existing data, but I can't test that right now (in a pub), so if you want to try that, make a backup first.

Is there a way to clone data from CrateDB into Crate running on a new container?

I currently have one container which runs Crate, and stores all its data in the /data/ directory. I am trying to create a clone of this container for debugging purposes -- ideally, the clone would be running Crate (which I can query) using the exact same data. I've tried mounting the same data directory into the /data/ directory of the cloned container and starting Crate, but when I run any queries, I notice that Crate shows 0 tables (that is, it doesn't recognize the data in the folder as database tables). How do I get around this? I know I can export and import data using COPY TO and COPY FROM, but I have so many tables that that would be quite cumbersome to write.
I’m a little bit wondering why you want to use the same data directory for debugging purposes, since you then modify data, which you probably don’t want to change. Also, the two instances would overwrite each others data, when using the same data directory at the same time. That’s the reason why this is not working.
What you still can do, is simply copying the folder in your file system and mount the second debugging node to the cloned folder.
Another solution would be to create a cluster containing both nodes as documented here: https://crate.io/docs/crate/guide/best_practices/docker.html.
Hope that helps.