Specify the GCP project in tf.io.GFile() - tensorflow

Is there a way how to specify the GCP project for downloading some objects using the tf.io.gfile.GFile? I know it can be used like this:
import tensorflow as tf
with tf.io.gfile.GFile("gs://<bucket>/<path>") as f:
f.read()
but this does not have any parameter for project. I know you can select active project using the CLI tools, but I want to download data from different projects. Is it possible, or do I need to use some other GCS client? If so, which is the most compatible with TF and can be most easily used in tf.function?

Buckets are unique across projects, so although you see only buckets created as a part of a project on the https://console.cloud.google.com/storage/browser?project=<project>&prefix=&forceOnObjectsSortingFiltering=false page, you can query it regardless of the project, so as long as you have access there, you just can access the data without specifying project.

Related

How to migrate MlFlow experiments from one Databricks workspace to another with registered models?

so unfortunatly we have to redeploy our Databricks Workspace in which we use the MlFlow functonality with the Experiments and the registering of Models.
However if you export the user folder where the eyperiment is saved with a DBC and import it into the new workspace, the Experiments are not migrated and are just missing.
So the easiest solution did not work. The next thing I tried was to create a new experiment in the new workspace. Copy all the experiment data from the dbfs of the old workspace (with dbfs cp -r dbfs:/databricks/mlflow source, and then the same again to upload it to the new workspace) to the new one. And then just reference the location of the data to the experiment like in the following picture:
This is also not working, no run is visible, although the path is already existing.
The next idea was that the registred models are the most important one so at least those should be there and accessible. For that I used the documentation here: https://www.mlflow.org/docs/latest/model-registry.html.
With the following code you get a list of the registred models on the old workspace with the reference on the run_id and location.
from mlflow.tracking import MlflowClient
client = MlflowClient()
for rm in client.list_registered_models():
pprint(dict(rm), indent=4)
And with this code you can add models to a model registry with a reference to the location of the artifact data (on the new workspace):
# first the general model must be defined
client.create_registered_model(name='MyModel')
# and then the run of the model you want to registre will be added to the model as version one
client.create_model_version( name='MyModel', run_id='9fde022012046af935fe52435840cf1', source='dbfs:/databricks/mlflow/experiment_id/run_id/artifacts/model')
But that did also not worked out. if you go into the Model Registry you get a message like this: .
And I really checked, at the given path (the source) there the data is really uploaded and also a model is existing.
Do you have any new ideas to migrate those models in Databricks?
There is no official way to migrate experiments from one workspace to another. However, leveraging the MLflow API, there is an "unofficial" tool that can migrate experiments minus the notebook revision associated with a run.
mlflow-tools
As an addition to #Andre's anwser
you can also check mlflow-export-import from the same developer
mlflow-export-import

Loading Keras Model in [Google App Engine]

Use-case:
I am trying to load a pre-trained Keras Model as .h5 file in Google App Engine. I am running App Engine on a Python runtime 3.7 and Standard Environment.
Issue:
I tried using the load_model() Keras function. Unfortunately, the load_model function does require a 'file_path' and I failed to load the Model from the Google App Engine file explorer. Further, Google Cloud Storage seems not to be an option as it is not recognized as a file path.
Questions:
(1) How can I load a pretrained model (e.g. .h5) into Google App Engine (without saving it locally first)?
(2) Maybe there is a way to load the model.h5 into Google App Engine from Google Storage that I have not thought of, e.g by using another function (other than tf.keras.models.load_model()) or in another format?
I just want to read the model in order to make predictions. Writing or training the model in not required.
I finally managed to load the Keras Model in Google App Engine -- overcoming four challenges:
Solution:
First challenge: As of today Google App Engine does only provide TF Version 2.0.0x. Hence, make sure to set in your requirements.txt file the correct version. I ended up using 2.0.0b1 for my project.
Second challenge: In order to use a pretrained model, make sure the model has been saved using this particular TensorFlow Version, which is running on Google App Engine.
Third challenge: Google App Engine does not allow you to read from disk. The only possibility to read / or store data is to use memory respectively the /tmp folder (as correctly pointed out by user bhito). I ended up connecting my Gcloud bucket and loaded the model.h5 file as a blob into the /tmp folder.
Fourth challenge: By default the instance class of Google App Engine is limited to 256mb. Due to my model size, I needed to increase the instance class accordingly.
In summary, YES tf.keras.models.load_model() does work on App Engine reading from Cloud Storage having the right TF Version and the right instance (with enough memory)
I hope this will help future folks who want to use Google App Engine to deploy there ML Models.
You will have to download the file first before using it, Cloud Storage paths can't be used to access objects. There is a sample on how to download objects in the documentation:
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
And then write the file to the /tmp temporary folder which is the only one available in App Engine. But you have to take into consideration that once the instance using the file is deleted, the file will be deleted as well.
Being more specific to your question, to load a keras model, it's useful to have it as a pickle, as this tutorial shows:
def _load_model():
global MODEL
client = storage.Client()
bucket = client.get_bucket(MODEL_BUCKET)
blob = bucket.get_blob(MODEL_FILENAME)
s = blob.download_as_string()
MODEL = pickle.loads(s)
I also have been able to found an answer to another Stackoverflow post that covers what you're actually looking for.

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Merge two Endeca Servers (Endeca 3.1) into one. Including their current data

Let me explain in more detail:
1st: I'm running endeca 3.1, so Endeca Server here refers to 3.0's Data Domain.
I'm required to use an Endeca Server currently present on Endeca (Downloaded a Demo VM). All the info on it, including, groups, attributes and data, must be merged into out Endeca Server. (It can also be the other way around, i could merge my Endeca Server into this one.)
So far, i've tried to do the following:
1) Clone the Endeca Server
2) use the putCollection sconfig operation to create a collection on it with the same name i have on mine.
3) Load configurations using the LoadCollection & LoadAttributes graphs from OEID POC Template 3.1. I point to the new collection on the Configuration.xls file.
This is where i encounter an issue. The LoadAttributes graph gets a T/O message from the server's WS. Then the config WSDL becomes inaccesible for a while. I can't go beyond this point.
I've been able to load data into the collection, but i need to load the attributes first.
THanks in advance for your replies.
Regards
There are a few techniques.
Have you tried exporting the data domain and then importing it?
You can use the endeca-cmd tools to export to a file, and then import from that file. This would enable you to add 2 datastores into one server.
If you want to combine 2 datastores then that is a different question.
The simplest approach in 3.1 if the data collections are small. Extract then as CSV (via a data-table), convert to XLS and add them via self provisioning into separate collections within a single data store. If you are running in the VM this is potentially the easiest approach.
This can also be done using Integrator.
You don't need to load the attributes unless you are using multi-value types. You can call against the conversation web-service to extract data and then load it using 'bulk-load' I would not worry too much about creating the attributes unless this becomes essential due to their type or complexity. If you cannot call against the conversation web-service, then again extract as csv and load using Integrator.

Google BigQuery, unable to load data into shared datasets

I created a project on Google BigQuery and enabled billing.
Went on to create few datasets that were shared with my team members (Can EDIT premissions).
However, my team mates are unable to load data into the respective datasets shared with them. Whenever they try it says billing not enabled for this project.
I am able to load data into the datasets but not my team.
It's been more than 24 hours
Thanks in advance
Note that in order to load data, they need to run a load job, and that load job needs to be run in a project. Perhaps billing is not enabled on the project they are using?
You can give your team members read access to the project (or greater) to allow them to run jobs in your own billing-enabled project.
You can share a BigQuery project at the project level and at the dataset level.
See https://developers.google.com/bigquery/access-control.
I assume you are sharing at the dataset level. Can you try sharing the project instead with your team members? (here: https://cloud.google.com/console/project)
Please report back!