Azureml : error "The SSL connection could not be established, see inner exception." while creating Tabular Dataset from Azure Blob Storage file - ssl

I have a new error using Azure ML maybe due to the Ubuntu upgrade to 22.04 which I did yesterday.
I have a workspace azureml created through the portal and I can access it whitout any issue with python SDK
from azureml.core import Workspace
ws = Workspace.from_config("config/config.json")
ws.get_details()
output
{'id': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.MachineLearningServices/workspaces/azml_lk',
'name': 'azml_lk',
'identity': {'principal_id': 'XXXXX',
'tenant_id': 'XXXXX',
'type': 'SystemAssigned'},
'location': 'westeurope',
'type': 'Microsoft.MachineLearningServices/workspaces',
'tags': {},
'sku': 'Basic',
'workspaceid': 'XXXXX',
'sdkTelemetryAppInsightsKey': 'XXXXX',
'description': '',
'friendlyName': 'azml_lk',
'keyVault': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Keyvault/vaults/azmllkXXXXX',
'applicationInsights': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.insights/components/azmllkXXXXX',
'storageAccount': '/subscriptions/XXXXX/resourceGroups/gr_louis/providers/Microsoft.Storage/storageAccounts/azmllkXXXXX',
'hbiWorkspace': False,
'provisioningState': 'Succeeded',
'discoveryUrl': 'https://westeurope.api.azureml.ms/discovery',
'notebookInfo': {'fqdn': 'ml-azmllk-westeurope-XXXXX.westeurope.notebooks.azure.net',
'resource_id': 'XXXXX'},
'v1LegacyMode': False}
I then use this workspace ws to upload a file (or a directory) to Azure Blob Storage like so
from azureml.core import Dataset
ds = ws.get_default_datastore()
Dataset.File.upload_directory(
src_dir="./data",
target=ds,
pattern="*dataset1.csv",
overwrite=True,
show_progress=True
)
which again works fine and outputs
Validating arguments.
Arguments validated.
Uploading file to /
Filtering files with pattern matching *dataset1.csv
Uploading an estimated of 1 files
Uploading ./data/dataset1.csv
Uploaded ./data/dataset1.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Creating new dataset
{
"source": [
"('workspaceblobstore', '//')"
],
"definition": [
"GetDatastoreFiles"
]
}
My file is indeed uploaded to Blob Storage and I can see it either on azure portal or on azure ml studio (ml.azure.com).
The error comes up when I try to create a Tabular dataset from the uploaded file. The following code doesn't work :
from azureml.core import Dataset
data1 = Dataset.Tabular.from_delimited_files(
path=[(ds, "dataset1.csv")]
)
and it gives me the error :
ExecutionError:
Error Code: ScriptExecution.DatastoreResolution.Unexpected
Failed Step: XXXXXX
Error Message: ScriptExecutionException was caused by DatastoreResolutionException.
DatastoreResolutionException was caused by UnexpectedException.
Unexpected failure making request to fetching info for Datastore 'workspaceblobstore' in subscription: 'XXXXXX', resource group: 'gr_louis', workspace: 'azml_lk'. Using base service url: https://westeurope.experiments.azureml.net. HResult: 0x80131501.
The SSL connection could not be established, see inner exception.
| session_id=XXXXXX
After some research, I assumed it might be due to openssl version (which now is 1.1.1) but I am not sure and I surely don't know how to fix it...any ideas ?

According to the document there is no direct procedure to convert the file dataset into tabular dataset. Instead, we can create a workspace and that creates two storage methods (blobstorage which is the default storage, file storage). The SSL will be taken care by workspace.
We can create a datastore in the workspace and connect that to the blob storage.
Follow the procedure to do the same.
Create a workspace
If we want, we can create a dataset.
We can create from local files of datastore.
To choose a datastore, first we need to have a file in the datastore
Goto Datastores and click on create dataset. Observe that the name is workspaceblobstorage(default).
Fill the details and see that the dataset type is Tabular.
In the path, we will be having the local files path and can check there, under the select or create a datastore, it is showing default storage as blob.
After uploading, we can wee the name in this section which is a datastore and tabular dataset.
In your workspace created, check whether the public access is Disabled or Enabled. If disabled, it will not allow to access due to lack of SSL. Checkout the image below. After enabling, use the same procedure which was implemented till now.

Related

Bigquery LoadJobConfig Delete Source Files After Transfer

When creating a Bigquery Data Transfer Service Job Manually through the UI, I can select an option to delete source files after transfer. When I try to use the CLI or the Python Client to create on-demand Data Transfer Service Jobs, I do not see an option to delete the source files after transfer. Do you know if there is another way to do so? Right now, my Source URI is gs://<bucket_path>/*, so it's not trivial to delete the files myself.
For me works this snippet (replace YOUR-... with your data):
from google.cloud import bigquery_datatransfer
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR-CRED-FILE-PATH"
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
destination_project_id = "YOUR-PROJECT-ID"
destination_dataset_id = "YOUR-DATASET-ID"
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id=destination_dataset_id,
display_name="YOUR-TRANSFER-NAME",
data_source_id="google_cloud_storage",
params={
"data_path_template":"gs://PATH-TO-YOUR-DATA/*.csv",
"destination_table_name_template":"YOUR-TABLE-NAME",
"file_format":"CSV",
"skip_leading_rows":"1",
"delete_source_files": True
},
)
transfer_config = transfer_client.create_transfer_config(
parent=transfer_client.common_project_path(destination_project_id),
transfer_config=transfer_config,
)
print(f"Created transfer config: {transfer_config.name}")
In this example, table YOUR-TABLE-NAME must already exist in BigQuery, otherwise the transfer will crash with error "Not found: Table YOUR-TABLE-NAME".
I used this packages:
google-cloud-bigquery-datatransfer>=3.4.1
google-cloud-bigquery>=2.31.0
Pay attention to the attribute delete_source_files in params. From docs:
Optional param delete_source_files will delete the source files after each successful transfer. (Delete jobs do not retry if the first effort to delete the source files fails.) The default value for the delete_source_files is false.

Why I don't have the file savemodel.pbtxt or savemodel.pb in my bucket [Cloud Storage, BigQueryML, AI Plateform]

I am currently following this tutorial : https://cloud.google.com/architecture/predicting-customer-propensity-to-buy
to create a predictive model on customer behavior in BigQuery and BigQueryML. I then need to be able to extract my model to Cloud Storage and send it to AI Plateform to make predictions online.
My problem is this: the "gcloud ai-platform versions create" step (STEP 5) does not work.
While searching, I noticed that during the extract , the file that Google Shell is asking for is missing in my bucket.
I show you the error in my Shell. The file in question is savemodel.pb.
name#cloudshell:~ (name_account_analytics)$ gcloud ai-platform versions create --model=model_test V_1 --region=us-central1 --framework=tensorflow --python-version=3.7 --runtime-version=1.15 --origin=gs://bucket_model_test/V_1/ --staging-bucket=gs://bucket_model_test
Using endpoint [https://us-central1-ml.googleapis.com/]
ERROR: (gcloud.ai-platform.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri Error: Deployment directory gs://bucket_model_test/V_1/ is expected to contain exactly one of: [saved_model.pb, saved_model.pbtxt].
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: 'Deployment directory gs://bucket_model_test/V_1/ is expected to
contain exactly one of: [saved_model.pb, saved_model.pbtxt].'
field: version.deployment_uri
name#cloudshell:~ (name_account_analytics)$ gcloud ai-platform versions create --model=model_test V_1 --region=us-central1 --framework=tensorflow --python-version=3.7 --runtime-version=1.15 --origin=gs://bucket_model_test/V_1/model.bst --staging-bucket=gs://bucket_model_test
Using endpoint [https://us-central1-ml.googleapis.com/]
ERROR: (gcloud.ai-platform.versions.create) FAILED_PRECONDITION: Field: version.deployment_uri Error: The provided URI for model files doesn't contain any objects.
- '#type': type.googleapis.com/google.rpc.BadRequest
fieldViolations:
- description: The provided URI for model files doesn't contain any objects.
field: version.deployment_uri
How do I tell it to create it? Why doesn't it do it automatically?
Thanks for your help!
Welodya

Cross-region copy fails for Linode Object Storage

We're using boto3 with Linode Object Storage, which is compatible with AWS S3 according to their documentation.
Everything seems to work well, except cross-region copy operation.
When I download an object from source region/bucket and then upload it to destination region/bucket, everything works well. Although, I'd like to avoid that unnecessary upload/download step.
I have the bucket named test-bucket on both regions. And I'd like to copy the object named test-object from us-east-1 to us-southeast-1 cluster.
Here is the example code I'm using:
from boto3 import client
from boto3.session import Session
sess = Session(
aws_access_key_id='***',
aws_secret_access_key='***'
)
s3_client_src = sess.client(
service_name='s3',
region_name='us-east-1',
endpoint_url='https://us-east-1.linodeobjects.com'
)
# test-bucket and test-object are already exists.
s3_client_trg = sess.client(
service_name='s3',
region_name='us-southeast-1',
endpoint_url='https://us-southeast-1.linodeobjects.com'
)
copy_source = {
'Bucket': 'test-bucket',
'Key': 'test-object'
}
s3_client_trg.copy(CopySource=copy_source, Bucket='test-bucket', Key='test-object', SourceClient=s3_client_src)
When I call:
s3_client_src.list_objects(Bucket='test-bucket')['Contents']
It shows me that the test-object exists, But when I run copy, then it throws following message:
An error occurred (NoSuchKey) when calling the CopyObject operation: Unknown
Any help is appreciated!

Is there a way to automate this Python script in GCP?

I am a complete beginner in using GCP functions/products.
I have written the following code below, that takes a list of cities from a local folder, and call in weather data for each city in that list, eventually uploading those weather values into a table in BigQuery. I don't need to change the code anymore, as it creates new tables when a new week begins, now I would want to "deploy" (I am not even sure if this is called deploying a code) in the cloud for it to automatically run there. I tried using App Engine and Cloud Functions but faced issues in both places.
import requests, json, sqlite3, os, csv, datetime, re
from google.cloud import bigquery
#from google.cloud import storage
list_city = []
with open("list_of_cities.txt", "r") as pointer:
for line in pointer:
list_city.append(line.strip())
API_key = "PLACEHOLDER"
Base_URL = "http://api.weatherapi.com/v1/history.json?key="
yday = datetime.date.today() - datetime.timedelta(days = 1)
Date = yday.strftime("%Y-%m-%d")
table_id = f"sonic-cat-315013.weather_data.Historical_Weather_{yday.isocalendar()[0]}_{yday.isocalendar()[1]}"
credentials_path = r"PATH_TO_JSON_FILE"
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
client = bigquery.Client()
try:
schema = [
bigquery.SchemaField("city", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Date", "Date", mode="REQUIRED"),
bigquery.SchemaField("Hour", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Temperature", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Humidity", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Condition", "STRING", mode="REQUIRED"),
bigquery.SchemaField("Chance_of_rain", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Precipitation_mm", "FLOAT", mode="REQUIRED"),
bigquery.SchemaField("Cloud_coverage", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("Visibility_km", "FLOAT", mode="REQUIRED")
]
table = bigquery.Table(table_id, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="Date", # name of column to use for partitioning
)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
except:
print("Table {}_{} already exists".format(yday.isocalendar()[0], yday.isocalendar()[1]))
def get_weather():
try:
x["location"]
except:
print(f"API could not call city {city_name}")
global day, time, dailytemp, dailyhum, dailycond, chance_rain, Precipitation, Cloud_coverage, Visibility_km
day = []
time = []
dailytemp = []
dailyhum = []
dailycond = []
chance_rain = []
Precipitation = []
Cloud_coverage = []
Visibility_km = []
for i in range(24):
dayval = re.search("^\S*\s" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
timeval = re.search("\s(.*)" ,x["forecast"]["forecastday"][0]["hour"][i]["time"])
day.append(dayval.group()[:-1])
time.append(timeval.group()[1:])
dailytemp.append(x["forecast"]["forecastday"][0]["hour"][i]["temp_c"])
dailyhum.append(x["forecast"]["forecastday"][0]["hour"][i]["humidity"])
dailycond.append(x["forecast"]["forecastday"][0]["hour"][i]["condition"]["text"])
chance_rain.append(x["forecast"]["forecastday"][0]["hour"][i]["chance_of_rain"])
Precipitation.append(x["forecast"]["forecastday"][0]["hour"][i]["precip_mm"])
Cloud_coverage.append(x["forecast"]["forecastday"][0]["hour"][i]["cloud"])
Visibility_km.append(x["forecast"]["forecastday"][0]["hour"][i]["vis_km"])
for i in range(len(time)):
time[i] = int(time[i][:2])
def main():
i = 0
while i < len(list_city):
try:
global city_name
city_name = list_city[i]
complete_URL = Base_URL + API_key + "&q=" + city_name + "&dt=" + Date
response = requests.get(complete_URL, timeout = 10)
global x
x = response.json()
get_weather()
table = client.get_table(table_id)
varlist = []
for j in range(24):
variables = city_name, day[j], time[j], dailytemp[j], dailyhum[j], dailycond[j], chance_rain[j], Precipitation[j], Cloud_coverage[j], Visibility_km[j]
varlist.append(variables)
client.insert_rows(table, varlist)
print(f"City {city_name}, ({i+1} out of {len(list_city)}) successfully inserted")
i += 1
except Exception as e:
print(e)
continue
In the code, there is direct reference to two files that is located locally, one is the list of cities and the other is the JSON file containing the credentials to access my project in GCP. I believed that uploading these files in Cloud Storage and referencing them there won't be an issue, but then I realised that I can't actually access my Buckets in Cloud Storage without using the credential files.
This leads me to being unsure whether the entire process would be possible at all, how do I authenticate in the first place from the cloud, if I need to reference that first locally? Seems like an endless circle, where I'd authenticate from the file in Cloud Storage, but I'd need authentication first to access that file.
I'd really appreciate some help here, I have no idea where to go from this, and I also don't have great knowledge in SE/CS, I only know Python R and SQL.
For Cloud Functions, the deployed function will run with the project service account credentials by default, without needing a separate credentials file. Just make sure this service account is granted access to whatever resources it will be trying to access.
You can read more info about this approach here (along with options for using a different service account if you desire): https://cloud.google.com/functions/docs/securing/function-identity
This approach is very easy, and keeps you from having to deal with a credentials file at all on the server. Note that you should remove the os.environ line, as it's unneeded. The BigQuery client will use the default credentials as noted above.
If you want the code to run the same whether on your local machine or deployed to the cloud, simply set a "GOOGLE_APPLICATION_CREDENTIALS" environment variable permanently in the OS on your machine. This is similar to what you're doing in the code you posted; however, you're temporarily setting it every time using os.environ rather than permanently setting the environment variable on your machine. The os.environ call only sets that environment variable for that one process execution.
If for some reason you don't want to use the default service account approach outlined above, you can instead directly reference it when you instantiate the bigquery.Client()
https://cloud.google.com/bigquery/docs/authentication/service-account-file
You just need to package the credential file with your code (i.e. in the same folder as your main.py file), and deploy it alongside so it's in the execution environment. In that case, it is referenceable/loadable from your script without needing any special permissions or credentials. Just provide the relative path to the file (i.e. assuming you have it in the same directory as your python script, just reference only the filename)
There may be different flavors and options to deploy your application and these will depend on your application semantics and execution constraints.
It will be too hard to cover all of them and the official Google Cloud Platform documentation cover all of them in great details:
Google Compute Engine
Google Kubernetes Engine
Google App Engine
Google Cloud Functions
Google Cloud Run
Based on my understanding of your application design, the most suitable ones would be:
Google App Engine
Google Cloud Functions
Google Cloud Run: Check these criteria to see if you application is a good fit for this deployment style
I would suggest using Cloud Functions as you deployment option in which case your application will default to using the project App Engine service account to authenticate itself and perform allowed actions. Hence, you should only check if the default account PROJECT_ID#appspot.gserviceaccount.com under the IAM configuration section has proper access to needed APIs (BigQuery in your case).
In such a setup, you want need to push your service account key to Cloud Storage which I would recommend to avoid in either cases, and you want need to pull it either as the runtime will handle authentication the function for you.

Export database(having nested documents) from cloud firestore to Bigquery using command

I am trying to export cloud firestore data into bigquery to do sql operations.
Exported cloud firestore to cloud storage [using] (https://cloud.google.com/firestore/docs/manage-data/export-import)
gcloud beta firestore export gs://htna-3695c-storage --collection-ids='users','feeds'
Followed https://cloud.google.com/bigquery/docs/loading-data-cloud-firestoreto import from bigquery.
We have 2 collections: Users & Feeds in the cloud firestore.
I have successfully exported feeds collection but am not able to export users collection.
I am getting an error while importing data from storage to bigquery
Error: unexpected property name 'Contacts'. we have contacts field in the collection users.
this contacts field is of type 'Map'.
I also tried the command line. Below is the command to export bigquery.
**bq --location=US load --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the same error:
'unexpected property name 'Contacts'.
I thought to add projection fields to export only specified fields some thing like below
**bq --location=US load --projection_fields=[Coins,Referral,Profile] --source_format=DATASTORE_BACKUP myproject_Dataset.users gs://myproject-storage/2019-04-19T13:29:28_75338/all_namespaces/kind_users/all_namespaces_kind_users.export_metadata**
here also I got the error:
Waiting on >bqjob_r73b7ddbc9398b737_0000016a4909dd27_1 ... (0s) Current status: DONE
BigQuery error in load operation: Error processing job 'myproject:bqjob_r73b7ddbc9398b737_0000016a4909dd27_1': An internal error
occurred and the request could not be completed. Error: 4550751
Can anyone please let me know how to fix these issues?
Thank you in advance.Image of firestore Db
It seems to be an issue with some of your documents. According to the limitations of the BigQuery import of Firestore data, you must have a schema with less than 10,000 unique field names. In your Contacts schema, you are using the contacts' names as keys, that design is likely to produce a big amount of field names. You would need to check if this is happening to your other documents.
As a workaround, you could change the design (at some stage of the loading process) from:
"Contacts" : {
"contact1" : "XXXXX",
"contact2" : "YYYYY",
}
to:
"Contacts" : [
{
"name" : "contact1",
"number" : "XXXXX"
},
{
"name" : "contact2",
"number" : "YYYYY"
},
]
This schema would reduce drastically the number of field_names which will make it easier to manipulate from BigQuery.