Getting error while connecting ADLS to Notebook in AML - azure-data-lake

I am getting below error while connecting dataset created and registered in AML notebook and which is based on ADLS. When I connect this dataset in designer I am able to visualize the same. Below is the code that I am using. Please let me know the solution if anyone have faced the same error.
Examle 1 Import dataset to notebbok
from azureml.core import Workspace, Dataset
subscription_id = 'abcd'
resource_group = 'RGB'
workspace_name = 'DSG'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='abc')
dataset.to_pandas_dataframe()
Error 1
ExecutionError: Could not execute the specified transform.
(Error in getting metadata for path /local/top.txt.
Operation: GETFILESTATUS failed with Unknown Error: The operation has timed out..
Last encountered exception thrown after 5 tries.
[The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.]
[ServerRequestId:])|session_id=2d67
Example 2 Import data from datastore to notebook
from azureml.core import Workspace, Datastore, Dataset
datastore_name = 'abc'
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name)
datastore_paths = [(datastore, '/local/top.txt')]
df_ds = Dataset.Tabular.from_delimited_files(
path=datastore_paths, validate=True,
include_path=False, infer_column_types=True,
set_column_types=None, separator='\t',
header=True, partition_format=None
)
df = df_ds.to_pandas_dataframe()
Error 2
Cannot load any data from the specified path. Make sure the path is accessible.

Try removing the initial slash from your path 'local/top.txt'
datastore_paths = [(datastore, 'local/top.txt')]

For your dataset abc, can you visualize/preview the data on ml.azure.com?
Might be due to the fact that your data permission is not set up correctly in ADLS. You need to give permission to the service principal for the file/folder you are access.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control
Data Access Setting on a file in ADLS

Related

Timeout error when writing large amounts of data to big query

I am getting the following error when trying to write large amounts of data to big query using
client.insert_rows_json()
google.api_core.exceptions.RetryError: Deadline of 600.0s exceeded while calling target function, last exception: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
method. I have tried modifying the timeout parameter in the following way:
client.insert_rows_json(*args, timeout=1000000)
but I still get the same timeout error where the deadline is still at 600.0s.
Is there someway to establish the client with:
credentials = service_account.Credentials.from_service_account_info(service_account_json)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)
and specify how long before timeout should occur?
Please try this solution
from google.cloud import bigquery
from google.oauth2 import service_account
# Load the service account credentials
service_account_json = 'path/to/service_account.json'
credentials = service_account.Credentials.from_service_account_file(service_account_json)
# Create the client with the desired timeout value
client = bigquery.Client(
credentials=credentials,
project=credentials.project_id,
default_query_job_config=bigquery.QueryJobConfig(
timeout=1800, # Set the timeout to 1800 seconds (30 minutes)
),
)

pubsub message ingestion into Bigquery using data fusion

I've built a simple realtime pipeline to receive messages and attributes from pubsub subscription and wrangle them to keep only a few fields and load it to a BigQuery table. when deployed and run, the pipeline log says Importing into table '<tablename>' from 0 paths; path[0] is '(empty)'; awaitCompletion: true
I'm unable to understand why 0 paths and why all the records are going to errors when an error collector was setup. Is there a way to debug the wrangler stage better?
sample wrangler directives as below:
keep message,attributes
set-charset :message 'utf-8'
set-type :attributes string
parse-as-json :attributes 1
parse-as-json :message 5
keep attributes_page_url,attributes_cart_remove,attributes_page_title,attributes_transaction_complete,message_event_id,message_data_dom_domain,message_data_dom_title,message_data_dom_pathname,message_data_udo_ut_visitor_id
columns-replace s/^attributes_//g
columns-replace s/^message_//g
Any help is appreciated.Thanks
The reason you see that there are 0 paths to load is that all records are causing errors during wrangling.
There are 2 ways in which you can capture these errors:
Configure the Wrangler stage to Fail Pipeline on error. This will show the exception/error in the logs.
Attach the Error output from the Wrangler stage to an Error Collector, and store the output in a File or GCS Sink. This allows you to capture the error message for each row. Configure the Error Collector as follows:
Error Message Column Name = errorMsg
Error Code Column Name = errorCode
Error Emitter Node Name = invalidRecord

Failed to connect to BigQuery with Python - ServiceUnavailable

querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!
503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page
I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

S3ResponseError: 403 Forbidden.An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist

try:
conn = boto.connect_s3(access_key,secret_access_key)
bucket = conn.get_bucket(bucket_name, validate=False)
k1 = Key(bucket)
k1.key = 'Date_Table.csv'
# k = bucket.get_key('Date_Table.csv')
k1.make_public()
k1.get_contents_to_filename(tar)
except Exception as e:
print(e)
i am getting error
S3ResponseError: 403 Forbidden
AccessDeniedAccess
DeniedD9ED8BFF6D6A993Eaw0KmxskATNBTDUEo3SZdwrNVolAnrt9/pkO/EGlq6X9Gxf36fQiBAWQA7dBSjBNZknMxWDG9GI=
i tried all posibility and still getting same error .. please guide me how to solve this issue.
i tried other way as below and getting error
An error occurred (NoSuchKey) when calling the GetObject operation:
The specified key does not exist.
session = boto3.session.Session(aws_access_key_id=access_key, aws_secret_access_key=secret_access_key,region_name='us-west-2')
print ("session:"+str(session)+"\n")
client = session.client('s3', endpoint_url=s3_url)
print ("client:"+str(client)+"\n")
stuff = client.get_object(Bucket=bucket_name, Key='Date_Table.csv')
print ("stuff:"+str(stuff)+"\n")
stuff.download_file(local_filename)
ge
Always use boto3. boto is deprecated.
As long as you setup AWS CLI credential, you don't need to pass the hard-coded credential. Read boto3 credential setup throughly.
There is no reason to initiate boto3.session unless you are using different region and user profile.
Take your time and study difference between service client(boto3.client) vs service resources(boto3.resources).
Low level boto3.client is easier to use for experiments. Use high level boto3.resource if you need to pass around arbitrary object.
Here is the simple code for boto3.client("s3").download_file.
import boto3
# initiate the proper AWS services client, i.e. S3
s3 = boto3.client("s3")
s3.download_file('your_bucket_name', 'Date_Table.csv', '/your/local/path/and/filename')