Failed to connect to BigQuery with Python - ServiceUnavailable - pandas

querying data from BigQuery has been working for me. Then I updated my google packages (e. g. google-cloud-bigquery) and suddenly I could no longer download data. Unfortunately, I don't know the old version of the package I was using any more. Now, I'm using version '1.26.1' of google-cloud-bigquery.
Here is my code which was running:
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
KEY_FILE_LOCATION = "path_to_json"
PROCECT_ID = 'bigquery-123454'
credentials = service_account.Credentials.from_service_account_file(KEY_FILE_LOCATION)
client = bigquery.Client(credentials= credentials,project=PROCECT_ID)
query_job = client.query("""
SELECT
x,
y
FROM
`bigquery-123454.624526435.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20200501' AND '20200502'
""")
results = query_job.result()
df = results.to_dataframe()
Except of the last line df = results.to_dataframe() the code works perfectly. Now I get a weired error which consists of three parts:
Part 1:
_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"#1596627109.629000000","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1596627109.629000000","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
Part 2:
ServiceUnavailable: 503 failed to connect to all addresses
Part 3:
RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x0000000010BD3C80>, table_reference {
project_id: "bigquery-123454"
dataset_id: "_a0003e6c1ab4h23rfaf0d9cf49ac0e90083ca349e"
table_id: "anon2d0jth_f891_40f5_8c63_76e21ab5b6f5"
}
requested_streams: 1
read_options {
}
format: ARROW
parent: "projects/bigquery-123454"
, metadata=[('x-goog-request-params', 'table_reference.project_id=bigquery-123454&table_reference.dataset_id=_a0003e6c1abanaw4egacf0d9cf49ac0e90083ca349e'), ('x-goog-api-client', 'gl-python/3.7.3 grpc/1.30.0 gax/1.22.0 gapic/1.0.0')]), last exception: 503 failed to connect to all addresses
I don't have an explanation for this error. I don't think it has something to do with me updating the packages.
Once I had problems with the proxy but these problems caused another/different error.
My colleague said that the project "bigquery-123454" is still available in BigQuery.
Any ideas?
Thanks for your help in advance!

503 error occurs when there is a network issue. Try again after some time or retry the job.
You can read more about the error on Google Cloud Page

I found the answer:
After downgrading the package "google-cloud-bigquery" from version 1.26.1 to 1.18.1 the code worked again! So the new package caused the errors.
I downgraded the package using pip install google-cloud-bigquery==1.18.1 --force-reinstall

Related

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another.
The job fails with the error:
Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error:
Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].
I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene.
The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one.
Any idea of how to work with this issue ?
Here's the code:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv
options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'
with beam.Pipeline(options=options) as p:
query = "SELECT ..."
bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)
bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)
table_schema = ...
bq_data | beam.io.WriteToBigQuery(
project="prj",
dataset="test",
table="test",
schema=table_schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.
So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.
This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().
You can specify the dataset to use. That way the process doesn't create a temp dataset.
Example:
TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
.fromQuery("selectQuery").withQueryTempDataset("existingDataset")
.usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Getting error while connecting ADLS to Notebook in AML

I am getting below error while connecting dataset created and registered in AML notebook and which is based on ADLS. When I connect this dataset in designer I am able to visualize the same. Below is the code that I am using. Please let me know the solution if anyone have faced the same error.
Examle 1 Import dataset to notebbok
from azureml.core import Workspace, Dataset
subscription_id = 'abcd'
resource_group = 'RGB'
workspace_name = 'DSG'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='abc')
dataset.to_pandas_dataframe()
Error 1
ExecutionError: Could not execute the specified transform.
(Error in getting metadata for path /local/top.txt.
Operation: GETFILESTATUS failed with Unknown Error: The operation has timed out..
Last encountered exception thrown after 5 tries.
[The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.,The operation has timed out.]
[ServerRequestId:])|session_id=2d67
Example 2 Import data from datastore to notebook
from azureml.core import Workspace, Datastore, Dataset
datastore_name = 'abc'
workspace = Workspace.from_config()
datastore = Datastore.get(workspace, datastore_name)
datastore_paths = [(datastore, '/local/top.txt')]
df_ds = Dataset.Tabular.from_delimited_files(
path=datastore_paths, validate=True,
include_path=False, infer_column_types=True,
set_column_types=None, separator='\t',
header=True, partition_format=None
)
df = df_ds.to_pandas_dataframe()
Error 2
Cannot load any data from the specified path. Make sure the path is accessible.
Try removing the initial slash from your path 'local/top.txt'
datastore_paths = [(datastore, 'local/top.txt')]
For your dataset abc, can you visualize/preview the data on ml.azure.com?
Might be due to the fact that your data permission is not set up correctly in ADLS. You need to give permission to the service principal for the file/folder you are access.
https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control
Data Access Setting on a file in ADLS

Could not claim interface pyusb

I'm trying to use pyusb over usblib1.0 to read the data from an old Ps2 mouse using a Ps2 to USB adapter that represents it as a HID device.
I am able to access the device, but when I try to send it the GET_REPORT request over control transfer, it shows me this error:
[Errno None] b'libusb0-dll:err [claim_interface] could not claim interface 0, win error: The parameter is incorrect.\r\n'
Here is my code:
import usb.core as core
import time
from usb.core import USBError as USBError
dev = core.find(idVendor=0x13ba, idProduct=0x0018, address=0x03)
interface = 0
endpoint = dev[0][(interface, 0)][0]
dev.set_configuration()
collected = 0
attempts = 50
while collected < attempts:
try:
print(dev.ctrl_transfer(0b10100001, 0x01, wValue=100, data_or_wLength=64))
collected += 1
except USBError as e:
print(e)
time.sleep(0.1)
I'm using python 3.x in windows 10 (Lenovo g-510 if it matters to anyone)
The driver I installed is a usblib-win32 using Zadig
Any help will be appreciated!
Thanks
EDIT:
Tried using WIN-USB so that It'll work with usblib-1.0
It didn't find the device. Returned None from the usb.core.fןnd() function
Continuing with WIN-USB with usblib-1.0, I successfully found the device, but no it appears to have no configuration.
dev.set_configuration()
returns:
File "C:\Users\Idan Stark\AppData\Local\Programs\Python\Python36-32\lib\site-packages\usb\backend\libusb1.py", line 595, in _check
raise USBError(_strerror(ret), ret, _libusb_errno[ret])
usb.core.USBError: [Errno 2] Entity not found
Any help will be appreciated, in usblib-1.0 or usblib-0.1, anything to make this work! Thank you!

JVM is not ready after 10 seconds

I configured sparkr normally from the tutorials, and everything was working. I was able to read the database with read.df, but suddenly nothing else works, and the following error appears:
Error in sparkR.init(master = "local") : JVM is not ready after 10 seconds
Why does it appear now suddenly? I've read other users with the same problem, but the solutions given did not work. Below is my code:
Sys.setenv(SPARK_HOME= "C:/Spark")
Sys.setenv(HADOOP_HOME = "C:/Hadoop")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library(SparkR)
#initialeze SparkR environment
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.2.0" "sparkr-shell"')
Sys.setenv(SPARK_MEM="4g")
#Create a spark context and a SQL context
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
Try to do few things below:
Check if c:/Windows/System32/ is there in the PATH.
Check if spark-submit.cmd has proper execute permissions.
If both the above things are true and even if it is giving the same error, then delete spark directory and again create a fresh one by unzipping spark gzip file.
I'm a beginner of R, and I have solved the same problem "JVM is not ready after 10 seconds" by installing JDK(version 7+) before installing sparkr in my mac. And it works well now. Hope this can help you with your problem.

Unable to load bigquery data in local spark (on my mac) using pyspark

I am getting below error after executing below code. Am I missing something in the installation? I am using spark installed on my local mac and so I am checking to see if I need to install additional libraries for below code to work and load data from bigquery.
Py4JJavaError Traceback (most recent call last)
<ipython-input-8-9d6701949cac> in <module>()
13 "com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
14 "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
---> 15 conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.gson.JsonObject
import json
import pyspark
sc = pyspark.SparkContext()
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")
conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>",
"mapred.bq.input.project.id": "publicdata",
"mapred.bq.input.dataset.id":"samples",
"mapred.bq.input.table.id": "shakespeare" }
tableData = sc.newAPIHadoopRDD(
"com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat",
"org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject",
conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"],
int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)
The error "java.lang.ClassNotFoundException: com.google.gson.JsonObject" seems to hint that a library is missing.
Please try adding the gson jar to your path: http://search.maven.org/#artifactdetails|com.google.code.gson|gson|2.6.1|jar
Highlighting something buried in the connector link in Felipe's response: the bq connector used to be included by default in Cloud Dataproc, but was dropped starting at v 1.3. The link shows you three ways to get it back.