BIGQUERY PERMISSION | What is bigquery.readsession can do to bigquery dataset? - google-bigquery

I do not understand about the BigQuery Read Session User permission. I wonder if I got assigned this role. Can I query the data set in the Bigquery via python SDK?
I tried:
from google.cloud import bigquery
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'Path/xxx.json'
project_id = 'Project_ID'
client = bigquery.Client()
client.query(`SQL`)
Got:
Forbidden: 403 Access Denied: Table <> User does not have permission to query table <>, or perhaps it does not exist in location <>.
Location: <>
Job ID: <>
To be clear, I want to know what the read session means in Bigquery.

When the Storage Read API is used, structured data is sent in a binary serialization format which allows parallelism. Storage Read API provides fast access to managed BigQuery storage using RPC protocol.For the usage of Storage Read API, ReadSession has to be created.
The ReadSession message contains the information about maximum number of streams, the snapshot time, the set of columns to return, and the predicate filter which is provided to CreateReadSession RPC. A ReadSession response contains the set of Stream identifiers which is used by Storage API. The Stream identifier that is returned from the ReadSession response is used to read all the data from the table. For more information, you can check this documentation.

Related

Azure Synapse Copy Data from BigQuery, Source ERROR [HY000] [Microsoft][BigQuery] (131) Unable to authenticate with Google BigQuery Storage API

I am getting this error at the Source tab at the Use query (Table, Query) Query, when doing a copy data activity at the Azure Synapse pipeline.
Unable to authenticate with Google BigQuery Storage API:
.
The strange thing is I can preview data at the Source dataset, I can also preview data when select the Use query Table option.
I can even run query to select the table's schema
SELECT
*
FROM
`3082`.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
but I get this authentication error when selecting columns
SELECT
*
FROM
`3082.gcp_billing_export_v1_019F74_6EA5E8_C96548`;
ERROR [HY000] [Microsoft][BigQuery] (131) Unable to authenticate with Google BigQuery Storage API. Check your account permissions
The above error is due to issue in authentication of BigQuery Storage API. The permission required to access data from BigQuery are,
bigquery.readsessions.create
bigquery.readsessions.getData
bigquery.readsessions.update
The role BigQuery User will help in giving above permissions.
Reference:
Google cloud doc on Access Control - BigQuery User.
MS doc on Google BigQuery connector issue

Is there a way we can filter blobs that are not in archive tier in Azure Blob?

I have a blob container say "demo". It has many files, few are in hot tier and few are in archive tier. I want to process only those files which are in Hot ignoring archive tier files.
using getmetadata activity lists all files including archive tier files
using "az storage blob list throws" error - 'This operation is not permitted on an archived blob.'
Pl direct me in the right direction.
You can use List Blobs operation which returns a list of the blobs under the specified container.
Method: GET
Request URI: https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list
HTTP Version: HTTP/1.1
This will return the response body in XML format which you can later filter on the basis of access tier names.
In the above URI, you also need to provide parameter include={versions=2019-12-12} if it doesn't use newest version automatically.
For version 2017-04-17 and above, List Blobs returns the AccessTier
element if an access tier has been explicitly set. For Blob Storage or
General Purpose v2 accounts, valid values are Hot/Cool/Archive. If the
blob is in rehydrate pending state then ArchiveStatus element is
returned with one of the valid values
rehydrate-pending-to-hot/rehydrate-pending-to-cool.
Refer List Blobs for more details.

In BigQuery, query to get GCS metadata (filenames in GCS)

We have a GCS bucket with a subfolder at url https://storage.googleapis.com/our-bucket/path-to-subfolder. This sub-folder contains files:
file_1_3.png
file_7_4.png
file_3_2.png
file_4_1.png
We'd like to create a table in BigQuery with a column number1 with values 1,7,3,4 (first number in filename) and a column number2 with the second numbers. String splitting is easy, once the data (a column with filenames) is in BigQuery. How can the filenames be retrieved? Is it possible to query a GCS bucket for metadata on files?
EDIT: want to do this
Updating the answer to reflect the question of how do you retrieve GCS Bucket metadata on files.
There are two options you can have here depending on the use case:
Utilize a cloud function on a cron schedule to perform a read of metadata (like in the example you shared) then using the BQ Client library perform an insert. Then perform the regex listed below.
This option utilizes a feature (remote function) in preview so you may not have the functionality needed, however may be able to request it. This option would get you the latest data on read. It involves the following:
Create a Cloud Function that returns an array of blob names, see code below.
Create a connection resource in BigQuery (overall process is listed here however since the remote function portion is in preview the documentation and potentially your UI may not reflect the necessary options (it did not in mine).
Create a remote function (third code block in link)
Call the function from your code then manipulate as needed with regexp.
Example CF for option 2:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
blob_array = []
for blob in blobs:
blob_array.append()
return blob_array
Example remote function from documentation:
CREATE FUNCTION mydataset.remoteMultiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
REMOTE WITH CONNECTION us.myconnection
OPTIONS(endpoint="https://us-central1-myproject.cloudfunctions.net/multiply");
Once its in it will return the full gcs path of the file. From there you can use REGEX like the following regexp_extract(_FILE_NAME, 'file_(.+)_') to extract the important information.
Now that BQ Remote Function (RF) is GA as well as JSON, I thought of sharing a way to get any property of blobs in a bucket, right from BQ SQL.
!! Make sure to carefully read the official documentation first on how to set up RF as it's easy to miss a step. There are slight differences if you rather use 2nd Gen Function or Cloud run
Create following storage Cloud Function (here Python) - 1st gen good enough:
import json
from google.cloud import storage
storage_client = storage.Client()
def list_blobs(request):
print(request_json := request.get_json()) # print for debugging
calls = request_json['calls']
bucket_name = calls[0][0]
blobs = storage_client.list_blobs(bucket_name)
reply = [b._properties for b in blobs]
return json.dumps({'replies': [reply]})
Create BQ remote function (assumes fns dataset, us.api connection and my_project_id):
CREATE FUNCTION fns.list_blobs(bucket STRING)
RETURNS JSON
REMOTE WITH CONNECTION us.api
OPTIONS(endpoint="https://us-central1-my_project_id.cloudfunctions.net/storage")
The trick to return multiples values for a single request is to use JSON type
SELECT whatever properties you want
SELECT STRING(blob.name), STRING(blob.size), CAST(STRING(blob.updated) AS TIMESTAMP)
FROM
UNNEST(
JSON_EXTRACT_ARRAY(
fns.list_blobs('my_bucket')
)
) blob
The JSON is converted to an ARRAY, and UNNEST() pivots to multiple rows - unfortunately not columns too.
Voila ! I wish there was a easier way to fully parse a JSON array to a table, populating all columns at once, but as of this writing, you must explicitly extract the properties you want:
You can do many more cool things by extending the functions (cloud and remote) so you don't have to leave SQL, like,
generate and return signed URL to display/download right from a query result (e.g. BI tool)
use user_defined_context and branch logic in the CF code, to perform other operations like delete blobs or do other stuff
Object tables are read-only tables containing metadata index over the unstructured data objects in a specified Cloud Storage bucket. Each row of the table corresponds to an object, and the table columns correspond to the object metadata generated by Cloud Storage, including any custom metadata.
With Object tables we can get the file names and do operations on top of that in BigQuery itself.
https://cloud.google.com/bigquery/docs/object-table-introduction

Access Denied: BigQuery BigQuery: Permission denied while writing data

I am trying to export data from bigquery to Google Cloud Storage while using command.
EXPORT DATA OPTIONS(
uri='gs://bucket/archivage-base/Bases archivees/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-012021.creation_tables.dataext`
And I have this error: Access Denied: BigQuery BigQuery: Permission denied while writing data.
I cannot understand why because the service account seems to have all the grants. And i didn't find any topic that heps me to solve the problem
Thank you !
If this is the live query you're using and you haven't redacted the real bucket name, it's probably because of the bucket string in the URI. The URI should be something like gs://your-bucket-name/prefix/path/to/output/yourfileprefix_*.csv
If you have redacted the bucket name, then check to make sure that the user (or service account) identity issuing the query has the requisite access to the bucket and objects in cloud storage.
i had same issue.
First, you need to check service acct in IAM. enter image description here
after add, you create a file json Certification. and add it in project

Exporting data from Google Bigquery table to Google Cloud Storage

When exporting data from the Google bigquery table to Google cloud storage in Python, I get the error:
Access Denied: BigQuery BigQuery: Permission denied while writing
data.
I checked the JSON key file and it links to the owner of the storage. What can I do?
there are several reason's for this type of error
1. you give the exact path to the GOOGLE_APPLICATION_CREDENTIALS key.
2. Please check that you have writing permission in your project.
3. You have given a correct schema and their value if you writing a table, many of the times this type of error occurred due to incorrect schema value