Is it possible to extract job from big query to GCS across project ids? - google-bigquery

Hey guys trying to export a bigquery table to cloud storage a la this example . Not working for me at the moment, am worried that the reason is that the cloud storage project is different to the bigquery table, is this actually doable? I can't see how using that template above.

Confirming:
You CAN have your table in ProjectA to be exported/extracted to GCS bucket in ProjectB. You just need make sure you have proper permissions on both sides. At least:
READ for respective dataset in Project A and
and
WRITE for respective bucket in Project B
Please note: Data in respective dataset of Project A and bucket in Project B - MUST be in the same location - US or EU , etc.
Simply to say: sourse and destination must be in the same location

Related

How to query a BigQuery table in one GCP project and one location and write results to a table in another project and another location with Airflow?

I need to query a BigQuery table in one GCP project (say #1) and one location (EU) and write results to a table in another project (say #2) and another location (US) with Airflow.
Composer/Airflow instance itself runs in project #2 and location US.
Airflow is using GCP connection configured with a service account from project #2 which also has most of the rights in project #1.
I realise that this might involve multiple extra steps such as storing data temporarily in GCS, so this is fine as long as the end result is achieved.
How should I approach this problem? I saw quite a few articles but none does suggest a strategy for dealing with this situation which I suppose is fairly common.

Who pays when using BigQuery Storage Read API across projects?

Say that I have two projects in GCP, Projects A and B. Project A has a BigQuery table, and project B needs to read that data using the BigQuery Storage Read API.
If I create a service account in project B, then go to project A and assign this user the role BigQuery Read Session User as well as add them to the dataset ACL, they will be able to stream the table content. Which project will receive the bill for the data extracted? Project A where the read session is created, or Project B which is the home of the acting service account?
To be clear, I would like for Project B to pay for the load they generate.
I have tried to find a way to be explicit about this, but as far as I can tell there is no way to specify billing project when creating a read session. I have also checked what happens when I try to create a read session with the "parent project" set to Project B while the table location says Project A, and this just leads to the table not being found at all.
In Storage Read API pricing, BigQuery charges for the number of bytes processed (also referred to as bytes read). In your scenario, Project A has the Bigquery table and where the Read Session happens and you just attached Project B's service account as BigQuery Read Session User in Project A, hence the billed amount will go to Project A.
As an alternative, you can check using Billing Reports the cost trends for Bigquery Storage API with the following filters:

is it possible to read a Google Drive folder (all files) as BigQuery external data source?

I am using Google Drive as an external data source in BigQuery. I can able to access a single file, but unable to read a folder with multiple files.
Note:
I have picked up the shareable link from Google Drive for folder and used "bq mk.." command referencing the link ID. Although it creates the table but unable to pull data.
I've not tried it with drive so I have no sense of how performant it is, but when defining an external table (or load job), you can specify the source data as a list of URIs. My suspicion is that it's not particularly scalable and may run into limits in drive, as that's not a typical access pattern. Google Cloud Storage is a much more suitable datasource for this kind of thing.

Is it possible to change the region of a Google Cloud Platform project?

If I go to the Google Developer Console then I can see all my Cloud Platform projects, but not their regions.
How do I see the region of each project? And is it possible to change the region once it has been set?
Thanks for any help.
There is no such thing as a region of a GCP project.
In other words, region/location is specific to resources, and a GCP project is not permanently tied to a single region/location.
For example, you can have a project with multiple BigQuery datasets in different regions.
That same project can have many Compute Engine instances running, each one in different location/region.
There is a default region that is set per GCP project, but that can always be overwritten when creating resources in GCP, and is mainly used to guess default location when location is not specified in API calls.
Regarding the BigQuery aspect of this question:
Data Locations on a table are immutable once set.
In order to change the location, the easiest solution would be to export the data to Google Cloud Storage, delete the table, re-create the table in the correct region, then import the data.
https://cloud.google.com/appengine/docs/python/console/#server-location
Setting the server location
When you create your project, you can specify the location from which it will be served. In the new project dialog, click on the link to Show Advanced Options, and select a location from the pulldown menu:
us-central
us-east1
europe-west
If you select us-east1 your project will be served from a single region in South Carolina. The us-central and europe-west locations contain multiple regions in the United States and western Europe, respectively. Projects deployed to either us-central or europe-west may be served from any one of the regions they contain. If you want to colocate your App Engine instances with other single-region services, such as Google Compute Engine, you should select us-east1.

How to upload multiple files to google cloud storage bucket as a transaction

Use Case:
Upload multiple files into a cloud storage bucket, and then use that data as a source to a bigquery import. Use the name of the bucket as the metadata to drive which sharded table the data should go into.
Question:
In order to prevent partial import to the bigquery table, ideally, I would like to do the following,
Upload the files into a staging bucket
Verify all files have been uploaded correctly
Rename the staging bucket to its final name (for example, gs://20130112)
Trigger the bigquery import to load the bucket into a sharded table
Since gsutil does not seem to support bucket rename, what are the alternative ways to accomplish this?
Google Cloud Storage does not support renaming buckets, or more generally an atomic way to operate on more than one object at a time.
If your main concern is that all objects were uploaded correctly (as opposed to needing to ensure the bucket content is only visible once all objects are uploaded), gsutil cp supports that -- if any object fails to upload, it will report the number that failed to upload and exit with a non-zero status.
So, a possible implementation would be a script that runs gsutil cp to upload all your files, and then checks the gsutil exit status before creating the BigQuery table load job.
Mike Schwartz, Google Cloud Storage team
Object names are actually flat in Google Cloud Storage; from the service's perspective, '/' is just another character in the name. The folder abstraction is provided by clients, like gsutil and various GUI tools. Renaming a folder requires clients to request a sequence of copy and delete operations on each object in the folder. There is no atomic way to rename a folder.
Mike Schwartz, Google Cloud Storage team