Google Dataflow - BigQuery job backed by GoogleSheets - google-bigquery

I have a Google Dataflow batch job written in Java.
This Java code accesses Bigquery and performs a few transformations
and then outputs back into Bigquery.
This code can access the Bigquery tables just fine.
But, when I choose a table that is backed by a federated source like google sheets it doesn't work.
It says no OAuth token with Google Drive scope found.
Pipeline options
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
Pipeline p1 = Pipeline.create(options);
Any ideas?

Can you try:
gcloud auth login --enable-gdrive-access
before you launch the Dataflow job?

Answering my own question, but to get around this issue I'm going to use Google Apps Script to upload to Bigquery as a native table.
Please see this link.
I'm just going to modify the Load CSV data code snippet into BigQuery and then create an installable trigger to execute this function every night to upload to Bigquery.
Beware you can't execute triggers like onEdit, onOpen that require authorisation.

Related

Google Data Fusion Salesforce to Bigquery Pipeline, automatic way of managing schema updates in Salesforce

Hey I am trying to create some batch jobs that reads from a couple Salesforce Objects and pushes them to BQ. Every-time batch process runs it will truncate the table in BQ and push all the data in the SF object back into BQ. Is it possible for google data fusion to automatically detect changes in an object in Salesforce(like adding a new column or changing data types of a column) then be registered and pushed to BQ via google data fusion?
For SF side of the puzzle you could look into https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_describeGlobal.htm and If-Modified-Since header telling you if the definition of table(s) changed. That url is for all tables in the org or you run table-specific metadata describe calls with https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_sobject_describe.htm
But I can't tell you how to use it in your job.
You can use the provided answer of #eyescream to be the condition or the trigger for the update to BigQuery. You may push changes to BigQuery using the pre-built plugin Stream Source approach from Datafusion in which, as mentioned in this docmentation, it
tracks updates in Salesforce sObjects. Examples of sObjects are
opportunities, contacts, accounts, leads, any custom object, etc.
You may use this approach to automatically track changes and push them to BigQuery. You can also find the whole Salesforce Streaming Source configuration reference in this documentation as also redirected from google's official documentation.
However, if you want a more dynamic approach for your overall use case, you may also use the integration of BigQuery with Salesforce. However in this approach, you will need to build your own code in which you can also use #eyescream 's answer as the primary condition/trigger and then automatically push the update to your BigQuery schema.

Error when creating scheduled query on Bigquery "Error creating scheduled query: er"

I just started a new project on Google Cloud, set up some bigquery datasets and tables. I now want to set up some scheduled queries. I have already enabled BigQuery Data Transfer API. My query is valid (it's just SELECT * FROM table). I can't find anything about this error online.
See screenshot
UPDATE: I've experimented a bit and it seems to be an organization wide issue. All projects, new and old within my organization get this same error when trying to schedule a query. I tried for a project in a different organization and did not have the issue. What could be causing this error for ALL projects in an organization?
UPDATE 2:
By querying a table that is not empty the error change to "Error creating scheduled query: Yn" instead of "Error creating scheduled query: er" (when the scheduled query would have queried an empty table).
I faced the same issue than you, and basically I just needed to run the query first before creating the the scheduled query... And that did the trick.
from the BQ FAQs :
"Scheduled queries use features of BigQuery Data Transfer Service. Verify that you have completed all actions required in Enabling BigQuery Data Transfer Service."
basically, what this means is that you need to enable the data transfer api in your project, AND give the user who creates the scheduled query a BQ admin role in order to have the right permissions to access that transfer service.
If done right, you should get a popup when creating the scheduled query to confirm that the data transfer service has access to your uses account (if you block popups you might not see this message and get stuck)
If this error only occurs in your organisation, I believe it might be caused by a organisation policy on Google Cloud. I would encourage you to double check if there is any org policy causing this error. If that's not the case, open a support ticket with GCP.
What worked for me was signing in through Incognito Mode with just my account and attempting to save the scheduled query. I have multiple Google Accounts signed it at one time and for whatever reason, BigQuery throws this generic error after authorization is successful and BigQuery is granted the access it requested.
You need to make sure that you are creating the query under the project targeted not in any other projects because it won't appear
Also you need to enable the API as one of the above answers
This eventually worked for me when i ran this in an cognito window

Authentication into Google BigQuery without using Environment Variables

I'm making a simple script in Google Colabs (Jupyter Notebook) that can grab stuff from our big data environment (in BigQuery) and analyze it. I'm avoiding using environmental variables as most of the engineers won't know how to set it up. Ideally, i'm looking for a way to authenticate in using our Google username/password. Does anyone have any experience authenticating into GBQ this way? Thanks
The Colab docs contain an example showing how to issue an authenticated BigQuery query.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Then,
# Display query output immediately
%%bigquery --project yourprojectid
SELECT
COUNT(*) as total_rows
FROM `bigquery-public-data.samples.gsod`

Google Cloud Logging export to Big Query does not seem to work

I am using the the google cloud logging web ui to export google compute engine logs to a big query dataset. According to the docs, you can even create the big query dataset from this web ui (It simply asks to give the dataset a name). It also automatically sets up the correct permissions on the dataset.
It seems to save the export configuration without errors but a couple of hours have passed and I don't see any tables created for the dataset. According to the docs, exporting the logs will stream the logs to big query and will create the table with the following template:
my_bq_dataset.compute_googleapis_com_activity_log_YYYYMMDD
https://cloud.google.com/logging/docs/export/using_exported_logs#log_entries_in_google_bigquery
I can't think of anything else that might be wrong. I am the owner of the project and the dataset is created in the correct project (I only have one project).
I also tried exporting the logs to a google storage bucket and still no luck there. I set the permissions correctly using gsutil according to this:
https://cloud.google.com/logging/docs/export/configure_export#setting_product_name_short_permissions_for_writing_exported_logs
And finally I made sure that the 'source' I am trying to export actually has some log entries.
Thanks for the help!
Have you ingested any log entries since configuring the export? Cloud Logging only exports entries to BigQuery or Cloud Storage that arrive after the export configuration is set up. See https://cloud.google.com/logging/docs/export/using_exported_logs#exported_logs_availability.
You might not have given edit permission for 'cloud-logs#google.com' in the Big Query console. Refer this.

Export Google Cloud Datastore and import to BigQuery programmatically

I'm looking for a method to export my Cloud Datastore and import it into BigQuery daily. The manual way is described at google page. I do not find a clean way to automate it.
There isn't a simple way to do this, but you can separate out the two parts: creating appengine backups and loading them into bigquery.
You can use scheduled backups to create datastore backups periodically (https://cloud.google.com/appengine/articles/scheduled_backups).
You can then use Apps Script to automate the BigQuery portion (https://developers.google.com/apps-script/advanced/bigquery#load_csv_data) or use an AppEngine cron to do the same thing.
As of last week there's a proper way to automate this. The most important part is gcloud beta datastore export.
I created a script around it: https://github.com/chees/datastore2bigquery
You could run this in a cron job.
See here for a demo of how it works: https://www.youtube.com/watch?v=dGyQCE3bWkU
Building on #Jordan's answer above, the steps to do this would be:
1) Make a storage bucket
2) Export datastore entities to this bucket
3) Open Big Query Web UI, and load using the Google Cloud file path.
Full tutorial with images is available at this post.
It is possible using the following code. It basically uses App Engine Cron jobs and BigQuery API.
https://github.com/wenzhe/appengine_datastore_bigquery