Unable to retrieve job history beyond 2 months in BigQuery Web UI - google-bigquery

I want to repeat a load job for a table I created 3 months ago, but I am only able to see 2 months of job history in BigQuery browser UI. I would like the table schema used in that specific load job.
Is there any way to view it?

Use the Jobs.list in the BigQuery API:
Lists all jobs that you started in the specified project. Job
information is available for a six month period after creation
You could also use the cli tool:
bq ls --jobs --all
Or, you could use audit logs too.

Related

Running BQ Load command using Google cloud scheduler

I would like to run bq load command once every day at 00:00 UTC. Can I use Google Cloud scheduler for scheduling this command?
As mentioned by #Daniel, there is no direct way to use cloud scheduler to execute-schedule queries, however there are options you can consider to run queries on schedule.
Use scheduled queries directly in BQ
Since your source is GCS, you can load data from GCS to BQ and then execute scheduled queries like mentioned here
Use scheduled Cloud Function to run queries
Schedule using Data Transfer
You can also try what #Graham Polley has mentioned in this blog, which requires an architecture combining Cloud Scheduler, Cloud Sourse Repositories and Cloud Build
Assuming you have a file that is being loaded into Cloud Storage everyday before 7am, you may consider a more resilient design: when the file is created in CS create a notification that starts the process to load it. It is a better design that will get the information earlier into BigQuery and it will keep working even if the file creation is delayed.
When the file is created in Cloud Storage get a message in PubSub: https://cloud.google.com/storage/docs/pubsub-notifications
Then, a Cloud Function is invoked that will execute the bq load command.
BTW if you have many files or even some dependencies, consider using Cloud Composer as an orchestrator to keep its complexity under control.
You would not be able to do it directly with Cloud Scheduler you would need an intermediary like a Cloud Function to execute a command. Alternatively you could try scheduling a data transfer, depending on the requirements of your load job.
Here is an example from the documentation:
https://cloud.google.com/bigquery/docs/cloud-storage-transfer#setting_up_a_cloud_storage_transfer
Based on your update of desiring to shard the table based on date, try scheduled queries in the following manner.
Create an external table pointed to the desired path in GCS as described here
Define your query, i recommend defining a query with column names and appropriate casting.
SELECT *
FROM myproject.dataset_id.external_table_name
-- INCLUDE FILTERING ON _FILE_NAME IF NEEDED LIKE FOLLOWING:
-- WHERE _FILE_NAME LIKE SOME_VALUE
Create Schedule Query with Run_Date Parmeter in the table name like new_table_{run_date}

BigQuery - Scheduled Query Update Delete via CLI

The BigQuery's Scheduled Query is an interesting one and for GCP to enable the creation of one programmatically via its CLI tool offers some flexibilities and conveniences. After creating a few SQs I realised, from time to time, updates needs to be done on one or several of the SQ.
The challenge here is finding a way to update/ or delete/ or enable/disable an existing scheduled query via the CLI. GCP's documentation clearly explained how one can create using either the bq query or bq mk command. There is absolutely nowhere to suggest we can delete or update/modify an existing scheduled query either via CLI or Python.
I was thinking of bq rm but there isn't a flag to specifically deletes a scheduled query. I may be risking dropping an entire dataset or table.
Perhaps it is a limitation at the moment. However, if anyone has found a way to do so, please share your solution or your way around.
sorry for the confusion.
Scheduled query does support update/delete in CLI. Scheduled query is managed as a transfer config in CLI, please see update transfer config and delete transfer config.
For example, to update/delete a scheduled query with name projects/p/locations/us/transferConfigs/scheduled_query:
# Update the query parameter in a scheduled query.
bq update --transfer_config --params='{"query":"SELECT 1"}' projects/p/locations/us/transferConfigs/scheduled_query
# Delete a scheduled query.
bq rm --transfer_config projects/p/locations/us/transferConfigs/scheduled_query
Hope this helps!

Bigquery user statistics from Microstrategy

I am using Microstrategy to connect to Bigquery using a service account. I want to collect user level job statistics from MSTR but since I am using a service account, I need a way to track user level job statistics in Bigquery for all the jobs executed via Microstrategy.
Since you are using a Service account to make the requests from Microstrategy, you could look up for all your project Jobs by listing them then, by using each Job ID in the list, gather the information of the job as this shows the Email used for the Job ID.
A workaround for this would be also using Stackdriver Logging advanced filters and use a filter to get the jobs made by the Service Account. For instance:
resource.type="bigquery_resource"
protoPayload.authenticationInfo.principalEmail="<your service account>"
Keep in mind this only shows jobs in the last 30 days due to the Logs retention periods.
Hope it helps.

BigQuery Google Analytics Export Processing Time Management

Our company has many schedule reports in BigQuery that generate aggregation tables of Google Analytics data. Because we cannot control when Google Analytics data is imported into our BigQuery environment we keep getting days with no data.
This means we then have to manually run the data for missing days.
I have edited my schedule query to keep pushing back the time of day the scheduled query runs however it is now running around 8 AM. These queries are for reports for stakeholders and stakeholders are requesting them earlier. Is there any way to ensure Google Analytics export to BigQuery processing times?
You may also think about a Scheduled Query solution that reruns at a later time if the requested table isn't available yet.
You can't current add a conditional trigger to a BigQuery scheduled query.
You could manually add a fail safe to your query to check for table from yesterday using a combination of the code below and DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY):
SELECT
MAX(FORMAT_TIMESTAMP('%F %T', TIMESTAMP(PARSE_DATE('%Y%m%d',
REGEXP_EXTRACT(_TABLE_SUFFIX,r'^\d\d\d\d\d\d\d\d'))) ))
FROM `DATASET.ga_sessions_*` AS ga_sessions
Obviously this will fail if the conditions are not met and will not retry, which I understand is not an advancement on your current setup.
I've encountered this many times in the past and eventually had to move my data pipelines to another solution, as scheduled queries are still quite simplistic.
I would recommend you take a look at CRMint for simple pipelines into BigQuery:
https://github.com/google/crmint
If you still find this too simplistic then you should look at Google Cloud Composer, where you can check a table exists before running a particular job in a pipeline:

BigQuery detailed charges just shows how much data was analyzed

I'm trying to find out what is causing my BigQuery bill to be so high but when I click View Detailed Charges on Google Cloud I just get how much data was analyzed and how much it costs. Is there a place where I can view a detailed breakdown of what jobs cost so much and what is causing the bill to get so large?
Is there a place where I can view a detailed breakdown of what jobs cost so much and what is causing the bill to get so large?
You should be able to use Jobs.list API to lists all jobs that you started in the specified project. Job information is available for a six month period after creation. The job list is sorted in reverse chronological order, by job creation time. Requires the Can View project role, or the Is Owner project role if you set the allUsers property
You actually can even make it without any coding - https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/list#try-it
Collect all you jobs info and analyse it as you wish
For the long term solution - you can either automate above process or use BigQuery Monitoring using Stackdriver