BigQuery query history: get performance counters/metrics - google-bigquery

What are the best practices to get the historical query metrics. Let's assume there were 3 users, they did run 3, 4, 5 queries respectively during a day (through JDBC/ODBC). How could I get a list of those queries along with other metadata information, eg. price, data volume scanned, slots, start/end time, rows returned etc…
Could I also get the explain/execution plan equivalent for those queries?
I saw somewhere I could try to use the CLI:
Listing all query jobs: bq ls -j -q
Getting the data for specific job: bq show --format=prettyjson -j <Job ID>
or maybe API could give me more information?
but ultimately what is the best/recommended practice here?
For instance in AWS RedShift I can use views/meta tables like STL_QUERY, STL_QUERYTEXT, STL_CONNECTION_LOG, SVL_QUERY_SUMMARY view etc… I am wondering if there's similar mechanism to use SQL to access and filter that information?

... or maybe API could give me more information?
You can use Jobs: list and Jobs: get to respectively lists jobs started in the specified project and return information about a specific job.
If Jobs.get call is successful, this method returns a Jobs resource in the response body where you can find all details you mentioned in your question

You can use BigQuery webUI to fetch all information, remember there is a limit 1000 records BUT it gives you a nice an elegant way similar to AWS option.
This is how you set the option to see ALL your users jobs:
And using the search box you can apply filters on your search
Clicking on the arrow on the right side gives some advanced options:

Related

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

gcp NL api sentiment analysis - How to store the results in BigQuery

I am using gcp bigquery to store news streaming by google function and save it in bigquery.
How can I run a python script that is using the data from bigquery and finally writes back the result for score and magnitude to the related dataset?
I could not find anything in the google documentation about it, just how to run the sentiment analysis but not, how to get data from bigquery in and results out back to bigquery.
Thanks a lot for your support.
You didn't give us enough specifics for a specific answer, so let me give you my general way of trying this:
First, let's get the sentiment analysis of one arbitrary sentence with the gcloud CLI:
gcloud --format json ml language analyze-entity-sentiment --content "It's time we just let this thing go - it was a prett
y good bad idea, wasn't it though? -- Bad Idea, Sara Bareilles" | jq -c . > sentiments.json
Please notice that I removed the formatting of the output JSON with jq and stored the results in one file.
To load this file with maybe multiple JSON lines for each sentence into BigQuery:
bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON temp.sentiments sentiments.json
The question asks for "stream into BigQuery", but it might make more sense to batch the load like shown here.
Now we have a table with the results in BigQuery:
SELECT * FROM `fh-bigquery.temp.sentiments` LIMIT 1000
Btw, I added Sara Bareilles to the sentence to make sure that BigQuery got a full schema for auto-detection when creating the table the first time.
If you want to stream data into BigQuery, then look at the streaming into BigQuery docs. I wanted to isolate in this answer the basics of getting and looking at Cloud NLP data into BigQuery - the rest is just the basics of working with it.

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks
While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/

How can I trigger an email or other notification based on a BigQuery query?

I would like to receive a notification, ideally via email, when some threshold is met in Google BigQuery. For example, if the query is:
SELECT name, count(id) FROM terrible_things
WHERE date(terrible_thing) < -1d
Then I would want to get an alert when there were greater than 0 results, and I would want that alert to contain the name of each object and how many there were.
BigQuery does not provide the kinds of services you'd need to build this without involving other technologies. However, you should be able to use something like appengine (which does have a task scheduling mechanism) to periodically issue your monitoring query probe, check the results of the job, and alert if there are nonzero rows in the results. Alternately, you could do this locally using some scripting and leveraging the BQ command line tool.
You could also refine things by using BQ's table decorators to only scan the data that's arrived since you last ran your monitoring query, if you retain knowledge of the last probe's execution in the calling system.
In short: Something else needs to issue the queries and react based on the outcome, but BQ can certainly evaluate the data.

how to list job ids from all users?

I'm using the Java API to query for all job ids using the code below
Bigquery.Jobs.List list = bigquery.jobs().list(projectId);
list.setAllUsers(true);
but it doesn't list me job ids that were run by Client ID for web applications (ie. metric insights) I'm using private key authentication.
Using the command line tool 'bq ls -j' in turn giving me only the metric insight job ids but not the ones ran with the private key auth. Is there a get all method?
The reason I'm doing this is trying to get better visibility into what queries are eating up our data usage. We have multiple sources of queries: metric insights, in house automation, some done manually, etc.
As of version 2.0.10, the bq client has support for API authorization using service account credentials. You can specify using a specific service account with the following flags:
bq --service_account your_service_account_here#developer.gserviceaccount.com \
--service_account_credential_store my_credential_file \
--service_account_private_key_file mykey.p12 <your_commands, etc>
Type bq --help for more information.
My hunch is that listing jobs for all users is broken, and nobody has mentioned it since there is usually a workaround. I'm currently investigating.
Jordan -- It sounds like you're honing in on what we want to do. For all access that we've allowed into our project/dataset we want to produce an aggregate/report of the "totalBytesProcessed" for all queries executed.
The problem we're struggling with is that we have a handful of distinct java programs accessing our data, a 3rd party service (metric insights) and 7-8 individual users who have query access via the web interface. Fortunately the incoming data only has one source so explaining the cost for that is simple. For queries though I am kinda blind at the moment (and it appears queries will be the bulk of the monthly bill).
It would be ideal if I can get the underyling data for this report with just one listing made with some single top level auth. With that I think from the timestamps and the actual SQL text I can attribute each query to a source.
One thing that might make this problem far easier is if there were more information in the job record (or some text adornment in the job_id for queries). I don't see that I can assign my own jobIDs on queries (perhaps I missed it?) and perhaps recording some source information in the job record would be possible? Just thinking out loud now...
There are three tables you can query for this.
region-**.INFORMATION_SCHEMA.JOBS_BY_{USER, PROJECT, ORGANIZATION}
Where ** should be replaced by your region.
Example query for JOBS_BY_USER in the eu region:
select
count(*) as num_queries,
date(creation_time) as date,
sum(total_bytes_processed) as total_bytes_processed,
sum(total_slot_ms) as total_slot_ms_cost
from
`region-eu.INFORMATION_SCHEMA.JOBS_BY_USER` as jobs_by_user,
jobs_by_user.referenced_tables
group by
2
order by 2 desc, total_bytes_processed desc;
Documentation is available at:
https://cloud.google.com/bigquery/docs/information-schema-jobs