Most queried and highest billable BigQuery tables - google-bigquery

I need to check the most queried BQ tables, the most expensive billable BQ tables and also the highest slot usage and the associated users.
Anyone could provide the simplest way (query) to do this in BiqQuery?

You can try exploring table INFORMATION_SCHEMA.JOBS_BY_*. This table contains details on jobs that were executed (query, total_bytes_billed, etc.) see jobs metadata documentation for more details per field.
Based on your requirements, you might want to try the query below. Feel free to tweak the query as needed.
select
parent.user_email,
sum(parent.total_bytes_billed) as sum_total_bytes_billed,
count(ref.table_id) as query_table_count,
ref.table_id,
sum(parent.total_slot_ms) as slot_usage
from `region-us`.INFORMATION_SCHEMA.JOBS_BY_USER as parent, unnest(referenced_tables) as ref
group by ref.table_id,parent.user_email
order by sum_total_bytes_billed desc
Output:
NOTE: This was tested on my test project, hence a small sample size and there are no other users that query on this project.

Related

Is it possible to retrieve full query history and correlate its cost in google bigquery?

I am querying multiple tables and I am able to see the cost of each query for my personal use. As I view the Query History I only see the queries I ran on my account.
So my question is, is it possible to somehow to see the queries which have been run by others (as well as the cost of the query ) in a project from the query history ?
You can use Jobs information schema:
SELECT query, total_bytes_processed FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE project_id = 'you_project_id' AND user_email = 'my#eamil.com'
According to the documentation, there is not a direct method of getting costs by job and user. However, there is a way of doing it.
For a detailed billing analysis, I would advise you to export the logs to BigQuery with a custom filter and from there analyse the billing for each user and query job.
So, you can create an export using the Logs Viewer or the API. While creating your sink use the following custom filter:
resource.type="bigquery_resource"
logName="projects/<your_project>/logs/cloudaudit.googleapis.com%2Fdata_access"
protoPayload.methodName="jobservice.jobcompleted"
The above filter will retrieve completed query jobs whilst the data access logs are a comprehensive audit of every query run in BigQuery along with the total bytes scanned. I would like to point that you have to make sure that data_access logs are enable, link.
From the log entries you will get the fields:
protoPayload.authenticationInfo.principalEmail
protoPayload.serviceData.jobCompletedEvent.job.jobName.jobId
protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.query.query
protoPayload.serviceData.jobCompletedEvent.job.jobStatistics.totalBilledBytes
In BigQuery, you can use a query as follows:
SELECT
protopayload_auditlog.authenticationInfo.principalEmail AS email,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes AS total_billed_bytes,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query AS query,
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobName.jobId as job_id
FROM
`<myproject>.<mydataset>.cloudaudit_googleapis_com_data_access`
WHERE
protopayload_auditlog.methodName = 'jobservice.jobcompleted';
Afterwards, to get an estimate of the price per each query you can use the totalBilledBytes and the Pricing summary in order to add a new column with a price estimative for each query. Therefore, you have a final table with the user's email, the query code, total bytes billed, job id and an estimate price.

Tableau count values after a GROUP BY in SQL

I'm using Tableau to show some schools data.
My data structure gives a table that has all de school classes in the country. The thing is I need to count, for example, how many schools has Primary and Preschool (both).
A simplified version of my table should look like this:
In that table, if I want to know the number needed in the example, the result should be 1, because in only one school exists both Primary and Preschool.
I want to have a multiple filter in Tableau that gives me that information.
I was thinking in the SQL query that should be made and it needs a GROUP BY statement. An example of the consult is here in a fiddle: Database example query
In the SQL query I group by id all the schools that meet either one of the conditions inside de IN(...) and then count how many of them meet both (c=2).
Is there a way to do something like this in Tableau? Either using groups or sets, using advanced filters or programming a RAW SQL calculated fiel?
Thanks!
Dubafek
PS: I add a link to my question in Tableu's forum because you can download my testing workbook there: Tableu's forum question
I've solved the issue using LODs (specifically INCLUDE and EXCLUDE statements).
I created two calculated fields having the aggregation I needed:
Then I made a calculated field that leaves only the School IDs that matches the number of types they have (according with the filtering) with the number of types selected in the multiple filter (both of the fields shown above):
Finally, I used COUNTD([Condition]) to display the amounts of schools matching with at least the School types selected.
Hope this helps someone with similar issue.
PS: If someone wants the Workbook with the solution I've uploaded it in an answer in the Tableau Forum

Finding the query that created a table in BigQuery

I am a new employee at the company. The person before me had built some tables in BigQuery. I want to investigate the create table query for that particular table.
Things I would want to check using the query is:
What joins were used?
What are the other tables used to make the table in question?
I have not worked with BigQuery before but I did my due diligence by reading tutorials and the documentation. I could not find anything related there.
Brief outline of your actions below:
Step 1 - gather all query jobs of that user using Jobs.list API - you must have Is Owner permission for respective projects to get someone else's jobs
Step 2 - extract only those jobs run by the user you mentioned and referencing your table of interest - using destination table attribute
Step 3 - for those extracted jobs - just simply check respective queries which allow you to learn how that table was populated
Hth!
I have been looking for an answer since a long time.
Finally found it :
Go to the three bars tab on the left hand side top
From there go to the Analytics tab.
Select BigQuery under which you will find Scheduled queries option,click on that.
In the filter tab you can enter the keywords and get the required query of the table.
For me, I was able to go through my query history and find the query I used.
Step 1.
Go to the Bigquery UI, on the bottom there are personal history and project history tabs. If you can use the same account used to execute the query I recommend personal history.
Step 2.
Click on the tab and there will be a list of queries ordered from most recently run. Check the time the table was created and find a query that ran before the table creation time.
Since the query will run first and create the table there will be slight differences. For me it stayed between a few seconds.
Step 3.
After you find the query used to create the table, simply copy it. And you're done.

SQL Server 2012 Managment Studio Basic SELECT queries

I am new to SQL Server. I have been assigned to do some simple queries to start off, then eventually move on to more complex queries.
I have spent a lot of time on this website: http://www.w3schools.com and I understand it, I think, but then when I go back to my company's database, I find myself searching from many, many, different tables with different information.
For example, a table would say [Acct_Name] and the query comes back with not the correct account name (s) that I need. Any advice that you think might help me? Thank you.
It sounds like you are looking to limit your results to specific accounts. There are many ways to go about this, so no one will be able to give you a all encompassing answer but if you are looking to just pull a single account
SELECT * FROM (your table name) WHERE Acct_Name = 'the account name'
The * means you are selecting all columns in the table and your WHERE clause is where you set your search conditionals, like account name or by account ID. If you had a account creation date, you could get all accounts created on or before a date like this
SELECT * FROM (your table name) WHERE Created < '2016-06-01 00:00:00'
Replace the column name 'Created' with the column that holds the date field of account creation
Learning the WHERE clause and what you can do there to limit your results will get you on a solid footing to start, from there you will want to learn JOINs and how to link tables by primary keys.
Code academy has some great tutorials https://www.codecademy.com/learn/learn-sql

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.