I'm trying to avoid the processing cost in BigQuery when creating a table from a query result e.g:
select * from xxx where ....
...and write into a destination table.
Is there a way to do it using one of the BigQuery tools?
No.
Whenever you query, you will be pay for how much data you are touching in the query no matter what.
Remember, your first 1TB p/m is free in BigQuery. So maybe this is not really a problem for you.
Related
We have a view in athena which is partitioned on processing_date (data type: string - format 20201231)
We are looking for data in 2020.
For exploration, we need all the columns.
Query :
select * from online_events_dw_view
where from_iso8601_date(processing_date) > from_iso8601_date('20191231')
Error :
Query exhausted resources at this scale factor
Is there any better way to optimize the query
You are applying a function to the partition column, chances are high that this leads to athena scanning all data and therefore you run into the problem.
Why not simply: processing_date like '2020%'
Maybe also try with a limit 1000 to limit the amount of data if you are just interested in the column.
The error "Query exhausted resources at this scale factor" is most often caused when sorting result sets with a lot of columns.
Since you don't post the view SQL there is no way to say for sure if that is the problem in your case, but it's almost always wide rows and sorting so I assume there is an ORDER BY in your view. Try removing that and see if the query executes without error.
Is there any better way to optimize the query
You need to post much more information for us to be able to help you. Without the SQL for the view it is impossible to say anything. Also post the SQL for all involved tables, and give some context about partitioning, the amount of data, the file formats, etc.
Simple count query takes amazingly long time to accomplish.
Am I doing something wrong?
SELECT COUNT(*) FROM `TABLE`
(if someone from bigquery hears this:
jobid: southamerica-east1.bquxjob_6a6df24d_16dfdbe0b54)
There are multiple reasons for a query running slow in BigQuery, as mentioned in the comments - if your table is an external table, that might cause an issue as well. If timing is critical for you and the queries you have are extremely simple, you might want to consider using Cloud SQL which is a realtime database.
BigQuery is normally used for larger more complex queries over very large datasets. If you have a support package, you might want to reach the Google Cloud Support team to have a look at the query to understand why it is running so slow.
Another workaround, just in case you only want to know the number of rows, could be to query the metadata:
SELECT table_id, row_count, size_bytes
FROM `PROJECT_ID.DATASET.__TABLES__` where table_id = 'your_table'
I heavily use bigQuery and there are now quite a number of intermediate tables. Because teammates can upload their own tables, I do not understand all the tables well.
I want to check if a table have not been used for a long time, then check if it can be deleted manually.
Is there anyone know how to do?
Many thanks
You could use logs if you have access. If you made yourself familiar with how to filter log entries you can find out about your usage quite easily: https://cloud.google.com/logging/docs/quickstart-sdk#explore
There's also the possibility of exporting logs to big query - so you could analyze them using SQL - I guess that's even more convenient.
You can get table specific meta data via the TABLES command.
SELECT *,TIMESTAMP_MILLIS(LAST_MODIFIED_TIME) ACCESS_DATE
FROM [DATASET].__TABLES__
The mentioned code snippet should provide you with the last access date.
When you need to read all the data from one or more tables in bigquery in a dataflow job there are two approaches to it I would say. The first one is to use BigQueryIO with from, which reads the table in question, and the second approach is to use fromQuery where you specify a query that reads all the data from the same table. So my question is:
Is it any cost or performance benefit for using one over the other?
I haven't find anything in the docs about this, but I would really like to know. I imagine that maybe read is faster since you don't need to run a query that scans the data, meaning it is more similar to the preview functionality you have in BigQuery UI. If that is true it might also be much cheaper, but it make sense if they both cost the same.
So in short, what is the difference between:
BigQueryIO.read(...).from(tableName)
And
BigQueryIO.read(...).fromQuery("SELECT * FROM " + tableName)
from is both cheaper and faster than fromQuery(SELECT * FROM ...).
from directly exports the table and exporting data is free for BigQuery.
fromQuery(SELECT * FROM ...) will first scan the entire table ($5/TB) and export the result.
They say there are no stupid questions, but this might be an exception.
I understand that BigQuery, being a columnar database, does a full table scan for any query over a specific column.
I also understand that query results can be cached or a named table can be created with the results of a query.
However I also see tabledata.list() in the documentation, and I'm unsure of how this fits in with query costs. Once a table is created from a query, am I free to access that table without cost through the API?
Let's say, for example, I run a query that is grouped by UserID, and I want to then present the results of that query to individual users based on that ID. As far as I understand there are two obvious ways of getting out the appropriate row for doing so.
I can write another query over the destination table with a WHERE userID=xxx clause
I can use the tabledata.list() endpoint to get all the (potentially paginated) data and get the appropriate row myself in my code
Where situation 1 would incur a query cost, and situation 2 would not? Am I getting this right?
Tabledata.list API is free as it actually does not use BigQuery Engine at all
so you are right for both 1 and 2