Full table scan on partitioned table - google-bigquery

I have two tables (pageviews and base_events) that are both partitioned on a date field derived_tstamp. Every night I'm doing an incremental update to the base_events table, querying the new data from pageviews like so:
select
*
from
`project.sp.pageviews`
where derived_tstamp > (select max(derived_tstamp) from `project.sp_modeled.base_events`)
Looking at the query costs, this query scans the full table instead of only the new data. Usually this should only get yesterdays data.
Do you have any idea, what's wrong with the query?

Subqueries will trigger a full table scan. The solution is to use scripting. I have solved my problem with the following query:
declare event_date_checkpoint DATE default (
select max(date(page_view_start)) from `project.sp_modeled.base_events
);
select
*
from
`project.sp.pageviews`
where derived_tstamp > event_date_checkpoint
More on scripting:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#declare

Related

Optimize SELECT MAX(timestamp) query

I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).

How do I make a query to cast the value in a column for all partitioned tables in big query

I am curious if there is a way to query and write to all the partitioned tables in big query. I wanted to cast a single column to a different datatype and apply it to all the values across the partitions in a big query table.
i.e.
select cast(nums as STRING) from `project_id.dataset.table`
And have it written back out to all the values in the column across the table. Is there a straightforward way to do this in bigquery?
Let's create a table:
CREATE TABLE `deleting.part`
PARTITION BY day
AS
SELECT DATE('2018-01-01') day, 2 i
UNION ALL SELECT DATE('2018-01-02'), 3
Now, let's change i from INT64 to FLOAT64:
CREATE OR REPLACE TABLE `deleting.part`
PARTITION BY day
AS
SELECT * REPLACE(CAST(i AS FLOAT64) AS i)
FROM `deleting.part`
Cost: Full table scan.

Is there a way to make a SELECT in BigQuery query conditionally only if a table exists?

I have an app that has to query hundreds of BigQuery tables (in a Dataflow job) , some of which may not exist (there are tables named by day for events which occur on each day, and some days some tables may not have been created).
Is there a way to write a BQ SQL query such that it makes a SELECT against some_table if and only if the named table exists, and returns no rows otherwise?
Someone had posted a query which returns if a table exists
#standardSQL
SELECT COUNT(1) AS table_count
FROM `my-project:blah.__TABLES_SUMMARY__`
WHERE table_id = 'some-table-name-2017-04-02'
But we are trying to do a job in Dataflow, and its difficult to make these queries first outside of the dataflow control structure.
Is there a way to combine something like the query above with a SELECT against that table 'some-table-name-2017-04-02', in a single SQL statement such that if the table does not exist, we just get no rows back, rather than an error?
The problem is that the BigQuery SQL parser will not even compile a query if it references a table name that does not exist, even if no query is done to that table.
Here you can check budd
if not exists
(select * from sysobjects where name="my table")
begin
execute "create table mytable(x int)"
end

Google BigQuery - Using wildcard table query with date partitioned table?

I am trying to use wildcard table functions to query bunch of date-partitioned tables.
This query works:
select * from `Mydataset.fact_table_1` where _partitiontime='2016-09-30' limit 10
This query does not work:
select * from `Mydataset.fact_table_*` where _partitiontime='2016-09-30' limit 10
Is this operation not supported?
If it is not supported what's the best way to read same day's data from multiple date-partitioned tables?
Following statement
select * from TABLE_QUERY(YOUR_DATASET,'table_id contains "fact_table_"') where _PARTITIONTIME = TIMESTAMP('2016-09-30')
Should do the trick

RedShift tuning to time separated table

Currently I handle big data in redshift.
I would like to ask you a best table schema.
Data set on the table is like this.
tmp_tbl_yyyy_MM
user_id int,
position_x int,
position_y int,
date date,
type int
sortkey (date)
and 1 bilion data is inserted per month.
I usually scan data of the past 3 months, and
according to this article,
time separated table seems to be good.
Therefore I separated the table by month like "_yyyy_MM"
Here is an example query I often run.
select user_id from(
select * from tmp_tbl_yyyy_MM
unionall
select * from tmp_tbl_yyyy_MM
)
where
(position_x between ? and ?
and position_y between ? and ?)
or
(position_x between ? and ?
and position_y between ? and ?)
or ...
and date between ? and ?
and type = ?;
position_x,position_y conditions are repeated over 1000 times.
This query's plan is sequential scan, so its very slow.
Teach me the best way to get same results.
I guess the points are table, query and sortkey.
Unionall is bad?
Shouldn't I separate the table by month?
Where clause should be in the subqueries?
Should I set interleaved sortkey to all contdition like position_x,position_y,date,type