How to select from partitioned table except a version in big query?

How to select from partitioned table except a version in big query? - sql

My data set have a partitioned table and I have select all from them:
Select Customer id
from Company.database.Customer_*
various from (2022-01-01 till today)
But have a error version on 2022-06-08 and I dont want to select this version?
I tried
Select Customer id
from Company.database.Customer_*[^20220608] but well doesnt work

Finally I found the answer:
Company.database.Customer_TABLE_SUFFIX >=
FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Related

Update tables with new data

I receive new tables with data in bigquery everyday.
One tables = one date.
For example: costdata_01012018, costdata02012018 and so on.
I have script that union them every day so I have a new tables with all data I need. For now I truncate the final table every day and it doesn't seem right.
Is there any way to union them without truncation?
I just need to add a new table to the final one
I tried to create 'from' instruction that dynamically finds new table but it doesn't work.
SELECT date, adcost
FROM CONCAT('[test-project-187411:dataset.CostData_', STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d"), ']')
What am I doing wrong?

Two options to do this:
#standardsql
SELECT date, adcost
FROM `test-project-187411:dataset.CostData_*`
WHERE _TABLE_SUFFIX = FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Legacy SQL
#legacysql
SELECT date, adcost
FROM TABLE_QUERY([test-project-187411:dataset], 'tableid = CONCAT("CostData_", STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')

With daily tables on BigQuery, how can I query rolling 12 months?

I have daily tables in BigQuery, (table_name_yyyymmdd). How can I write a view that will always query the rolling 12 months of data?

As an example:
Save below query as a view (let's name it - view_test - I assume it in the same dataset as tables)
#standardSQL
SELECT PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) as table, COUNT(1) as rows_count
FROM `yourProject.yourDataset.table_name_*`
WHERE _TABLE_SUFFIX
BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 13 DAY) )
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) )
GROUP BY 1
Now you can use it as below for example:
#standardSQL
SELECT *
FROM `yourProject.yourDataset.view_test`
So, this views referencing last 12 full days
You can change DAY to MONTH to have 12 months instead
Hope you got an idea
If needed this can easily be "translated" to Legacy SQL (make sure the view and query that calls that view are using the same SQL version/dialect)
Note: Google recommends migrate to Standard SQL whenever it is possible!

You could use function TABLE_DATE_RANGE which according to doc (https://cloud.google.com/bigquery/docs/reference/legacy-sql#table-date-range) :
Queries multiple daily tables that span a date range.
like below:
SELECT *FROM TABLE_DATE_RANGE(data_set_name.table_name,
TIMESTAMP('2016-01-01'),
TIMESTAMP('2016-12-31'))
as there is currently no option to parametrise your view programatically you need to generate your queries/views by some other tool/code

How to choose the latest partition in BigQuery table?

I am trying to select data from the latest partition in a date-partitioned BigQuery table, but the query still reads data from the whole table.
I've tried (as far as I know, BigQuery does not support QUALIFY):
SELECT col FROM table WHERE _PARTITIONTIME = (
SELECT pt FROM (
SELECT pt, RANK() OVER(ORDER by pt DESC) as rnk FROM (
SELECT _PARTITIONTIME AS pt FROM table GROUP BY 1)
)
)
WHERE rnk = 1
);
But this does not work and reads all rows.
SELECT col from table WHERE _PARTITIONTIME = TIMESTAMP('YYYY-MM-DD')
where 'YYYY-MM-DD' is a specific date does work.
However, I need to run this script in the future, but the table update (and the _PARTITIONTIME) is irregular. Is there a way I can pull data only from the latest partition in BigQuery?

October 2019 Update
Support for Scripting and Stored Procedures is now in beta (as of October 2019)
You can submit multiple statements separated with semi-colons and BigQuery is able to run them now
See example below
DECLARE max_date TIMESTAMP;
SET max_date = (
SELECT MAX(_PARTITIONTIME) FROM project.dataset.partitioned_table`);
SELECT * FROM `project.dataset.partitioned_table`
WHERE _PARTITIONTIME = max_date;
Update for those who like downvoting without checking context, etc.
I think, this answer was accepted because it addressed the OP's main question Is there a way I can pull data only from the latest partition in BigQuery? and in comments it was mentioned that it is obvious that BQ engine still scans ALL rows but returns result based on ONLY recent partition. As it was already mentioned in comment for question - Still something that easily to be addressed by having that logic scripted - first getting result of subquery and then use it in final query
Try
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(TIMESTAMP(partition_id))
FROM [dataset.partitioned_table$__PARTITIONS_SUMMARY__]
)
or
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONTIME IN (
SELECT MAX(_PARTITIONTIME)
FROM [dataset.partitioned_table]
)

Sorry for digging up this old question, but it came up in a Google search and I think the accepted answer is misleading.
As far as I can tell from the documentation and running tests, the accepted answer will not prune partitions because a subquery is used to determine the most recent partition:
Complex queries that require the evaluation of multiple stages of a query in order to resolve the predicate (such as inner queries or subqueries) will not prune partitions from the query.
So, although the suggested answer will deliver the results you expect, it will still query all partitions. It will not ignore all older partitions and only query the latest.
The trick is to use a more-or-less-constant to compare to, instead of a subquery. For example, if _PARTITIONTIME isn't irregular but daily, try pruning partitions by getting yesterdays partition like so:
SELECT * FROM [dataset.partitioned_table]
WHERE _PARTITIONDATE = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Sure, this isn't always the latest data, but in my case this happens to be close enough. Use INTERVAL 0 DAY if you want todays data, and don't care that the query will return 0 results for the part of the day where the partition hasn't been created yet.
I'm happy to learn if there is a better workaround to get the latest partition!

List all partitions with:
#standardSQL
SELECT
_PARTITIONTIME as pt
FROM
`[DATASET].[TABLE]`
GROUP BY 1
And then choose the latest timestamp.
Good luck :)
https://cloud.google.com/bigquery/docs/querying-partitioned-tables

I found the workaround to this issue. You can use with statement, select last few partitions and filter out the result. This is I think better approach because:
You are not limited by fixed partition date (like today - 1 day). It will always take the latest partition from given range.
It will only scan last few partitions and not whole table.
Example with last 3 partitions scan:
WITH last_three_partitions as (select *, _PARTITIONTIME as PARTITIONTIME
FROM dataset.partitioned_table
WHERE _PARTITIONTIME > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 DAY))
SELECT col1, PARTITIONTIME from last_three_partitions
WHERE PARTITIONTIME = (SELECT max(PARTITIONTIME) from last_three_partitions)

A compromise that manages to query only a few partitions without resorting to scripting or failing with missing partitions for fixed dates.
WITH latest_partitions AS (
SELECT *, _PARTITIONDATE AS date
FROM `myproject.mydataset.mytable`
WHERE _PARTITIONDATE > DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
)
SELECT
*
FROM
latest_partitions
WHERE
date = (SELECT MAX(date) FROM latest_partitions)

You can leverage the __TABLES__ list of tables to avoid re-scanning everything or having to hope latest partition is ~3 days ago. I did the split and ordinal stuff to guard against in case my table prefix appears more than once in the table name for some reason.
This should work for either _PARTITIONTIME or _TABLE_SUFFIX.
select * from `project.dataset.tablePrefix*`
where _PARTITIONTIME = (
SELECT split(table_id,'tablePrefix')[ordinal(2)] FROM `project.dataset.__TABLES__`
where table_id like 'tablePrefix%'
order by table_id desc limit 1)

I had this answer in a less popular question, so copying it here as it's relevant (and this question is getting more pageviews):
Mikhail's answer looks like this (working on public data):
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
AND wiki='es'
# 122.2 MB processed
But it seems the question wants something like this:
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
AND wiki='es'
# 50.6 GB processed
... but for way less than 50.6GB
What you need now is some sort of scripting, to perform this in 2 steps:
max_date = (SELECT DATE(MAX(datehour)) FROM `fh-bigquery.wikipedia_v3.pageviews_2019` WHERE wiki='es')
;
SELECT MAX(views)
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) = {{max_date}}
AND wiki='es'
# 115.2 MB processed
You will have to script this outside BigQuery - or wait for news on https://issuetracker.google.com/issues/36955074.

Building on the answer from Chase. If you have a table that requires you filter over a column, and you're receiving the error:
Cannot query over table 'myproject.mydataset.mytable' without a filter over column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME' that can be used for partition elimination
Then you can use:
SELECT
MAX(_PARTITIONTIME) AS pt
FROM
`myproject.mydataset.mytable`
WHERE _PARTITIONTIME IS NOT NULL
Instead of the latest partition, I've used this to get the earliest partition in a dataset by simply changing max to min.

SELECT statement optimization

I'm not so expert in SQL queryes, but not even a complete newbie.
I'm exporting data from a MS-SQL database to an excel file using a SQL query.
I'm exporting many columns and two of this columns contain a date and an hour, this are the columns I use for the WHERE clause.
In detail I have about 200 rows for each day, everyone with a different hour, for many days. I need to extract the first value after the 15:00 of each day for more days.
Since the hours are different for each day i can't specify something like
SELECT a,b,hour,day FROM table WHERE hour='15:01'
because sometimes the value is at 15:01, sometimes 15:03 and so on (i'm looking for the closest value after the 15:00), for fix this i used this workaround:
SELECT TOP 1 a,b,hour,day FROM table WHERE hour > "15:00"
in this way i can take the first value after the 15:00 for a day...the problem is that i need this for more days...for a user-specifyed interval of days. At the moment i fix this with a UNION ALL statement, like this:
SELECT TOP 1 a,b,hour,day FROM table WHERE data="first_day" AND hour > "15:00"
UNION ALL SELECT TOP 1 a,b,hour,day FROM table WHERE data="second_day" AND hour > "15:00"
UNION ALL SELECT TOP 1 a,b,hour,day FROM table WHERE data="third_day" AND hour > "15:00"
...and so on for all the days (i build the SQL string with a for each day in the specifyed interval).
Until now this worked, but now I need to expand the days interval (now is maximun a week, so 5 days) to up to 60 days. I don't want to build an huge query string, but i can't imagine an alternative way for write the SQL.
Any help appreciated
Ettore

I typical solution for this uses row_number():
SELECT a, b, hour, day
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY day ORDER BY hour) as seqnum
FROM table t
WHERE hour > '15:00'
) t
WHERE seqnum = 1;

Help me build a SQL select statement

SQL isn't my greatest strength and I need some help building a select statement.
Basically, this is my requirement. The table stores a list of names and a timestamp of when the name was entered in the table. Names may be entered multiple times during a week, but only once a day.
I want the select query to return names that were entered anytime in the past 7 days, but not today.
To get a list of names entered today, this is the statement I have:
Select * from table where Date(timestamp) = Date(now())
And to get a list of names entered in the past 7 days, not including today:
Select * from table where (Date(now())- Date(timestamp) < 7) and (date(timestamp) != date(now()))
If the first query returns a set or results, say A, and the second query returns B, how can I get
B-A

Try this if you're working with SQL Server:
SELECT * FROM Table
WHERE Timestamp BETWEEN
dateadd(day,datediff(day,0,getdate()),-7),
AND dateadd(day,datediff(day,0,getdate()),0)
This ensures that the timestamp is between 00:00 7 days ago, and 00:00 today. Today's entries with time greater than 00:00 will not be included.

In plain English, you want records from your second query where the name is not in your first query. In SQL:
Select *
from table
where (Date(now())- Date(timestamp) < 7)
and (date(timestamp) != date(now()))
and name not in (Select name
from table
where Date(timestamp) = Date(now())
)

not in
like
select pk from B where PK not in A
or you can do something like
Select * from table where (Date(now())- Date(timestamp) < 7) and (Date(now())- Date(timestamp) > 1)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to select from partitioned table except a version in big query? - sql

Finally I found the answer: Company.database.Customer_TABLE_SUFFIX >= FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)

Related

Update tables with new data

With daily tables on BigQuery, how can I query rolling 12 months?

How to choose the latest partition in BigQuery table?

SELECT statement optimization

Help me build a SQL select statement

Categories

Resources