BigQuery unexpected _table_suffix result - google-bigquery

The following query:
select array_agg(distinct _table_suffix order by _table_suffix desc limit 10) from `project_id.visits.visit_20*`
returns:
but
select array_agg(distinct _table_suffix order by _table_suffix desc limit 10) from `project_id.visits.visit_202*`
returns:
How is it possible ? (The length of the first result should be 10).

Related

choose latest partition of a Bigquery table where filter over partition column is required

I have been using the following query
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
It is not ideal as the partition could be unavailable due to delay.. Thus I try the following queries
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) IN
(
SELECT
MAX(DATE(_PARTITIONTIME)) AS max_partition
FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
)
as well as
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE TIMESTAMP(DATE(_PARTITIONTIME)) IN
(
SELECT parse_timestamp("%Y%m%d", MAX(partition_id))
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'table'
)
Neither of them work due to
Cannot query over table 'project.dataset.table' without a filter over
column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME'
that can be used for partition elimination.
In both of your solutions the limiting filter for the partition column is calculated during the query. This lead to full table scan.
Therfore, you need to add a filter for the partition column which is always know at the beginning of the run of your query.
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) IN
(
SELECT
MAX(DATE(_PARTITIONTIME)) AS max_partition
FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
)
AND DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
If the last partition date could be months back, this is a better solution:
Declare max_date date;
execute immediate
"""
SELECT max(date(_PARTITIONTIME)) FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) > "2000-12-15"
""" into max_date;
execute immediate
"""
Select * from `project.dataset.table` where date(_PARTITIONTIME)= date('""" || max_date || "')"

select rows with condition of date presto

I try to select by hour the number of impression for a particular day :
I try with this code :
SELECT
date_trunc('hour', CAST(date_time AS timestamp)) date_time,
COUNT(impression_id) AS count_impression_id
FROM
parquet_db.imp_pixel
WHERE
date_time = '2022-07-27'
LIMIT 100
GROUP BY 1
But I got this error when I add the "where" clause :
line 5:1: mismatched input 'group'. Expecting:
Can you help me to fix it? thanks
LIMIT usually comes last in a SQL query. Also, you should not be using LIMIT without ORDER BY. Use this version:
SELECT DATE_TRUNC('hour', CAST(date_time AS timestamp)) date_time,
COUNT(impression_id) AS count_impression_id
FROM parquet_db.imp_pixel
WHERE CAST(date_time AS date) = '2022-07-27'
GROUP BY 1
ORDER BY <something>
LIMIT 100;
Note that the ORDER BY clause determines which 100 records you get in the result set. Your current (intended) query lets Presto decide on its own which 100 records get returned.

Using a subquery as reference to generate multiple result in same query

I'm using sqlite to store some data and i have a Subquery that i want to reuse to deduce multiple results and the Subquery goes like this :
select * from s_stats where datetime(start_time) > datetime('now','localtime','-3 days') group by src_ip,src_port,dest_ip,dest_port order by start_time desc
I want to re use the above query to generate multiple filtered data in the same query .
One Result is done by this :
select start_time,action,count(*) from (select * from s_stats where datetime(start_time) > datetime('now','localtime','-3 days') group by src_ip,src_port,dest_ip,dest_port order by start_time desc) where action='BLOCKED' group by action,start_time order by start_time desc
I also want to do :
select start_time,action,count(*) from (select * from s_stats where datetime(start_time) > datetime('now','localtime','-3 days') group by src_ip,src_port,dest_ip,dest_port order by start_time desc) group by start_time order by start_time desc
Is there any way to combine both the query into one single query by using the subquery as some variable ?
Thanks
You can use conditional aggregation to get both counts in one query:
select start_time,
action,
count(case when action = 'BLOCKED' then 1 end) as blocked,
count(*) as total
from (select *
from s_stats
where datetime(start_time) > datetime('now','localtime','-3 days')
group by src_ip,src_port,dest_ip,dest_port
order by start_time desc)
group by start_time
order by start_time desc

How to get COUNT with UNNEST function in Google BigQuery?

I need to get count of events which has specific parameter in it. Let's say, I have param event_notification_received with params (type, title, code_id). And in param code_id - I have unique name of advertisement. And I need to count how much events where received with such parameter. I am using UNNEST function, to get access to params of event. But it gives too much of results after execution, I think it's because of UNNEST function. How can I count correctly events? Thanks.
Here is my standard SQL query:
#standardSQL
SELECT event_date, event_timestamp, event_name, user_id, app_info.version,
geo.country, geo.region, geo.city,
my_event_params,
user_prop,
platform
FROM
`myProject.analytics_199660162.events_201807*`,
UNNEST(event_params) as my_event_params,
UNNEST(user_properties) as user_prop
WHERE
_TABLE_SUFFIX BETWEEN '24' AND '30' AND
event_name = "event_notification_received"
AND
my_event_params.value.string_value = "my_adverticement_name"
AND
platform = "ANDROID"
ORDER BY event_timestamp DESC
Is this what you want?
SELECT . . .,
(SELECT COUNT(*)
FROM UNNEST(event_params) as my_event_params
WHERE my_event_params.value.string_value = 'my_adverticement_name'
) as event_count
FROM `myProject.analytics_199660162.events_201807*`,
UNNEST(user_properties) as user_prop
WHERE _TABLE_SUFFIX BETWEEN '24' AND '30' AND
event_name = 'event_notification_received' AND
platform = 'ANDROID'
ORDER BY event_timestamp DESC;
If you UNNEST() and CROSS JOIN more than two columns at the FROM level, you'll get duplicated rows - yup.
Instead UNNEST() at the SELECT level, just to extract and COUNT the values you are looking for:
SELECT COUNT(DISTINCT (
SELECT value.string_value FROM UNNEST(user_properties) WHERE key='powers')
) AS distinct_powers
FROM `firebase-sample-for-bigquery.analytics_bingo_sample.events_20160607`

Partition pruning for bigquery partitioned table

I have a query using analytic function for a day partitioned table. I would expect it to read only data in partitions filtered in where clause but it reads all partitions in the table.
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` ) q_nested
) q
WHERE
rank < 1000
)
SELECT
num_mean_temp_samples ,
count(1) as samples
FROM query
WHERE
day in ( '20100101', '20100103')
GROUP BY 1 ORDER BY 1
I verified partition pruning works without analytic function:
WITH query AS (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned`
)
or after adding UNION ALL nested select:
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` WHERE _PARTITIONDATE < "1970-01-01" ) q_nested1
UNION ALL SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` WHERE _PARTITIONDATE >= "1970-01-01" ) q_nested2
) q
WHERE
rank < 1000
)
Table mydataset.gsod_partitioned is public dataset gsod based where day=20100101 partition is created as follows:
bq query --destination_table 'private.gsod_partitioned$20100101' --time_partitioning_type=DAY --use_legacy_sql=false
'SELECT station_number, mean_temp, num_mean_temp_samples FROM `bigquery-public-data.samples.gsod` where year=2010 and month=01 and day=01'
Could you find a way to enable partition pruning for the analytic function without adding extra union to the query?
Regarding _PARTITIONDATE - it isn't documented feature and it is recommended to use _PARTITIONETIME instead, you could look for some other question to see one of Googlers saying about that: Use of the _PARTITIONDATE vs. the _PARTITIONTIME pseudo-columns in BigQuery
Regarding partition pruning with analitycal functions in last year Google added support for filter pushdown but is works only for _PARTITIONTIME which should be included in fields covered by PARTITON BY clause
It should look like this:
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY _pt order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) AS day,
_PARTITIONTIME as _pt,
*
FROM
`mydataset.gsod_partitioned` ) q_nested
) q
WHERE
rank < 1000
)
SELECT
num_mean_temp_samples ,
count(1) as samples
FROM query
WHERE
day in ( '20100101', '20100103')
GROUP BY 1 ORDER BY 1