I have a table containing the daily transactions with date column.
The table is in BigQuery and is partitioned by the date column.
What is the most effective way to query all month-end data from the table?
I tired the sql like below but it processed the whole table which is about 100GB
SELECT * FROM table
WHERE date = LAST_DAY(date , month)
It should process less bytes as the table is partitioned by the date? (like 300 mb if I just choose one specific end of month in the where clause)
SELECT * FROM table
WHERE date = "2022-11-30"
Any ways to get what I want with processing less data?
You can minimize volume of data processed and cost by Calculating a list of In Scope last_date of the month and apply filter condition over data partitioned tables.
Following example will explain you:-
Original data looks like as given below, output expected is highlighted record without scanning complete table
Code to achieve it is:-
with data as
(select '2020-11-20' as add1, 'Robert' as name Union all
select '2021-10-10' as add1, 'Smith' as name Union all
select '2023-9-9' as add1, 'Mike' as name Union all
select '2024-8-2' as add1, 'Donal' as name Union all
select '2025-7-31' as add1, 'Kim' as name ),
-- Calculing Inscope List of last_dates of the month
new_data as
(select add1, LAST_DAY(cast (add1 as date)) as last_dt
from data)
-- Applying filter condition on date fileds
select * from data a, new_data b
where cast (a.add1 as date)=last_dt
Output will be last record which is having last day of the month.
You can use the following query to filter on the last day of the current month and to process only the partition of the last day of month :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(CURRENT_DATE('Europe/Paris'), INTERVAL 1 MONTH), MONTH) - 1;
The same query with a date column instead of the current date :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(your_date_column, INTERVAL 1 MONTH), MONTH) - 1;
Related
I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.
I have a google bigquery table with orders with a DATE column and other columns related to the orders. The starting date of the dataset is from 2021-01-01 (yyyy-mm-dd).
My aim is to filter on the DATE column from last year and this year to the previous iso week. For this, I used the ISOWEEK to create a new column:
WITH
last_week_last_year AS (
SELECT
DATE,
EXTRACT(ISOWEEK FROM DATE) AS isoweek,
FROM
`orders`
WHERE
EXTRACT(ISOWEEK FROM DATE) = EXTRACT(ISOWEEK FROM CURRENT_DATE())-1
GROUP BY 1, 2
ORDER BY DATE
)
SELECT * FROM last_week_last_year
This query results as the following table:
The issue is that when I filter on the original orders table by the DATE from the last_week_last_year table I get all the orders back instead of just the filtered version.
My method to filter is WHERE DATE IN (SELECT DATE FROM last_week_last_year) as seen below.
SELECT
*
FROM
`orders`
WHERE
DATE IN (SELECT DATE FROM last_week_last_year)
ORDER BY DATE DESC;
A snapshot of resulting table. It contains all of the records from 2021-01-01 until the latest day.
How can I make sure that on the latter query the table is filtered based on the first query's dates in DATE column?
I have a table where every row is transaction and there are few columns: clients IDs and dates for every transaction.
I am trying to write a query which will give a table where column N shows number of clients whose first transaction happened in month N made transactions in months: N, N+1, N+2, ...
For example (desired table for 3 months data):
1 2 3
100 90 78
80 80
60
First row of the column 1 shows number of clients whose first transaction happened in month 1, second row shows how many of this clients stayed after 1 month, third row - after two month etc
My current query (Year is a column wit year for the date, like 2017, month is a number of month like 1 for January):
WITH not_in AS(
SELECT ID, Year, month
FROM table
WHERE trans_date<date "2017-01-01"),
ID_in AS(
SELECT ID, Year, month
FROM table
WHERE trans_date BETWEEN date "2017-01-01" AND date "2017-01-31"
),
from_this AS(
SELECT ID, Year, month
FROM table
)
SELECT Year, Month, count(distinct ID)
FROM from_this
WHERE ID IN (select ID from ID_in)
AND
ID NOT IN (select ID from not_in)
GROUP BY 1,2
ORDER BY 1,2
But this gives only one column (for January 2017) of the desired table. I need to change dates for other months in 2017, 2018 and so on manually.
How to avoid this?
I guess, it should be looped somehow. And I think, I should create volatile table and add columns to it within loop, then select * from it.
Also I can not find an instruction for variables declaration and while loops in Teradata, any clearifications are appreciated.
I have a simple question.
I need to count all records from multiple tables with day and hour and add all of them together in a single final table.
So the query for each tab is something like this
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_1
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_2
select timestamp_trunc(timestamp,day) date, timestamp_trunc(timestamp,hour) hour, count(*) from table_3
and so on so forth
I would like to combine all the results showing number of total records for each day and hour from these tables.
Expected results will be like this
date, hour, number of records of table 1, number of records of table 2, number of records of table 3 ........
What would the most optimum SQL query for this?
Probably the simplest way is to union them together and aggregation:
select timestamp_trunc(timestamp, hour) as hh,
countif(which = 1) as num_1,
countif(which = 2) as num_2
from ((select timestamp, 1 as which
from table_1
) union all
(select timestamp, 2 as which
from table_2
) union all
. . .
) t
group hh
order by hh;
You are using timestamp_trunc(). It returns a timestamp truncated to the hour -- there is no need to also include the date.
Below is for BigQuery Standard SQL
#standardSQL
SELECT
TIMESTAMP_TRUNC(TIMESTAMP, DAY) day,
EXTRACT(HOUR FROM TIMESTAMP) hour,
COUNT(*) cnt,
_TABLE_SUFFIX AS table
FROM `project.dataset.table_*`
GROUP BY day, hour, table
I'm trying to create a quarterly report where some of the dates are generated from a lookup query. The input is start_date = 20181001 and end date = 20191231. While I could just query the whole range, I don't need Q1/2/3 so I'm dynamically generating the in-between dates.
The problem comes when I use them in the subquery with the table_suffix.
The dynamically generated ones don't work; it looks like it returns null and queries the entire table rather than date partitioned. But when I just hard code the values in a subquery, they work fine.
If I you query both date lookup tables, they look identical
Results of both dynamically_created and hard_coded table. So I have no idea where this error is coming from.
CREATE TEMP FUNCTION start_end() AS ( [parse_date('%Y%m%d','{start_date}'), parse_date('%Y%m%d','{end_date}')] );
CREATE TEMP FUNCTION wildcard_format(date_object date) as (replace(cast(date_object as string),"-",""));
-- create a calendar table 1 column "day" and one row for each day in the desired timeframe
WITH
calendar AS (
SELECT
extract(quarter from day) quarter
,extract(year from day) year
,day
FROM
UNNEST(GENERATE_DATE_ARRAY( start_end()[OFFSET(0)], start_end()[OFFSET(1)], INTERVAL 1 DAY) ) AS day
),
dynamically_created as (
select
wildcard_format(min(day)) start_py
,wildcard_format(max(case when year = extract (year from parse_date('%Y%m%d','{start_date}')) then day else null end)) end_py
,wildcard_format(min(case when year = extract (year from parse_date('%Y%m%d','{end_date}')) then day else null end)) start_cy
,wildcard_format(max(day)) end_cy
from
calendar
where quarter = extract (quarter from parse_date('%Y%m%d','{end_date}'))
),
hard_coded as (
SELECT
'20181001' as start_py
,'20181231' as end_py
,'20191001' as start_cy
,'20191231' as end_cy
),
sesh_data as (
select
*
from
`projectid.datasetid.summary_*`
where
(SELECT _table_suffix between start_py AND end_py FROM dynamically_created) #not working
(SELECT _table_suffix between start_py AND end_py FROM hard_coded) #working
),
select * from sesh_data