How to join in BigQuery SQL based on an IF condition - google-bigquery

I have a table where each row is a task with a created_at and a completed_at date.
I want to understand this data not by task but by date.
I need to be able to see the number of tasks on any given day.
The SQL below generates and unnests a date array and joins my data table to that result based on the created_at date, but this won't fit my needs.
WITH main FROM (
SELECT * FROM `data.task_merge`),
second AS (
SELECT *
FROM unnest(GENERATE_DATE_ARRAY('2020-01-01', '2022-12-31', INTERVAL 1 DAY)) AS newdate)
SELECT *
FROM second
LEFT JOIN main
ON second.newdate = cast(main.created_at AS DATE)
What I need:
For every date x in the date array: If task y in the dataset has a created_at date <= x and a completed_at date >= x, join that task to the table against x. Then increment y+1 and repeat against x, and when we've finished the table of tasks, increment to x+1 and restart at y.

I answered myself - you just put a WHERE in the ON part of the join

Related

most effective way to query all the month-end data in BigQuery

I have a table containing the daily transactions with date column.
The table is in BigQuery and is partitioned by the date column.
What is the most effective way to query all month-end data from the table?
I tired the sql like below but it processed the whole table which is about 100GB
SELECT * FROM table
WHERE date = LAST_DAY(date , month)
It should process less bytes as the table is partitioned by the date? (like 300 mb if I just choose one specific end of month in the where clause)
SELECT * FROM table
WHERE date = "2022-11-30"
Any ways to get what I want with processing less data?
You can minimize volume of data processed and cost by Calculating a list of In Scope last_date of the month and apply filter condition over data partitioned tables.
Following example will explain you:-
Original data looks like as given below, output expected is highlighted record without scanning complete table
Code to achieve it is:-
with data as
(select '2020-11-20' as add1, 'Robert' as name Union all
select '2021-10-10' as add1, 'Smith' as name Union all
select '2023-9-9' as add1, 'Mike' as name Union all
select '2024-8-2' as add1, 'Donal' as name Union all
select '2025-7-31' as add1, 'Kim' as name ),
-- Calculing Inscope List of last_dates of the month
new_data as
(select add1, LAST_DAY(cast (add1 as date)) as last_dt
from data)
-- Applying filter condition on date fileds
select * from data a, new_data b
where cast (a.add1 as date)=last_dt
Output will be last record which is having last day of the month.
You can use the following query to filter on the last day of the current month and to process only the partition of the last day of month :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(CURRENT_DATE('Europe/Paris'), INTERVAL 1 MONTH), MONTH) - 1;
The same query with a date column instead of the current date :
SELECT * FROM table
WHERE date = DATE_TRUNC(DATE_ADD(your_date_column, INTERVAL 1 MONTH), MONTH) - 1;

How to add/subtract n number of days from a custom date in postgresql?

I have a database in which, I have a issue_date column, and forecast_date column,
I am selecting the maximum date from the database,
but I want to fetch/query/extract the N th previous day from maximum available date
like (maximum date - 1 / 2 or n number of days).
SELECT issue_date, forecast_date, state_name, district_name, rainfall, geometry
FROM all_parameters_forecast_data
WHERE "forecast_date" = (SELECT ((MAX("forecast_date")- INTERVAL '1 day') AS "forecast_date") FROM all_parameters_forecast_data)
& As the max date is custom,
so can not use today or yesterday logic here.
Is there any way possible?
Calculate the MAX date
Filter your records using BETWEEN max_date - n AND max_date
Example:
SELECT issue_date, forecast_date, state_name, district_name, rainfall, geometry
FROM (
SELECT
*,
MAX("forecast_data") OVER () as max_forecast_date
FROM
all_parameters_forecast_data
) s
WHERE "forecast_date" BETWEEN max_forecast_date - n AND max_forecast_date
There are many ways to achieve #1. In my example I used the MAX() window function to add the required value as separate column which can be used for comparison later.

Get 30 days prior data for each row of query

I have a query where I have a list of ~ 20k users for a specific week of the month that represents that they have logged on to our site.
What I need to get - for each of these users, in the past 30 days if they have
1. logged on: defined by any rows recorded in the same table
2. max event in the 30 day window, prior to the date in the current where clause
This is the current code snippet that helps me narrow to the ~20k users for a given week to begin with:
select
user_id,
max(timestamp)
from table
where timestamp between '2019-02-01' and '2019-02-05'
group by 1,2;
Expected result set/columns:
user_id,
max(timestamp),
logged_on, [if they have any # of rows in the same table within 30 days prior to their max(timestamp) date]
previous_timestamp, [the 2nd most recent login date within 30 days prior to their max(timestamp) date]
I think this is what you're looking for. Not sure if it's the most efficient method though - perhaps windowing functions may perform better but like bob-mccormick mentioned: the tricky bit would be filling in dates where the user (partition key) was not active so that the range query will work correctly.
Example data setup (Snowflake syntax)
-- Create sample table
create temporary table user_logins (userid number, date_logged_on timestamp);
;
-- Insert some random sample data
insert overwrite into user_logins
select
uniform(1,10,random()) userid,
dateadd('minutes', uniform(1,86400,random()) * -1,current_timestamp::timestamp_ntz) date_logged_on
from table(generator(rowcount => 100))
;
Select statement
-- Run select
with user_last_logins as (
select
userid,
max(date_logged_on) last_login
from user_logins
where
date_logged_on between '2019-01-01' and '2019-05-08'
group by userid
)
select
user_last_logins.userid,
max(user_last_logins.last_login) last_logged_on,
count(prior_30_each_user.userid) num_logins_prior_30,
max(prior_30_each_user.date_logged_on)
from user_last_logins
left join user_logins prior_30_each_user
on user_last_logins.userid = prior_30_each_user.userid
and prior_30_each_user.date_logged_on > dateadd('day', -30, user_last_logins.last_login) and prior_30_each_user.date_logged_on < user_last_logins.last_login
group by user_last_logins.userid
;

Smoothing out a result set by date

Using SQL I need to return a smooth set of results (i.e. one per day) from a dataset that contains 0-N records per day.
The result per day should be the most recent previous value even if that is not from the same day. For example:
Starting data:
Date: Time: Value
19/3/2014 10:01 5
19/3/2014 11:08 3
19/3/2014 17:19 6
20/3/2014 09:11 4
22/3/2014 14:01 5
Required output:
Date: Value
19/3/2014 6
20/3/2014 4
21/3/2014 4
22/3/2014 5
First you need to complete the date range and fill in the missing dates (21/3/2014 in you example). This can be done by either joining a calendar table if you have one, or by using a recursive common table expression to generate the complete sequence on the fly.
When you have the complete sequence of dates finding the max value for the date, or from the latest previous non-null row becomes easy. In this query I use a correlated subquery to do it.
with cte as (
select min(date) date, max(date) max_date from your_table
union all
select dateadd(day, 1, date) date, max_date
from cte
where date < max_date
)
select
c.date,
(
select top 1 max(value) from your_table
where date <= c.date group by date order by date desc
) value
from cte c
order by c.date;
May be this works but try and let me know
select date, value from test where (time,date) in (select max(time),date from test group by date);

Calculate closest working day in Postgres

I need to schedule some items in a postgres query based on a requested delivery date for an order. So for example, the order has a requested delivery on a Monday (20120319 for example), and the order needs to be prepared on the prior working day (20120316).
Thoughts on the most direct method? I'm open to adding a dates table. I'm thinking there's got to be a better way than a long set of case statements using:
SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
This gets you previous business day.
SELECT
CASE (EXTRACT(ISODOW FROM current_date)::integer) % 7
WHEN 1 THEN current_date-3
WHEN 0 THEN current_date-2
ELSE current_date-1
END AS previous_business_day
To have the previous work day:
select max(s.a) as work_day
from (
select s.a::date
from generate_series('2012-01-02'::date, '2050-12-31', '1 day') s(a)
where extract(dow from s.a) between 1 and 5
except
select holiday_date
from holiday_table
) s
where s.a < '2012-03-19'
;
If you want the next work day just invert the query.
SELECT y.d AS prep_day
FROM (
SELECT generate_series(dday - 8, dday - 1, interval '1d')::date AS d
FROM (SELECT '2012-03-19'::date AS dday) x
) y
LEFT JOIN holiday h USING (d)
WHERE h.d IS NULL
AND extract(isodow from y.d) < 6
ORDER BY y.d DESC
LIMIT 1;
It should be faster to generate only as many days as necessary. I generate one week prior to the delivery. That should cover all possibilities.
isodow as extract parameter is more convenient than dow to test for workdays.
min() / max(), ORDER BY / LIMIT 1, that's a matter of taste with the few rows in my query.
To get several candidate days in descending order, not just the top pick, change the LIMIT 1.
I put the dday (delivery day) in a subquery so you only have to input it once. You can enter any date or timestamp literal. It is cast to date either way.
CREATE TABLE Holidays (Holiday, PrecedingBusinessDay) AS VALUES
('2012-12-25'::DATE, '2012-12-24'::DATE),
('2012-12-26'::DATE, '2012-12-24'::DATE);
SELECT Day, COALESCE(PrecedingBusinessDay, PrecedingMondayToFriday)
FROM
(SELECT Day, Day - CASE DATE_PART('DOW', Day)
WHEN 0 THEN 2
WHEN 1 THEN 3
ELSE 1
END AS PrecedingMondayToFriday
FROM TestDays) AS PrecedingMondaysToFridays
LEFT JOIN Holidays ON PrecedingMondayToFriday = Holiday;
You might want to rename some of the identifiers :-).