Query needed to show data in a different way - sql

After making a query, I get this data in this format:
I need to set up a query that pairs dates related to same id and counts the difference in days.
the result should be like this:
I'm using postgresql. Could you please help me to set up the query to get the desired output ?
Thanks in advance

Hmmm . . . I'm thinking lead() and some additional filtering and arithmetic:
select id, date as date_start, next_date as date_end,
(next_date - date_start) as days
from (select t.*, lead(date) over (partition by id order by date) as next_date
from t
) t
where event = 0;

The most robust and maintainable answer is that by #GordonLinoff.
Another method to solve this can be using CTEs. Here I am assuming that event = 0 and event = 1 are paired (which is true in the example shown):
WITH t1 AS
(SELECT id, date
FROM t
WHERE event = 0 rank() OVER (
ORDER BY id, date) AS id_date_rank),
t2 AS
(SELECT id, date
FROM t
WHERE event = 1 rank() OVER (
ORDER BY id, date) AS id_date_rank),
SELECT t1.id,
t1.date AS date_start,
t2.date AS date_end,
DATE_PART('day', (t2.date - t1.date)) AS no_days
FROM t1
INNER JOIN t2 ON (t1.id_date_rank = t2.id_date_rank)
ORDER BY t1.id,
t1.date;

Related

BigQuery: 'join lateral' alternative for referencing value in subquery

I have a BigQuery table that holds append-only data - each time an entity is updated a new version of it is inserted. Each entity has its unique ID and each entry has a timestamp of when it was inserted.
When querying for the latest version of the entity, I order by rank, partition by id, and select the most recent version.
I want to take advantage of this and chart the progression of these entities over time. For example, I would like to generate a row for each day since Jan. 1st, with a summary of the entities as they were on that day. In postgres, I would do:
select
...
from generate_series('2022-01-01'::timestamp, '2022-09-01'::timestamp, '1 day'::interval) query_date
left join lateral (
select *
from (
with snapshot as (
select distinct on (id) *
from table
where "createdOn" <= query_date
order by id, "createdOn" desc
)
This basically behaves like a for-each, having each subquery run once for each query_date (day, in this instance) which I can reference in the where clause. Each subquery then filters the data so that it only uses data up to a certain time.
I know that I can create a saved query for the "subquery" logic and then schedule a prefill to run once for each day over the timeline, but I would like to understand how to write an exploratory query.
EDIT 1
Using a correlated subquery is a step in the right direction, but does not work when the subquery needs to join with another table (another append-only table holding a related entity).
So this works:
select
day
, (
select count(*)
from `table` t
where date(createdOn) < day
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
But if I need the subquery to join with another table, like in:
select
day
, (
select as struct *
from (
select
id
, status
, rank() over (partition by id order by createdOn desc) as rank
from `table1`
where date(createdOn) < day
qualify rank = 1
) t1
left join (
select
id
, other
, rank() over (partition by id order by createdOn desc) as rank
from `table2`
where date(createdOn) < day
qualify rank = 1
) t2 on t2.other = t1.id
)
from unnest((select generate_date_array(date('2022-01-01'), current_date(), interval 1 day) as day)) day
order by day desc
I get an error saying Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN. Another SO question about that error (Avoid correlated subqueries error in BigQuery) solves the issue by moving the correlated query to a join in the top query - which misses what I am trying to achieve.
Took me a while, but I figured out a way to do this using the answer in Bigquery: WHERE clause using column from outside the subquery.
Basically, it requires to flip the order of the queries, here's how it's done:
select *
from (
select *
from `table1` t1
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t1.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t1.id ORDER BY t1.createdOn desc) = 1
)
left join (
select
* -- aggregate here
from (
SELECT
id, other, createdOn
FROM `table2` t2
JOIN (select day from unnest((select generate_timestamp_array(timestamp('2022-01-01'), current_timestamp(), interval 1 day) as day)) day) day
ON (t2.createdOn) < day.day
QUALIFY ROW_NUMBER() OVER (PARTITION BY day, t2.id ORDER BY t2.createdOn desc) = 1
) snapshot
group by rs.other, day
) t2 on t2.other = t1.id and t2.day = t1.day
group by t1.day

Hive - max (rather than last) date in quarter

I'm querying a table and only want to select the end of quarter dates, I've done so like this:
select
yyyy_mm_dd,
id
from
t1
where
yyyy_mm_dd = cast(date_add(trunc(add_months(yyyy_mm_dd,3-pmod(month(yyyy_mm_dd)-1,3)),'MM'),-1) as date) --last day of q
With daily rows, from 2020-01-01 until 2020-12-31, the above works fine. However, 2021 rows end up being omitted as the quarter is incomplete. How could I modify the where clause so I select the last day of each quarter and the max date in the current quarter?
You can assign a row number for each quarter in descending order of date, and filter the rows with row number equals 1 (last date in each quarter):
select yyyy_mm_dd, id
from
(select
yyyy_mm_dd,
id,
row_number() over (partition by id, year(yyyy_mm_dd), quarter(yyyy_mm_dd) order by yyyy_mm_dd desc) as rn
from
t1
) t2
where rn = 1
It is not clear if you have multiple rows on the end-of-quarter dates. It might be safer to take the max and use that:
select t1.*
from (select t1.*,
max(yyyy_mm_dd) over (partition by id, year(yyyy_mm_dd), quarter(yyyy_mm_dd)) as max_yyyy_mm_dd
from t1
) t1
where yyyy_mm_dd = max_yyyy_mm_dd;
Note that this uses t1.* for the select. If you only wanted the maximum date, you can aggregate:
select id, max(yyyy_mm_dd)
from t1
group by id, year(yyyy_mm_dd), quarter(yyyy_mm_dd);

Filter SQL Server Records by Latest Date on Every Year

How would I filter this SQL server database so only the green records are left aka the last recorded date every year for each Customer ID field.
If you want to get the rows, not only the date values, using ROW_NUMBER() is an option (you only need to use the appropriate PARTITON BY and ORDER BY clauses):
SELECT *
FROM (
SELECT
CustomerId,
[Date],
ROW_NUMBER() OVER (PARTITION BY CustomerId, YEAR[Date] ORDER BY [Date] DESC) AS Rn
FROM YourTable
) t
WHERE Rn = 1
To check the maximum date in the year, you can write a query to get for each year the date where not exists another (in the same year), as follow:
SELECT *
FROM yourtable t1
WHERE NOT EXISTS
(SELECT 1
FROM yourtable t2
WHERE t1.customerID = t2.customerID
AND t1.date > t2.date
AND DATEPART(YEAR, t1) = DATEPART(YEAR, t2))
If you have only two columns, then you can just use aggregation:
select customer_id, max(date)
from t
group by customer_id, year(date);

Windows functions orderen by date when some dates doesn't exist

Suppose this example query:
select
id
, date
, sum(var) over (partition by id order by date rows 30 preceding) as roll_sum
from tab
When some dates are not present on date column the window will not consider the unexistent dates. How could i make this windowns aggregation including these unexistent dates?
Many thanks!
You can join a sequence containing all dates from a desired interval.
select
*
from (
select
d.date,
q.id,
q.roll_sum
from unnest(sequence(date '2000-01-01', date '2030-12-31')) d
left join ( your_query ) q on q.date = d.date
) v
where v.date > (select min(my_date) from tab2)
and v.date < (select max(my_date) from tab2)
In standard SQL, you would typically use a window range specification, like:
select
id,
date,
sum(var) over (
partition by id
order by date
range interval '30' day preceding
) as roll_sum
from tab
However I am unsure that Presto supports this syntax. You can resort a correlated subquery instead:
select
id,
date,
(
select sum(var)
from tab t1
where
t1.id = t.id
and t1.date >= t.date - interval '30' day
and t1.date <= t.date
) roll_sum
from tab t
I don't think Presto support window functions with interval ranges. Alas. There is an old fashioned way to doing this, by counting "ins" and "outs" of values:
with t as (
select id, date, var, 1 as is_orig
from t
union all
select id, date + interval '30 day', -var, 0
from t
)
select id.*
from (select id, date, sum(var) over (partition by id order by date) as running_30,
sum(is_org) as is_orig
from t
group by id, date
) id
where is_orig > 0

get date range between dates

I have following table tbl in database and I have dynamic joining date 1-1-2012 and I want this date is between (Fall and spring) or (spring and summer) or (summer and fall).I want query in which i passed only joining date which return semestertime and joining date in Oracle.
Semestertime joiningDate
Fall 10-13-2011
Spring 2-1-2012
Summer 6-11-2012
Fall 10-1-2015
If I understand your question correctly:
SELECT *
FROM your_table
WHERE joiningDate between to_date (your_lower_limit_date_here, 'mm-dd-yyyy')
AND to_date (your_upper_limit_date_here, 'mm-dd-yyyy`);
What about something like that:
select 'BEFORE' term,
t."Semestertime", to_char(t."joiningDate", 'MM-DD-YYYY')
from (
select tbl.*, rownum rn from tbl where tbl."joiningDate" < to_date('1-1-2012','MM-DD-YYYY')
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- your reference date
order by tbl."joiningDate" desc) t
where rn = 1
union all
select 'AFTER' term,
t."Semestertime", to_char(t."joiningDate", 'MM-DD-YYYY')
from (
select tbl.*, rownum rn from tbl where tbl."joiningDate" > to_date('1-1-2012','MM-DD-YYYY')
-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-- your reference date
order by tbl."joiningDate" asc) t
where rn = 1
This will return the "term" before and after a given date. You will probably have to adapt such query to your specific needs. But that might be a good starting point.
For example, given your business rules, you might consider using <= instead of <. You you might require to have the result displayer a column instead of rows. Bu all of this shouldn't be too had to change.
As an alternate solution using CTE and sub-queries:
with testdata as (select to_date('1-1-2012','MM-DD-YYYY') refdate from dual)
select v.what, tbl.* from tbl join
(
select 'BEFORE' what, max(t1."joiningDate") d
from tbl t1
where t1."joiningDate" < to_date('1-1-2012','MM-DD-YYYY')
union all
select 'AFTER' what, min(t1."joiningDate") d
from tbl t1
where t1."joiningDate" > to_date('1-1-2012','MM-DD-YYYY')
) v
on tbl."joiningDate" = v.d
See http://sqlfiddle.com/#!4/c7fa5/15 for a live demo comparing those solutions.