Grouping results by a set of dates in Redshift with two tables - sql

Hope you are fine, I am trying to account the amount of observations that I have in an employee database. Tables look more or less like this:
Date_Table
date_dt
2020-09-07
2020-09-14
2020-09-21
Employee_table
login_id
effective_date
is_active
a
2020-09-07
1
a
2020-09-14
1
b
2020-09-07
1
b
2020-09-14
0
c
2020-09-21
1
keep in mind the effective_date represents (the higher the date the most recent the change) some change (attrition, position change, what ever, those are easily filtered) being the latest the one the current status.
In the above example the date 2020-09-14 for empl_login b would be the day it stopped to be active within the table.
I want to reflect something like this:
the_date
amount_of_employees
2020-09-07
2
2020-09-14
1
2020-09-21
2
This query works perfectly fine, and provides me the correct number:
SELECT '2020-09-07',COUNT(DISTINCT login_id) amount_of_employees
FROM (SELECT date_dt FROM Date_Table) AS dd,(SELECT *,
ROW_NUMBER() OVER (PARTITION BY login_id ORDER BY effective_date DESC) AS chk
FROM Employee_table WHERE effective_date <= '2020-09-07' ) AS dp
WHERE
dp.is_active =1
AND
dp.chk=1
GROUP BY 1
ORDER BY 1 ASC;
Great! This one works and gives me the right value:
the_date
amount_of_employees
2020-09-07
2
However, when I try this to build my dataset with this query:
SELECT dd.date_dt ,COUNT(DISTINCT login_id) amount_of_employees
FROM (SELECT date_dt FROM Date_Table) AS dd,(SELECT *,
ROW_NUMBER() OVER (PARTITION BY login_id ORDER BY effective_date DESC) AS chk
FROM Employee_table WHERE effective_date <= dd.date_dt ) AS dp
WHERE
dp.is_active =1
AND
dp.chk=1
GROUP BY 1
ORDER BY 1 ASC;
I get this error message:
Invalid operation: subquery in FROM may not refer to other relations of same query level
I tried to investigate something like this:
https://w3coded.com/questions/672056/error-subquery-in-from-cannot-refer-to-other-relations-of-same-query-level
but didn't work or doesn't apply necessarily. May be I am not getting it
Any idea? I wouldn't like to make A lot of unions, but is a workaround.
Thanks in advance

I'm not familiar with Amazon Redshift,but as long as your query syntax is supported, you can use a subquery to get the count, and there you'll be able to refer to the columns of the outer query like this
SELECT
dt.date_dt,
(
SELECT COUNT(DISTINCT login_id)
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY login_id ORDER BY effective_date DESC) AS rn
FROM employee_table et
WHERE et.effective_date <= dt.date_dt
ORDER BY effective_date DESC
) t
WHERE rn = 1 AND is_active = 1
) amount
FROM date_table dt

this is a solution for this:
SELECT dt.date_dt, COUNT(DISTINCT login_id) other_account
FROM Date_Table dt
LEFT JOIN employee_table et ON dd.date_dt BETWEEN et.effective_date AND et.effective_date + (some additional interval)
WHERE et.is_active = 1 (And other where clauses)
GROUP BY 1
Thanks for all your support

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Oracle SQL LAG() function results in duplicate rows

I have a very simple query that results in two rows:
SELECT DISTINCT
id,
trunc(start_date) start_date
FROM example.table
WHERE ID = 1
This results in the following rows:
id start_date
1 7/1/2012
1 9/1/2016
I want to add a column that simply shows the previous date for each row. So I'm using the following:
SELECT DISTINCT id,
Trunc(start_date) start_date,
Lag(start_date, 1)
over (
ORDER BY start_date) pdate
FROM example.table
WHERE id = 1
However, when I do this, I get four rows instead of two:
id start_date pdate
1 7/1/2012 NULL
1 7/1/2012 7/1/2012
1 9/1/2016 7/1/2012
1 9/1/2016 9/1/2012
If I change the offset to 2 or 3 the results remain the same. If I change the offset to 0, I get two rows again but of course now the start_date == pdate.
I can't figure out what's going on
Use an explicit GROUP BY instead:
SELECT id, trunc(start_date) as start_date,
LAG(trunc(start_date)) OVER (PARTITION BY id ORDER BY trunc(start_date))
FROM example.table
WHERE ID = 1
GROUP BY id, trunc(start_date)
The reason for this is: the order of execution of an SQL statements, is that LAG runs before the DISTINCT.
You actually want to run the LAG after the DISTINCT, so the right query should be:
WITH t1 AS (
SELECT DISTINCT id, trunc(start_date) start_date
FROM example.table
WHERE ID = 1
)
SELECT *, LAG(start_date, 1) OVER (ORDER BY start_date) pdate
FROM t1

how to get unique row numbers in sql

How to get only the first row from the result of the below query. I need the latest record for each date so I did the partition by created_date. But in some places, I am getting the same row number and not able to get the expected output. Please find the below query, current output, and expected output.
What changes do in need to make in order to get the expected output? Thank you.
WITH ctetable
AS (
SELECT created_date BPMDate
,tenor
,row_number() OVER (
PARTITION BY created_date ORDER BY created_date DESC
) rw
FROM table1 a
INNER JOIN table2 b ON a.case_id = b.case_id
AND a.eligible_transaction = 'true'
AND to_date(a.created_date) >= '2020-10-01'
AND to_date(a.created_date) <= '2020-10-05'
AND case_status = 'Completed'
)
SELECT BPMDate
,Tenor
,rw
FROM ctetable
Current output:
date tenor rw
2020-10-05 13:24:15.0 1W 1
2020-10-05 12:15:43.0 1Y 1
2020-10-05 12:15:43.0 1Y 2
2020-10-01 13:30:59.0 1W 1
2020-10-01 13:30:59.0 1W 2
Expected output:
date tenor rw
2020-10-05 13:24:15.0 1W 1
2020-10-01 13:30:59.0 1W 1
Regards,
Viresh
That would be:
with ctetable as (
select created_date, bpmdate, tenor,
row_number() over (partition by date(created_date) order by created_date desc ) rn
from table1 a
inner join table2 b
on a.case_id = b.case_id
and a.eligible_transaction = 'true'
and to_date(a.created_date) >= '2020-10-01'
and to_date(a.created_date) <= '2020-10-05'
and case_status='completed'
)
select bpmdate,tenor,rw
from ctetable
where rn = 1
Changes to your original code:
you need to remove the time portion of the date in the partition by clause of the window function; you didn't tell which database you are using: I used date(), but the function might be different in your database (trunc() in Oracle, date_trunc() in Postgres, and so on)
the outer query needs to filter on the row number that is equal to 1
You seem to want the first row per day:
select BPMDate, Tenor, rw
from (select t.*,
row_number() over (partition by trunc(bpmdate) order by bpmdate) as seqnum
from ctetable
) t
where seqnum = 1;
Note: I don't know if your database supports trunc(), but that is simply some method for extracting the date from the column.

Add columns to SQL query and filter by min(date) and sum(price)

I am trying to generate a list of users who's first purchase was in December 2018 and have spent over 100 dollars since then in SQL. I'm able to generate the list of users, but I'm unable to determine what their first purchase was or other variables and it appears to be an issue since the columns I'm trying to include are neither grouped nor aggregated so I'm hoping someone can point me in the right direction as I'm new to SQL.
Here's my code to generate the list I want to add more columns to:
select billing_address.name, contact_email, min(processed_at) as First_Purchase_Date, sum(total_price) as Total_Revenue
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
group by contact_email, billing_address.name
having min(processed_at) between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC' and sum(total_price) > 100
order by sum(total_price) desc
Is there some way I can modify this to pull each user's purchase from this list into a separate row and include more columns? So I'd pull in each user (and ALL of their purchases) who has a min(processed_at) in December 2018 AND their sum(total_price) > 100? something like this:
SELECT contact_email, billing_address, line_items, min(processed_at), sum(total_price) OVER (PARTITION BY contact_email)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY id) AS instance
FROM `table.orders`
) orders -- identify duplicate rows
WHERE instance = 1
However, the sum(total_price) doesn't work in this case and I can't filter by min(processed_at). Can someone guide me in the right direction?
I think that should use window functions instead of aggregation. You can compute the date of the first purchase and the total amount spent on the fly in a subquery, without aggregating (your original group by columns become the partition columns of the window functions). Then you can use these information to filter in the outer query.
This should get you close to what you want:
select o.*
from (
select
o.*,
min(processed_at) over(partition by contact_email, billing_address) min_processed_at,
sum(total_price) over(partition by contact_email, billing_address) sum_total_price
from (
select
o.*,
row_number() over(partition by id) instance
from orders o
) o
where instance = 1
) o
where
processed_at between '2019-01-01 00:00:00 UTC' and '2019-02-01 00:00:00 UTC'
and sum_total_price > 100
Your question was a bit unclear as you did not provide much detail about your input tables or your expected output, so this is a guess.
The following query gets all transactions from users who meet the criteria:
-- BigQuery StandardSQL
with ordered_orders as (
--rank each ID by processed_at date first to last
select *, row_number() over(partition by id order by processed_at asc) as rn
from `table.orders`
),
first_criteria as (
-- select IDs where first processed_at date is in 2018-12
select id, processed_at as first_order_date
from ordered_orders
where rn = 1
and extract(year from processed_at) = 2018
and extract(month from processed_at) = 12
),
second_criteria as (
-- further select IDs who meet first criteria and have a total of > 100
select id, sum(total_prices) as total_revenue
from ordered_orders
inner join first_criteria using(id)
group by id
having total_revenue > 100
),
orders_with_criteria as (
-- get all orders for users who meet both criteria
select ordered_orders.* except(rn), first_order_date, total_revenue
from ordered_orders
inner join first_criteria using(id)
inner join second_criteria using(id)
),
-- select any fields you want
select * from orders_with_criteria
I prefer liberal use of CTEs in cases like this to keep the logic clear.
I also wouldn't be surprised if this query doesn't work as you intend. I think it is highly doubtful that the ID column in your orders table refers to the customer id, which is what you/we are partitioning on. Depending on who set up your tables, id probably refers to the order id. If you have a customer_id (or account #, etc), then I would use that instead of id in the query.
No need to use row_number() in BigQuery for this:
SELECT billing_address.name, contact_email,
MIN(processed_at) as First_Purchase_Date,
SUM(total_price) as Total_Revenue,
ARRAY_AGG(o ORDER BY processed_at LIMIT 1) as first_order
FROM `table.orders` o
WHERE instance = 1
GROUP BY contact_email, billing_address.name
HAVING MIN(processed_at) >= '2019-01-01 00:00:00 UTC' AND
MIN(processed_at) < '2019-02-01 00:00:00 UTC' AND
SUM(total_price) > 100
ORDER BY SUM(total_price) desc;
This returns the entire first order as a struct. You can select specific columns, if you prefer.

cummulative distinct count

I'm having trouble getting a cumulative distinct count so let's just assume the below dataset.
DATE RID
1/1/18 1
1/1/18 2
1/1/18 3
1/1/18 3
So if we run this query
SELECT DATE, COUNT(DISTINCT RID) FROM TABLE;
we would expect it to return 3, however let's assume that the data for the next day is as follows.
DATE RID
1/2/18 1
1/2/18 6
1/2/18 9
How would you write a query to get the following results where the data for 1/1/18 is considered when returning the distinct for 1/2/18.
So it would be the following results.
Date Count(*)
1/1/18 3
1/2/18 5 <- 1/1/18 distinct plus + 1/2 distinct.
Hope that makes sense, keep in mind this is a very large dataset if that changes things.
You can do a cumulative count of the earliest date for each rid:
select mindate, count(*), sum(count(*)) over (order by mindate)
from (select rid, min(date) as mindate
from t
group by rid
) t
group by mindate
order by mindate;
Note: This will be missing dates that is not a mindate for some rid. Here is one way to get all the dates, if that is an issue:
select mindate, count(rid), sum(count(rid)) over (order by mindate)
from ((select rid, min(date) as mindate
from t
group by rid
)
union all
(select distinct NULL, date
from t
)
) rd
group by mindate
order by mindate;
Below query can give required cumulative distinct count.
--Step 3:
SELECT dt,
cum_distinct_cnt
FROM (
--Step 2:
SELECT rid,
dt,
COUNT(CASE WHEN row_num = 1 THEN rid END) OVER (ORDER BY dt ROWS BETWEEN Unbounded PRECEDING AND CURRENT ROW) cum_distinct_cnt
FROM (
--Step 1:
SELECT rid,
dt,
ROW_NUMBER() OVER (PARTITION BY rid ORDER BY dt) row_num
FROM table) innerTab1
) innerTab2
QUALIFY ROW_NUMBER() OVER (PARTITION BY dt ORDER BY cum_distinct_cnt DESC) = 1
Since your dataset is very large, you can break the below query on steps as explained in query and create work tables to populate innerTab1/ innerTab2 to get final output