Comparing dates in same column based on a condition in SQL / Hive - sql

I have a table with the below schema.
Each person_id can have multiple codes (A,B,C,D etc) associated with them. For each person_id with code 'A' compare the corresponding date to the date of all other codes the person may have and filter out the dates to within 6 months of the date of code 'A'
So take example of the first person_id 30038590555, I want to make sure the date of code B and C are within 6 months of the date of A. Since both are above the 6 month threshold, they should be filtered out.
person_id code Date
30038590555 B 5/16/2017
30038590555 C 1/9/2019
30038590555 A 1/25/2020
37057397055 A 3/21/2020
38438355555 A 1/25/2020
59385393355 C 7/22/2014
59385393355 A 2/22/2020
44384037555 A 12/21/2019
49384037555 A 3/21/2020
50573409355 D 4/5/2016
50573409355 A 4/6/2016
50573409355 F 4/7/2016
50573409355 G 3/2/2017
50573409355 B 3/7/2017

This is interpreting "within 6 months as being "within 6 months after". The solution can be adapted if it really means 6 months before or after.
If I understand correctly, you want to keep all "A"s and then all others that are within six months of an A. Use a conditional running max:
select t.*
from (select t.*,
max(case when code = 'A' then date end) over (partition by person_id order by date) as prev_a_date
from t
) t
where code = 'A' or prev_a_date > add_months(date, -6)

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

Postgresql Sum statistics by month for a single year

I have the following table in the database:
item_id, integer
item_name, character varying
price, double precision
user_id, integer
category_id, integer
date, date
1
Pizza
2.99
1
2
'2020-01-01'
2
Cinema
5
1
3
'2020-01-01'
3
Cheeseburger
4.99
1
2
'2020-01-01'
4
Rental
100
1
1
'2020-01-01'
Now I want to get the statistics for the total price for each month in a year. It should include all items as well as a single category both for all the time and specified time period. For example, using this
SELECT EXTRACT(MONTH from date),COALESCE(SUM(price), 0)
FROM item_table
WHERE user_id = 1 AND category_id = 3 AND date BETWEEN '2020-01-01'AND '2021-01-01'
GROUP By date_part
ORDER BY date_part;
I expect to obtain this:
date_part
total
1
5
2
0
3
0
...
...
12
0
However, I get this:
date_part
total
1
5
1) How can I get zero value for a case when no items for a specified category are found? (now it just skips the month)
2) The above example gives the statistics for the selected category within some time period. For all my purposes I need to write 3 more queries (select for all time and all categories/ all the time single category/ single year all categories). Is there a unique query for all these cases? (when some parameters like category_id or date are null )
You can get the "empty" months by doing a right join against a table that contains the month numbers and moving the WHERE criteria into the JOIN criteria:
-- Create a temporary "table" for month numbers
WITH months AS (SELECT * FROM generate_series(1, 12) AS t(n))
SELECT months.n as date_part, COALESCE(SUM(price), 0) AS total
FROM item_table
RIGHT JOIN months ON EXTRACT(MONTH from date) = months.n
AND user_id = 1 AND category_id = 3 AND "date" BETWEEN '2020-01-01'AND '2021-01-01'
GROUP BY months.n
ORDER By months.n;
I'm not quite sure what you want from your second part, but you could take a look at Grouping Sets, Cube and Rollup.

How to get dates to filter into specific columns, without duplicating rows. Possible without Union?

Creating results that filter dates into two columns (we'll call them 30 day adn 60 day). The column that needs to be filled is based on another column. I'm trying to get these columns to fill without having to union (if it's even possible)
Let's say that you get a letter on day 30 and a different letter on day 60. I want the results to look like this
User 30 day Letter Date 60 Day Letter Date
TIM 2021-02-01 2021-03-03
Currently, the query is taking the above and splitting it into 2 separate rows where the 30/60 day column will be NULL when applicable and filling in the correct date. In order to get that I'm running the following
Select c.name as User,
CASE WHEN letter_id = '123' then letter.created ELSE NULL End as '30 day letter',
CASE WHEN letter_id = '456' then letter.created ELSE NULL End as '60 day letter'
From customer c
inner join letter l on c.id=l.customer_id
where letter_id IN ('123','456')
Again, I can get results with 2 rows but just wondering if it's possible to get 1 row without a union but still have items filter to the right place.
I think you just want aggregation:
Select c.name as User,
max(CASE WHEN l.letter_id = '123' then l.created End) as letter_30_day,
max(CASE WHEN l.letter_id = '456' then l.created End) as letter_60_day
From customer c inner join
letter l
on c.id = l.customer_id
where letter_id IN ('123', '456')
group by c.name;

Expected payments by day given start and end date

I'm trying to create a SQL view that gives me the expected amount to be received by calendar day for recurring transactions. I have a table containing recurring commitments data, with the following columns:
id,
start_date,
end_date (null if still active),
payment day (1,2,3,etc.),
frequency (monthly, quarterly, semi-annually, annually),
commitment amount
For now, I do not need to worry about business days vs calendar days.
In its simplest form, the end result would contain every historical calendar day as well as future dates for the next year, and produce how much was/is expected to be received in those particular days.
I've done quite a bit of researching, but cannot seem to find an answer that addresses the specific problem. Any direction on where to start would be greatly appreciated.
The expect output would look something like this:
| Date | Expected Amount |
|1/1/18 | 100 |
|1/2/18 | 200 |
|1/3/18 | 150 |
Thank you ahead of time!
Link to data table in db-fiddle
Expected Output Spreadsheet
It's something like this, but I've never used Netezza
SELECT
cal.d, sum(r.amount) as expected_amount
FROM
(
SELECT MIN(a.start_date) + ROW_NUMBER() OVER(ORDER BY NULL) as d
FROM recurring a, recurring b, recurring c
) cal
LEFT JOIN
recurring r
ON
(
(r.frequency = 'monthly' AND r.payment_day = DATE_PART('DAY', cal.d)) OR
(r.frequency = 'annually' AND DATE_PART('MONTH', cal.d) = DATE_PART('MONTH', r.start_date) AND r.payment_day = DATE_PART('DAY', cal.d))
) AND
r.start_date >= cal.d AND
(r.end_date <= cal.d OR r.end_date IS NULL)
GROUP BY cal.d
In essence, we cartesian join our recurring table together a few times to generate a load of rows, number them and add the number onto the min date to get an incrementing date series.
The payments data table is left joined onto this incrementing date series on:
(the day of the date from the series) = (the payment day) for monthlies
(the month-day of the date from the series) = (the month and payment day of the start_date)
Finally, the whole lot is grouped and summed
I don't have a test instance of Netezza so if you encounter some minor syntax errors, do please have a stab at fixing them up yourself (to make it faster for you to get a solution). If you reach a point where you can't work out what the query is doing, let me know
Disclaimer: I'm no expert on Netezza, so I decided to write you a standard SQL that may need some tweaking to run on Netezza.
with
digit as (select 0 as x union select 1 union select 2 union select 3 union select 4
union select 5 union select 6 union select 7 union select 8 union select 9
),
number as ( -- produces numbers from 0 to 9999 (28 years)
select d1.x + d2.x * 10 + d3.x * 100 + d4.x * 1000 as n
from digit d1
cross join digit d2
cross join digit d3
cross join digit d4
),
expected_payment as ( -- expands all expected payments
select
c.start_date + nb.n as day,
c.committed_amount
from recurring_commitement c
cross join number nb
where c.start_date + nb.n <= c.end_data
and c.frequency ... -- add logic for monthly, quarterly, etc. here
)
select
day,
sum(committed_amout) as expected_amount
from expected_payment
group by day
order by day
This solution is valid for commitments that do not exceed 28 years, since the number CTE (Common Table Expression) is producing up to a maximum of 9999 days. Expand with a fifth digit if you need longer commitments.
Note: I think the way I'm adding days to a day to a date is not correct in Netezza's SQL. The expression c.start_date + nb.n may need to be rephrased.

Assign a Y/N flag based last 12 month activity

I'm working with a list of hospital patients and would like to flag each patient account with a "Y" if they were seen in the hospital nine or more times over the past 12 months.
I've come up with this, which would work fine if the patient list were static and only included a 12 month period:
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
case when count(distinct txn.hsp_account_id) over(partition by PAT.PAT_MRN_ID) >= 9 then 'Y' else 'N' end as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
But I'd like to query the prior two years worth of data but only use the 12 months prior to the arrival date (ENC.ADT_ARRIVAL_DTTM) in calculating the Y or N.
The problem I'm running in to with the above query is that it's going back and counting all visits by a particular patient between 7/1/17 and 10/31/18.
What I'd like is that if the arrival date for a record is 8/1/18, it should count all visits between 8/1/17 and 8/1/18, ignoring anything with an arrival date earlier than 8/1/17 or later than 8/1/18.
Is this sort of "rolling" calculation possible? Many thanks!
You can use a windowing clause:
SELECT ENC.HSP_ACCOUNT_ID, ENC.PAT_MRN_ID, ENC.ADT_ARRIVAL_DTTM,
(CASE WHEN COUNT(DISTINCT txn.hsp_account_id) OVER
(PARTITION BY PAT.PAT_MRN_ID
ORDER BY ENC.SERVICE_DATE
RANGE BETWEEN 365 PRECEDING AND CURRENT ROW
) >= 9
THEN 'Y' ELSE 'N'
END) as familiar_face_yn
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN DATE '2017-07-01' AND DATE '2018-10-31'
with cte as
(
SELECT
ENC.HSP_ACCOUNT_ID,
ENC.PAT_MRN_ID,
ENC.ADT_ARRIVAL_DTTM,
-- find the most recent visit
max(ENC.ADT_ARRIVAL_DTTM) over(partition by PAT.PAT_MRN_ID) as last_date
FROM CLARITY.F_ED_ENCOUNTERS ENC
WHERE ENC.SERVICE_DATE BETWEEN '1-JUL-17' AND '31-OCT-18'
)
select ...
-- count all rows with within a 12 month range before the most recent visit
case when count(distinct case when ADT_ARRIVAL_DTTM >= add_months(last_date, -12) then txn.hsp_account_id end)
over (partition by PAT.PAT_MRN_ID) >= 9
then 'Y'
else 'N'
end as familiar_face_yn
from cte
I don't know if you really need the DISTINCT count...