Columns columns by time in a sql table - sql

I have a sql table as follows:
+-----------+----------+----------+---------------+
| AccountID | PersonId | DoctorID | Admitdatetime |
+-----------+----------+----------+---------------+
| 1 | 2 | 345 | 20090108 |
| 2 | 3 | 53 | 20090109 |
| 3 | 1 | 234 | 20090110 |
| 4 | 2 | 345 | |
+-----------+----------+----------+---------------+
Each row of this table is like a visit of a patient given by the admitdatetime. Each unique record is referenced by AccountID
Date column is basically int and is yyyymmdd. So just subtracting two dates might not be right as it is not datetime. I just checked.
Now, what I want to do for each record in the table is to add 3 columns. One for last three months, one for last 6 months, and one for last 12 months.
The columns are described as follows:
The no. of cases a DoctorID has seen in the past 3 months of that current record. Similarly, no. of cases a DoctorID has seen in the past 6 months of that current record.
I am doing a self join like this:
SELECT a.DoctorID, count(AccountID) FROM
Visits AS a INNER JOIN
Visits AS b ON a.DoctorId = b.DoctorId
WHERE a.admitdatetime - b.admitdatetime <= 90
The above one I am doing for the 3 months case, but I don't think it is right. I want for each record the no. of cases (count of AccountId) a doctor has seen 3,6,9 months before that. So for each DoctorID, that value would vary based on which record the doctorID is present and it's 3,6,9 months prior that admitdatetime of that record such that the above code would just give me one value for a doctorID. That doesn't seem right.
I think the join should be grouped by DoctorId, AccountId as I need to join all the doctorid back to each record and each record is identified by accountid. So then join it back on doctorid and accountid. Does this sound right?

I would suggest correlated subqueries:
select v.*,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '3 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last3,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '6 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last6,
(select sum(case when v2.AdmitDate >= v.AdmitDate - interval '12 month'
then 1 else 0
end)
from visits v2
where v2.doctorid = v.doctorid
) as last12
from visits v;
I should point out that Postgres allows you to simplify this syntax:
(select sum((v2.AdmitDate >= v.AdmitDate - interval '3 month')::int)
from visits v2
where v2.doctorid = v.doctorid
) as last3,
And in more recent versions of Postgres you can use a lateral join to combine the logic into a single subquery.
EDIT:
A reasonable simplification of the query is:
select v.*,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and v2.AdmitDate >= v.AdmitDate - interval '3 month'
) as last3,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and 2.AdmitDate >= v.AdmitDate - interval '6 month'
) as last6,
(select count(*)
from visits v2
where v2.doctorid = v.doctorid and v2.AdmitDate >= v.AdmitDate - interval '12 month'
) as last12
from visits v;

Related

Count if previous month data exists postgres

i'm stuck with a query to count id where if it exists in previous month than 1
my table look like this
date | id |
2020-02-02| 1 |
2020-03-04| 1 |
2020-03-04| 2 |
2020-04-05| 1 |
2020-04-05| 3 |
2020-05-06| 2 |
2020-05-06| 3 |
2020-06-07| 2 |
2020-06-07| 3 |
i'm stuck with this query
SELECT date_trunc('month',date), id
FROM table
WHERE id IN
(SELECT DISTINCT id FROM table WHERE date
BETWEEN date_trunc('month', current_date) - interval '1 month' AND date_trunc('month', current_date)
the main problem is that i stuck with current_date function. is there any dynamic ways change current_date? thanks
What i expected to be my result is
date | count |
2020-02-01| 0 |
2020-03-01| 1 |
2020-04-01| 1 |
2020-05-01| 1 |
2020-06-01| 2 |
Solution 1 with SELF JOIN
SELECT date_trunc('month', c.date) :: date AS date
, count(DISTINCT c.id) FILTER (WHERE p.date IS NOT NULL)
FROM test AS c
LEFT JOIN test AS p
ON c.id = p.id
AND date_trunc('month', c.date) = date_trunc('month', p.date) + interval '1 month'
GROUP BY date_trunc('month', c.date)
ORDER BY date_trunc('month', c.date)
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
Solution 2 with WINDOW FUNCTIONS
SELECT DISTINCT ON (date) date
, count(*) FILTER (WHERE count > 0 AND previous_month) OVER (PARTITION BY date)
FROM
( SELECT DISTINCT ON (id, date_trunc('month', date))
id
, date_trunc('month', date) AS date
, count(*) OVER w AS count
, first_value(date_trunc('month', date)) OVER w = date_trunc('month', date) - interval '1 month' AS previous_month
FROM test
WINDOW w AS (PARTITION BY id ORDER BY date_trunc('month', date) GROUPS BETWEEN 1 PRECEDING AND 1 PRECEDING)
) AS a
Result :
date count
2020-02-01 0
2020-03-01 1
2020-04-01 1
2020-05-01 1
2020-06-01 2
see dbfiddle

Get count of susbcribers for each month in current year even if count is 0

I need to get the count of new subscribers each month of the current year.
DB Structure: Subscriber(subscriber_id, create_timestamp, ...)
Expected result:
date | count
-----------+------
2021-01-01 | 3
2021-02-01 | 12
2021-03-01 | 0
2021-04-01 | 8
2021-05-01 | 0
I wrote the following query:
SELECT
DATE_TRUNC('month',create_timestamp)
AS create_timestamp,
COUNT(subscriber_id) AS count
FROM subscriber
GROUP BY DATE_TRUNC('month',create_timestamp);
Which works but does not include months where the count is 0. It's only returning the ones that are existing in the table. Like:
"2021-09-01 00:00:00" 3
"2021-08-01 00:00:00" 9
First subquery is used for retrieving year wise each month row then LEFT JOIN with another subquery which is used to retrieve month wise total_count. COALESCE() is used for replacing NULL value to 0.
-- PostgreSQL (v11)
SELECT t.cdate
, COALESCE(p.total_count, 0) total_count
FROM (select generate_series('2021-01-01'::timestamp, '2021-12-15', '1 month') as cdate) t
LEFT JOIN (SELECT DATE_TRUNC('month',create_timestamp) create_timestamp
, SUM(subscriber_id) total_count
FROM subscriber
GROUP BY DATE_TRUNC('month',create_timestamp)) p
ON t.cdate = p.create_timestamp
Please check from url https://dbfiddle.uk/?rdbms=postgres_11&fiddle=20dcf6c1784ed0d9c5772f2487bcc221
get the count of new subscribers each month of the current year
SELECT month::date, COALESCE(s.count, 0) AS count
FROM generate_series(date_trunc('year', LOCALTIMESTAMP)
, date_trunc('year', LOCALTIMESTAMP) + interval '11 month'
, interval '1 month') m(month)
LEFT JOIN (
SELECT date_trunc('month', create_timestamp) AS month
, count(*) AS count
FROM subscriber
GROUP BY 1
) s USING (month);
db<>fiddle here
That's assuming every row is a "new subscriber". So count(*) is simplest and fastest.
See:
Join a count query on generate_series() and retrieve Null values as '0'
Generating time series between two dates in PostgreSQL

Get occurences of past 2 weeks on any given date

I have data like
id | date |
-------------
1 | 1.1.20 |
3 | 4.1.20 |
2 | 4.1.20 |
1 | 5.1.20 |
6 | 2.1.20 |
What I would like to get is to get the amount of occurrences an user with ID did in the past 2 weeks on any given date so basically "occurences between date - 14 days and date. I'm trying to categorize users by their amount of sessions past 2 weeks, and I'm following them by daily cohorts.
This query does not work since there can be days when the user does not log in aka does not have a row:
COUNT (distinct id) OVER (PARTITION BY id ORDER BY date ROWS BETWEEN 14 PRECEDING AND 0 FOLLOWING)
Unfortunately, Presto does not support range() window functions. One method is a self-join/aggregation or correlated subquery:
select t.id, count(tprev.id)
from t left join
t tprev
on tprev.id = t.id and
tprev.date > t.date - interval '13' day and
tprev.date <= t.date
group by t.id;
This interprets your request as wanting 14 days of data, including the current day.
Another method that is much more verbose but might be faster is to use lag() . . . and lag() again:
select t.id,
(1 + -- current date
(case when lag(date, 1) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
(case when lag(date, 2) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
. . .
(case when lag(date, 13) over (partition by id order by date) > date - interval '14' day then 1 else 0 end) +
) as cnt_14
from t;

Calculating Aggregates on subset of data based on condition

I have a DB as follows:
| company | timestamp | value |
| ------- | ---------- | ----- |
| google | 2020-09-01 | 5 |
| google | 2020-08-01 | 4 |
| amazon | 2020-09-02 | 3 |
I'd like to calculate the average value for each company within the last year if there are >= 20 datapoints. If there are less than 20 datapoints then I'd like the average during the entire time duration. I know I can do two separate queries and get the averages for each scenario. The question I suppose is how do I merge them back in a single table based on the criteria I have.
select company, avg(value) from my_db GROUP BY company;
select company, avg(value) from my_db
where timestamp > (CURRENT_DATE - INTERVAL '12 months')
GROUP BY company;
WITH last_year AS (
SELECT company, avg(value), 'year' AS range -- optional tag
FROM tbl
WHERE timestamp >= now() - interval '1 year'
GROUP BY 1
HAVING count(*) >= 20 -- 20+ rows in range
)
SELECT company, avg(value), 'all' AS range
FROM tbl
WHERE NOT EXISTS (SELECT FROM last_year WHERE company = t.company)
GROUP BY 1
UNION ALL TABLE last_year;
db<>fiddle here
An index on (timestamp) will only be used if your table is big and holds many years.
If most companies have 20+ rows in range, an index on (company) will be used for the 2nd SELECT to retrieve the few outliers.
Use conditional aggregation:
select company,
case
when sum(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
If by 20 datapoints you mean 20 rows in the last 12 months for each company, then:
select company,
case
when count(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then
avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
else avg(value)
end
from my_db
group by company
You can use window functions to provide the information for filtering:
select company, avg(value),
(count(*) = cnt_this_year) as only_this_year
from (select t.*,
count(*) filter (where date_trunc('year', datecol) = date_trunc('year', now()) over (partition by company) as cnt_this_year
from t
) t
where cnt_this_year >= 20 and date_trunc('year', datecol) = date_trunc('year', now()) or
cnt_this_year < 20
group by company;
The third column specifies if all the rows are from this year. By filtering in the where clause, it is simple to add other calculations as well (such as min(), max(), and so on).

SQL not returning a value if no row exist for time queried

I'm writing this SQL query which returns the number of records created in an hour in last 24 hours. I'm getting the result for only those hours that have a non zero value. If no records were created, it doesn't return anything at all.
Here's my query:
SELECT HOUR(timeStamp) as hour, COUNT(*) as count
FROM `events`
WHERE timeStamp > DATE_SUB(NOW(), INTERVAL 24 HOUR)
GROUP BY HOUR(timeStamp)
ORDER BY HOUR(timeStamp)
The output of current Query:
+-----------------+----------+
| hour | count |
+-----------------+----------+
| 14 | 6 |
| 15 | 5 |
+-----------------+----------+
But i'm expecting 0 for hours in which no records were created. Where am I going wrong?
One solution is to generate a table of numbers from 0 to 23 and left join it with your original table.
Here is a query that uses a recursive query to generate the list of hours (if you are running MySQL, this requires version 8.0):
with hours as (
select 0 hr
union all select hr + 1 where h < 23
)
select h.hr, count(e.eventID) as cnt
from hours h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr
If your RDBMS does not support recursive CTEs, then one option is to use an explicit derived table:
select h.hr, count(e.eventID) as cnt
from (
select 0 hr union all select 1 union all select 2 ... union all select 23
) h
left join events e
on e.timestamp > now() - interval 1 day
and hour(e.timestamp) = h.hr
group by h.hr