how to find number of active users for say 1 day,2 days, 3 days.....postgreSQL - sql

A distribution of # days active within a week: I am trying to find how many members are active for 1 day, 2days, 3days,…7days during a specific week 3/1-3/7.
Is there any way to use aggregate function on top of partition by?
If not what can be used to achieve this?
select distinct memberID,count(date) over(partition by memberID) as no_of_days_active
from visitor
where date between '"2019-01-01 00:00:00"' and '"2019-01-07 00:00:00"'
order by no_of_days_active
result should look something like this
#Days Active Count
1 20
2 32
3 678
4 34
5 3
6 678
7 2345

I think you want two levels of aggregation to count the number of days during the week:
select num_days_active, count(*) as num_members
from (select memberID, count(distinct date::date) as num_days_active
from visitor
where date >= '2019-01-01'::date and
date < '2019-01-08'::date
group by memberID
) v
group by num_days_active
order by num_days_active;
Note that I changed the date comparisons. If you have a time component, then between does not work. And, because you included time in the constant, I added an explicit conversion to date for the count(distinct). That might not be necessary, if date is really a date with no time component.

Piggybacking off of #Gordon's answer, I personally like using a with statement for the subqueries:
with dat as (
select distinct
memberID,
count(date) over(partition by memberID) as no_of_days_active
from visitor
where 1=1
and date between '2019-01-01'::date and '2019-01-07'::date
order by no_of_days_active
)
select
no_of_days_active,
count(no_of_days_active) no_of_days_active_cnt
from dat
group by no_of_days_active
order by no_of_days_active

Related

How to conditional SQL select

My table consists of user_id, revenue, publish_month columns.
Right now I use group_by user_id and sum(revenue) to get revenue for all individual users.
Is there a single SQL query I can use to query for user revenue across a time period conditionally? If for a specific user, there is a row for this month, I want to query for this month, last month and the month before. If there is not yet a row for this month, I want to query for last month and the two months before.
Any advice with which approach to take would be helpful. If I should be using cases, if-elses with exists or if this is do-able with a single SQL query?
UPDATE---since I did a bad job of describing the question, I've come to include some example data and expected results
Where current month is not present for user 33
Where current month is present
Assuming publish_month is a DATE datatype, this should get the most recent three months of data per user...
SELECT
user_id, SUM(revenue) as s_revenue
FROM
(
SELECT
user_id, revenue, publish_month,
MAX(publish_month) OVER (PARTITION BY user_id) AS user_latest_publish_month
FROM
yourtableyoudidnotname
)
summarised
WHERE
publish_month >= DATEADD(month, -2, user_latest_publish_month)
GROUP BY
user_id
If you want to limit that to the most recent 3 months out of the last 4 calendar months, just add AND publish_month >= DATEADD(month, -3, DATE_TRUNC(month, GETDATE()))
The ambiguity here is why it is important to include a Minimal Reproducible Example
With input data and require results, we could test our code against your requirements
If you're using strings for the publish_month, you shouldn't be, and should fix that with utmost urgency.
You can use a windowing function to "number" the months. In this way the most recent one will have a value of 1, the prior 2, and the one before 3. Then you can only select the items with a number of 3 or less.
Here is how:
SELECT user_id, revienue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
now you just select the items with RN less than 3 and do your sum
SELECT user_id, SUM(revenue) as s_revenue
FROM (
SELECT user_id, revenue, publish_month,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY publish_month DESC) as RN
FROM yourtableyoudidnotname
) X
WHERE RN <= 3
GROUP BY user_id
You could also do this without a sub query if you use the windowing function for SUM and a range, but I think this is easier to understand.
From the comment -- there could be an issue if you have months from more than one year. To solve this make the biggest number in the order by always the most recent. so instead of
ORDER BY publish_month DESC
you would have
ORDER BY (100*publish_year)+publish_month DESC
This means more recent years will always have a higher number so january of 2023 will be 202301 while december of 2022 will be 202212. Since january is a bigger number it will get a row number of 1 and december will get a row number of 2.

How to get number of billable customers per month SQL

This is what my table looks like:
NOTE: Don't worry about the BMI field being empty in some rows. We assume that each row is a reading. I have omitted some columns for privacy reasons.
I want to get a count of the number of active customers per month. A customer is active if they have at least 18 readings in total (1 reading per day for 18 days in a given month). How do I write this SQL query? Assume the table name is 'cust'. I'm using SQL Server. Any help is appreciated.
Presumably a patient is a customer in your world. If so, you can use two levels of aggregation:
select yyyy, mm, count(*)
from (select year(createdat) as yyyy, month(createdat) as mm,
patient_id,
count(distinct convert(date, createdat)) as num_days
from t
group by year(createdat), month(createdat), patient_id
) ymp
where num_days >= 18
group by yyyy, mm;
You need to group by patient and the month, then group again by just the month
SELECT
mth,
COUNT(*) NumPatients
FROM (
SELECT
EOMONTH(c.createdat) mth
FROM cust c
GROUP BY EOMONTH(c.createdat), c.patient_id
HAVING COUNT(*) >= 18
-- for distinct days you could change it to:
-- HAVING COUNT(DISTINCT CAST(c.createdat AS date)) >= 18
) c
GROUP BY mth;

Finding id's available in previous weeks but not in current week

How to find if an id which was present in previous weeks but not available in current week on a rolling basis. For e.g
Week1 has id 1,2,3,4,5
Week2 has id 3,4,5,7,8
Week3 has id 1,3,5,10,11
So I found out that id 1 and 2 are missing in week 2 and id 2,4,7,8 are missing in week 3 from previous 2 weeks But how to do this on a rolling window for a large amount of data distributed over a period of 20+ years
Please find the sample dataset and expected output. I am expecting the output to be partitioned based on the week_end Date
Dataset
ID|WEEK_START|WEEK_END|APPEARING_DATE
7152|2015-12-27|2016-01-02|2015-12-27
8350|2015-12-27|2016-01-02|2015-12-27
7152|2015-12-27|2016-01-02|2015-12-29
4697|2015-12-27|2016-01-02|2015-12-30
7187|2015-12-27|2016-01-02|2015-01-01
8005|2015-12-27|2016-01-02|2015-12-27
8005|2015-12-27|2016-01-02|2015-12-29
6254|2016-01-03|2016-01-09|2016-01-03
7962|2016-01-03|2016-01-09|2016-01-04
3339|2016-01-03|2016-01-09|2016-01-06
7834|2016-01-03|2016-01-09|2016-01-03
7962|2016-01-03|2016-01-09|2016-01-05
7152|2016-01-03|2016-01-09|2016-01-07
8350|2016-01-03|2016-01-09|2016-01-09
2403|2016-01-10|2016-01-16|2016-01-10
0157|2016-01-10|2016-01-16|2016-01-11
2228|2016-01-10|2016-01-16|2016-01-14
4697|2016-01-10|2016-01-16|2016-01-14
Excepted Output
Partition1: WEEK_END=2016-01-02
ID|MAX(LAST_APPEARING_DATE)
7152|2015-12-29
8350|2015-12-27
4697|2015-12-30
7187|2015-01-01
8005|2015-12-29
Partition1: WEEK_END=2016-01-09
ID|MAX(LAST_APPEARING_DATE)
7152|2016-01-07
8350|2016-01-09
4697|2015-12-30
7187|2015-01-01
8005|2015-12-29
6254|2016-01-03
7962|2016-01-05
3339|2016-01-06
7834|2016-01-03
Partition3: WEEK_END=2016-01-10
ID|MAX(LAST_APPEARING_DATE)
7152|2016-01-07
8350|2016-01-09
4697|2016-01-14
7187|2015-01-01
8005|2015-12-29
6254|2016-01-03
7962|2016-01-05
3339|2016-01-06
7834|2016-01-03
2403|2016-01-10
0157|2016-01-11
2228|2016-01-14
Please use below query,
select ID, MAX(APPEARING_DATE) from table_name
group by ID, WEEK_END;
Or, including WEEK)END,
select ID, WEEK_END, MAX(APPEARING_DATE) from table_name
group by ID, WEEK_END;
You can use aggregation:
select t.*, max(week_end)
from t
group by id
having max(week_end) < '2016-01-02';
Adjust the date in the having clause for the week end that you want.
Actually, your question is a bit unclear. I'm not sure if a later week end would keep the row or not. If you want "as of" data, then include a where clause:
select t.id, max(week_end)
from t
where week_end < '2016-01-02'
group by id
having max(week_end) < '2016-01-02';
If you want this for a range of dates, then you can use a derived table:
select we.the_week_end, t.id, max(week_end)
from (select '2016-01-02' as the_week_end union all
select '2016-01-09' as the_week_end
) we cross join
t
where t.week_end < we.the_week_end
group by id, we.the_week_end
having max(t.week_end) < we.the_week_end;

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

query to display additional column based on aggregate value

I've been mulling on this problem for a couple of hours now with no luck, so I though people on SO might be able to help :)
I have a table with data regarding processing volumes at stores. The first three columns shown below can be queried from that table. What I'm trying to do is to add a 4th column that's basically a flag regarding if a store has processed >=$150, and if so, will display the corresponding date. The way this works is the first instance where the store has surpassed $150 is the date that gets displayed. Subsequent processing volumes don't count after the the first instance the activated date is hit. For example, for store 4, there's just one instance of the activated date.
store_id sales_volume date activated_date
----------------------------------------------------
2 5 03/14/2012
2 125 05/21/2012
2 30 11/01/2012 11/01/2012
3 100 02/06/2012
3 140 12/22/2012 12/22/2012
4 300 10/15/2012 10/15/2012
4 450 11/25/2012
5 100 12/03/2012
Any insights as to how to build out this fourth column? Thanks in advance!
The solution start by calculating the cumulative sales. Then, you want the activation date only when the cumulative sales first pass through the $150 level. This happens when adding the current sales amount pushes the cumulative amount over the threshold. The following case expression handles this.
select t.store_id, t.sales_volume, t.date,
(case when 150 > cumesales - t.sales_volume and 150 <= cumesales
then date
end) as ActivationDate
from (select t.*,
sum(sales_volume) over (partition by store_id order by date) as cumesales
from t
) t
If you have an older version of Postgres that does not support cumulative sum, you can get the cumulative sales with a subquery like:
(select sum(sales_volume) from t t2 where t2.store_id = t.store_id and t2.date <= t.date) as cumesales
Variant 1
You can LEFT JOIN to a table that calculates the first date surpassing the 150 $ limit per store:
SELECT t.*, b.activated_date
FROM tbl t
LEFT JOIN (
SELECT store_id, min(thedate) AS activated_date
FROM (
SELECT store_id, thedate
,sum(sales_volume) OVER (PARTITION BY store_id
ORDER BY thedate) AS running_sum
FROM tbl
) a
WHERE running_sum >= 150
GROUP BY 1
) b ON t.store_id = b.store_id AND t.thedate = b.activated_date
ORDER BY t.store_id, t.thedate;
The calculation of the the first day has to be done in two steps, since the window function accumulating the running sum has to be applied in a separate SELECT.
Variant 2
Another window function instead of the LEFT JOIN. May of may not be faster. Test with EXPLAIN ANALYZE.
SELECT *
,CASE WHEN running_sum >= 150 AND thedate = first_value(thedate)
OVER (PARTITION BY store_id, running_sum >= 150 ORDER BY thedate)
THEN thedate END AS activated_date
FROM (
SELECT *
,sum(sales_volume)
OVER (PARTITION BY store_id ORDER BY thedate) AS running_sum
FROM tbl
) b
ORDER BY store_id, thedate;
->sqlfiddle demonstrating both.