Getting maximum sequential streak with events - updated question - sql

I've previously posted a similar question to this, but an update on the parameters has meant that the solution posted wouldn't work, and I've had trouble trying to work out how to integrate the revised requirement. I'm not sure the protocol in here- it appears that I can't post an updated question to the original post at Getting maximum sequential streak with events
I’m looking for a single query, if possible, running PostgreSQL 9.6.6 under pgAdmin3 v1.22.1
I have a table with a date and a row for each event on the date:
Date Events
2018-12-10 1
2018-12-10 1
2018-12-10 0
2018-12-09 1
2018-12-08 0
2018-12-08 0
2018-12-07 1
2018-12-06 1
2018-12-06 1
2018-12-06 0
2018-12-06 1
2018-12-04 1
2018-12-03 0
I’m looking for the longest sequence of dates without a break. In this case, 2018-12-08 and 2018-12-03 are the only dates with no events, there are two dates with events between 2018-12-08 and today, and three between 2018-12-8 and 2018-12-07 - so I would like the answer of 3.
I know I can group them together with something like:
Select Date, count(Date) from Table group by Date order by Date Desc
To get just the most recent sequence, I’ve got something like this- the subquery returns the most recent date with no events, and the outer query counts the dates after that date:
select date, count(distinct date) from Table
where date>
( select date from Table
group by date
having count (case when Events is not null then 1 else null end) = 0
order by date desc
fetch first row only)
group by date
But now I need the longest streak, not just the most recent streak.
I had assumed when I posted previously that there were rows for every date in the range. But this assumption wasn't correct, so the answer given doesn't work. I also need the query to return the start and end date for the range.
Thank you!

You can assign group by doing a cumulative count of the 0s. Then count the distinct dates in each group:
select count(*), min(date), max(date), count(distinct date)
from (select t.*,
count(*) filter (where events = 0) over (order by date) as grp
from t
) t
group by grp
order by count(distinct date) desc
limit 1;

Related

SQL - Get historic count of rows collected within a certain period by date

For many years I've been collecting data and I'm interested in knowing the historic counts of IDs that appeared in the last 30 days. The source looks like this
id
dates
1
2002-01-01
2
2002-01-01
3
2002-01-01
...
...
3
2023-01-10
If I wanted to know the historic count of ids that appeared in the last 30 days I would do something like this
with total_counter as (
select id, count(id) counts
from source
group by id
),
unique_obs as (
select id
from source
where dates >= DATEADD(Day ,-30, current_date)
group by id
)
select count(distinct(id))
from unique_obs
left join total_counter
on total_counter.id = unique_obs.id;
The problem is that this results would return a single result for today's count as provided by current_date.
I would like to see a table with such counts as if for example I had ran this analysis yesterday, and the day before and so on. So the expected result would be something like
counts
date
1235
2023-01-10
1234
2023-01-09
1265
2023-01-08
...
...
7383
2022-12-11
so for example, let's say that if the current_date was 2023-01-10, my query would've returned 1235.
If you need a distinct count of Ids from the 30 days up to and including each date the below should work
WITH CTE_DATES
AS
(
--Create a list of anchor dates
SELECT DISTINCT
dates
FROM source
)
SELECT COUNT(DISTINCT s.id) AS "counts"
,D.dates AS "date"
FROM CTE_DATES D
LEFT JOIN source S ON S.dates BETWEEN DATEADD(DAY,-29,D.dates) AND D.dates --30 DAYS INCLUSIVE
GROUP BY D.dates
ORDER BY D.dates DESC
;
If the distinct count didnt matter you could likely simplify with a rolling sum, only hitting the source table once:
SELECT S.dates AS "date"
,COUNT(1) AS "count_daily"
,SUM("count_daily") OVER(ORDER BY S.dates DESC ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING) AS "count_rolling" --assumes there is at least one row for every day.
FROM source S
GROUP BY S.dates
ORDER BY S.dates DESC;
;
This wont work though if you have gaps in your list of dates as it'll just include the latest 30 days available. In which case the first example without distinct in the count will do the trick.
SELECT count(*) AS Counts
dates AS Date
FROM source
WHERE dates >= DATEADD(DAY, -30, CURRENT_DATE)
GROUP BY dates
ORDER BY dates DESC

Need to count unique transactions by month but ignore records that occur 3 days after 1st entry for that ID

I have a table with just two columns: User_ID and fail_date. Each time somebody's card is rejected they are logged in the table, their card is automatically tried again 3 days later, and if they fail again, another entry is added to the table. I am trying to write a query that counts unique failures by month so I only want to count the first entry, not the 3 day retries, if they exist. My data set looks like this
user_id fail_date
222 01/01
222 01/04
555 02/15
777 03/31
777 04/02
222 10/11
so my desired output would be something like this:
month unique_fails
jan 1
feb 1
march 1
april 0
oct 1
I'll be running this in Vertica, but I'm not so much looking for perfect syntax in replies. Just help around how to approach this problem as I can't really think of a way to make it work. Thanks!
You could use lag() to get the previous timestamp per user. If the current and the previous timestamp are less than or exactly three days apart, it's a follow up. Mark the row as such. Then you can filter to exclude the follow ups.
It might look something like:
SELECT month,
count(*) unique_fails
FROM (SELECT month(fail_date) month,
CASE
WHEN datediff(day,
lag(fail_date) OVER (PARTITION BY user_id,
ORDER BY fail_date),
fail_date) <= 3 THEN
1
ELSE
0
END follow_up
FROM elbat) x
WHERE follow_up = 0
GROUP BY month;
I'm not so sure about the exact syntax in Vertica, so it might need some adaptions. I also don't know, if fail_date actually is some date/time type variant or just a string. If it's just a string the date/time specific functions may not work on it and have to be replaced or the string has to be converted prior passing it to the functions.
If the data spans several years you might also want to include the year additionally to the month to keep months from different years apart. In the inner SELECT add a column year(fail_date) year and add year to the list of columns and the GROUP BY of the outer SELECT.
You can add a flag about whether this is a "unique_fail" by doing:
select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t;
Then, you want to count this flag by month:
select to_char(fail_date, 'Mon'), -- should aways include the year
sum(first_failure_flag)
from (select t.*,
(case when lag(fail_date) over (partition by user_id order by fail_date) > fail_date - 3
then 0 else 1
end) as first_failure_flag
from t
) t
group by to_char(fail_date, 'Mon')
order by min(fail_date)
In a Derived Table, determine the previous fail_date (prev_fail_date), for a specific user_id and fail_date, using a Correlated subquery.
Using the derived table dt, Count the failure, if the difference of number of days between current fail_date and prev_fail_date is greater than 3.
DateDiff() function alongside with If() function is used to determine the cases, which are not repeated tries.
To Group By this result on Month, you can use MONTH function.
But then, the data can be from multiple years, so you need to separate them out yearwise as well, so you can do a multi-level group by, using YEAR function as well.
Try the following (in MySQL) - you can get idea for other RDBMS as well:
SELECT YEAR(dt.fail_date) AS year_fail_date,
MONTH(dt.fail_date) AS month_fail_date,
COUNT( IF(DATEDIFF(dt.fail_date, dt.prev_fail_date) > 3, user_id, NULL) ) AS unique_fails
FROM (
SELECT
t1.user_id,
t1.fail_date,
(
SELECT t2.fail_date
FROM your_table AS t2
WHERE t2.user_id = t1.user_id
AND t2.fail_date < t1.fail_date
ORDER BY t2.fail_date DESC
LIMIT 1
) AS prev_fail_date
FROM your_table AS t1
) AS dt
GROUP BY
year_fail_date,
month_fail_date
ORDER BY
year_fail_date ASC,
month_fail_date ASC

Running Sum for the last 30 days on BigQuery

I am trying to get the following query on Google Merchandise Store public dataset in BigQuery:
Date
Number of distinct users
Running sum of the number of distinct users in the last 30 days
For eg (I used 3 days in the example for simplicity):
date distinct_users distinct_users_3days
15/07/2018 8 15
14/07/2018 2 12
13/07/2018 5 20
12/07/2018 5 15
11/07/2018 10 10
...
This is my current SQL code which gets the first two columns, but I can't figure out how to get the running sum:
SELECT
date,
COUNT(DISTINCT(fullVisitorId)) as daily_active_user
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_2017*`
WHERE
_table_suffix BETWEEN "0101"
AND "0715"
GROUP BY
date
Any help is appreciated! :)
I managed to figure out the answer to my question so I would like to share with the others who may encounter this problem in future.
The SQL code is:
SELECT
date,
COUNT(DISTINCT(fullVisitorId)) as daily_active_user,
SUM(count(Distinct(fullVisitorId))) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS monthly_active_user
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_2017*`,
unnest(hits) as h
WHERE
_table_suffix BETWEEN "0101" AND "0715"
GROUP BY
date
This gives a column which sums the distinct users in a 30 day window.
Please try the following query for 3 days (SQL server 2014 )-:
SELECT date,COUNT(DISTINCT(fullVisitorId)) as daily_active_user,sum(COUNT(DISTINCT(fullVisitorId))) over (PARTITION BY null ORDER BY date desc ROWS
BETWEEN CURRENT ROW AND 2 FOLLOWING) AS distinct_users_3days FROM YOUR_TABLE_NAME WHERE _table_suffix BETWEEN '0101' AND '715' GROUP BY date
For 30 days-:
SELECT
date,COUNT(DISTINCT(fullVisitorId)) as daily_active_user,
sum(COUNT(DISTINCT(fullVisitorId))) over (PARTITION BY null ORDER BY date desc ROWS
BETWEEN CURRENT ROW AND 29 FOLLOWING) AS distinct_users_3days
FROM YOUR_TABLE_NAME
WHERE _table_suffix
BETWEEN '0101' AND '715'
GROUP BY date

SQL - Counting months passed by a Person Indicator

I'm trying to count the number of months that have passed based on ID, it's possible that for some records the months will not increase by 1 each time (i.e. someone could have a record for 1/1/13 and 3/1/13 but not 2/1/13) however I only want a count of the records in my table. So missing months don't matter.
An example table would be: (notice the missing month and it's irrelevancy).
DATE ID Months Passed
----------- --- --------------
2013-11-01 105 1
2013-12-01 105 2
2014-02-01 105 3
2014-03-01 105 4
Essentially an Excel COUNTIFSin SQL, which I've written:
=COUNTIFS(IDColumn, ID, MonthColumn, "<=" & Month)
Does anyone know of a way to generate the desired column using SQL?
Try ROW_NUMBER(). If you just want the "Months Passed" column to increase by 1 each time, and for each ID, that will do the trick.
SELECT
Date,
Id,
Indicator,
ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Date) AS RowNum
FROM YourTable
WHERE Indicator = 'YES'
UNION
SELECT
Date,
Id,
Indicator,
0 AS RowNum
FROM YourTable
WHERE Indicator = 'NO'
You could more simply count rows grouped by month (more complex if you have count months in different years separately):
SELECT COUNT(derived.monthVal)
FROM (SELECT MONTH(<your date field>) AS monthVal
FROM [your table]
WHERE [Your ID Column] = <the id>
GROUP BY MONTH(<your date field>)) AS derived;

Smoothing out a result set by date

Using SQL I need to return a smooth set of results (i.e. one per day) from a dataset that contains 0-N records per day.
The result per day should be the most recent previous value even if that is not from the same day. For example:
Starting data:
Date: Time: Value
19/3/2014 10:01 5
19/3/2014 11:08 3
19/3/2014 17:19 6
20/3/2014 09:11 4
22/3/2014 14:01 5
Required output:
Date: Value
19/3/2014 6
20/3/2014 4
21/3/2014 4
22/3/2014 5
First you need to complete the date range and fill in the missing dates (21/3/2014 in you example). This can be done by either joining a calendar table if you have one, or by using a recursive common table expression to generate the complete sequence on the fly.
When you have the complete sequence of dates finding the max value for the date, or from the latest previous non-null row becomes easy. In this query I use a correlated subquery to do it.
with cte as (
select min(date) date, max(date) max_date from your_table
union all
select dateadd(day, 1, date) date, max_date
from cte
where date < max_date
)
select
c.date,
(
select top 1 max(value) from your_table
where date <= c.date group by date order by date desc
) value
from cte c
order by c.date;
May be this works but try and let me know
select date, value from test where (time,date) in (select max(time),date from test group by date);