Getting proper count for longest user streaks - sql

I'm having a difficult time getting the correct counts for longest user streaks. Streaks are consecutive days with check-ins for each user.
Any help would be greatly appreciated. Here's a fiddle with my script and sample data: http://sqlfiddle.com/#!17/d2825/1/0
check_ins table:
user_id goal_id check_in_date
------------------------------------------
| colt | 40365fa0 | 2019-01-07 15:35:53
| colt | d31efe70 | 2019-01-11 15:35:52
| berry| be2fcd50 | 2019-01-12 15:35:51
| colt | e754d050 | 2019-01-13 15:17:16
| colt | 9c87a7f0 | 2019-01-14 15:35:54
| colt | ucgtdes0 | 2019-01-15 12:30:59
PostgreSQL script:
WITH dates(DATE) AS
(SELECT DISTINCT Cast(check_in_date AS DATE),
user_id
FROM check_ins),
GROUPS AS
(SELECT Row_number() OVER (
ORDER BY DATE) AS rn, DATE - (Row_number() OVER (ORDER BY DATE) * interval '1' DAY) AS grp, DATE, user_id
FROM dates)
SELECT Count(*) AS streak,
user_id
FROM GROUPS
GROUP BY grp,
user_id
ORDER BY 1 DESC;
Here's what I get when I run the code above:
streak user_id
--------------
4 colt
1 colt
1 berry
What it should be. I'd like to also only get the longest streak for each user.
streak user_id
--------------
3 colt
1 berry

In Postgres, you can write this as:
select distinct on (user_id) user_id, count(distinct check_in_date::date) as num_days
from (select ci.*,
dense_rank() over (partition by user_id order by check_in_date::date) as seq
from check_ins ci
) ci
group by user_id, check_in_date::date - seq * interval '1 day'
order by user_id, num_days desc;
Here is a db<>fiddle.
This follows similar logic to your approach, but your query seems more complicated than necessary. This does use the Postgres distinct on functionality, which is handy to avoid an additional subquery.

Firstly, Thanks for the fiddle script and sample data.
You are not using the right row_number to implement gaps and islands problem. It should be like in the below query for your data set. On top of that, to get the one with the highest streak, you would need to use DISTINCT ON after grouping by the the group number (grp in your query, I called it seq).
I hope you want to see only the distinct entries per day for a user's data. I have tried to reflect the same with slight changes in the with clause.
SELECT * FROM (
WITH check_ins_dt AS
( SELECT DISTINCT check_in_date::DATE as check_in_date,
user_id
FROM check_ins)
SELECT DISTINCT ON (user_id) COUNT(*) AS streak,user_id
FROM (
SELECT c.*,
ROW_NUMBER() OVER(
ORDER BY check_in_date
) - ROW_NUMBER() OVER(
PARTITION BY user_id
ORDER BY check_in_date
) AS seq
FROM check_ins_dt c
) s
GROUP BY user_id,
seq
ORDER BY user_id,
COUNT(*) DESC ) q order
by streak desc;
Demo

Related

How to sample for users and dates within a given timeframe in SQL?

I am using Redshift SQL and would like to sample by users-id but I am not sure how to specify that.
Let's say my table looks like this
user_id | date | other columns
1 | 2020-01-01 | ...
1 | 2020-02-01 | ...
2 | 2020-02-11 | ...
...
How do I filter for 10,000 random user-id & day pairs within 2000-01-01 AND 2020-01-01. How do I do this in SQL?
We can use ROW_NUMBER() with a random ordering:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY RANDOM()) rn
FROM yourTable
WHERE date BETWEEN '2000-01-01' AND '2020-01-01'
)
SELECT *
FROM cte
WHERE rn <= 10000;
You can use ORDER BY and LIMIT (or TOP).
select *
from <table>
order by random()
limit 10000

How to select the last record of each ID

I need to extract the last records of each user from the table. The table schema is like below.
mytable
product | user_id |
-------------------
A | 15 |
B | 15 |
A | 16 |
C | 16 |
-------------------
The output I want to get is
product | user_id |
-------------------
B | 15 |
C | 16 |
Basically the last records of each user.
Thanks in advance!
You can use a window function called ROW_NUMBER.Here is a solution for you given below. I have also made a demo query in db-fiddle for you. Please check link Demo Code in DB-Fiddle
WITH CTE AS
(SELECT product, user_id,
ROW_NUMBER() OVER(PARTITION BY user_id order by product desc)
as RN
FROM Mytable)
SELECT product, user_id FROM CTE WHERE RN=1 ;
You can try using row_number()
select product,iserid
from
(
select product, userid,row_number() over(partition by userid order by product desc) as rn
from tablename
)A where rn=1
There is no such thing as a "last" record unless you have a column that specifies the ordering. SQL tables represent unordered sets (well technically, multisets).
If you have such a column, then use distinct on:
select distinct on (user_id) t.*
from t
order by user_id, <ordering col> desc;
Distinct on is a very handy Postgres extension that returns one row per "group". It is the first row based on the ordering specified in the order by clause.
You should have a column that stores the insertion order. Whether through auto increment or a value with date and time.
Ex:
autoIncrement
produt
user_id
1
A
15
2
B
15
3
A
16
4
C
16
SELECT produt, user_id FROM table inner join
( SELECT MAX(autoIncrement) as id FROM table group by user_id ) as table_Aux
ON table.autoIncrement = table_Aux.id

Exclude columns from Group by

I have a table like this
My current query
Select team,
stat_id,
max(statsval) as statsval
from tbl
group by team,
statid
Issue :
I need to get season also in select and obliviously I need to add to group by but is is giving me un expected results I can't change my group by.Because I need to group by stat_id only I can group by season. I need to get the season of the max() record. Can some one help me on this?
I even tried
Select team,
stat_id,
max (seasonid),
max(statsval) as statsval
from tbl
group by team,
statid
But it takes the max season not exactly the correct result.
Excepted result
+--------+--------+-------+---------+---------+
| season | team | round | stat_id | statval |
+--------+--------+-------+---------+---------+
| 2004 | 500146 | 3 | 1 | 5 |
| 2007 | 500147 | 1 | 1 | 4 |
+--------+--------+-------+---------+---------+
Depending on your edition of SQL Server, this can be done with Window functions only:
SELECT DISTINCT team
, stat_id
, max(statsval) OVER (PARTITION BY team, stat_id) statsval
, FIRST_VALUE(season_id) OVER (PARTITION BY team, stat_id ORDER BY statsval DESC)
FROM tbl
Try this using windows functions
Select distinct team,
statid,
max(statsval) OVER(PARTITION BY team,statid ORDER BY seasonid) as statid,
max(seasonid) OVER(PARTITION BY team,statid ORDER BY statid)
from tbl
Try this and look up the team id after the grouping is done:
;with tmp as
(
select team,
stat_id,
max(statsval) as statsval
from tbl
group by team,
statid
)
select tmp.*,
tbl.seasonid
from tmp join tbl
on tmp.team = tbl.team and tmp.statid = tbl.stat_id;
If you want the complete row, you can simply use a correlated subquery:
Select t.*
from tbl t
where t.season = (select max(t2.season)
from tbl t2
where t2.team = t.team and t2.statsval = t.statsval
);
With an index on tbl(team, statsval, season), this probably has as good as or better performance than other options.
A fun method that has worse performance (even with the index) is:
select top (1) with ties t.*
from tbl t
order by row_number() over (partition by team, statsval order by season desc);

How can i select only id of min created date in each group [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 6 years ago.
Imagine next tables
Ticket Table
========================
| id | question |
========================
| 1 | Can u help me :)? |
========================
UserEntry Table
======================================================
| id | answer | dateCreated | ticket_id |
======================================================
| 2 | It's my plessure :)? | 2016-08-05 | 1 |
=======================================================
| 3 | How can i help u ? | 2016-08-06 | 1 |
======================================================
So how can I only get id of row for each group which have min date value
So my expected answer should be like that
====
| id |
====
| 2 |
====
UPDATE:
I got the solution in next query
SELECT id FROM UserEntry WHERE datecreated IN (SELECT MIN(datecreated) FROM CCUserEntry GROUP BY ticket_id)
Improved Answer
SELECT id FROM UserEntry WHERE (ticket_id, datecreated) IN
(SELECT ticket_id, MIN(datecreated) FROM UserEntry GROUP BY ticket_id);
Also this is a good and right answer too (NOTE: DISTINCT ON is not a part of the SQL standard.)
SELECT DISTINCT ON (ue.ticket_id) ue.id
FROM UserEntry ue
ORDER BY ue.ticket_id, ue.datecreated
It seems you want to select the ID with the minimum datecreated. That is simple: select the minimum date and then select the id(s) matching this date.
SELECT id FROM UserEntry WHERE datecreated = (SELECT MIN(datecreated) FROM UserEntry);
If you are sure you won't have ties or if you are fine with just one row anyway, you can also use FETCH FIRST ROW ONLY which doesn't have a tie clause in PostgreSQL unfortunately.
SELECT id FROM UserEntry ORDER BY datecreated FETCH FIRST ROW ONLY;
UPDATE: You want the entry ID for the minimum date per ticket. Per ticket translates to GROUP BY ticket_id in SQL.
SELECT ticket_id, id FROM UserEntry WHERE (ticket_id, datecreated) IN
(SELECT ticket_id, MIN(datecreated) FROM UserEntry GROUP BY ticket_id);
The same can be achieved with window functions where you read the table only once:
SELECT ticket_id, id
FROM
(
SELECT ticket_id, id, RANK() OVER (PARTITION BY ticket_id ORDER BY datecreated) AS rnk
FROM UserEntry
) ranked
WHERE rnk = 1;
(Change SELECT ticket_id, id to SELECT id if you want the queries not to show the ticket ID, which would make the results harder to understand of course :-)
You may want fetch first row only or distinct on (if you care about more than one ticket):
SELECT DISTINCT ON (ue.ticket_id) ue.id
FROM UserEntry ue
ORDER BY ue.ticket_id, ue.date_created
This will get the id on the row with the minimum date_created value.
A solution with ANSI SQL that works on a wide range of DBMS that support modern SQL is to use window functions:
select id
from (
select id, row_number() over (partition by ticket_id order by date_created) as rn
from userentry
) t
where rn = 1;
Note that in Postgres, Gordon's solution using distinct on () is usually faster then using window functions

Compare different orders of the same table

I have this following scenario, a table with these columns:
table_id|user_id|os_number|inclusion_date
In the system, the os_number is sequential for the users, but due to a system bug some users inserted OSs in wrong order. Something like this:
table_id | user_id | os_number | inclusion_date
-----------------------------------------------
1 | 1 | 1 | 2015-11-01
2 | 1 | 2 | 2015-11-02
3 | 1 | 3 | 2015-11-01
Note the os number 3 inserted before the os number 2
What I need:
Recover the table_id of the rows 2 and 3, which is out of order.
I have these two select that show me the table_id in two different orders:
select table_id from table order by user_id, os_number
select table_id from table order by user_id, inclusion_date
I can't figure out how can I compare these two selects and see which users are affected by this system bug.
Your question is a bit difficult because there is no correct ordering (as presented) -- because dates can have ties. So, use the rank() or dense_rank() function to compare the two values and return the ones that are not in the correct order:
select t.*
from (select t.*,
rank() over (partition by user_id order by inclusion_date) as seqnum_d,
rank() over (partition by user_id order by os_number) as seqnum_o
from t
) t
where seqnum_d <> seqnum_o;
Use row_number() over both orders:
select *
from (
select *,
row_number() over (order by os_number) rnn,
row_number() over (order by inclusion_date) rnd
from a_table
) s
where rnn <> rnd;
table_id | user_id | os_number | inclusion_date | rnn | rnd
----------+---------+-----------+----------------+-----+-----
3 | 1 | 3 | 2015-11-01 | 3 | 2
2 | 1 | 2 | 2015-11-02 | 2 | 3
(2 rows)
Not entirely sure about the performance on this but you could use a cross apply on the same table to get the results in one query. This will bring up the pairs of table_ids which are incorrect.
select
a.table_id as InsertedAfterTableId,
c.table_id as InsertedBeforeTableId
from table a
cross apply
(
select b.table_id
from table b
where b.inclusion_date < a.inclusion_date and b.os_number > a.os_number
) c
Both query examples given below simply check a mismatch between inclusion date and os_number:
This first query should return the offending row (the one whose os_number is off from its inclusion date)--in the case of the example row 3.
select table.table_id, table.user_id, table.os_number from table
where EXISTS(select * from table t
where t.user_id = table.user_id and
t.inclusion_date > table.inclusion_date and
t.os_number < table.os_number);
This second query will return the table numbers and users for two rows that are mismatched:
select first_table.table_id, second_table.table_id, first_table.user_id from
table first_table
JOIN table second_table
ON (first_table.user_id = second_table.user_id and
first_table.inclusion_date > second_table.inclusion_date and
first_table.os_number < second_table.os_number);
I would use WINDOW FUNCTIONS to get row numbers in orders in question and then compare them:
SELECT
sub.table_id,
sub.user_id,
sub.os_number,
sub.inclusion_date,
number_order_1, number_order_2
FROM (
SELECT
table_id,
user_id,
os_number,
inclusion_date,
row_number() OVER (PARTITION BY user_id
ORDER BY os_number
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_1,
row_number() OVER (PARTITION BY user_id
ORDER BY inclusion_date
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
) AS number_order_2
FROM
table
) sub
WHERE
number_order_1 <> number_order_1
;
EDIT:
Because of a_horse_with_no_name made good point about my final answer. I've back to my first answer (look in edit history) which work also if os_number isn't gapless.
select *
from (
select a_table.*,
lag(inclusion_date) over (partition by user_id order by os_number) as last_date
from a_table
) result
where last_date is not null AND last_date>inclusion_date;
This should cover gaps as well as ties. Basically, I simply check the inclusion_date of the last os_number, and make sure it's not strictly greater than the current date (so 2 version on the same date is fine).