How to count repeating values in a column in PostgreSQL? - sql

Hi I have a table like below, and I want to count the repeating values in the status column. I don't want to calculate the overall duplicate values. For example, I just want to count how many "Offline" appears until the value changes to "Idle".
This is the result I wanted. Thank you.

This is often called gaps-and-islands.
One way to do it is with two sequences of row numbers.
Examine each intermediate result of the query to understand how it works.
WITH
CTE_rn
AS
(
SELECT
status
,dt
,ROW_NUMBER() OVER (ORDER BY dt) as rn1
,ROW_NUMBER() OVER (PARTITION BY status ORDER BY dt) as rn2
FROM
T
)
SELECT
status
,COUNT(*) AS cnt
FROM
CTE_rn
GROUP BY
status
,rn1-rn2
ORDER BY
min(dt)
;
Result
| status | cnt |
|---------|-----|
| offline | 2 |
| idle | 1 |
| offline | 2 |
| idle | 1 |

WITH
cte1 AS ( SELECT status,
"date",
workstation,
CASE WHEN status = LAG(status) OVER (PARTITION BY workstation ORDER BY "date")
THEN 0
ELSE 1 END changed
FROM test ),
cte2 AS ( SELECT status,
"date",
workstation,
SUM(changed) OVER (PARTITION BY workstation ORDER BY "date") group_num
FROM cte1 )
SELECT status, COUNT(*) "count", workstation, MIN("date") "from", MAX("date") "till"
FROM cte2
GROUP BY group_num, status, workstation;
fiddle

Related

SQL The largest number of consecutive values for each value

I have Tabel MatchResults
id | player_win_id
------------------
1 | 1
2 | 1
3 | 3
4 | 1
5 | 2
6 | 3
7 | 3
8 | 1
9 | 1
10 | 1
I need to find out for each player ID the highest number of consecutive victories. I use MS SQL Server.
Expected Result
PLAYER_ID | WIN_COUNT
------------------
1 | 3
2 | 1
3 | 2
This is a type of gaps-and-islands problem. One solution uses the difference of row numbers. So, to get all streaks:
select player_win_id, count(*)
from (select t.*,
row_number() over (order by id) as seqnum,
row_number() over (partition by player_win_id order by id) as seqnum_p
from MatchResults t
) t
group by player_win_id, (seqnum - seqnum_p);
Why this works is a little tricky to explain. But if you look at the results of the subquery, you'll probably see how the difference between the row number values captures adjacent rows with the same player win id.
For the maximum, the simplest is probably just an aggregation query:
select player_win_id, max(cnt)
from (select player_win_id, count(*) as cnt
from (select t.*,
row_number() over (order by id) as seqnum,
row_number() over (partition by player_win_id order by id) as seqnum_p
from MatchResults t
) t
group by player_win_id, (seqnum - seqnum_p)
) p
group by player_win_id;
Now I understand the previous comment. The code for my table is:
select player_win_id, max(cnt)
from (select player_win_id, count(*) as cnt
from (select *,
row_number() over (order by id) as seqnum,
row_number() over (partition by player_win_id order by id) as seqnum_p
from MatchResults ) t
group by player_win_id, (seqnum - seqnum_p)
) p
group by player_win_id;

Select SUM and column with max

I looking best or simplest way to SELECT type, user_with_max_value, SUM(value) GROUP BY type. Table look similar
type | user | value
type1 | 1 | 100
type1 | 2 | 200
type2 | 1 | 50
type2 | 2 | 10
And result look:
type1 | 2 | 300
type2 | 1 | 60
Use window functions:
select type, max(case when seqnum = 1 then user end), sum(value)
from (select t.*,
row_number() over (partition by type order by value desc) as seqnum
from t
) t
where seqnum = 1;
Some databases have functionality for an aggregation function that returns the first value. One method without a subquery using standard SQL is:
select distinct type,
first_value(user) over (partition by type order by value desc) as user,
sum(value) over (partition by type)
from t;
You can use window function :
select t.*
from (select t.type,
row_number() over (partition by type order by value desc) as seq,
sum(value) over (partition by type) as value
from table t
) t
where seq = 1;
Try below query.
It will help you.
SELECT type, max(user), SUM(value) from table1 GROUP BY type
use analytical functions
create table poo2
(
thetype varchar(5),
theuser int,
thevalue int
)
insert into poo2
select 'type1',1,100 union all
select 'type1',2,200 union all
select 'type2',1,50 union all
select 'type2',2,10
select thetype,theuser,mysum
from
(
select thetype ,theuser
,row_number() over (partition by thetype order by thevalue desc) r
,sum(thevalue) over (partition by thetype) mysum from poo2
) ilv
where r=1

sql, big query: aggregate all entries between two strings in a variable

I have to solve this problem within bigQuery. I have this column in my table:
event | time
_________________|____________________
start | 1
end | 2
random_event_X | 3
start | 4
error_X | 5
error_Y | 6
end | 7
start | 8
error_F | 9
start | 10
random_event_Y | 11
error_z | 12
end | 13
I would like to, from the end event record everything until start appear and then count it. Everything can happen between start and end and outside of it. If there is an end, there is a start, but if there is a start, there is not necessarily an end.
The desire output would be like:
string_agg | count
"start, end" | 1
"start, error_X, error_Y, end" | 1
"start, random_event_Y error_Z, end" | 1
So everything between each start and end if start has an end. So without the random_event_X at time 3, the start at time 8 or the error_F at time 9.
I was not able to find the solution and have struggle understanding how to approach this problem. Any help or advice is welcome.
Below is for BigQuery Standard SQL
#standardSQL
SELECT agg_events, COUNT(1) cnt
FROM (
SELECT STRING_AGG(event ORDER BY time) agg_events, COUNTIF(event IN ('start', 'end')) flag
FROM (
SELECT *, COUNTIF(event = 'start') OVER(PARTITION BY grp1 ORDER BY time) grp2
FROM (
SELECT *, COUNTIF(event = 'end') OVER(ORDER BY time DESC) grp1
FROM `project.dataset.table`
)
)
GROUP BY grp1, grp2
)
WHERE flag = 2
GROUP BY agg_events
If to apply to sample data from your question - result is
Row agg_events cnt
1 start,random_event_Y,error_z,end 1
2 start,error_X,error_Y,end 1
3 start,end 1
SQL tables represent unordered sets -- this is particularly true in massively parallel, columnar databases such as BigQuery.
So, I have to assume that you have some other column that specifies the ordering. If so, you can use a cumulative sum to identify the groups and then aggregation:
select grp,
string_agg(event, ',' order by time)
from (select t.*,
countif(event = 'start') over (order by time) as grp
from t
) t
group by grp
order by min(time);
Note: I would also advise you to use array_agg() instead of string_agg(). Arrays are generally easier to work with than strings.
EDIT:
I see, you only want up to end. In that case, another level of window funtions:
select grp,
string_agg(event, ',' order by <ordering col>)
from (select t.*,
max(case when event = 'end' then time end) over (partition by grp) as max_end_time
from (select t.*,
countif(event = 'start') over (order by <ordering col>) as grp
from t
) t
) t
where max_end_time is null or time <= max_end_time
group by grp
order by min(<ordering col>);

redshift: how to find row_number after grouping and aggregating?

Suppose I have a table of customer purchases ("my_table") like this:
--------------------------------------
customerid | date_of_purchase | price
-----------|------------------|-------
1 | 2019-09-20 | 20.23
2 | 2019-09-21 | 1.99
1 | 2019-09-21 | 123.34
...
I'd like to be able to find the nth highest spending customer in this table (say n = 5). So I tried this:
with cte as (
select customerid, sum(price) as total_pay,
row_number() over (partition by customerid order by total_pay desc) as rn
from my_table group by customerid order by total_pay desc)
select * from cte where rn = 5;
But this gives me nonsense results. For some reason rn doesn't seem to be unique (for example there are a bunch of customers with rn = 1). I don't understand why. Isn't rn supposed to be just a row number?
Remove the partition by in the definition of row_number():
with cte as (
select customerid, sum(price) as total_pay,
row_number() over (order by total_pay desc) as rn
from my_table
group by customerid
)
select *
from cte
where rn = 5;
You are already aggregating by customerid, so each customer has only one row. So the value of rn will always be 1.

SQL most recent using row_number() over partition

I'm working with some web clicks data, and am just looking for the most recent page_name with the user_id visited (by a timestamp). Using the below code, the user_id is repeated and page_name with shown, with sorted descending. However, I would just like recent_click always = 1. The query when complete will be used as a subquery in a larger query.
Here is my current code:
SELECT user_id,
page_name,
row_number() over(partition by session_id order by ts desc) as recent_click
from clicks_data;
user_id | page_name | recent_click
--------+-------------+--------------
0001 | login | 1
0001 | login | 2
0002 | home | 1
You should be able to move your query to a subquery and add where criteria:
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click = 1
You should move the row_number() function into a subquery and then filter it in the outer query.
Something like this:
SELECT * FROM (
SELECT
[user_id]
,[page_name]
,ROW_NUMBER() OVER (PARTITION BY [session_id]
ORDER BY [ts] DESC) AS [recent_click]
FROM [clicks_data]
)x
WHERE [recent_click] = 1