Finding consecutive patterns (with SQL) - sql

A table consecutive in PostgreSQL:
Each se_id has an idx
from 0 up to 100 - here 0 to 9.
The search pattern:
SELECT *
FROM consecutive
WHERE val_3_bool = 1
AND val_1_dur > 4100 AND val_1_dur < 5900
Now I'm looking for the longest consecutive appearance of this pattern
for each p_id - and the AVG of the counted val_1_dur.
Is it possible to calculate this in pure SQL?
table as txt
"Result" as txt

One method is the difference of row numbers approach to get the sequences for each:
select pid, count(*) as in_a_row, sum(val1_dur) as dur
from (select t.*,
row_number() over (partition by pid order by idx) as seqnum,
row_number() over (partition by pid, val3_bool order by idx) as seqnum_d
from consecutive t
) t
group by (seqnun - seqnum_d), pid, val3_bool;
If you are looking specifically for "1" values, then add where val3_bool = 1 to the outer query. To understand why this works, I would suggest that you stare at the results of the subquery, so you can understand why the difference defines the consecutive values.
You can then get the max using distinct on:
select distinct on (pid) t.*
from (select pid, count(*) as in_a_row, sum(val1_dur) as dur
from (select t.*,
row_number() over (partition by pid order by idx) as seqnum,
row_number() over (partition by pid, val3_bool order by idx) as seqnum_d
from consecutive t
) t
group by (seqnun - seqnum_d), pid, val3_bool;
) t
order by pid, in_a_row desc;
The distinct on does not require an additional level of subquery, but I think that makes the logic clearer.

There are Window Functions, that enable you to compare one line with the previous and next one.
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
https://www.postgresql.org/docs/current/static/tutorial-window.html
As seen on How to compare the current row with next and previous row in PostgreSQL? and Filtering by window function result in Postgresql

Related

Complex Ranking in SQL (Teradata)

I have a peculiar problem at hand. I need to rank in the following manner:
Each ID gets a new rank.
rank #1 is assigned to the ID with the lowest date. However, the subsequent dates for that particular ID can be higher but they will get the incremental rank w.r.t other IDs.
(E.g. ADF32 series will be considered to be ranked first as it had the lowest date, although it ends with dates 09-Nov, and RT659 starts with 13-Aug it will be ranked subsequently)
For a particular ID, if the days are consecutive then ranks are same, else they add by 1.
For a particular ID, ranks are given in date ASC.
How to formulate a query?
You need two steps:
select
id_col
,dt_col
,dense_rank()
over (order by min_dt, id_col, dt_col - rnk) as part_col
from
(
select
id_col
,dt_col
,min(dt_col)
over (partition by id_col) as min_dt
,rank()
over (partition by id_col
order by dt_col) as rnk
from tab
) as dt
dt_col - rnk caluclates the same result for consecutives dates -> same rank
Try datediff on lead/lag and then perform partitioned ranking
select t.ID_COL,t.dt_col,
rank() over(partition by t.ID_COL, t.date_diff order by t.dt_col desc) as rankk
from ( SELECT ID_COL,dt_col,
DATEDIFF(day, Lag(dt_col, 1) OVER(ORDER BY dt_col),dt_col) as date_diff FROM table1 ) t
One way to think about this problem is "when to add 1 to the rank". Well, that occurs when the previous value on a row with the same id_col differs by more than one day. Or when the row is the earliest day for an id.
This turns the problem into a cumulative sum:
select t.*,
sum(case when prev_dt_col = dt_col - 1 then 0 else 1
end) over
(order by min_dt_col, id_col, dt_col) as ranking
from (select t.*,
lag(dt_col) over (partition by id_col order by dt_col) as prev_dt_col,
min(dt_col) over (partition by id_col) as min_dt_col
from t
) t;

Spark SQL - Finding the maximum value of a month per year

I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.
You can use window functions:
select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1
For each year, this gives you the row that has the greatest cnt. If there are ties, the query returns them.
Note that count is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt in the query.
You can use window functions, if you want to use SQL:
select t.*
from (select t.*,
row_number() over (partition by year order by count desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank() instead of row_number().

Aggregate function like MAX for most common cell in column?

Group by the highest Number in a column worked great with MAX(), but what if I would like to get the cell that is at most common.
As example:
ID
100
250
250
300
200
250
So I would like to group by ID and instead of get the lowest (MIN) or highest (MAX) number, I would like to get the most common one (that would be 250, because there 3x).
Is there an easy way in SQL Server 2012 or am I forced to add a second SELECT where I COUNT(DISTINCT ID) and add that somehow to my first SELECT statement?
You can use dense_rank to return all the id's with the highest counts. This would handle cases when there are ties for the highest counts as well.
select id from
(select id, dense_rank() over(order by count(*) desc) as rnk from tablename group by id) t
where rnk = 1
A simple way to do what you want uses top and order by:
SELECT top 1 id
FROM t
GROUP BY id
ORDER BY COUNT(*) DESC;
This is a statistic called the mode. Getting the mode and max is a bit challenging in SQL Server. I would approach it as:
WITH cte AS (
SELECT t.id, COUNT(*) AS cnt,
row_number() OVER (ORDER BY COUNT(*) DESC) AS seqnum
FROM t
GROUP BY id
)
SELECT MAX(id) AS themax, MAX(CASE WHEN seqnum = 1 THEN id END) AS MODE
FROM cte;

Selecting type(s) of account with 2nd maximum number of accounts

Suppose we have an accounts table along with the already given values
I want to find the type of account with second highest number of accounts. In this case, result should be 'FD'. In case their is a contention for second highest count I need all those types in the result.
I'm not getting any idea of how to do it. I've found numerous posts for finding second highest values, say salary, in a table. But not for second highest COUNT.
This can be done using cte's. Get the counts for each type as the first step. Then use dense_rank (to get multiple rows with same counts in case of ties) to get the rank of rows by type based on counts. Finally, select the second ranked row.
with counts as (
select type, count(*) cnt
from yourtable
group by type)
, ranks as (
select type, dense_rank() over(order by cnt desc) rnk
from counts)
select type
from ranks
where rnk = 2;
One option is to use row_number() (or dense_rank(), depending on what "second" means when there are ties):
select a.*
from (select a.type, count(*) as cnt,
row_number() over (order by count(*) desc) as seqnum
from accounta a
group by a.type
) a
where seqnum = 2;
In Oracle 12c+, you can use offset/fetch:
select a.type, count(*) as cnt
from accounta a
group by a.type
order by count(*) desc
offset 1
fetch first 1 row only

Excluding only one MIN value on Oracle SQL

I am trying to select all but the lowest value in a column (GameScore), but when there are two of this lowest value, my code excludes both (I know why it does this, I just don't know exactly how to correct it and include one of the two lowest values).
The code looks something like this:
SELECT Id, SUM(Score) / COUNT(Score) AS Score
FROM
(SELECT Id, Score
FROM GameScore
WHERE Game_No = 1
AND Score NOT IN
(SELECT MIN(Score)
FROM GameScore
WHERE Game_No = 1
GROUP BY Id))
GROUP BY Id
So if I am drawing from 5 values, but one of the rows only pulls 3 scores because the bottom two are the same, how do I include the 4th? Thanks.
In order to do this you have to separate them up somehow; your current issue is that the 2 lowest scores are the same so any (in)equality operation performed on either values treats the other one identically.
You could use something like the analytic query ROW_NUMBER() to uniquely identify rows:
select id, sum(score) / count(score) as score
from ( select id, score, row_number() over (order by score) as score_rank
from gamescore
where gameno = 1
)
where score_rank <> 1
group by id
ROW_NUMBER():
assigns a unique number to each row to which it is applied (either each row in the partition or each row returned by the query), in the ordered sequence of rows specified in the order_by_clause, beginning with 1.
As the ORDER BY clause is on SCORE in ascending order one of the lowest score will be removed. This will be a random value unless you add other tie-breaker conditions to the ORDER BY.
You can do this a few ways, including what #Ben shows. From a mostly SQL Server background I was curious if just ROWNUM could be used and found this piece on ROWNUM vs ROW_NUMBER interesting. I'm not sure if it is dated.
All in a SQLFiddle.
Note: I'm using a subquery factoring/CTE as I think the read more clearly than in-line subqueries.
Using ROWNUM:
WITH OrderedScore AS (
SELECT id, game_no, score
,rownum as score_rank
FROM GameScore
WHERE game_no = 1
ORDER BY Score ASC
)
SELECT id
,sum(score)/count(score)
FROM OrderedScore
WHERE score_rank > 1
GROUP BY id;
Using ROW_NUMBER() OVER(ORDER BY...) as Ben does:
WITH OrderedScore AS (
SELECT id, game_no, score
,ROW_NUMBER() OVER(ORDER BY score ASC) as score_rank
FROM GameScore
WHERE game_no = 1
ORDER BY Score ASC
)
SELECT id
,sum(score)/count(score)
FROM OrderedScore
WHERE score_rank > 1
GROUP BY id;
Using ROW_NUMBER() OVER(PARTION BY...ORDER BY...) which I think leads to more flexibility if you want to remove the low score by game_no or id at some point:
WITH OrderedScore AS (
SELECT id, game_no, score
,ROW_NUMBER() OVER(PARTITION BY id ORDER BY score ASC) as score_rank
FROM GameScore
WHERE game_no = 1
ORDER BY Score ASC
)
SELECT id
,sum(score)/count(score)
FROM OrderedScore
WHERE score_rank > 1
GROUP BY id;