Exclude group of records—if number ever goes up - sql

I have a road inspection table:
INSPECTION_ID ROAD_ID INSP_DATE CONDITION_RATING
--------------------- ------------- --------- ----------------
506411 3040 01-JAN-81 15
508738 3040 14-APR-85 15
512461 3040 22-MAY-88 14
515077 3040 17-MAY-91 14 -- all ok
505967 3180 01-MAY-81 11
507655 3180 13-APR-85 9
512374 3180 11-MAY-88 17 <-- goes up; NOT ok
515626 3180 25-APR-91 16.5
502798 3260 01-MAY-83 14
508747 3260 13-APR-85 13
511373 3260 11-MAY-88 12
514734 3260 25-APR-91 12 -- all ok
I want to write a query that will exclude the entire road -- if the road's condition ever goes up over time. For example, exclude road 3180, since the condition goes from 9 to 17 (an anomaly).
Question:
How can I do that using Oracle SQL?
Sample data: db<>fiddle

Here's one option:
find "next" condition_rating value (within the same road_id - that's the partition by clause, sorted by insp_date)
return road_id whose difference between the "next" and "current" condition_rating is less than zero
SQL> with temp as
2 (select road_id,
3 condition_rating,
4 nvl(lead(condition_rating) over (partition by road_id order by insp_date),
5 condition_rating) next_cr
6 from test
7 )
8 select distinct road_id
9 from temp
10 where condition_rating - next_cr < 0;
ROAD_ID
----------
3180
SQL>

Based on OPs own answer, which make the expected outcome more clear.
In my permanent urge to avoid self-joins I'd go for the nested window function:
SELECT road_id, condition_rating, insp_date
FROM ( SELECT prev.*
, COUNT(CASE WHEN condition_rating < next_cr THEN 1 END) OVER(PARTITION BY road_id) bad
FROM (select t.*
, lead(condition_rating) over (partition by road_id order by insp_date) next_cr
from t
) prev
) tagged
WHERE bad = 0
ORDER BY road_id, insp_date
NOTE
lead() gives null for the last row which the query considers by the case expression to mark bad rows: condition_rating < next_cr — if next_cr is null, the condition won't be true so that the case maps it as "not bad".
The case is just to mimic the filter clause: https://modern-sql.com/feature/filter
MATCH_RECOGNIZE might be another option to this problem, but due to the lack of '^' and '$' I'm worried that the backtracking might cause more problems it is worth.
Nested window functions are typically no big performance hit if they use compatible OVER clauses, like in this query.

Here's an answer that's similar to #Littlefoot's answer:
with insp as (
select
road_id,
condition_rating,
insp_date,
case when condition_rating > lag(condition_rating,1) over(partition by road_id order by insp_date) then 'Y' end as condition_goes_up
from
test_data
)
select
insp.*
from
insp
left join
(
select distinct
road_id,
condition_goes_up
from
insp
where
condition_goes_up = 'Y'
) insp_flag
on insp.road_id = insp_flag.road_id
where
insp_flag.condition_goes_up is null
--Note: I removed the ORDER BY, because I think the window function already orders the rows the way I want.
db<>fiddle
Edit:
Here's a version that's similar to what #Markus Winand did:
insp as (
select
road_id,
condition_rating,
insp_date,
case when condition_rating > lag(condition_rating,1) over(partition by road_id order by insp_date) then 'Y' end as condition_goes_up
from
test_data
)
select
insp_tagged.*
from
(
select
insp.*,
count(condition_goes_up) over(partition by road_id) as condition_goes_up_count
from
insp
) insp_tagged
where
condition_goes_up_count = 0
I ended up going with that option.
db<>fiddle

Related

count most repeated value per group in hive?

I am using hive 0.14.0 in a hortonworks data platform, on a big file similar to this input data:
tpep_pickup_datetime
pulocationid
2022-01-28 23:32:52.0
100
2022-02-28 23:02:40.0
202
2022-02-28 17:22:45.0
102
2022-02-28 23:19:37.0
102
2022-03-29 17:32:02.0
102
2022-01-28 23:32:40.0
101
2022-02-28 17:28:09.0
201
2022-03-28 23:59:54.0
100
2022-02-28 21:02:40.0
100
I want to find out what was the most common hour in each locationid, this being the result:
locationid
hour
100
23
101
17
102
17
201
17
202
23
i was thinking in using a partition command like this:
select * from (
select hour(tpep_pickup_datetime), pulocationid
(max (hour(tpep_pickup_datetime))) over (partition by pulocationid) as max_hour,
row_number() over (partition by pulocationid) as row_no
from yellowtaxi22
) res
where res.row_no = 1;
but it shows me this error:
SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: Invalid function pulocationid
is there any other way of doing this?
with raw_groups -- subquery syntax
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
),
grouped_data as -- another subquery syntax based on `with`
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
raw_groups
group by
mylocation.pulocationid
)
select --format data into your requested format
location.pulocationid,
location.hour
from
grouped_data
I do not remember hive 0.14 can use with clause, but you could easily re-write the query to not use it.(by substituting the select in pace of the table names) I just don't find it as readable:
select --format data into your requested format
location.pulocationid,
location.hour
from
(
select
max(mylocation) as location -- will pick max based on count(time_stamp)
from
(
select
struct(
count(time_stamp), -- must be first for max to use it to sort on
location.pulocationid ,
hour(time_stamp) as hour
) as mylocation -- create a struct to make max do the work for us
from
location
group by
location.pulocationid,
hour(time_stamp)
)
group by
mylocation.pulocationid
)
You were half way there!
The idea was in the right direction however the syntax is a little bit off:
First find the count per each hour
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime)
Then add the row_number but you need to order it by the total count in a descending way:
select pulocationid , hour , cnt , row_number () over ( partition be pulocationid order by cnt desc ) as row_no from
Last but not the list, take only the rows with the highest count ( this can be done by the max function rather than the row_number one by the way)
Or in total :
select pulocationid , hour from (
select pulocationid , hour , cnt , row_number ()
over ( partition by pulocationid order by cnt desc )
as row_no from (
Select pulocationid, hour (tpep_pickup_datetime), count (*) cnt from yellowtaxi22
Group by pulocationid, hour (tpep_pickup_datetime) ))
Where row_no=1

SQL - Returning unique row based on criteria and a priority

I have a data table that looks in practice like this:
Team Shirt Number Name
1 1 Seaman
1 13 Lucas
2 1 Bosnic
2 14 Schmidt
2 23 Woods
3 13 Tubilandu
3 14 Lev
3 15 Martin
I want to remove duplicates of team by the following logic - if there is a "1" shirt number, use that. If not, look for a 13. If not look for 14 then any.
I realise it is probably quite basic but I don't seem to be making any progress with case statements. I know it's something with sub-queries and case statements but I'm struggling and any help gratefully received!
Using SSMS.
Since you didn't specified any DBMS, let me assume row_number() would work for that :
DELETE
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY Team
ORDER BY (CASE WHEN Shirt_Number = 1
THEN 1
WHEN Shirt_Number = 13
THEN 2
WHEN Shirt_Number = 14
THEN 3
ELSE 4
END)
) AS Seq
FROM table t
) t
WHERE Seq = 1;
This assuming Shirt_Numbers have a gap else only order by Shirt_Number enough.
I think you are looking for a partition by clause usage. Solution below worked in Sql Server.
create table #eray
(team int, shirtnumber int, name varchar(200))
insert into #eray values
(1, 1, 'Seaman'),
(1, 13, 'Lucas'),
(2, 1, 'Bosnic'),
(2, 14, 'Schmidt')
;with cte as (
Select Team, ShirtNumber, Name,
ROW_NUMBER() OVER (PARTITION BY Team ORDER BY ShirtNumber ASC) AS rn
From #eray
where ShirtNumber in (1,13,14)
)
select * from cte where rn=1
If you have a table of teams, you can use cross apply:
select ts.*
from teams t cross apply
(select top (1) ts.*
from teamshirts ts
where ts.team = t.team
order by (case shirt_number when 1 then 1 when 13 then 2 when 14 then 3 else 4 end)
) ts;
If you have no numbers between 2 and 12, you can simplify this to:
select ts.*
from teams t cross apply
(select top (1) ts.*
from teamshirts ts
where ts.team = t.team
order by shirt_number
) ts;

Find min and max for subsets of consecutive rows - gaps and islands

Trying to build a query.
The input is ordered by rownumber in column 'rn' starting with 1 for each unique value in 'name' and defining a given sequence of entries in 'act'. In column 'act' it holds two values in multiple occurence, >sleep< and >wake<. The goal is to find for each consecutive set of rows of one of those values the minimum and maximum value of startt and endd.
This shall be the input:
name act rn startt endd
---------- ---------- ------ ------ ------
jimmy sleep 1 1 3
jimmy wake 2 4 7
jimmy wake 3 8 10
jimmy sleep 4 11 13
karen wake 1 1 4
karen sleep 2 5 7
karen wake 3 8 9
karen wake 4 10 12
karen wake 5 13 14
karen sleep 6 15 17
karen sleep 7 18 20
the desired output:
name act startt endd
---------- ---------- ------ ------
jimmy sleep 1 3
jimmy wake 4 10
jimmy sleep 11 13
karen wake 1 4
karen sleep 5 7
karen wake 8 14
karen sleep 15 20
The source of the input does not provide further columns. The number of members in each subset can be very much higher then in this example.
I tried different ways of aggregating, but none worked. I believe using LEAD and LAGG and further trickery might get me there, but that appears to be awfully unelegant. I have the notion it is key to differentiate each subset, i.e. create an identifier unique to all its members. With this at hand an aggregate with min and max is simple. Maybe i'm wrong. Maybe it's impossible. Maybe a self join. Maybe a recursive cte. I don't know.
So: does anybody know how to get this? Help is much appreciated.
UPDATE:
Thank You to Gordon Linoff, shawnt00 and the other contributors who commented. With Your advice I feel major gaps in my toolbox of logic closing.
For the interested:
declare #t table (
name nvarchar(10)
,act nvarchar (10)
,startt smallint
,endd smallint
)
insert into #t (
name
,act
,startt
,endd
)
values
('jimmy','sleep', 1,3)
,('jimmy','wake', 4,7)
,('jimmy','wake', 8,10)
,('jimmy','sleep', 11,13)
,('karen','wake', 1,4)
,('karen','sleep', 5,7)
,('karen','wake', 8,9)
,('karen','wake', 10,12)
,('karen','wake', 13,14)
,('karen','sleep', 15,17)
,('karen','sleep', 18,20)
; --- all rows, no aggregating
with
cte as (
select
name
,act
,row_number() over (partition by name order by name,startt) rn
,row_number() over (partition by name, act order by name,startt) act_n
,startt
,endd
from
#t )
select
name
,act
,startt
,endd
,rn
,act_n
,rn - act_n diff
from
cte
order by
name
,rn
;--- aggregating for the desired ouput
with
cte as (
select
name
,act
,row_number() over (partition by name order by name,startt) rn
,row_number() over (partition by name, act order by name,startt) act_n
,startt
,endd
from
#t )
select
name
,act
,min(startt) startt
,max(endd) endd
,min(rn) min_rn
,max(rn) max_rn
from
cte
group by
name
,act
,rn - act_n
order by
name
,min(rn)
You want to find consecutive groups of similar rows and then aggregation. I like the difference of row numbers approach:
select name, act, min(startt) as startt, max(endd) as endd
from (select i.*,
row_number() over (partition by name, act order by rn) as seqnum_na,
row_number() over (partition by name order by rn) as seqnum_n
from input i
) i
group by (seqnum_n - seqnum_na), name, act;
You can see how this works by looking at what the subquery does.
This assumes that you don't have any gaps in the rn numbering so it doesn't calculate it again.
with T2 as (
select *, row_number() over (partition by name, act order by rn) as grp_rn
from T
)
select name, act, min(startt) as startt, max(endd) as endd
from T2
group by name, act, rn - grp_rn
order by name, startt;
http://rextester.com/CCQJJ93990
This is a typical gaps and islands query. The key here is that when you have a cluster of rows the two different numberings will increase in step and that means the difference will be a constant for the cluster. This difference increases as you work your way down the list.

In T-SQL How Can I Select Up To The 5 Most Recent Rows, Grouped By An Identifier, If They Contain A Specific Value?

Long title.
I am using T-SQL and attempting to find all accounts who's most recent transactions are ACHFAIL, and determine how many in a row they have had, up to 5.
I already wrote a huge, insanely convoluted query to group and count all ACHFAIL transactions that have had x ACHFAILs in a row. Now the requirements are the simpler "only count the most recent transactions"
Below is what I have so far, but I cannot wrap my head around the next step to take. I was trying to simplify my task by only counting up the 5, but if I could provide an accurate count of all the ACHFAIL attempts in a row, that would more ideal.
WITH grouped
AS (
SELECT
ROW_NUMBER() OVER (PARTITION BY TRANSACTIONS.deal_id
ORDER BY TRANSACTIONS.deal_id, tran_date DESC) AS row_num
,TRANSACTIONS.tran_code
,TRANSACTIONS.tran_date
,TRANSACTIONS.deal_id
FROM TRANSACTIONS
)
SELECT TOP 1000 * FROM grouped
which returns rows such as:
row_num tran_code tran_date deal_id
1 ACHFAIL 2014-08-05 09:20:38.000 {01xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
2 ACHCLEAR 2014-08-04 16:27:17.473 {01xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
1 ACHCLEAR 2014-09-09 15:14:48.337 {02xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
2 ACHCLEAR 2014-09-08 14:23:00.737 {02xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
1 ACHFAIL 2014-07-18 14:35:38.037 {03xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
2 ACHFAIL 2014-07-18 13:58:52.000 {03xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
3 ACHCLEAR 2014-07-17 14:48:58.617 {03xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
4 ACHFAIL 2014-07-16 15:04:28.023 {03xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}
01xxxxxx has 1 ACHFAIL
02xxxxxx has 0 ACHFAIL
03xxxxxx has 2 ACHFAIL
You are half way there. With any sort of problem with "consecutive rows", you will need a recursive CTE (that's TEMP2 below):
;WITH
TEMP1 AS
(
SELECT tran_code,
deal_id,
ROW_NUMBER() OVER (PARTITION BY deal_id ORDER BY tran_date DESC) AS tran_rank
FROM TRANSACTIONS
),
TEMP2 AS
(
SELECT tran_code,
deal_id,
tran_rank
FROM TEMP1
WHERE tran_rank = 1 -- last transaction for a deal
AND tran_code = 'ACHFAIL' -- failed transactions only
UNION ALL
SELECT curr.tran_code,
curr.deal_id,
curr.tran_rank
FROM TEMP1 curr
INNER JOIN TEMP2 prev ON curr.deal_id = prev.deal_id -- transaction must be for the same deal
AND curr.tran_rank = prev.tran_rank + 1 -- must be consecutive
WHERE curr.tran_code = 'ACHFAIL' -- must have failed
AND curr.tran_rank <= 5 -- up to 5 only
)
SELECT t.deal_id,
ISNULL(MAX(tran_rank),0) AS FailCount
FROM TRANSACTIONS t
LEFT JOIN TEMP2 t2 ON t.deal_id = t2.deal_id
GROUP BY t.deal_id
SQL Fiddle
If I understand correctly, you want the number of fails in the five most recent transactions for each deal. That would be something like:
WITH grouped AS (
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY t.deal_id ORDER BY tran_date DESC
) AS seqnum
FROM TRANSACTIONS t
)
SELECT deal_id, sum(case when tran_code = 'ACHFAIL' then 1 else 0 end) as NuMFails
FROM grouped
WHERE seqnum <= 5
GROUP BY deal_id;
The CTE enumerates the rows. The where clause takes the 5 most recent rows for each deal. The group by then aggregates by deal_id.
Note that you do not need to include the partition by column(s) in the order by when you use over.

Different value counts on same column

I am new to Oracle. I have an Oracle table with three columns: serialno, item_category and item_status. In the third column the rows have values of serviceable, under_repair or condemned.
I want to run the query using count to show how many are serviceable, how many are under repair, how many are condemned against each item category.
I would like to run something like:
select item_category
, count(......) "total"
, count (.....) "serviceable"
, count(.....)"under_repair"
, count(....) "condemned"
from my_table
group by item_category ......
I am unable to run the inner query inside the count.
Here's what I'd like the result set to look like:
item_category total serviceable under repair condemned
============= ===== ============ ============ ===========
chair 18 10 5 3
table 12 6 3 3
You can either use CASE or DECODE statement inside the COUNT function.
SELECT item_category,
COUNT (*) total,
COUNT (DECODE (item_status, 'serviceable', 1)) AS serviceable,
COUNT (DECODE (item_status, 'under_repair', 1)) AS under_repair,
COUNT (DECODE (item_status, 'condemned', 1)) AS condemned
FROM mytable
GROUP BY item_category;
Output:
ITEM_CATEGORY TOTAL SERVICEABLE UNDER_REPAIR CONDEMNED
----------------------------------------------------------------
chair 5 1 2 2
table 5 3 1 1
This is a very basic "group by" query. If you search for that you will find plenty of documentation on how it is used.
For your specific case, you want:
select item_category, item_status, count(*)
from <your table>
group by item_category, item_status;
You'll get something like this:
item_category item_status count(*)
======================================
Chair under_repair 7
Chair condemned 16
Table under_repair 3
Change the column ordering as needed for your purpose
I have a tendency of writing this stuff up so when I forget how to do it, I have an easy to find example.
The PIVOT clause was new in 11g. Since that was 5+ years ago, I'm hoping you are using it.
Sample Data
create table t
(
serialno number(2,0),
item_category varchar2(30),
item_status varchar2(20)
);
insert into t ( serialno, item_category, item_status )
select
rownum serialno,
( case
when rownum <= 12 then 'table'
else 'chair'
end ) item_category,
( case
--table status
when rownum <= 12
and rownum <= 6
then 'servicable'
when rownum <= 12
and rownum between 7 and 9
then 'under_repair'
when rownum <= 12
and rownum > 9
then 'condemned'
--chair status
when rownum > 12
and rownum < 13 + 10
then 'servicable'
when rownum > 12
and rownum between 23 and 27
then 'under_repair'
when rownum > 12
and rownum > 27
then 'condemned'
end ) item_status
from
dual connect by level <= 30;
commit;
and the PIVOT query:
select *
from
(
select
item_status stat,
item_category,
item_status
from t
)
pivot
(
count( item_status )
for stat in ( 'servicable' as "servicable", 'under_repair' as "under_repair", 'condemned' as "condemned" )
);
ITEM_CATEGORY servicable under_repair condemned
------------- ---------- ------------ ----------
chair 10 5 3
table 6 3 3
I still prefer #Ramblin' Man's way of doing it (except using CASE in place of DECODE) though.
Edit
Just realized I left out the TOTAL column. I'm not sure there's a way to get that column using the PIVOT clause, perhaps someone else knows how. May also be the reason I don't use it that often.