SQL Query to eliminate similar entries - sql

I am working on a problem in SQL Server 2008
I have a table with six columns:
PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT
The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:
76, 89, 99, 102, 155, 122
11, 89, 99, 102, 155, 122
89, 99, 102, 155, 122, 130
In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.
I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).
I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):
(I won't include all the code but I will explain the method, I can paste more if it helps)
Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.
Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.
Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...
Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1
I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?
-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE ( [dOne] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dOne?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dTwo?
OR [dTwo] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dThree?
...
OR [dSix] IN (#dOne, #dTwo, #dThree, #dFour, #dFive, #dSix) **mOne = dSix?
)
I am able to 'fake' the above using six queries but that is too inefficient...
Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...

Unless I miss something this should produce the correct result.
declare #T table
(
PK INT identity primary key,
dOne SmallINT,
dTwo SmallINT,
dThree SmallINT,
dFour SmallINT,
dFive SmallINT,
dSix SmallINT
)
insert into #T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)
;with q1(PK, d1, d2, d3, d4, d5) as
(
select PK, dTwo, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dThree, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dFour, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFive, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dSix
from #T
union all
select PK, dOne, dTwo, dThree, dFour, dFive
from #T
),
q2 as
(
select PK,
row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
from q1
),
q3 as
(
select PK
from q2
where rn = 1
group by PK
having count(*) = 6
)
select T.*
from #T as T
inner join q3 as Q
on T.PK = Q.PK

I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.
CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO
INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO
SELECT DISTINCT
T2.my_id
FROM
(SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
(
SELECT
COUNT(*)
FROM
dbo.Test_Sets_Normalized T3
INNER JOIN dbo.Test_Sets_Normalized T4 ON
T4.my_id = T2.my_id AND
T4.c = T3.c
WHERE
T3.my_id = T1.my_id) >= 5
That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN back to the original table and delete by IDs.
There's probably an improvement possible somewhere that doesn't require the DISTINCT. I'll give it a little more thought.

Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.
You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:
SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
SELECT PK From SillyTable Duplicate
WHERE ( (
(Duplicate.dOne=ToDelete.dOne)
AND (Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
) OR (
(Duplicate.dOne=ToDelete.dTwo)
AND (Duplicate.dTwo=ToDelete.dThree)
AND (Duplicate.dThree=ToDelete.dFour)
AND (Duplicate.dFour=ToDelete.dFive)
AND (Duplicate.dFive=ToDelete.dSix)
) OR (
(Duplicate.dTwo=ToDelete.dOne)
AND (Duplicate.dThree=ToDelete.dTwo)
AND (Duplicate.dFour=ToDelete.dThree)
AND (Duplicate.dFive=ToDelete.dFour)
AND (Duplicate.dSix=ToDelete.dFive)
) OR (
(Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
AND (Duplicate.dSix=ToDelete.dSix)
) ...
This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.

Related

How to do an as-of-join in SQL (Snowflake)?

I am looking to join two time-ordered tables, such that the events in table1 are matched to the "next" event in table2 (within the same user). I am using SQL / Snowflake for this.
For argument's sake table1 is "notification_clicked" events and table2 is "purchases"
This is one way to do it:
WITH partial_result AS (
SELECT
userId, notificationId, notificationTimeStamp, transactionId, transactionTimeStamp
FROM table1 CROSS JOIN table2
WHERE table1.userId = table2.userId
AND notificationTimeStamp <= transactionTimeStamp)
SELECT *
FROM partial_result
QUALIFY ROW_NUMBER() OVER(
PARTITION BY userId, notificationId ORDER BY transactionTimeStamp ASC
) = 1
It is not super readable, but is this "the" way to do this?
If you're doing an AsOf join against small tables, you can use a regular Venn diagram type of join. If you're running it against large tables, a regular join will lead to an intermediate cardinality explosion before the filter.
For large tables, this is the highest performance approach I have to date. Rather than treating an AsOf join like a regular Venn diagram join, we can treat it like a special type of union between two tables with a filter that uses the information from that union. The sample SQL does the following:
Unions the A and B tables so that the Entity and Time come from both tables and all other columns come from only one table. Rows from the other table specify NULL for these values (measures 1 and 2 in this case). It also projects a source column for the table. We'll use this later.
In the unioned table, it uses a LAG function on windows partitioned by the Entity and ordered by the Time. For each row with a source indicator from the A table, it lags back to the first Time with source in the B table, ignoring all values in the A table.
with A as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M1" -- Measure (could be many)
from (values
(1, 7, 1, 'M1-1'),
(1, 8, 1, 'M1-2'),
(1, 41, 1, 'M1-3'),
(1, 89, 1, 'M1-4')
)
), B as
(
select
COLUMN1::int as "E", -- Entity
COLUMN2::int as "T", -- Time
COLUMN4::string as "M2" -- Different measure (could be many)
from (values
(1, 6, 1, 'M2-1'),
(1, 12, 1, 'M2-2'),
(1, 20, 1, 'M2-3'),
(1, 35, 1, 'M2-4'),
(1, 57, 1, 'M2-5'),
(1, 85, 1, 'M2-6'),
(1, 92, 1, 'M2-7')
)
), UNIONED as -- Unify schemas and union all
(
select 'A' as SOURCE_TABLE -- Project the source table
,E as AB_E -- AB_ means it's unified
,T as AB_T
,M1 as A_M1 -- A_ means it's from A
,NULL::string as B_M2 -- Make columns from B null for A
from A
union all
select 'B' as SOURCE_TABLE
,E as AB_E
,T as AB_T
,NULL::string as A_M1 -- Make columns from A null for B
,M2 as B_M2
from B
)
select AB_E as ENTITY
,AB_T as A_TIME
,lag(iff(SOURCE_TABLE = 'A', null, AB_T)) -- Lag back to
ignore nulls over -- previous B row
(partition by AB_E order by AB_T) as B_TIME
,A_M1 as M1_FROM_A
,lag(B_M2) -- Lag back to the previous non-null row.
ignore nulls -- The A sourced rows will already be NULL.
over (partition by AB_E order by AB_T) as M2_FROM_B
from UNIONED
qualify SOURCE_TABLE = 'A'
;
This will perform orders of magnitude faster for large tables because the highest intermediate cardinality is guaranteed to be the cardinality of A + B.
To simplify this refactor, I wrote a stored procedure that generates the SQL given the paths to table A and B, the entity column in A and B (right now limited to one, but if you have more it will get the SQL started), the order by (time) column in A and B, and finally the list of columns to "drag through" the AsOf join. It's rather lengthy so I posted it on Github and will work later to document and enhance it:
https://github.com/GregPavlik/AsOfJoin/blob/main/StoredProcedure.sql

Count length of consecutive duplicate values for each id

I have a table as shown in the screenshot (first two columns) and I need to create a column like the last one. I'm trying to calculate the length of each sequence of consecutive values for each id.
For this, the last column is required. I played around with
row_number() over (partition by id, value)
but did not have much success, since the circled number was (quite predictably) computed as 2 instead of 1.
Please help!
First of all, we need to have a way to defined how the rows are ordered. For example, in your sample data there is not way to be sure that 'first' row (1, 1) will be always displayed before the 'second' row (1,0).
That's why in my sample data I have added an identity column. In your real case, the details can be order by row ID, date column or something else, but you need to ensure the rows can be sorted via unique criteria.
So, the task is pretty simple:
calculate trigger switch - when value is changed
calculate groups
calculate rows
That's it. I have used common table expression and leave all columns in order to be easy for you to understand the logic. You are free to break this in separate statements and remove some of the columns.
DECLARE #DataSource TABLE
(
[RowID] INT IDENTITY(1, 1)
,[ID]INT
,[value] INT
);
INSERT INTO #DataSource ([ID], [value])
VALUES (1, 1)
,(1, 0)
,(1, 0)
,(1, 1)
,(1, 1)
,(1, 1)
--
,(2, 0)
,(2, 1)
,(2, 0)
,(2, 0);
WITH DataSourceWithSwitch AS
(
SELECT *
,IIF(LAG([value]) OVER (PARTITION BY [ID] ORDER BY [RowID]) = [value], 0, 1) AS [Switch]
FROM #DataSource
), DataSourceWithGroup AS
(
SELECT *
,SUM([Switch]) OVER (PARTITION BY [ID] ORDER BY [RowID]) AS [Group]
FROM DataSourceWithSwitch
)
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID], [Group] ORDER BY [RowID]) AS [GroupRowID]
FROM DataSourceWithGroup
ORDER BY [RowID];
You want results that are dependent on actual data ordering in the data source. In SQL you operate on relations, sometimes on ordered set of relations rows. Your desired end result is not well-defined in terms of SQL, unless you introduce an additional column in your source table, over which your data is ordered (e.g. auto-increment or some timestamp column).
Note: this answers the original question and doesn't take into account additional timestamp column mentioned in the comment. I'm not updating my answer since there is already an accepted answer.
One way to solve it could be through a recursive CTE:
create table #tmp (i int identity,id int, value int, rn int);
insert into #tmp (id,value) VALUES
(1,1),(1,0),(1,0),(1,1),(1,1),(1,1),
(2,0),(2,1),(2,0),(2,0);
WITH numbered AS (
SELECT i,id,value, 1 seq FROM #tmp WHERE i=1 UNION ALL
SELECT a.i,a.id,a.value, CASE WHEN a.id=b.id AND a.value=b.value THEN b.seq+1 ELSE 1 END
FROM #tmp a INNER JOIN numbered b ON a.i=b.i+1
)
SELECT * FROM numbered -- OPTION (MAXRECURSION 1000)
This will return the following:
i id value seq
1 1 1 1
2 1 0 1
3 1 0 2
4 1 1 1
5 1 1 2
6 1 1 3
7 2 0 1
8 2 1 1
9 2 0 1
10 2 0 2
See my little demo here: https://rextester.com/ZZEIU93657
A prerequisite for the CTE to work is a sequenced table (e. g. a table with an identitycolumn in it) as a source. In my example I introduced the column i for this. As a starting point I need to find the first entry of the source table. In my case this was the entry with i=1.
For a longer source table you might run into a recursion-limit error as the default for MAXRECURSION is 100. In this case you should uncomment the OPTION setting behind my SELECT clause above. You can either set it to a higher value (like shown) or switch it off completely by setting it to 0.
IMHO, this is easier to do with cursor and loop.
may be there is a way to do the job with selfjoin
declare #t table (id int, val int)
insert into #t (id, val)
select 1 as id, 1 as val
union all select 1, 0
union all select 1, 0
union all select 1, 1
union all select 1, 1
union all select 1, 1
;with cte1 (id , val , num ) as
(
select id, val, row_number() over (ORDER BY (SELECT 1)) as num from #t
)
, cte2 (id, val, num, N) as
(
select id, val, num, 1 from cte1 where num = 1
union all
select t1.id, t1.val, t1.num,
case when t1.id=t2.id and t1.val=t2.val then t2.N + 1 else 1 end
from cte1 t1 inner join cte2 t2 on t1.num = t2.num + 1 where t1.num > 1
)
select * from cte2

Multiple joins to get the same lookup column for different values

We have a rather large SQL query, which is rather poorly performing. One of the problems (from analysing query plan) is the number of joins we have.
Essentially we have values in our data that we need to do a look up on another table.to get the value to display to the user. The problem is that we have do a join on the same table 4 times because there are 4 different columns that all need the same look up.
Hopefully this diagram might make it clearer
Raw_Event_data
event_id, datetime_id, lookup_1, lookup_2, lookup_3, lookup_4
1, 2013-01-01_12:00, 1, 5, 3, 9
2, 2013-01-01_12:00, 121, 5, 8, 19
3, 2013-01-01_12:00, 11, 2, 3, 32
4, 2013-01-01_12:00, 15, 2, 1, 0
Lookup_table
lookup_id, lookup_desc
1, desc1
2, desc2
3, desc3
...
Our query then looks something like this
Select
raw.event_id,
raw.datetime_id,
lookup1.lookup_desc,
lookup2.lookup_desc,
lookup3.lookup_desc,
lookup4.lookup_desc,
FROM
Raw_Event_data raw, Lookup_table lookup1,Lookup_table lookup2,Lookup_table lookup3,Lookup_table lookup4
WHERE raw.event_id = 1 AND
raw.lookup_1 *= lookup1 AND
raw.lookup_2 *= lookup2.lookup_id AND
raw.lookup_3 *= lookup3.lookup_id AND
raw.lookup_4 *= lookup4.lookup_id
So I get as an output
1, 2013-01-01_12:00, desc1, desc5, desc3, desc9
As I said the query works, but the joins are killing the performance.
That is a simplistic example I give there, in reality there will be 12 joins like above and we won't be selecting a specific event, but rather a range of events.
The question is, is there a better way of doing those joins.
correlated subqueries might be the way to go:
SELECT r.event_id
, r.datetime_id
, (select lookup1.lookup_desc from lookup_table lookup1 where lookup1.lookup_id = r.lookup_1) as desc_1
, (select lookup2.lookup_desc from lookup_table lookup2 where lookup2.lookup_id = r.lookup_2) as desc_2
, (select lookup3.lookup_desc from lookup_table lookup3 where lookup3.lookup_id = r.lookup_3) as desc_3
, (select lookup4.lookup_desc from lookup_table lookup4 where lookup4.lookup_id = r.lookup_4) as desc_4
FROM Raw_Event_data r
WHERE r.event_id = 1
;
My first attempt would be to handle the indexing myself, if I was refused by the DBA's.
declare #start_range bigint, #end_range bigint
select
#start_range = 5
,#end_range = 500
create local temporary table raw_event_subset
( --going to assume some schema based on your comments...obviously you will change these to whatever the base schema is.
event_id bigint
,datetime_id timestamp
,lookup_1 smallint
,lookup_2 smallint
--etc
) on commit preserve rows
create HG index HG_temp_raw_event_subset_event_id on raw_event_subset (event_id)
create LF index LF_temp_raw_event_subset_lookup_1 on raw_event_subset (lookup_1)
create LF index LF_temp_raw_event_subset_lookup_2 on raw_event_subset (lookup_2)
--etc
insert into raw_event_subset
select
event_id
,datetime_id
,lookup_1
,lookup_2
--,etc
from raw_event_data
where event_id >= #start_range --event_id *must* have an HG index on it for this to be worthwhile.
and event_id <= #end_range
--then run your normal query, except replace raw_event_data with raw_event_subset
select
event_id
,datetime_id
,l1.lookup_desc
,l2.lookup_desc
--etc
from raw_event_subset r
left join lookup_table l1
on l1.lookup_id = r.lookup_1
left join lookup_table l2
on l2.lookup_id = r.lookup_2
--etc
drop table raw_event_subset
hope this helps...

How to improve performance of this query?

With reference to SQL Query how to summarize students record by date? I was able to get the report I wanted.
I was told in real world the students table will have 30 Millions of records. I do have index on (StudentID, Date). Any suggestions to improve the performance or is there a better way to build the report ?
Right now I have the following query
;with cte as
(
select id,
studentid,
date,
'#'+subject+';'+grade+';'+convert(varchar(10), date, 101) report
from student
)
-- insert into studentreport
select distinct
studentid,
STUFF(
(SELECT cast(t2.report as varchar(50))
FROM cte t2
where c.StudentId = t2.StudentId
order by t2.date desc
FOR XML PATH (''))
, 1, 0, '') AS report
from cte c;
Without seeing the execution plan, it's not really possible to write an optimized SQL statement so I'll make suggestions instead.
Don't use a cte as they often don't handle queries with large memory requires well (at least, in my experience). Instead, stage the cte data in a real table, either with a materialized/indexed view or with a working table (maybe a large temp table). Then execute the second select (after the cte) to combine your data in an ordered list.
The number of comments to your question indicates that you have a large problem (or problems). You're converting tall and skinny data (think integers, datetime2 types) into ordered lists within a strings. Try to think instead in terms of storing in the smallest data formats available and manipulating into strings until afterward (or never). Alternatively, give serious thought into creating an XML data field to replace the 'report' field.
If you can make it work, this is what I would do (including a test case without indexes). Your mileage may vary, but give it a try:
create table #student (id int not null, studentid int not null, date datetime not null, subject varchar(40), grade varchar(40))
insert into #student (id,studentid,date,subject,grade)
select 1, 1, getdate(), 'history', 'A-' union all
select 2, 1, dateadd(d,1,getdate()), 'computer science', 'b' union all
select 3, 1, dateadd(d,2,getdate()), 'art', 'q' union all
--
select 1, 2, getdate() , 'something', 'F' union all
select 2, 2, dateadd(d,1,getdate()), 'genetics', 'e' union all
select 3, 2, dateadd(d,2,getdate()), 'art', 'D+' union all
--
select 1, 3, getdate() , 'memory loss', 'A-' union all
select 2, 3, dateadd(d,1,getdate()), 'creative writing', 'A-' union all
select 3, 3, dateadd(d,2,getdate()), 'history of asia 101', 'A-'
go
select studentid as studentid
,(select s2.date as '#date', s2.subject as '#subject', s2.grade as '#grade'
from #student s2 where s1.studentid = s2.studentid for xml path('report'), type) as 'reports'
from (select distinct studentid from #student) s1;
I don't know how to make the output legible on here, but the resultset is 2 fields. Field 1 is an integer, field 2 is XML with one node per report. This still isn't as ideal as just sending the resultset, but it is at least one result per studentid.

How to get the complete row from a maximum calculation?

I do struggle with a GROUP BY -- again. The basics I can handle, but there it is: How do I get to different columns I named in the group by, without destroying my grouping? Note that group by is only my own idea, there may be others that work better. It must work in Oracle, though.
Here is my example:
create table xxgroups (
groupid int not null primary key,
groupname varchar2(10)
);
insert into xxgroups values(100, 'Group 100');
insert into xxgroups values(200, 'Group 200');
drop table xxdata;
create table xxdata (
num1 int,
num2 int,
state_a int,
state_b int,
groupid int,
foreign key (groupid) references xxgroups(groupid)
);
-- "ranks" are 90, 40, null, 70:
insert into xxdata values(10, 10, 1, 4, 100);
insert into xxdata values(10, 10, 0, 4, 200);
insert into xxdata values(11, 11, 0, 3, 100);
insert into xxdata values(20, 22, 5, 7, 200);
The task is to create a result-row for each distinct (num1, num2) and print that groupname with the highest calculated "rank" from state_a and state_b.
Note that the first two rows have the same nums and thus only the higher ranking should be selected -- with the groupname being "Group 200".
I got quite far with the basic group by, I think.
SELECT xd.num1||xd.num2 nummer, max(ranking.goodness)
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
order by goodness desc
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
GROUP BY xd.num1||xd.num2
ORDER BY nummer
;
The result is 90% of what I need:
nummer ranking
----------------
1010 90
1111
2022 70
100% perfect would be
nummer groupname
-------------------
1010 Group 100
1111 Group 100
2022 Group 200
The tricky part is, that I want the groupname in the result. And I can not include it in the select, because then I would have to put it into the group by as well -- which I do not want (then I would not select the best ranking entry from over all groups)
In my solution a use a model table to calculate the "rank". There are other solution I am sure. The point is, that it is a non-trivial calculation that I do not want to do twice.
I know from other examples that one could use a second query to get back to the original row to get to the groupname, but I can not see how I could to this here,
without duplicating my ranking calculation.
A nice suggestion was to replace the group by with a LIMIT 1/ORDER BY goodness and use this calculating select as a filtering subselect. But a) there is no LIMIT in Oracle, and I doubt a rownum<=1 would do in a subselect and b) I can not wrap my brain around it anyway. Maybe there is a way?
You can use the FIRST aggregation modifier to selectively apply your function over a subset of rows of a group -- here a single row (SQLFiddle demo):
SELECT xd.num1||xd.num2 nummer,
MAX(xg.groupname) KEEP (DENSE_RANK FIRST
ORDER BY ranking.goodness DESC) grp,
max(ranking.goodness)
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
order by goodness desc
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
GROUP BY xd.num1||xd.num2
ORDER BY nummer;
Your method with analytics works as well but since we already use aggregations here, we may as well use the FIRST modifier to get all columns in one go.
Whow, I did search before, but now I found this answer, which I could adopt to my question. The Oracle-solution here is over, partition by with order by and row_number():
select *
from ( select data.*, row_number()
over (partition by nummer order by goodness desc) as seqnum
from (
SELECT xd.num1, xd.num2 nummer, xg.groupname, ranking.goodness
FROM xxdata xd
, xxgroups xg
,( select state_a, state_b, r as goodness
from dual
model return updated rows
dimension by (0 state_a, 0 state_b) measures (0 r)
rules (r[1,4]=90, r[3,7]=80,r[5,7]=70, r[4,7]=60, r[0,7]=50, r[0,4]=40)
) ranking
WHERE xd.groupid=xg.groupid
and ranking.state_a (+) = xd.state_a
and ranking.state_b (+) = xd.state_b
ORDER BY nummer
) data )
where seqnum = 1
;
The result is
10 10 Group 100 90 1
11 11 Group 100 1
20 22 Group 200 70 1
which is beautiful.
Now I have to try to understand what over in the select excactly does....