How to label a big set of “transitive groups” with a constraint? - sql

EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.

An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.

I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.

The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;

Related

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.
Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

Oracle SQL statement to update column values based on specific condition

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.
I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).
This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);
Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;
The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

How to consolidate blocks of time?

I have a derived table with a list of relative seconds to a foreign key (ID):
CREATE TABLE Times (
ID INT
, TimeFrom INT
, TimeTo INT
);
The table contains mostly non-overlapping data, but there are occasions where I have a TimeTo < TimeFrom of another record:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 70 |
| 10 | 60 | 150 |
| 10 | 75 | 150 |
| .. | ... | ... |
+----+----------+--------+
The result set is meant to be a flattened linear idle report, but with too many of these overlaps, I end up with negative time in use. I.e. If the window above for ID = 10 was 150 seconds long, and I summed the differences of relative seconds to subtract from the window size, I'd wind up with 150-(20+20+90+75)=-55. This approach I've tried, and is what led me to realizing there were overlaps that needed to be flattened.
So, what I'm looking for is a solution to flatten the overlaps into one set of times:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 10 | 30 |
| 10 | 50 | 150 |
| .. | ... | ... |
+----+----------+--------+
Considerations: Performance is very important here, as this is part of a larger query that will perform well on it's own, and I'd rather not impact its performance much if I can help it.
On a comment regarding "Which seconds have an interval", this is something I have tried for the end result, and am looking for something with better performance. Adapted to my example:
SELECT SUM(C.N)
FROM (
SELECT A.N, ROW_NUMBER()OVER(ORDER BY A.N) RowID
FROM
(SELECT TOP 60 1 N FROM master..spt_values) A
, (SELECT TOP 720 1 N FROM master..spt_values) B
) C
WHERE EXISTS (
SELECT 1
FROM Times SE
WHERE SE.ID = 10
AND SE.TimeFrom <= C.RowID
AND SE.TimeTo >= C.RowID
AND EXISTS (
SELECT 1
FROM Times2 D
WHERE ID = SE.ID
AND D.TimeFrom <= C.RowID
AND D.TimeTo >= C.RowID
)
GROUP BY SE.ID
)
The problem I have with this solution is I have get a Row Count Spool out of the EXISTS query in the query plan with a number of executions equal to COUNT(C.*). I left the real numbers in that query to illustrate that getting around this approach is for the best. Because even with a Row Count Spool reducing the cost of the query by quite a bit, it's execution count increases the cost of the query as a whole by quite a bit as well.
Further Edit: The end goal is to put this in a procedure, so Table Variables and Temp Tables are also a possible tool to use.
OK. I'm still trying to do this with just one SELECT. But This totally works:
DECLARE #tmp TABLE (ID INT, GroupId INT, TimeFrom INT, TimeTo INT)
INSERT INTO #tmp
SELECT ID, 0, TimeFrom, TimeTo
FROM Times
ORDER BY Id, TimeFrom
DECLARE #timeTo int, #id int, #groupId int
SET #groupId = 0
UPDATE #tmp
SET
#groupId = CASE WHEN id != #id THEN 0
WHEN TimeFrom > #timeTo THEN #groupId + 1
ELSE #groupId END,
GroupId = #groupId,
#timeTo = TimeTo,
#id = id
SELECT Id, MIN(TimeFrom), Max(TimeTo) FROM #tmp
GROUP BY ID, GroupId ORDER BY ID
Left join each row to its successor overlapping row on the same ID value (where such exist).
Now for each row in the result-set of LHS left join RHS the contribution to the elapsed time for the ID is:
isnull(RHS.TimeFrom,LHS.TimeTo) - LHS.TimeFrom as TimeElapsed
Summing these by ID should give you the correct answer.
Note that:
- where there isn't an overlapping successor row the calculation is simply
LHS.TimeTo - LHS.TimeFrom
- where there is an overlapping successor row the calculation will net to
(RHS.TimeFrom - LHS.TimeFrom) + (RHS.TimeTo - RHS.TimeFrom)
which simplifies to
RHS.TimeTo - LHS.TimeFrom
What about something like below (assumes SQL 2008+ due to CTE):
WITH Overlaps
AS
(
SELECT t1.Id,
TimeFrom = MIN(t1.TimeFrom),
TimeTo = MAX(t2.TimeTo)
FROM dbo.Times t1
INNER JOIN dbo.Times t2 ON t2.Id = t1.Id
AND t2.TimeFrom > t1.TimeFrom
AND t2.TimeFrom < t1.TimeTo
GROUP BY t1.Id
)
SELECT o.Id,
o.TimeFrom,
o.TimeTo
FROM Overlaps o
UNION ALL
SELECT t.Id,
t.TimeFrom,
t.TimeTo
FROM dbo.Times t
INNER JOIN Overlaps o ON o.Id = t.Id
AND (o.TimeFrom > t.TimeFrom OR o.TimeTo < t.TimeTo);
I do not have a lot of data to test with but seems decent on the smaller data sets I have.
I also wrapped by head around this issue - and afterall I found, that the problem is your data.
You claim (if i get that right), that these entries should reflect the relative times, when a user goes idle / comes back.
So, you should consider to sanitize your data and refactor your inserts to produce valid data sets.
For instance, the two lines:
+----+----------+--------+
| ID | TimeFrom | TimeTo |
+----+----------+--------+
| 10 | 50 | 70 |
| 10 | 60 | 150 |
how can it be possible that a user is idle until second 70, but goes idle on second 60? This already implies, that he has been back latest at around second 59.
I can only assume that this issue comes from different threads and/or browser windows (tabs) a user might be using your application with. (Each having it's own "idle detection")
So instead of working-around the symptoms - you should fix the cause! Why is this data entry inserted into the table? You could avoid this by simple checking, if the user is already idle before inserting a new row.
Create a unique key constraint on ID and TimeTo
Whenever an idle-event is detected, execute the following query:
INSERT IGNORE INTO Times (ID,TimeFrom,TimeTo)VALUES('10', currentTimeStamp, -1);
-- (If the user is already "idle" - nothing will happen)
Whenever an comeback-event is detected, execute the following query:
UPDATE Times SET TimeTo=currentTimeStamp WHERE ID='10' and TimeTo=-1
-- (If the user is already "back" - nothing will happen)
The fiddle linked here: http://sqlfiddle.com/#!2/dcb17/1 would reproduce the chain of events for your example, but resulting in a clean and logical set of idle-windows:
ID TIMEFROM TIMETO
10 10 30
10 50 70
10 75 150
Note: The Output is slightly different from the output you desired. But I feel that this is more accurate, cause of the reason outlined above: A user cannot go idle on second 70 without returning from it's current idle state before. He either STAYS idle (and a second thread/tab runs into the idle-event) Or he returned in between.
Especially for your need to maximize performance, you should fix the data and not invent a work-around-query. This is maybe 3 ms upon inserts, but could be worth 20 seconds upon select!
Edit: if Multi-Threading / Multiple-Sessions is the cause for the wrong insert, you would also need to implement a check, if most_recent_come_back_time < now() - idleTimeout - otherwhise a user might comeback on tab1, and is recorded idle on tab2 after a few seconds, cause tab2 did run into it's idle timeout, cause the user only refreshed tab1.
I had the 'same' problem once with 'days' (additionaly without counting WE and Holidays)
The word counting gave me the following idea:
create table Seconds ( sec INT);
insert into Seconds values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9), ...
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom and t.timeto-1
and id=10;
you can cut the start to 0 (I put the '10' here in braces)
select count(distinct sec) from times t, seconds s
where s.sec between t.timefrom- (10) and t.timeto- (10)-1
and id=10;
and finaly
select count(distinct sec) from times t, seconds s,
(select min(timefrom) m from times where id=10) as m
where s.sec between t.timefrom-m.m and t.timeto-m.m-1
and id=10;
additionaly you can "ignore" eg. 10 seconds by dividing you loose some prezition but earn speed
select count(distinct sec)*d from times t, seconds s,
(select min(timefrom) m from times where id=10) as m,
(select 10 d) as d
where s.sec between (t.timefrom-m)/d and (t.timeto-m)/d-1
and id=10;
Sure it depends on the range you have to look at, but a 'day' or two of seconds should work (although i did not test it)
fiddle ...

Aggregation over order-dependent partition?

I have a source data set like this (simplified to be more clear):
Key F1 F2
1 X 4
2 X 5
3 Y 6
4 X 9
5 X 7
6 X 8
7 Y 9
8 X 6
9 X 5
10 Y 3
The data is sorted by the Key field. Now, I want to compute an aggregate of the F2 field over partitions that are defined by the F1 field: A partition starts at the first X value and ends with the first subsequent Y value.
So, for example, I might want wo compute the MIN() over the partitions defined as described above. Then the result set would look like this:
rownum MIN(F2)
1 4
2 7
3 3
I have tried a number of resources (incl. our own intranet community and of course stackoverflow) but found nothing for my case. Usually partitioning only works with a field that can be used to identify the partitions. Here, the partitions are defined by a change in a field's content with respect to a given order.
Although I am aware that I may have to resort to writing a procedural solution I would prefer to solve this in pure SQL.
Any ideas how such a partitioning could be achieved with a SQL select statement?
Thanks and regards
Kai.
A little bit shorter solution: http://sqlfiddle.com/#!12/7390d/24
Query:
select min(f2)
from t t1
group by (select max(key)
from t t2
where t2.f1='Y' and
t1.key > t2.key)
Result:
| MIN |
-------
| 4 |
| 7 |
| 3 |
The idea is to find the key of preceding 'Y' for each row and group by it. Should work with any SQL engine.
You didn't specify engine or dialect or version so I assumed SQL Server 2012.
Example that you can run to see the solution: http://sqlfiddle.com/#!6/f5d38/21
You solve it by creating correct partitions in your set. Code looks like this.
WITH groupLimits as
(
SELECT
[Key] AS groupend
,COALESCE(LAG([Key]) OVER (order by [Key]),0)+1 AS groupstart
FROM sourceData
WHERE F1 = 'Y'
)
SELECT
MIN(sourceData.F2)
FROM groupLimits
INNER JOIN sourceData
ON sourceData.[Key] BETWEEN groupLimits.groupstart and groupLimits.groupend
GROUP BY groupLimits.groupstart
ORDER BY groupLimits.groupstart

How to Order By within an Order By in SQL?

I'm trying to create a SQL statement which will recreate the hierarchical container/folder/test structure in SilkCentral Test Manager.
Test containers have no ParentID
Test folders contain a ParentID and IsLeaf = 0
Tests contain a ParentID and IsLeaf = 1
This Query results in all of the test containers, folders, and tests:
SELECT "NodeID", "ParentID", "Name", "IsLeaf", "OrderNumber"
FROM "Silk"."TM_TestPlanNodes" AS TPN
WHERE PROJECTID = 36
ORDER BY "ParentID", "OrderNumber", "IsLeaf"
Here are some of the Results:
NodeID ParentID Name IsLeaf OrderNumber
65408 Installation and Upgrades 0 0
65445 Connectivity 0 1
65448 Focus 0 2
65409 GINA / PLAP 0 3
65446 Graphical User Interface 0 4
71038 Login Properties 0 5
65449 Miscellaneous 0 6
70636 Net Firewall 0 7
70998 Software Updates 0 8
65447 Third-party Services 0 9
70805 SilkTest Automated Tests 0 10
68812 65408 0. Setup 0 0
65454 65408 1. Installations & Upgrades 0 1
65450 65408 Typical/Custom Installation 0 2
I would like this ordering instead:
The ParentID is sorted, but if there exists a Node with the ParentID=thePreviousNode'sID, then that is chosen next. If there are multiple of those nodes, they should be ordered by IsLeaf and then, OrderNumber.
How to accomplish this? I'm very limited in what I can do, because I think very complicated syntax will end up throwing errors in Silk. I was going to try a nested SELECT statement:
SELECT "NodeID", "ParentID", "Name","IsLeaf"
FROM "Silk"."TM_TestPlanNodes"
WHERE PROJECTID = '36'AND ParentID LIKE (
SELECT ParentID
FROM "Silk"."TM_TestPlanNodes"
WHERE NAME = 'Installation and Upgrades')
But this is getting this error: "Could not execute report query: Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression."
This is why I'm fiddling with Order By.
You can use a recursive cte to create a hidden column and orderby that column. The hidden column should be Something like:
WITH cte (NodeID, ParentID, Name, IsLeaf, [Order])
AS
(
SELECT NodeID, ParentID, Name, IsLeaf, cast(NodeID as nvarchar(10))
FROM "Silk"."TM_TestPlanNodes"
WHERE PROJECTID = '36'
UNION ALL
SELECT "NodeID", "ParentID", "Name","IsLeaf", cast(leftNode.ParentID as nvarchar(10)) + cast(leftNode.NodeID as nvarchar(10))
FROM "Silk"."TM_TestPlanNodes" as leftNode
INNER JOIN cte on cte.NodeID = leftNode.ParentID
WHERE leftNode.ParentID = cte.NodeID
)
select "NodeID", "ParentID", "Name","IsLeaf" from cte
order by cast([Order] as nvarchar(50))
This was written in notepad so is possible to have some errors, but the idea is to make an [order] column that for example for 65530 would be 654086554569530 (the parent_parent, the parent and the node)
EDIT:
this only works if the ids are all 5 characters long, but from here you can make the proper tweaks.
Although it might not be a PERFECT fit, it is very close with a nested hierarchical representation of parent-child records in a self-joined list and incorporated proper ordering concerns. You may have to tweak it a bit for your table, but here's a link to a prior solution
To clarify that problem with the menu and the corresponding data.
id | parentid | name
1 | 0 | CatOne
2 | 0 | CatTwo
3 | 0 | CatThree
4 | 1 | SubCatOne
5 | 1 | SubCatOne2
6 | 3 | SubCatThree
Desired output
CatOne 1
--SubCatOne 4
--SubCatOne2 5
CatTwo 2
CatThree 3
--SubCatThree 6
The FIRST case is pre-grouping all the like ID's based on the parent... So, when the parent ID is 0, it IS the top-most level, so we keep it's ID. Then, any children under it, we want their respective PARENT IDs so all of the same are correctly pre-grouped.
The purpose of the SECOND group by is to force the entry that represents the actual TOP LEVEL menu item to the top of the list regardless of the child entries.
Say you have a table where IDs are already established, and you now add a new item into position ID = 7 for "New Top Level" and want to move ID #s 2 and 3 into the new "top-level section. If you just to the query with the first CASE, your records would be simulated returned as
ID Parent Name (natural order from the table)
2 7 CatTwo
3 7 CatThree
7 0 New Category. (we want THIS one in FIRST POSITION of the group)
As you can see, this would be a bad representation of the sub-grouping order. The top-level item actually is in the 3rd position... To bring it to the front, we are now sub-grouping and saying... if the Parent ID of the record = 0, then sort it as if it were a '1' priority. Anything else is considered a '2' priority and would simulate the result like
ID Parent Name SubPrioritySort
7 0 New Category. 1
2 7 CatTwo 2
3 7 CatThree 2
Since you are not actually returning these "CASE" values in your result query, you wouldn't otherwise visually see it... but for grins, add them as columns to your query to see the impact. Hopefully this clarified the answer for you.
In your question, you would obviously be able to add your sort order column to the basis of this query.