I have a problem that should be solved outside of SQL, but due to business constraints needs to be solved within SQL.
So, please don't tell me to do this at data ingestion, outside of SQL, I want to, but it's not an option...
I have a stream of events, with 4 principle properties....
The source device
The event's timestamp
The event's "type"
The event's "payload" (a dreaded VARCHAR representing various data-types)
What I need to do is break the stream up in to pieces (that I will refer to as "sessions").
Each session is specific to a device (effectively, PARTITION BY device_id)
No one session may contain more than one event of the same type
To shorten the examples, I'll limit them to include just the timestamp and the event_type...
timestamp | event_type desired_session_id
-----------+------------ --------------------
0 | 1 0
1 | 4 0
2 | 2 0
3 | 3 0
4 | 2 1
5 | 1 1
6 | 3 1
7 | 4 1
8 | 4 2
9 | 4 3
10 | 1 3
11 | 1 4
12 | 2 4
An idealised final output may be to pivot the final results...
device_id | session_id | event_type_1_timestamp | event_type_1_payload | event_type_2_timestamp | event_type_2_payload ...
(But that is not yet set in stone, but I will need to "know" which events make up a session, that their timestamps are, and what their payloads are. It is possible that just appending the session_id column to the input is sufficient, as long as I don't "lose" the other properties.)
There are:
12 discrete event types
hundreds of thousands of devices
hundred of thousands of events per device
a "norm" of around 6-8 events per "session"
but sometimes a session may have just 1 or all 12
These factors mean that half-cartesian products and the like are, umm, less than desirable, but possibly may be "the only way".
I've played (in my head) with analytic functions and gaps-and-islands type processes, but can never quite get there. I always fall back to a place where I "want" some flags that I can carry forward from row to row and reset them as needed...
Pseduo-code that doesn't work in SQL...
flags = [0,0,0,0,0,0,0,0,0]
session_id = 0
for each row in stream
if flags[row.event_id] == 0 then
flags[row.event_id] = 1
else
session_id++
flags = [0,0,0,0,0,0,0,0,0]
row.session_id = session_id
Any SQL solution to that is appreciated, but you get "bonus points" if you can also take account of events "happening at the same time"...
If multiple events happen at the same timestamp
If ANY of those events are in the "current" session
ALL of those events go in to a new session
Else
ALL of those events go in to the "current" session
If such a group of event include the same event type multiple times
Do whatever you like
I'll have had enough by that point...
But set the session as "ambiguous" or "corrupt" with some kind of flag?
I'm not 100% sure this can be done in SQL. But I have an idea for an algorithm that might work:
enumerate the counts for each event
take the maximum count up to each point as the "grouping" for the events (this is the session)
So:
select t.*,
(max(seqnum) over (partition by device order by timestamp) - 1) as desired_session_id
from (select t.*,
row_number() over (partition by device, event_type order by timestamp) as seqnum
from t
) t;
EDIT:
This is too long for a comment. I have a sense that this requires a recursive CTE (RBAR). This is because you cannot land at a single row and look at the cumulative information or neighboring information to determine if the row should start a new session.
Of course, there are some situations where it is obvious (say, the previous row has the same event). And, it is also possible that there is some clever method of aggregating the previous data that makes it possible.
EDIT II:
I don't think this is possible without recursive CTEs (RBAR). This isn't quite a mathematical proof, but this is where my intuition comes from.
Imagine you are looking back 4 rows from the current and you have:
1
2
1
2
1 <-- current row
What is the session for this? It is not determinate. Consider:
e s vs e s
1 1 2 1 <-- row not in look back
1 2 1 1
2 2 2 2
1 3 1 2
2 3 2 3
1 4 1 3
The value depends on going further back. Obviously, this example can be extended all the way back to the first event. I don't think there is a way to "aggregate" the earlier values to distinguish between these two cases.
The problem is solvable if you can deterministically say that a given event is the start of a new session. That seems to require complete prior knowledge, at least in some cases. There are obviously cases where this is easy -- such as two events in a row. I suspect, though, that these are the "minority" of such sequences.
That said, you are not quite stuck with RBAR through the entire table, because you have device_id for parallelization. I'm not sure if your environment can do this, but in BQ or Postgres, I would:
Aggregate along each device to create an array of structs with the time and event information.
Loop through the arrays once, perhaps using custom code.
Reassign the sessions by joining back to the original table or unnesting the logic.
UPD based on discussion (not checked/tested, rough idea):
WITH
trailing_events as (
select *, listagg(event_type::varchar,',') over (partition by device_id order by ts rows between previous 12 rows and current row) as events
from tbl
)
,session_flags as (
select *, f_get_session_flag(events) as session_flag
from trailing_events
)
SELECT
*
,sum(session_flag::int) over (partition by device_id order by ts) as session_id
FROM session_flags
where f_get_session_flag is
create or replace function f_get_session_flag(arr varchar(max))
returns boolean
stable as $$
stream = arr.split(',')
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = False
for row in stream:
if flags[row.event_id] == 0:
flags[row.event_id] = 1
is_new_session = False
else:
session_id+=1
flags = [0,0,0,0,0,0,0,0,0,0,0,0]
is_new_session = True
return is_new_session
$$ language plpythonu;
prev answer:
The flags could be replicated as the division remainder of running count of the event and 2:
1 -> 1%2 = 1
2 -> 2%2 = 0
3 -> 3%2 = 1
4 -> 4%2 = 0
5 -> 5%2 = 1
6 -> 6%2 = 0
and concatenated into a bit mask (similar to flags array in the pseudocode). The only tricky point is when to exactly reset all flags to zeros and initiate the new session ID but I could get quite close. If your sample table is called t and it has ts and type columns the script could look like this:
with
-- running count of the events
t1 as (
select
*
,sum(case when type=1 then 1 else 0 end) over (order by ts) as type_1_cnt
,sum(case when type=2 then 1 else 0 end) over (order by ts) as type_2_cnt
,sum(case when type=3 then 1 else 0 end) over (order by ts) as type_3_cnt
,sum(case when type=4 then 1 else 0 end) over (order by ts) as type_4_cnt
from t
)
-- mask
,t2 as (
select
*
,case when type_1_cnt%2=0 then '0' else '1' end ||
case when type_2_cnt%2=0 then '0' else '1' end ||
case when type_3_cnt%2=0 then '0' else '1' end ||
case when type_4_cnt%2=0 then '0' else '1' end as flags
from t1
)
-- previous row's mask
,t3 as (
select
*
,lag(flags) over (order by ts) as flags_prev
from t2
)
-- reset the mask if there is a switch from 1 to 0 at any position
,t4 as (
select *
,case
when (substring(flags from 1 for 1)='0' and substring(flags_prev from 1 for 1)='1')
or (substring(flags from 2 for 1)='0' and substring(flags_prev from 2 for 1)='1')
or (substring(flags from 3 for 1)='0' and substring(flags_prev from 3 for 1)='1')
or (substring(flags from 4 for 1)='0' and substring(flags_prev from 4 for 1)='1')
then '0000'
else flags
end as flags_override
from t3
)
-- get the previous value of the reset mask and same event type flag for corner case
,t5 as (
select *
,lag(flags_override) over (order by ts) as flags_override_prev
,type=lag(type) over (order by ts) as same_event_type
from t4
)
-- again, session ID is a switch from 1 to 0 OR same event type (that can be a switch from 0 to 1)
select
ts
,type
,sum(case
when (substring(flags_override from 1 for 1)='0' and substring(flags_override_prev from 1 for 1)='1')
or (substring(flags_override from 2 for 1)='0' and substring(flags_override_prev from 2 for 1)='1')
or (substring(flags_override from 3 for 1)='0' and substring(flags_override_prev from 3 for 1)='1')
or (substring(flags_override from 4 for 1)='0' and substring(flags_override_prev from 4 for 1)='1')
or same_event_type
then 1
else 0 end
) over (order by ts) as session_id
from t5
order by ts
;
You can add necessary partitions and extend to 12 event types, this code is intended to work on a sample table that you provided... it's not perfect, if you run the subqueries you'll see that flags are reset more often than needed but overall it works except the corner case for session id 2 with a single event type=4 following the end of the other session with the same event type=4, so I have added a simple lookup in same_event_type and used it as another condition for a new session id, hope this will work on a bigger dataset.
The solution I decided to live with is effectively "don't do it in SQL" by deferring the actual sessionising to a scalar function written in python.
--
-- The input parameter should be a comma delimited list of identifiers
-- Each identified should be a "power of 2" value, no lower than 1
-- (1, 2, 4, 8, 16, 32, 64, 128, etc, etc)
--
-- The input '1,2,4,2,1,1,4' will give the output '0001010'
--
CREATE OR REPLACE FUNCTION public.f_indentify_collision_indexes(arr varchar(max))
RETURNS VARCHAR(MAX)
STABLE AS
$$
stream = map(int, arr.split(','))
state = 0
collisions = []
item_id = 1
for item in stream:
if (state & item) == (item):
collisions.append('1')
state = item
else:
state |= item
collisions.append('0')
item_id += 1
return ''.join(collisions)
$$
LANGUAGE plpythonu;
NOTE : I wouldn't use this if there are hundreds of event types ;)
Effectively I pass in a data structure of events in sequence, and the return is a data structure of where the new sessions start.
I chose the actual data structures so make the SQL side of things as simple as I could. (Might not be the best, very open to other ideas.)
INSERT INTO
sessionised_event_stream
SELECT
device_id,
REGEXP_COUNT(
LEFT(
public.f_indentify_collision_indexes(
LISTAGG(event_type_id, ',')
WITHIN GROUP (ORDER BY session_event_sequence_id)
OVER (PARTITION BY device_id)
),
session_event_sequence_id::INT
),
'1',
1
) + 1
AS session_login_attempt_id,
session_event_sequence_id,
event_timestamp,
event_type_id,
event_data
FROM
(
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY device_id
ORDER BY event_timestamp, event_type_id, event_data)
AS session_event_sequence_id
FROM
event_stream
)
Assert a deterministic order to the events (encase of events happening at the same time, etc)
ROW_NUMBER() OVER (stuff) AS session_event_sequence_id
Create a comma delimited list of event_type_id's
LISTAGG(event_type_id, ',') => '1,2,4,8,2,1,4,1,4,4,1,1'
Use python to work out the boundaries
public.f_magic('1,2,4,8,2,1,4,1,4,4,1,1') => '000010010101'
For the first event in the sequence, count the number of 1's up to and including the first character in the 'boundaries'. For the second event in the sequence, count the number of 1's up to and including the second character in the boundaries, etc, etc.
event 01 = 1 => boundaries = '0' => session_id = 0
event 02 = 2 => boundaries = '00' => session_id = 0
event 03 = 4 => boundaries = '000' => session_id = 0
event 04 = 8 => boundaries = '0000' => session_id = 0
event 05 = 2 => boundaries = '00001' => session_id = 1
event 06 = 1 => boundaries = '000010' => session_id = 1
event 07 = 4 => boundaries = '0000100' => session_id = 1
event 08 = 1 => boundaries = '00001001' => session_id = 2
event 09 = 4 => boundaries = '000010010' => session_id = 2
event 10 = 4 => boundaries = '0000100101' => session_id = 3
event 11 = 1 => boundaries = '00001001010' => session_id = 3
event 12 = 1 => boundaries = '000010010101' => session_id = 4
REGEXP_COUNT( LEFT('000010010101', session_event_sequence_id), '1', 1 )
The result is something that's not very speedy, but robust and still better than other options I've tried. What it "feels like" is that (perhaps, maybe, I'm not sure, caveat, caveat) if there are 100 items in a stream then LIST_AGG() is called once and the python UDF is called 100 times. I might be wrong. I've seen Redshift do worse things ;)
Pseudo code for what turns out to be a worse option.
Write some SQL that can find "the next session" from any given stream.
Run that SQL once storing the results in a temp table.
=> Now have the first session from every stream
Run it again using the temp table as an input
=> We now also have the second session from every stream
Keep repeating this until the SQL inserts 0 rows in to the temp table
=> We now have all the sessions from every stream
The time taken to calculate each session was relatively low, and was actually dominated by the overhead of making repeated requests to RedShift. It also meant that the dominant factor was "how many session are in the longest stream" (In my case, 0.0000001% of the streams were 1000x longer than the average.)
The python version is actually slower in most individual cases, but is not dominated by those annoying outliers. This meant that overall the python version completed about 10x sooner than the "external loop" version described here. It also used a bucket load more CPU resources in total, but elapsed time is the more important factor right now :)
Related
I am reading data using modbus The data contains status of the 250 registers in a PLC as either off or on with the time of reading as the time stamp. The raw data received is stored in table as below where the column register represents the register read and the column value represents the status of the register as 0 or 1 with time stamp. In the sample I am showing data for just one register (ie 250). Slave ID represents the PLC from which data was obtained
I need to populate one more table Table_signal_on_log from the raw data table. This table should contain the time at which the value changed to 1 as the start time and the time at which it changes back to 0 as end time. This table is also given below
I am able to do it with a cursor but it is slow and if the number of signals increases could slow down the processing. How could I do without cursor. I tried to do it with set based operations I couldn't get one working. I need to avoid repeat values ie after recording 13:30:30 as the time at which signal becomes 1, I have to ignore all entries till it becomes 0 and record that as end time. Again ignore all values till becomes 1. This process is done once in 20 seconds (can be done at any interval but presently 20). So I may have 500 rows to be looped through every time. This may increase as the number of PLCs connected increases and cursor operation is bound to be an issue
Raw data table
SlaveID Register Value Timestamp ProcessTime
-------------------------------------------------------
3 250 0 13:30:10 NULL
3 250 0 13:30:20 NULL
3 250 1 13:30:30 NULL
3 250 1 13:30:40 NULL
3 250 1 13:30:50 NULL
3 250 1 13:31:00 NULL
3 250 0 13:31:10 NULL
3 250 0 13:31:20 NULL
3 250 0 13:32:30 NULL
3 250 0 13:32:40 NULL
3 250 1 13:32:50 NULL
Table_signal_on_log
SlaveID Register StartTime Endtime
3 250 13:30:30 13:31:10
3 250 13:32:50 NULL //value is still 1
This is a classic gaps-and-islands problem, there are a number of solutions. Here is one:
Get the previous Value for each row using LAG
Filter so we only have rows where the previous Value is different or non-existent, in other words the beginning of an "island" of rows.
Of those rows, get the next Timestamp for eacc row using LEAD.
Filter so we only have Value = 1.
WITH cte1 AS (
SELECT *,
PrevValue = LAG(t.Value) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM YourTable t
),
cte2 AS (
SELECT *,
NextTime = LEAD(t.Timestamp) OVER (PARTITION BY t.SlaveID, t.Register ORDER BY t.Timestamp)
FROM cte1 t
WHERE (t.Value <> t.PrevValue OR t.PrevValue IS NULL)
)
SELECT
t.SlaveID,
t.Register,
StartTime = t.Timestamp,
Endtime = t.NextTime
FROM cte2 t
WHERE t.Value = 1;
db<>fiddle
Is there a way to combine these two queries into a single query - just one trip to the database?
Both queries hit the same table, but the first is looking for Total Active Circuits, while the second is looking for Total Circuits.
I am hoping to display results like this...
4/15, 12/34, 2/21 (where the first number is ActiveCircuits and the second number is TotalCircuits)
SELECT COUNT(CircuitID) AS ActiveCircuits
FROM Circuit
WHERE StateID = 5
AND Active = 1
SELECT COUNT(CircuitID) AS TotalCircuits
FROM Circuit
WHERE StateID = 5
Use conditional aggregation:
SELECT COUNT(*) AS TotalCircuits,
SUM(CASE WHEN Active = 1 THEN 1 ELSE 0 END) as ActiveCircuits
FROM Circuit
WHERE StateID = 5;
This assumes that CircuitId is never NULL, which seems quite reasonable in a table called Circuit.
You can use case when to have a 1 wherever it's active and then take the sum to get the total # of 1's or activecircuits.
SELECT COUNT(CIRCUITID) AS TOTALCIRCUITS,
SUM(CASE WHEN ACTIVE = 1 THEN 1 ELSE 0 END) AS ACTIVECIRCUITS
FROM CIRCUIT
WHERE STATEID = 5
Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;
EDIT after #NealB solution: the #NealB's solution is very very fast comparated with any another one, and dispenses this new question about "add a constraint to improve performance". The #NealB's not need any improve, have O(n) time and is very simple.
The problem of "label transitive groups with SQL" have an elegant solution using recursion and CTE... But this solution consumes an exponential time (!). I need to work with 10000 itens: with 1000 itens need 1 second, with 2000 need 1 day...
Constraint: in my case is possible to break the problem into pieces of ~100 itens or less, but only to select one group of ~10 itens, and discard all the other ~90 labeled itens...
There are a generic algotithm to add and use this kind of "pre-selection", to reduce the quadratic, O(N^2), time? Perhaps, as showed by comments and #wildplasser, a O(N log(N)) time; but I expect, with "pre-selection" to reduce to O(N) time.
(EDIT)
I try to use alternative algorithm, but it need some improvement to use as solution here; or, to really increase performance (to O(N) time), need to use "pre-selection".
The "pre-selection" (constraint) is based on a "super-set grouping"... Stating by the original "How to label 'transitive groups' with SQL?" question t1 table,
table T1
(original T1 augmented by "super-set grouping label" ssg, and more one row)
ID1 | ID2 | ssg
1 | 2 | 1
1 | 5 | 1
4 | 7 | 1
7 | 8 | 1
9 | 1 | 1
10 | 11 | 2
So there are three groups,
g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1"
g2: {4,7,8} because "4 t 7" and "7 t 8"
g3: {10,11} because "10 t 11"
The super-group is only a auxiliary grouping,
ssg1: {g1,g2}
ssg2: {g3}
If we have M super-group-items and N total T1 items, the average group length will be less tham N/M. We can suppose (for my typical problem) also that ssg maximum length is ~N/M.
So, the "label algorithm" need to run only M times with ~N/M items if it use the ssg constraint.
An SQL only soulution appears to be a bit of a problem here. With the help of some procedural
programming on top of SQL the solution appears to be failry simple and efficient. Here is a brief outline
of a solution as could be implemented using any procedural language invoking SQL.
Declare table R with primary key ID where ID corresponds the same domain as ID1 and ID2 of table T1.
Table R contains one other non-key column, a Label number
Populate table R with the range of values found in T1. Set Label to zero (no label).
Using your example data, the initial setup for R would look like:
Table R
ID Label
== =====
1 0
2 0
4 0
5 0
7 0
8 0
9 0
Using a host language cursor plus an auxiliary counter, read each row from T1. Lookup ID1 and ID2 in R. You will find one of
four cases:
Case 1: ID1.Label == 0 and ID2.Label == 0
In this case neither one of these IDs have been "seen" before: Add 1 to the counter and then update both
rows of R to the value of the counter: update R set R.Label = :counter where R.ID in (:ID1, :ID2)
Case 2: ID1.Label == 0 and ID2.Label <> 0
In this case, ID1 is new but ID2 has already been assigned a label. ID1 needs to be assigned to the
same label as ID2: update R set R.Lablel = :ID2.Label where R.ID = :ID1
Case 3: ID1.Label <> 0 and ID2.Label == 0
In this case, ID2 is new but ID1 has already been assigned a label. ID2 needs to be assigned to the
same label as ID1: update R set R.Lablel = :ID1.Label where R.ID = :ID2
Case 4: ID1.Label <> 0 and ID2.Label <> 0
In this case, the row contains redundant information. Both rows of R should contain the same Label value. If not,
there is some sort of data integrity problem. Ahhhh... not quite see edit...
EDIT I just realized that there are situations where both Label values here could be non-zero and different. If both are non-zero and different then two Label groups need to be merged at this point. All you need to do is choose one Label and update the others to match with something like: update R set R.Label to ID1.Label where R.Label = ID2.Label. Now both groups have been merged with the same Label value.
Upon completion of the cursor, table R will contain Label values needed to update T2.
Table R
ID Label
== =====
1 1
2 1
4 2
5 1
7 2
8 2
9 1
Process table T2
using something along the lines of: set T2.Label to R.Label where T2.ID1 = R.ID. The end result should be:
table T2
ID1 | ID2 | LABEL
1 | 2 | 1
1 | 5 | 1
4 | 7 | 2
7 | 8 | 2
9 | 1 | 1
This process is puerly iterative and should scale to fairly large tables without difficulty.
I suggest you check this and use some
general-purpose language for solving it.
http://en.wikipedia.org/wiki/Disjoint-set_data_structure
Traverse the graph, maybe run DFS or BFS from each node,
then use this disjoint set hint. I think this should work.
The #NealB solution is the faster(!) See an example of PostgreSQL implementation here.
Below an example of another "brute force algorithm", only for curiosity!
As #peter.petrov and #RBarryYoung suggested, some performance problems can be avoided abandoning the CTE recursion... I do some issues at the basic labeler, and, abover I add the constraint for grouping by a super-set label. This new transgroup1_loop() function is working!
PS: this solution still have performance limitations, please post your answer with better, or with some adaptation of this one.
-- DROP table transgroup1;
CREATE TABLE transgroup1 (
id serial NOT NULL PRIMARY KEY,
items integer[], -- two or more items in the transitive relationship
ssg_label varchar(12), -- the super-set gropuping label
dels integer[] DEFAULT array[]::integer[]
);
INSERT INTO transgroup1(items,ssg_label) values
(array[1, 2],'1'),
(array[1, 5],'1'),
(array[4, 7],'1'),
(array[7, 8],'1'),
(array[9, 1],'1'),
(array[10, 11],'2');
-- or SELECT array[id1, id2],ssg_label FROM t1, with 10000 items
them, with these two functions we can solve the problem,
CREATE FUNCTION transgroup1_loop(p_ssg varchar, p_max_i integer DEFAULT 100)
RETURNS integer AS $funcBody$
DECLARE
cp_dels integer[];
i integer;
BEGIN
i:=1;
LOOP
UPDATE transgroup1
SET items = array_uunion(transgroup1.items,t2.items),
dels = transgroup1.dels || t2.id
FROM transgroup1 AS t1, transgroup1 AS t2
WHERE transgroup1.id=t1.id AND t1.ssg_label=$1 AND
t1.id>t2.id AND t1.items && t2.items;
cp_dels := array(
SELECT DISTINCT unnest(dels) FROM transgroup1
); -- ensures all itens to del
RAISE NOTICE '-- bug, repeting dels, item-%; % dels! %', i, array_length(cp_dels,1), array_to_string(cp_dels,';','*');
EXIT WHEN i>p_max_i OR array_length(cp_dels,1)=0;
DELETE FROM transgroup1
WHERE ssg_label=$1 AND id IN (SELECT unnest(cp_dels));
UPDATE transgroup1 SET dels=array[]::integer[];
i:=i+1;
END LOOP;
UPDATE transgroup1 -- only to beautify
SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
RETURN i;
END;
$funcBody$ LANGUAGE plpgsql VOLATILE;
to run and see results, you can use
SELECT transgroup1_loop('1'); -- run with ssg-1 items only
SELECT transgroup1_loop('2'); -- run with ssg-2 items only
-- show all with a sequential group label:
SELECT *, dense_rank() over (ORDER BY id) AS group_label from transgroup1;
results:
id | items | ssg_label | dels | group_label
----+-----------+-----------+------+-------------
4 | {8,7,4} | 1 | {} | 1
5 | {9,5,2,1} | 1 | {} | 2
6 | {11,10} | 2 | {} | 3
PS: the function array_uunion() is the same as original,
CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
-- ensures distinct items of a concatemation
SELECT ARRAY(SELECT unnest($1) UNION SELECT unnest($2))
$$ LANGUAGE sql immutable;
I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.
I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.
prop1 prop2 prop3 prop4
--------------------------------
0 0 1 53
0 0 2 55
1 1 1 8
1 1 1 8
1 1 1 8
1 1 1 8
0 0 2 55
0 0 2 55
0 0 1 53
Into:
prop1 prop2 prop3 prop4 count
-----------------------------------------
0 0 1 53 1
0 0 2 55 1
1 1 1 8 4
0 0 2 55 2
0 0 1 53 1
I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.
Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?
Is there a way to do this (SQL Server 2005)?
Is there a reason I shouldn't do this?
Thanks.
You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.
select
prop1,
prop2,
prop3,
prop4,
count(*) as count
from
tbl
group by
prop1,
prop2,
prop3,
prop4,
y,
x
order by y, x
Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.
Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.
This will work, though it is painful to look at:
;WITH Ordering
AS
(
SELECT Prop1,
Prop2,
Prop3,
Prop4,
ROW_NUMBER() OVER (ORDER BY Y, X) RN
FROM Props
)
SELECT
CurrentRow.Prop1,
CurrentRow.Prop2,
CurrentRow.Prop3,
CurrentRow.Prop4,
CurrentRow.RN -
ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4)
ORDER BY CurrentRow.RN
The gist is the following:
Enumerate each row using ROW_NUMBER OVER to get the correct order.
Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.
70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.