MonetDB: Enumerate groups of rows based on a given "boundary" condition - sql

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --

You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;

Related

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

MS SQL Server: Operate current select values in other selected columns

I want to take a value from a selected column to operate the next column. For example:
SELECT CASE
WHEN ID < 4 THEN ID
ELSE 10
END
AS MY_ID,
MY_ID + 5 AS EXTRA_ID
FROM FOO
That would output for IDs 1,2,3,4,5:
MY_ID EXTRA_ID
1 6
2 7
3 8
10 15
10 15
If I do MY_ID + 5 it will complain about MY_ID not existing (it's an alias, so it makes sense) and ID + 5 will read 1+5, 2+5, 3+5, 4+5, 5+5 instead of 1+5, 2+5, 3+5, 4+10, 5+10 when it goes through the ELSE. Is it even possible to do this? I'm doing it in SSRS - Report builder, and need to operate a result that might be set to a defualt value depending on the CASE clause.
You can repeat the same CASE expression with +5 in the end for the extra_id column
SELECT CASE
WHEN ID < 4 THEN ID
ELSE 10
END
AS MY_ID,
CASE
WHEN ID < 4 THEN ID
ELSE 10
END + 5 AS EXTRA_ID
FROM FOO
An alternative is to create the extra_id column value inside SSRS using an expression
= Fields!my_id.value + 5
you cannot reuse the calculation in the same level. Using my_id in the where clause will fail as well. Either you have to calucate it multiple times, place another select around your statement or use a with statement (CTE).
Simply wrap it with another select:
SELECT t.*,
t.my_id + 5 as extra_id
FROM(Your Query) t
Derived columns are not available in the same layer they're being created. By wrapping them with another select, you make them available (that because the inner query is being processed before the outer) .
You just need a subquery to create MY_ID before doing anything with it. By creating MY_ID in the inner query the outer query can use to define new fields.
SELECT
a.MY_ID,
a.MY_ID + 5 AS EXTRA_ID
from
(SELECT
CASE
WHEN ID < 4 THEN ID
ELSE 10
END
AS MY_ID
FROM FOO) as a

Oracle SQL statement to update column values based on specific condition

I have a table which is having 3 columns-PID,LOCID,ISMGR. Now in existing scenario, for some person, based on the location ID, he is set as ISMGR=true.
But as per the new requirement, we have to make all the ISMGR=true for any person who is having at least one ISMGR=true(means if he is mangager for any one location, he should be manager for all the locations).
Table Data before running the script:
PID|LOCID|ISMGR
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
Table Data after running the script:
PID|LOCID|ISMGR
1 1 1
1 2 1
1 3 1
2 1 1
2 2 1
Any help will be highly appreciated..
Thanks in advance.
I would be inclined to write this using exists:
update t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from t t2 where t2.pid = t.pid and t2.ismgr = 1);
exists should be more efficient than doing a subquery with an aggregation.
This will work best with indexes on t(pid, ismgr) and t(ismgr).
This is not an answer but a test of the two solutions offered so far - I will call them the "EXISTS" and the "AGGREGATE" solutions or approaches.
Details of the tests are below, but here are two overall conclusions:
Both approaches have comparable execution times; on average the AGGREGATE approach worked a little faster than the EXISTS approach, but by a very small margin (smaller than the differences between running times from one trial to the next). Without indexes on any columns, the run times were: (first number is for the EXISTS approach and the second for AGGREGATE). Trial 1: 8.19s 8.08s Trial 2: 8.98s 8.22s Trial 3: 9.46s 9.55s Note - Estimated optimizer costs should be used only to compare different execution plans for the same statement, not for different solutions using different approaches. Even so, someone will inevitably ask; so - for the EXISTS approach the lowest cost the Optimizer found was 4766; for AGGREGATE, 2665. Again, though, this is completely meaningless.
If a lot of rows need to be updated, indexes will hurt performance much more than they help it. Indeed, when rows are updated, the indexes must be updated as well. If only a small number of rows must be updated, then the indexes will help, because most of the time is spent finding the rows that must be updated and only little time is spent in the updates themselves. In my example almost 25% of rows had to be updated... so the AGGREGATE solution took 51.2 seconds and the EXISTS solution took 59.3 seconds! RECOMMENDATION: If you expect that a large number of rows may need to be updated, and you already have indexes on the table, you may be better off DROPPING them and re-creating them after the updates! Or, perhaps there are other solutions to this problem; I am not an expert (keep that in mind!)
To test properly, after I created the test table and committed, I ran each solution by itself, then I rolled back and, logged in as SYS (in a different session), I ran alter system flush buffer_cache to make sure performance is not randomly helped by cache hits or hurt by misses. In all cases everything is done from disk storage.
I created a table with id's from 1 to 1.2 million and a random integer between 1 and 3, with probabilities 40%, 40% and 20% respectively (see the use of dbms_random below). Then from this prep data I created the test table: each pid was included one, two or three times based on this random integer; and a random 0 or 1 was added as ismgr (with 50-50 probability) in each row. I also added a random integer between 1 and 4 as locid just to simulate the actual data; I didn't worry about duplicate locid since that column plays no role in the problem.
Of the 1.2 million pids, approximately 480,000 (40%) appear just once in the test table, another ~480,000 appear twice and ~240,000 three times. Total rows should be about 2,160,000. That's the cardinality of the base table (in reality it ended up being 2,160,546). Then: none of the ~480,000 rows with unique pid need to be changed; half of the 480,000 pids with a count of 2 will have the same ismgr (so no change) and the other half will be split, so we will need to change 240,000 rows from these; and a simple combinatorial argument shows that 3/8, or 270,000 rows, of the 720,000 rows for pids that appear three times in the table must be changed. So we should expect that 510,000 rows should be changed. In fact the update statements resulted in 510,132 rows updated (same for both solutions). These sanity checks show that the test was probably set up correctly. Below I show also a small sample from the base table, also as a sanity check.
CREATE TABLE statement:
create table tbl as
with prep ( pid, dup ) as (
select level,
round( dbms_random.value(0.5, 3) ) as dup
from dual
connect by level <= 1200000
)
select pid,
round( dbms_random.value(0.5, 4.5) ) as locid,
round( dbms_random.value(0, 1) ) as ismgr
from prep
connect by level <= dup
and prior pid = pid
and prior sys_guid() is not null
;
commit;
Sanity checks:
select count(*) from tbl;
COUNT(*)
----------
2160546
select * from tbl where pid between 324720 and 324730;
PID LOCID ISMGR
---------- ---------- ----------
324720 4 1
324721 1 0
324721 4 1
324722 3 0
324723 1 0
324723 3 0
324723 3 1
324724 3 1
324724 2 0
324725 4 1
324725 2 0
324726 2 0
324726 1 0
324727 3 0
324728 4 1
324729 1 0
324730 3 1
324730 3 1
324730 2 0
19 rows selected
UPDATE statements:
update tbl t
set ismgr = 1
where ismgr = 0 and
exists (select 1 from tbl t2 where t2.pid = t.pid and t2.ismgr = 1);
rollback;
update tbl
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from tbl
group by pid
having max(ismgr) = 1);
rollback;
-- statements to create indexes, used in separate testing:
create index pid_ismgr_idx on tbl(pid, ismgr);
create index ismgr_ids on tbl(ismgr);
Why PL/SQL? All you need is a plain SQL statement. For example:
update your_table t -- enter your actual table name here
set ismgr = 1
where ismgr = 0
and pid in ( select pid
from your_table
group by pid
having max(ismgr) = 1)
;
The existing solutions are perfectly fine, but I prefer to use merge any time I'm updating rows from a correlated sub-query. I find it to be more readable and the performance is typically commensurate with the exists method.
MERGE INTO t
USING (SELECT DISTINCT pid
FROM t
WHERE ismgr = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;
As #mathguy pointed out, in this case using group by and having is more efficient than distinct. To use that with merge is just a matter of changing the sub-query:
MERGE INTO t
USING (SELECT pid
FROM t
GROUP BY pid
HAVING MAX(ismgr) = 1) src
ON (t.pid = src.pid)
WHEN MATCHED THEN
UPDATE SET ismgr = 1
WHERE ismgr = 0;

Group rows into sets of 5

TableA
Col1
----------
1
2
3
4....all the way to 27
I want to add a second column that assigns a number to groups of 5.
Results
Col1 Col2
----- ------
1 1
2 1
3 1
4 1
5 1
6 2
7 2
8 2...and so on
The 6th group should have 2 rows in it.
NTILE doesn't accomplish what I want because of the way NTILE handles the groups if they aren't divisible by the integer.
If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.
This is clearly demonstrated in this SQL Fiddle. Groups 4, 5, 6 each have 4 rows while the rest have 5. I have some started some solutions but they were getting lengthy and I feel like I'm missing something and that this could be done in a single line.
You can use this:
;WITH CTE AS
(
SELECT col1,
RN = ROW_NUMBER() OVER(ORDER BY col1)
FROM TableA
)
SELECT col1, (RN-1)/5+1 col2
FROM CTE;
In your sample data, col1 is a correlative without gaps, so you could use it directly (if it's an INT) without using ROW_NUMBER(). But in the case that it isn't, then this answer works too. Here is the modified sqlfiddle.
A bit of math can go a long way. subtracting 1 from all values puts the 5s (edge cases) into the previous group here, and 6's into the next. flooring the division by your group size and adding one give the result you're looking for. Also, the SQLFiddle example here fixes your iterative insert - the table only went up to 27.
SELECT col1,
floor((col1-1)/5)+1 as grpNum
FROM tableA

Finding contiguous regions in a sorted MS Access query

I am a long time fan of Stack Overflow but I've come across a problem that I haven't found addressed yet and need some expert help.
I have a query that is sorted chronologically with a date-time compound key (unique, never deleted) and several pieces of data. What I want to know is if there is a way to find the start (or end) of a region where a value changes? I.E.
DateTime someVal1 someVal2 someVal3 target
1 3 4 A
1 2 4 A
1 3 4 A
1 2 4 B
1 2 5 B
1 2 5 A
and my query returns rows 1, 4 and 6. It finds the change in col 5 from A to B and then from B back to A? I have tried the find duplicates method and using min and max in the totals property however it gives me the first and last overall instead of the local max and min? Any similar problems?
I didn't see any purpose for the someVal1, someVal2, and someVal3 fields, so I left them out. I used an autonumber as the primary key instead of your date/time field; but this approach should also work with your date/time primary key. This is the data in my version of your table.
pkey_field target
1 A
2 A
3 A
4 B
5 B
6 A
I used a correlated subquery to find the previous pkey_field value for each row.
SELECT
m.pkey_field,
m.target,
(SELECT Max(pkey_field)
FROM YourTable
WHERE pkey_field < m.pkey_field)
AS prev_pkey_field
FROM YourTable AS m;
Then put that in a subquery which I joined to another copy of the base table.
SELECT
sub.pkey_field,
sub.target,
sub.prev_pkey_field,
prev.target AS prev_target
FROM
(SELECT
m.pkey_field,
m.target,
(SELECT Max(pkey_field)
FROM YourTable
WHERE pkey_field < m.pkey_field)
AS prev_pkey_field
FROM YourTable AS m) AS sub
LEFT JOIN YourTable AS prev
ON sub.prev_pkey_field = prev.pkey_field
WHERE
sub.prev_pkey_field Is Null
OR prev.target <> sub.target;
This is the output from that final query.
pkey_field target prev_pkey_field prev_target
1 A
4 B 3 A
6 A 5 B
Here is a first attempt,
SELECT t1.Row, t1.target
FROM t1 WHERE (((t1.target)<>NZ((SELECT TOP 1 t2.target FROM t1 AS t2 WHERE t2.DateTimeId<t1.DateTimeId ORDER BY t2.DateTimeId DESC),"X")));