Misleading count of 1 on JOIN in Postgres 11.7 - sql

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.

You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

Related

Performant query count of rows within range over sequence

I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.
A vastly simplified version of this table is:
CREATE TABLE Data (
EntityId INTEGER NOT NULL,
Start INTEGER NOT NULL,
Finish INTEGER
);
With some example data
INSERT INTO Data VALUES
(1, 0, 2),
(1, 4, 6),
(1, 8, NULL),
(2, 5, 7),
(2, 9, NULL),
(3, 8, NULL);
And an desired output of something like:
Time
Count
0
1
1
1
2
0
3
0
4
1
5
2
6
1
7
0
8
2
9
3
For which I am querying with:
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time
There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.
I have added an index on the table covering Start/End:
CREATE INDEX Ix_Data ON Data (
Start,
Finish
);
and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.
I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:
3 0 0 MATERIALIZE 3
7 3 0 SETUP
8 7 0 SCAN CONSTANT ROW
21 3 0 RECURSIVE STEP
22 21 0 SCAN TABLE Generate_Time
27 21 0 SCALAR SUBQUERY 2
32 27 0 SEARCH TABLE Data USING COVERING INDEX Ix_Data
57 0 0 SCAN SUBQUERY 3
59 0 0 SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71 0 0 USE TEMP B-TREE FOR GROUP BY
Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!
To get to the desired output as per your question, the following can be done.
For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT gt.Time
,count(d.entityid)
FROM Generate_Time gt
LEFT JOIN Data d
ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
GROUP BY gt.Time
This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.

How to calculate a progressive sum within an Access query

I have a query, let's call it qry_01, that produces a set of data similar to this:
ID N CN Sum
1 4 0 0
2 3 3 3
5 4 4 7
8 3 3 10
The values shown in this query actually come from a chain of queries and from a bunch of different tables.
The corrected value CN is calculated within the query, and counts N if the ID is not 1, and 0 if it is 1.
The Sum is the value I want to calculate by progressively summing up the CN values.
I tried to use DSUM, but I came out with nothing.
Can anyone please help me?
You could use a correlated subquery in the following way:
select t.id, t.n, t.cn, (select sum(u.cn) from qry_01 u where u.id <= t.id) as [sum]
from qry_01 t

MonetDB: Enumerate groups of rows based on a given "boundary" condition

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;

Need help grouping a column that may or may not have the same value but have the same accounts

How do I output data stored in Table 1 so that each like account number has that also has the same CPT group's together but the ones that do not match fall to the bottom of the list?
I have one table: select * from CPTCounts and this is what is displays
Format (relevant fields only):
account OriginalCPT Count ModifiedCPT Count
11 0 71010 1
11 71010 1 0
2 0 71010 1
2 0 71020 9
2 0 73130 1
2 0 77800 1
2 71010 1 0
2 71020 8 0
2 73130 1 0
2 73610 1 0
2 G0202 4 0
31 99010 1 0
31 0 99010 4
31 0 99700 2
What I want the results to be grouped like is below... and display like this or similar.
Account OriginalCPT Count ModifiedCPT Count
11 71010 1 71010 1
2 71010 1 71010 1
2 71020 8 0
2 73130 1 0
2 73610 1 0
2 G0202 4 0
31 99010 1 99010 4
31 0 99700 2
I have one table with the values above;
Select * from #CPTCounts
The grouping I am looking for is the Original = Modified CPT and sometimes I will not have a value in one side or the other but most of the times I will have a match. I would like to place all of the unmatched ones at the bottom of the account.
any suggestions?
I was thinking of creating a second table and joining the two with the account but how do I return each value?
select cpt1.account, cpt1.originalCPT, cpt1.count, cpt2.modifiedcpt, cpt2.count
from #cptcounts cpt1
join #cptcounts cpt2 on cpt1.accont = cpt2.account
but am having trouble with that solution.
I'm not sure I have an exact solution, but perhaps some food for thought at least. The fact that you need either the "original" or the "modified" set of columns makes me think that you need a full outer join rather than a left join. You don't mention which database you are using. In MySql, for example, full joins can be emulated by means of a union of a left and a right join, as in the following:
select cpt1.account, cpt1.originalCPT, cpt1.countO, cpt2.modifiedcpt, cpt2.countM
from CPTCounts cpt1
left outer join CPTCounts cpt2
on cpt1.account = cpt2.account
and cpt1.originalCPT=cpt2.modifiedCPT
where cpt1.account is not null
and (cpt1.originalCPT is not null or cpt2.modifiedCPT is not null)
union
select cpt1.account, cpt1.originalCPT, cpt1.countO, cpt2.modifiedcpt, cpt2.countM
from CPTCounts cpt1
right outer join CPTCounts cpt2
on cpt1.account = cpt2.account
and cpt1.originalCPT=cpt2.modifiedCPT
where cpt2.account is not null
and (cpt1.originalCPT is not null or cpt2.modifiedCPT is not null)
order by originalCPT, modifiedCPT, account
The ordering brings the non-matching rows to the top, but that seemed a lesser problem than getting the matching to work.
(Your output data is a bit confusing, because the CPT 71020, for example, occurs in both original and modified columns, but you haven't shown it as one of the matching ones in your result set. I'm presuming this is because it is just an example... but if I'm wrong, then I am missing some part of your intention.)
You can play around in this SQL Fiddle.

SQL Server cross-row compression

I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.
I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.
prop1 prop2 prop3 prop4
--------------------------------
0 0 1 53
0 0 2 55
1 1 1 8
1 1 1 8
1 1 1 8
1 1 1 8
0 0 2 55
0 0 2 55
0 0 1 53
Into:
prop1 prop2 prop3 prop4 count
-----------------------------------------
0 0 1 53 1
0 0 2 55 1
1 1 1 8 4
0 0 2 55 2
0 0 1 53 1
I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.
Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?
Is there a way to do this (SQL Server 2005)?
Is there a reason I shouldn't do this?
Thanks.
You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.
select
prop1,
prop2,
prop3,
prop4,
count(*) as count
from
tbl
group by
prop1,
prop2,
prop3,
prop4,
y,
x
order by y, x
Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.
Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.
This will work, though it is painful to look at:
;WITH Ordering
AS
(
SELECT Prop1,
Prop2,
Prop3,
Prop4,
ROW_NUMBER() OVER (ORDER BY Y, X) RN
FROM Props
)
SELECT
CurrentRow.Prop1,
CurrentRow.Prop2,
CurrentRow.Prop3,
CurrentRow.Prop4,
CurrentRow.RN -
ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4)
ORDER BY CurrentRow.RN
The gist is the following:
Enumerate each row using ROW_NUMBER OVER to get the correct order.
Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.
70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.