SQL Server cross-row compression - sql

I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.
I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.
prop1 prop2 prop3 prop4
--------------------------------
0 0 1 53
0 0 2 55
1 1 1 8
1 1 1 8
1 1 1 8
1 1 1 8
0 0 2 55
0 0 2 55
0 0 1 53
Into:
prop1 prop2 prop3 prop4 count
-----------------------------------------
0 0 1 53 1
0 0 2 55 1
1 1 1 8 4
0 0 2 55 2
0 0 1 53 1
I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.
Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?
Is there a way to do this (SQL Server 2005)?
Is there a reason I shouldn't do this?
Thanks.

You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.
select
prop1,
prop2,
prop3,
prop4,
count(*) as count
from
tbl
group by
prop1,
prop2,
prop3,
prop4,
y,
x
order by y, x
Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.
Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.

This will work, though it is painful to look at:
;WITH Ordering
AS
(
SELECT Prop1,
Prop2,
Prop3,
Prop4,
ROW_NUMBER() OVER (ORDER BY Y, X) RN
FROM Props
)
SELECT
CurrentRow.Prop1,
CurrentRow.Prop2,
CurrentRow.Prop3,
CurrentRow.Prop4,
CurrentRow.RN -
ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4)
ORDER BY CurrentRow.RN
The gist is the following:
Enumerate each row using ROW_NUMBER OVER to get the correct order.
Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.

70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.

Related

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

Give first duplicate a 1 and the rest 0

I have data which contains 1000+ lines and in this it contains errors people make. I have added a extra column and would like to find all duplicate Rev Names and give the first one a 1 and all remaining duplicates a 0. When there is no duplicate, it should be a 1. The outcome should look like this:
RevName ErrorCount Duplicate
Rev5588 23 1
Rev5588 67 0
Rev5588 7 0
Rev5588 45 0
Rev7895 6 1
Rev9065 4 1
Rev5588 1 1
I have tried CASE WHEN but its not giving the first one a 1, its giving them all zero's.
Thanks guys, I am pulling out my hair here trying to get this done.
You could use a case expression over the row_number window function:
SELECT RevName,
Duplicate,
CASE ROW_NUMER() OVER (PARTITION BY RevName
ORDER BY (SELECT 1)) WHEN 1 THEN 1 ELSE 0 END AS Duplicate
FROM mytable
SQL tables represent unordered sets. There is no "first" of anything, unless a column specifies the ordering.
Your logic suggests lag():
select t.*,
(case when lag(revname) over (order by ??) = revname then 0
else 1
end) as is_duplicate
from t;
The ?? is for the column that specifies the ordering.

MonetDB: Enumerate groups of rows based on a given "boundary" condition

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;

How to find daily differences over a flexible time period?

I have a very set of data as follows:
CustomerId char(6)
Points int
PointsDate date
with example data such as:
000021 0 01-JAN-2014
000021 10 02-JAN-2014
000021 20 03-JAN-2014
000021 30 06-JAN-2014
000021 40 07-JAN-2014
000021 10 12-JAN-2014
000034 0 04-JAN-2014
000034 40 05-JAN-2014
000034 20 06-JAN-2014
000034 40 08-JAN-2014
000034 60 10-JAN-2014
000034 80 21-JAN-2014
000034 10 22-JAN-2014
So, the PointsDate component is NOT consistent, nor is it contiguous (it's based around some "activity" happening)
I am trying to get, for each customer, the total amount of positive and negative differences in points, the number of positive and negative changes, as well as Max and Min...but ignoring the very first instance of the customer - which will always be zero.
e.g.
CustomerId Pos Neg Count(pos) Count(neg) Max Min
000021 40 30 3 1 40 10
000034 100 90 4 2 80 10
...but I have not a single clue how to achieve this!
I would put it in a cube, but a) there is only a single table and no other references and b) I know almost nothing about cubes!
The problem can be solved in regular TSQL with a common table expression that numbers the lines per customer, along with an outer self join that compares each row with the previous one;
WITH cte AS (
SELECT customerid, points,
ROW_NUMBER() OVER (PARTITION BY customerid ORDER BY pointsdate) rn
FROM mytable
)
SELECT cte.customerid,
SUM(CASE WHEN cte.points > old.points THEN cte.points - old.points ELSE 0 END) pos,
SUM(CASE WHEN cte.points < old.points THEN old.points - cte.points ELSE 0 END) neg,
SUM(CASE WHEN cte.points > old.points THEN 1 ELSE 0 END) [Count(pos)],
SUM(CASE WHEN cte.points < old.points THEN 1 ELSE 0 END) [Count(neg)],
MAX(cte.points) max,
MIN(cte.points) min
FROM cte
JOIN cte old
ON cte.rn = old.rn + 1
AND cte.customerid = old.customerid
GROUP BY cte.customerid
An SQLfiddle to test with.
The query would have been somewhat simplified using SQL Server 2012's more extensive analytic functions.
An approach similar to the one of Joachim Isaksson, but with more work in the CTE and less on the main query
WITH A AS (
SELECT c.CustomerID, c.Points, c.PointsDate
, Diff = c.Points - l.Points
, l.PointsDate lPointsDate
FROM Customer c
CROSS APPLY (SELECT TOP 1
Points, PointsDate
FROM Customer cu
WHERE c.CustomerID = cu.CustomerID
AND c.PointsDate > cu.PointsDate
ORDER BY cu.PointsDate Desc) l
)
SELECT CustomerID
, Pos = SUM(Diff * CAST(Sign(Diff) + 1 AS BIT))
, Neg = SUM(Diff * (1 - CAST(Sign(Diff) + 1 AS BIT)))
, [Count(pos)] = SUM(0 + CAST(Sign(Diff) + 1 AS BIT))
, [Count(neg)] = SUM(1 - CAST(Sign(Diff) + 1 AS BIT))
, Max(Points) [Max], Min(Points) [Min]
FROM A
GROUP BY CustomerID
SQLFiddle Demo
The condition that remove the first day is the JOIN (CROSS APPLY) in the CTE: the first day as no previous day, so is filtered out.
In the main query instead of using a CASE to filter the positive and negative difference I preferred the SIGN function:
this function return -1 for negative, 0 for zero and +1 for positive
shifting the value with Sign(Diff) + 1 mean that the new return values are 0, 1 and 2
the CAST to bit compress those to 0 for negative and 1 for zero or positive.
The 0 + in the definition of the [Count(pos)] create a implicit conversion to an integer value as BIT cannot be summed.
The 1 - to SUM and COUNT the negative difference is equivalent to a NOT: it invert the values of the BIT SIGN to 1 for negative and 0 for zero of positive.
I'll copy my comment from above: I know literally nothing about cubes, but it sounds like what you're looking for is just a cursor, is it not? I know everyone hates cursors, but that's the best way I know to compare consecutive rows without loading it down onto a client machine (which is obviously worse).
I see you mentioned in your response to me that you'd be okay setting it off to run overnight, so if you're willing to accept that sort of performance, I definitely think a cursor will be the easiest and quickest to implement. If this is just something you do here or there, I'd definitely do that. It's nobody's favorite solution, but it'd get the job done.
Unfortunately, yeah, at twelve million records, you'll definitely want to spend some time optimizing your cursor. I work frequently with a database that's around that size, and I can only imagine how long it'd take. Although depending on your usage, you might want to filter based on user, in which case the cursor will be easier to write, and I doubt you'll be facing enough records to cause much of a problem. For instance, you could just look at the top twenty users and test their records, then do more as needed.

SQL: Sort by priority, but put 0 last

I have a (int) column called "priority". When I select my items I want the highest priority (lowest number) to be first, and the lowest priority (highest number) to be the last.
However, the items without a priority (currently priority 0) should be listed by some other column after the ones with a priority.
In other words. If I have these priorities:
1 2 0 0 5 0 8 9
How do I sort them like this:
1 2 5 8 9 0 0 0
I guess I could use Int.max instead of 0, but 0 makes up such a nice default value which I would try to keep.
I don't think it can get cleaner than this:
ORDER BY priority=0, priority
SQLFiddle Demo
Note that unlike any other solutions, this one will take advantage of index on priority and will be fast if number of records is large.
Try:
order by case priority when 0 then 2 else 1 end, priority
A very simple solution could be to use a composite value/ "prefix" for sorting like this:
SELECT ...
FROM ...
ORDER By CASE WHEN priority = 0 THEN 9999 ELSE 0 END + priority, secondSortCriteriaCol
This will do the trick. You will need to replace testtable with your table name.
SELECT t.priority
FROM dbo.testtable t
ORDER BY (CASE WHEN t.priority = 0 THEN 2147483647 ELSE t.priority END)
In case it's not clear I've picked 2147483647 because this is the max value of the priority column so it will be last.
Mark's answer is better and defo one to go with.
order by case(priority) when 0 then 10 else priority end