I have a very set of data as follows:
CustomerId char(6)
Points int
PointsDate date
with example data such as:
000021 0 01-JAN-2014
000021 10 02-JAN-2014
000021 20 03-JAN-2014
000021 30 06-JAN-2014
000021 40 07-JAN-2014
000021 10 12-JAN-2014
000034 0 04-JAN-2014
000034 40 05-JAN-2014
000034 20 06-JAN-2014
000034 40 08-JAN-2014
000034 60 10-JAN-2014
000034 80 21-JAN-2014
000034 10 22-JAN-2014
So, the PointsDate component is NOT consistent, nor is it contiguous (it's based around some "activity" happening)
I am trying to get, for each customer, the total amount of positive and negative differences in points, the number of positive and negative changes, as well as Max and Min...but ignoring the very first instance of the customer - which will always be zero.
e.g.
CustomerId Pos Neg Count(pos) Count(neg) Max Min
000021 40 30 3 1 40 10
000034 100 90 4 2 80 10
...but I have not a single clue how to achieve this!
I would put it in a cube, but a) there is only a single table and no other references and b) I know almost nothing about cubes!
The problem can be solved in regular TSQL with a common table expression that numbers the lines per customer, along with an outer self join that compares each row with the previous one;
WITH cte AS (
SELECT customerid, points,
ROW_NUMBER() OVER (PARTITION BY customerid ORDER BY pointsdate) rn
FROM mytable
)
SELECT cte.customerid,
SUM(CASE WHEN cte.points > old.points THEN cte.points - old.points ELSE 0 END) pos,
SUM(CASE WHEN cte.points < old.points THEN old.points - cte.points ELSE 0 END) neg,
SUM(CASE WHEN cte.points > old.points THEN 1 ELSE 0 END) [Count(pos)],
SUM(CASE WHEN cte.points < old.points THEN 1 ELSE 0 END) [Count(neg)],
MAX(cte.points) max,
MIN(cte.points) min
FROM cte
JOIN cte old
ON cte.rn = old.rn + 1
AND cte.customerid = old.customerid
GROUP BY cte.customerid
An SQLfiddle to test with.
The query would have been somewhat simplified using SQL Server 2012's more extensive analytic functions.
An approach similar to the one of Joachim Isaksson, but with more work in the CTE and less on the main query
WITH A AS (
SELECT c.CustomerID, c.Points, c.PointsDate
, Diff = c.Points - l.Points
, l.PointsDate lPointsDate
FROM Customer c
CROSS APPLY (SELECT TOP 1
Points, PointsDate
FROM Customer cu
WHERE c.CustomerID = cu.CustomerID
AND c.PointsDate > cu.PointsDate
ORDER BY cu.PointsDate Desc) l
)
SELECT CustomerID
, Pos = SUM(Diff * CAST(Sign(Diff) + 1 AS BIT))
, Neg = SUM(Diff * (1 - CAST(Sign(Diff) + 1 AS BIT)))
, [Count(pos)] = SUM(0 + CAST(Sign(Diff) + 1 AS BIT))
, [Count(neg)] = SUM(1 - CAST(Sign(Diff) + 1 AS BIT))
, Max(Points) [Max], Min(Points) [Min]
FROM A
GROUP BY CustomerID
SQLFiddle Demo
The condition that remove the first day is the JOIN (CROSS APPLY) in the CTE: the first day as no previous day, so is filtered out.
In the main query instead of using a CASE to filter the positive and negative difference I preferred the SIGN function:
this function return -1 for negative, 0 for zero and +1 for positive
shifting the value with Sign(Diff) + 1 mean that the new return values are 0, 1 and 2
the CAST to bit compress those to 0 for negative and 1 for zero or positive.
The 0 + in the definition of the [Count(pos)] create a implicit conversion to an integer value as BIT cannot be summed.
The 1 - to SUM and COUNT the negative difference is equivalent to a NOT: it invert the values of the BIT SIGN to 1 for negative and 0 for zero of positive.
I'll copy my comment from above: I know literally nothing about cubes, but it sounds like what you're looking for is just a cursor, is it not? I know everyone hates cursors, but that's the best way I know to compare consecutive rows without loading it down onto a client machine (which is obviously worse).
I see you mentioned in your response to me that you'd be okay setting it off to run overnight, so if you're willing to accept that sort of performance, I definitely think a cursor will be the easiest and quickest to implement. If this is just something you do here or there, I'd definitely do that. It's nobody's favorite solution, but it'd get the job done.
Unfortunately, yeah, at twelve million records, you'll definitely want to spend some time optimizing your cursor. I work frequently with a database that's around that size, and I can only imagine how long it'd take. Although depending on your usage, you might want to filter based on user, in which case the cursor will be easier to write, and I doubt you'll be facing enough records to cause much of a problem. For instance, you could just look at the top twenty users and test their records, then do more as needed.
Related
I was create this query:
select first_price, last_price, cast((sum(1 - (first_price / nullif(last_price,0)))) as double) as first_vs_last_percentages
from prices
group by first_price, last_price
having first_vs_last_percentages >= 0.1
unfortunately this is my wrong data in first_vs_last_percentages col
ID
first_price
last_price
first_vs_last_percentages
1
10
11
1-(10/11) = 1.0
2
66
68
1-(66/68) = 1.0
It was supposed to return this output:
ID
first_price
last_price
first_vs_last_percentages
1
10
11
1-(10/11) = 0.0909
2
66
68
1-(66/68) = 0.0294
if someone has a good solution and it will be in presto syntax it will be wonderful.
It seems you got struck by another case of integer division (your cast to double is a bit late), update the query so the divisor or dividend type changes (for example by multiplying one of them by 1.0 which is a bit shorter then cast to double):
select -- ...
, sum(1 - (first_price * 1.0) / nullif(last_price, 0)) first_vs_last_percentages
from ...
P.S.
Your query is a bit strange, not sure why do you need grouping and sum here.
It depends on which database engine you work upon. Typically, most query confusion rely on either conceptual or syntatic mistakes. In either one or the other cases, it seek to operate a row-percentage double 100.0*(last-first)/first. It means, you can drop the group by and having, since we MUST NOT group by double values, rather intervals they belong.
select
first_price,
last_price,
CASE
WHEN first_price = 0 THEN NULL
ELSE (last_price-first_price)/first_price
end as first_vs_last_percentage
from prices
I have two solutions for finding the sum of positive integers and negative integers. Please,tell which one is more correct and more optimized?
Or Is there any other more optimized and correct query ?
Q:
Consider Table A with col1 and below values.
col1
20
-20
40
-40
-30
30
I need below output
POSITIVE_SUM NEGATIVE_SUM
90 -90
I have two solutions.
/q1/
select POSITIVE_SUM,NEGATIVE_SUM from
(select distinct sum(a2.col1) AS "POSITIVE_SUM" from A a1 join A a2 on a2.col1>0
group by a1.col1)
t1
,
(select distinct sum(a2.col1) AS "NEGATIVE_SUM"from A a1 join A a2 on a2.col1<0
group by a1.col1) t2;
/q2/
select sum (case when a1.col1 >= 0 then a1.col1 else 0 end) as positive_sum,
sum (case when a1.col1 < 0 then a1.col1 else 0 end) as negative_sum
from A a1;
POSITIVE_SUM NEGATIVE_SUM
90 -90
I wonder how you even came up with your 1st solution:
- self-join (twice) the table,
- producing 6 (identical) rows each and finally with distinct get 1 row,
- then cross join the 2 results.
I prepared a demo so you can see the steps that lead to the result of your 1st solution.
I don't know if this can be in any way optimized,
but is there case that it can beat a single scan of the table with conditional aggregation like your 2nd solution?
I don't think so.
The second query is not only better performing, but it returns the correct values. If you run the first query, you'll see that it returns multiple rows.
I think for the first query, you are looking for something like:
select p.positive_sum, n.negative_sum
from (select sum(col1) as positive_sum from a1 where col1 > 0) p cross join
(select sum(col1) as negative_sum from a1 where col1 < 0) n
And that you are asking wither the case expression is faster than the where.
What you are missing is that this version needs to scan the table twice. Reading data is generally more expensive than any functions on data elements.
Sometimes the second query might have very similar performance. I can think of three cases. First is when there is a clustered index on col1. Second is when col1 is used as a partitioning key. And third is on very small amounts of data (say data that fits on a single data page).
I have a table named 'candidate' which contains among others columns ,score_math' and 'score_language' reflecting candidate's score in respective tests. I need to
Show the number of students who scored at least 60 in both math and language (versatile_candidates) and the number of students who scored below 40 in both of
these tests (poor_candidates). Don't include students with NULL preferred_contact. My query is:
select
count(case when score_math>=60 and score_language>=60 then 1 else 0
end) as versatile_candidates,
count(case when score_math<40 and score_language<40 then 1 else 0 end) as
poor_candidates
from candidate
where preferred_contact is not null
But this produces always total number of candidates wit not-null preferred contact type. Can't really figure out what I did wrong and more importantly why this doesn't work. [DBMS is Postgres if this matters ]Please help
You're close - the reason you're getting the total number of all candidates is because COUNT() will count a 0 the same as a 1 (and any other non-NULL value, for that matter). And since the values could only ever be 0 or 1, your COUNT() will return the total number of all candidates.
Since you're already defaulting the cases that don't match to 0, all you need to do is change the COUNT() to a SUM():
Select Sum(Case When score_math >= 60
And score_language >= 60 Then 1
Else 0
End) As versatile_candidates
, Sum(Case When score_math < 40
And score_language < 40 Then 1
Else 0
End) As poor_candidates
From candidate
Where preferred_contact Is Not Null
COUNT() does not take into consideration NULL values. All other values which are not NULL will be counted.
You might want to replace it with SUM()
The job is actually a machine cycle count that rolls over to zero at 32,000 but the utility / electricity / odometer analogy gets the idea across.
Let's say we have a three digit meter. After 999 it will roll over to 0.
Reading Value Difference
1 990 -
2 992 2
3 997 5
4 003 6 *
5 008 5
I have a CTE query generating the difference between rows but the line
Cur.Value - Prv.Value as Difference
on reading 4 above returns -994 due to the clock rollover. (It should return '6'.)
Can anyone suggest an SQL trick to accommodate the rollover?
e.g., Here's a trick to get around SQL's lack of "GREATEST" function.
-- SQL doesn't have LEAST/GREATEST functions so we use a math trick
-- to return the greater number:
-- 0.5*((A+B) + abs(A-B))
0.5 * (Cur._VALUE - Prv._VALUE + ABS(Cur._VALUE - Prv._VALUE)) AS Difference
Can anyone suggest a similar trick for the rollover problem?
Fiddle: http://sqlfiddle.com/#!3/ce9d4/10
You could use a CASEstatement to detect the negative value-- which indicates a rollover condition-- and compensate for it:
--Create CTE
;WITH tblDifference AS
(
SELECT Row_Number()
OVER (ORDER BY Reading) AS RowNumber, Reading, Value
FROM t1
)
SELECT
Cur.Reading AS This,
Cur.Value AS ThisRead,
Prv.Value AS PrevRead,
CASE WHEN Cur.Value - Prv.Value < 0 -- this happens during a rollover
THEN Cur.Value - Prv.Value + 1000 -- compensate for the rollover
ELSE Cur.Value - Prv.Value
END as Difference
FROM
tblDifference Cur
LEFT OUTER JOIN tblDifference Prv
ON Cur.RowNumber=Prv.RowNumber+1
ORDER BY Cur.Reading
I'm having to return ~70,000 rows of 4 columns of INTs in a specific order and can only use very shallow caching as the data involved is highly volatile and has to be up to date. One property of the data is that it is often highly repetitive when it is in order.
I've started to look at various methods of reducing the row count in order to reduce network bandwidth and client side processing time/resources, but have not managed to find any kind of technique in T-SQL where I can 'compress' repetative rows down into a single row and a 'count' column. e.g.
prop1 prop2 prop3 prop4
--------------------------------
0 0 1 53
0 0 2 55
1 1 1 8
1 1 1 8
1 1 1 8
1 1 1 8
0 0 2 55
0 0 2 55
0 0 1 53
Into:
prop1 prop2 prop3 prop4 count
-----------------------------------------
0 0 1 53 1
0 0 2 55 1
1 1 1 8 4
0 0 2 55 2
0 0 1 53 1
I'd estimate that if this was possible, in many cases what would be a 70,000 row result set would be down to a few thousand at most.
Am I barking up the wrong tree here (is there implicit compression as part of the SQL Server protocol)?
Is there a way to do this (SQL Server 2005)?
Is there a reason I shouldn't do this?
Thanks.
You can use the count function! This will require you to use the group by clause, where you tell count how to break up, or group, itself. Gropu by is used for any aggregate function in SQL.
select
prop1,
prop2,
prop3,
prop4,
count(*) as count
from
tbl
group by
prop1,
prop2,
prop3,
prop4,
y,
x
order by y, x
Update: The OP mentioned these are ordered by y and x, not part of the result set. In this case, you can still use y and x as part of the group by.
Keep in mind that order means nothing if it doesn't have ordering columns, so in this case, we have to respect that with y and x in the group by.
This will work, though it is painful to look at:
;WITH Ordering
AS
(
SELECT Prop1,
Prop2,
Prop3,
Prop4,
ROW_NUMBER() OVER (ORDER BY Y, X) RN
FROM Props
)
SELECT
CurrentRow.Prop1,
CurrentRow.Prop2,
CurrentRow.Prop3,
CurrentRow.Prop4,
CurrentRow.RN -
ISNULL((SELECT TOP 1 RN FROM Ordering O3 WHERE RN < CurrentRow.RN AND (CurrentRow.Prop1 <> O3.Prop1 OR CurrentRow.Prop2 <> O3.Prop2 OR CurrentRow.Prop3 <> O3.Prop3 OR CurrentRow.Prop4 <> O3.Prop4) ORDER BY RN DESC), 0) Repetitions
FROM Ordering CurrentRow
LEFT JOIN Ordering O2 ON CurrentRow.RN + 1 = O2.RN
WHERE O2.RN IS NULL OR (CurrentRow.Prop1 <> O2.Prop1 OR CurrentRow.Prop2 <> O2.Prop2 OR CurrentRow.Prop3 <> O2.Prop3 OR CurrentRow.Prop4 <> O2.Prop4)
ORDER BY CurrentRow.RN
The gist is the following:
Enumerate each row using ROW_NUMBER OVER to get the correct order.
Find the maximums per cycle by joining only when the next row has different fields or when the next row does not exist.
Figure out the count of repetitions is by taking the current row number (presumed to be the max for this cycle) and subtracting from it the maximum row number of the previous cycle, if it exists.
70,000 rows of four integer columns is not really a worry for bandwidth on a modern LAN, unless you have many workstations executing this query concurrently; and on a WAN with more restricted bandwidth you could use DISTINCT to eliminate duplicate rows, an approach which would be frugal with your bandwidth but consume some server CPU. Again, however, unless you have a really overloaded server that is always performing at or near peak loads, this additional consumption would be a mere blip. 70,000 rows is next to nothing.