WHILE Window Operation with Different Starting Point Values From Column - SQL Server [duplicate] - sql

In SQL there are aggregation operators, like AVG, SUM, COUNT. Why doesn't it have an operator for multiplication? "MUL" or something.
I was wondering, does it exist for Oracle, MSSQL, MySQL ? If not is there a workaround that would give this behaviour?

By MUL do you mean progressive multiplication of values?
Even with 100 rows of some small size (say 10s), your MUL(column) is going to overflow any data type! With such a high probability of mis/ab-use, and very limited scope for use, it does not need to be a SQL Standard. As others have shown there are mathematical ways of working it out, just as there are many many ways to do tricky calculations in SQL just using standard (and common-use) methods.
Sample data:
Column
1
2
4
8
COUNT : 4 items (1 for each non-null)
SUM : 1 + 2 + 4 + 8 = 15
AVG : 3.75 (SUM/COUNT)
MUL : 1 x 2 x 4 x 8 ? ( =64 )
For completeness, the Oracle, MSSQL, MySQL core implementations *
Oracle : EXP(SUM(LN(column))) or POWER(N,SUM(LOG(column, N)))
MSSQL : EXP(SUM(LOG(column))) or POWER(N,SUM(LOG(column)/LOG(N)))
MySQL : EXP(SUM(LOG(column))) or POW(N,SUM(LOG(N,column)))
Care when using EXP/LOG in SQL Server, watch the return type http://msdn.microsoft.com/en-us/library/ms187592.aspx
The POWER form allows for larger numbers (using bases larger than Euler's number), and in cases where the result grows too large to turn it back using POWER, you can return just the logarithmic value and calculate the actual number outside of the SQL query
* LOG(0) and LOG(-ve) are undefined. The below shows only how to handle this in SQL Server. Equivalents can be found for the other SQL flavours, using the same concept
create table MUL(data int)
insert MUL select 1 yourColumn union all
select 2 union all
select 4 union all
select 8 union all
select -2 union all
select 0
select CASE WHEN MIN(abs(data)) = 0 then 0 ELSE
EXP(SUM(Log(abs(nullif(data,0))))) -- the base mathematics
* round(0.5-count(nullif(sign(sign(data)+0.5),1))%2,0) -- pairs up negatives
END
from MUL
Ingredients:
taking the abs() of data, if the min is 0, multiplying by whatever else is futile, the result is 0
When data is 0, NULLIF converts it to null. The abs(), log() both return null, causing it to be precluded from sum()
If data is not 0, abs allows us to multiple a negative number using the LOG method - we will keep track of the negativity elsewhere
Working out the final sign
sign(data) returns 1 for >0, 0 for 0 and -1 for <0.
We add another 0.5 and take the sign() again, so we have now classified 0 and 1 both as 1, and only -1 as -1.
again use NULLIF to remove from COUNT() the 1's, since we only need to count up the negatives.
% 2 against the count() of negative numbers returns either
--> 1 if there is an odd number of negative numbers
--> 0 if there is an even number of negative numbers
more mathematical tricks: we take 1 or 0 off 0.5, so that the above becomes
--> (0.5-1=-0.5=>round to -1) if there is an odd number of negative numbers
--> (0.5-0= 0.5=>round to 1) if there is an even number of negative numbers
we multiple this final 1/-1 against the SUM-PRODUCT value for the real result

No, but you can use Mathematics :)
if yourColumn is always bigger than zero:
select EXP(SUM(LOG(yourColumn))) As ColumnProduct from yourTable

I see an Oracle answer is still missing, so here it is:
SQL> with yourTable as
2 ( select 1 yourColumn from dual union all
3 select 2 from dual union all
4 select 4 from dual union all
5 select 8 from dual
6 )
7 select EXP(SUM(LN(yourColumn))) As ColumnProduct from yourTable
8 /
COLUMNPRODUCT
-------------
64
1 row selected.
Regards,
Rob.

With PostgreSQL, you can create your own aggregate functions, see http://www.postgresql.org/docs/8.2/interactive/sql-createaggregate.html
To create an aggregate function on MySQL, you'll need to build an .so (linux) or .dll (windows) file. An example is shown here: http://www.codeproject.com/KB/database/mygroupconcat.aspx
I'm not sure about mssql and oracle, but i bet they have options to create custom aggregates as well.

You'll break any datatype fairly quickly as numbers mount up.
Using LOG/EXP is tricky because of numbers <= 0 that will fail when using LOG. I wrote a solution in this question that deals with this

Using CTE in MS SQL:
CREATE TABLE Foo(Id int, Val int)
INSERT INTO Foo VALUES(1, 2), (2, 3), (3, 4), (4, 5), (5, 6)
;WITH cte AS
(
SELECT Id, Val AS Multiply, row_number() over (order by Id) as rn
FROM Foo
WHERE Id=1
UNION ALL
SELECT ff.Id, cte.multiply*ff.Val as multiply, ff.rn FROM
(SELECT f.Id, f.Val, (row_number() over (order by f.Id)) as rn
FROM Foo f) ff
INNER JOIN cte
ON ff.rn -1= cte.rn
)
SELECT * FROM cte

Not sure about Oracle or sql-server, but in MySQL you can just use * like you normally would.
mysql> select count(id), count(id)*10 from tablename;
+-----------+--------------+
| count(id) | count(id)*10 |
+-----------+--------------+
| 961 | 9610 |
+-----------+--------------+
1 row in set (0.00 sec)

Related

After Success Execution i got error :Arithmetic overflow error converting expression to data type int

I am reading an sql article on how to cleverly use sql queries ,i get the bellow article which is calculating factorial and its working for number 5 but when i change to another number bigger than 12 it push the error:Msg 8115, Level 16, State 2, Line 1
Arithmetic overflow error converting expression to data type int.
I think about 12 is the sum of numbers<=5 which are 5 4 3 2
the bellow query is working for all num<12 but when i change num to bigger than 12
it does not work even if i change the condition "num<"
with fact as (
select 1 as fac, 1 as num
union all
select fac*(num+1), num+1
from fact
where num<12)
select fac
from fact
where num=5
Your code is a little strange -- counting up and then using the outer where to stop.
I would just put in one number and count down. If you use decimal(38, 0), you can go much higher:
with fact as (
select cast(1 as decimal(38, 0)) as fac, 33 as num
union all
select fac * num, num - 1
from fact
where num > 0
)
select max(fac)
from fact;
This returns 8,683,317,618,811,886,495,518,194,401,280,000,000. That seems sufficiently large for most practical application.
The result of the calculation when using the number 13 is greater than the range of values an integer can hold
Instead you will have to convert to BIGINT
with fact as (
select CONVERT(BIGINT,1) as fac,
1 as num
union all
select CONVERT(BIGINT,fac*(num+1)),
num+1
from fact
where num<13
)
select fac
from fact
where num=13

Quicker with SQL to SUM 0 values or exclude them?

I want to SUM a lot of rows.
Is it quicker (or better practice, etc) to do Option A or Option B?
Option A
SELECT
[Person]
SUM([Value]) AS Total
FROM
Database
WHERE
[Value] > 0
GROUP BY
[Person]
Option B
SELECT
[Person]
SUM([Value]) AS Total
FROM
Database
GROUP BY
[Person]
So if I have, for Person X:
0, 7, 0, 6, 0, 5, 0, 0, 0, 4, 0, 9, 0, 0
Option A does:
a) Remove zeros
b) 7 + 6 + 5 + 4 + 9
Option B does:
a) 0 + 7 + 0 + 6 + 0 + 5 + 0 + 0 + 0 + 4 + 0 + 9 + 0 + 0
Option A has less summing, because it has fewer records to sum, because I've excluded the load that have a zero value. But Option B doesn't need a WHERE clause.
Anyone got an idea as to whether either of these are significantly quicker/better than the other? Or is it just something that doesn't matter either way?
Thanks :-)
Well, if you have a filtered index that exactly matches the where clause, and if that index removes a significant amount of data (as in: a good chunk of the data is zeros), then definitely the first... If you don't have such an index: then you'll need to test it on your specific data, but I would probably expect the unfiltered scenario to be faster, as it can do use a range of tricks to do the sum if it doesn't need to do branching etc.
However, the two examples aren't functionally equivalent at the moment (the second includes negative values, the first doesn't).
Assuming that Value is always positive the 2nd query might still return less rows if there's a Person with all zeroes.
Otherwise you should simply test actual runtime/CPU on a really large amount of rows.
As already pointed out, the two are not functionally equivalent. In addition to the differences already pointed out (negative values, different output row count), Option B also filters out rows where Value is NULL. Option A doesn't.
Based on the Execution plan for both of these and using a small dataset similar to the one you provided, Option B is slightly faster with an Estimated Subtree Cost of .0146636 vs .0146655. However, you may get different results depending on the query or size of dataset. Only option is to test and see for yourself.
http://www.developer.com/db/how-to-interpret-query-execution-plan-operators.html
Drop Table #Test
Create Table #Test (Person nvarchar(200), Value int)
Insert Into #Test
Select 'Todd', 12 Union
Select 'Todd', 11 Union
Select 'Peter', 20 Union
Select 'Peter', 29 Union
Select 'Griff', 10 Union
Select 'Griff', 0 Union
Select 'Peter', 0 Union
SELECT [Person], SUM([Value]) AS Total
FROM #Test
WHERE [Value] > 0
GROUP BY [Person]
SELECT [Person],SUM([Value]) AS Total
FROM #Test
GROUP BY [Person]

How to find a value, in which Range that it lies without using Between or Comparison operator in expression

I have a Table
Low High Item
0 45 A
45.01 75.24 B
75.24 108.00 C
108..01 122.00 D
Example : If My input is 30 should find in which range does it lies and return the corresponding Item i.e, A ( without using Between or Comparison operator in expression)
Assuming the ranges are non-overlapping:
select top 1 t.*
from table t
order by (#input - low)*(high - #input) desc;
The expression (#input - low)*(high - #input) should be positive only for values in the range. The descending sort will put them first.
This is using SQL Server syntax. Other databases might use limit or something else.
Why not this
Qyery 1 :
Select *
From Table1
Where 30 > Low AND 30 < High
SQL FIDDLE DEMO
Qyery 2 :
Select *
From Table1
Where 50.02 Between Low and high
SQL FIDDLE DEMO
Query 3 :
select t.*
from table1 t
where high - 99.9 > 0 and Low - 99.99 < 0
SQL FIDDLE DEMO
Hope this help you

Ordered count of consecutive repeats / duplicates

I highly doubt I'm doing this in the most efficient manner, which is why I tagged plpgsql on here. I need to run this on 2 billion rows for a thousand measurement systems.
You have measurement systems that often report the previous value when they lose connectivity, and they lose connectivity for spurts often but sometimes for a long time. You need to aggregate but when you do so, you need to look at how long it was repeating and make various filters based on that information. Say you are measuring mpg on a car but it's stuck at 20 mpg for an hour than moves around to 20.1 and so on. You'll want to evaluate the accuracy when it's stuck. You could also place some alternative rules that look for when the car is on the highway, and with window functions you can generate the 'state' of the car and have something to group on. Without further ado:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
So, what would you do differently in order to run this on a huge table, or what alternative tools would you use? I'm thinking plpgsql because I suspect this needs to done in-database, or during the data insertion process, although I generally work with the data after it's loaded. Is there any way to get this in one sweep without resorting to sub-queries?
I have tested one alternative method, but it still relies on a sub-query and I think this is faster. For that method you create a "starts and stops" table with start_timestamp, end_timestamp, system. Then you join to the larger table and if the timestamp is between those, you classify it as being in that state, which is essentially an alternative to cumlative_sum_of_nonrepeats_by_system. But when you do this, you join on 1=1 for thousands of devices and thousands or millions of 'events'. Do you think that's a better way to go?
Test case
First, a more useful way to present your data - or even better, in an sqlfiddle, ready to play with:
CREATE TEMP TABLE data(
system_measured int
, time_of_measurement int
, measurement int
);
INSERT INTO data VALUES
(1, 1, 5)
,(1, 2, 150)
,(1, 3, 5)
,(1, 4, 5)
,(2, 1, 5)
,(2, 2, 5)
,(2, 3, 5)
,(2, 4, 5)
,(2, 5, 150)
,(2, 6, 5)
,(2, 7, 5)
,(2, 8, 5);
Simplified query
Since it remains unclear, I am assuming only the above as given.
Next, I simplified your query to arrive at:
WITH x AS (
SELECT *, CASE WHEN lag(measurement) OVER (PARTITION BY system_measured
ORDER BY time_of_measurement) = measurement
THEN 0 ELSE 1 END AS step
FROM data
)
, y AS (
SELECT *, sum(step) OVER(PARTITION BY system_measured
ORDER BY time_of_measurement) AS grp
FROM x
)
SELECT * ,row_number() OVER (PARTITION BY system_measured, grp
ORDER BY time_of_measurement) - 1 AS repeat_ct
FROM y
ORDER BY system_measured, time_of_measurement;
Now, while it is all nice and shiny to use pure SQL, this will be much faster with a plpgsql function, because it can do it in a single table scan where this query needs at least three scans.
Faster with plpgsql function:
CREATE OR REPLACE FUNCTION x.f_repeat_ct()
RETURNS TABLE (
system_measured int
, time_of_measurement int
, measurement int, repeat_ct int
) LANGUAGE plpgsql AS
$func$
DECLARE
r data; -- table name serves as record type
r0 data;
BEGIN
-- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed, see below!
repeat_ct := 0; -- init
FOR r IN
SELECT * FROM data d ORDER BY d.system_measured, d.time_of_measurement
LOOP
IF r.system_measured = r0.system_measured
AND r.measurement = r0.measurement THEN
repeat_ct := repeat_ct + 1; -- start new array
ELSE
repeat_ct := 0; -- start new count
END IF;
RETURN QUERY SELECT r.*, repeat_ct;
r0 := r; -- remember last row
END LOOP;
END
$func$;
Call:
SELECT * FROM x.f_repeat_ct();
Be sure to table-qualify your column names at all times in this kind of plpgsql function, because we use the same names as output parameters which would take precedence if not qualified.
Billions of rows
If you have billions of rows, you may want to split this operation up. I quote the manual here:
Note: The current implementation of RETURN NEXT and RETURN QUERY
stores the entire result set before returning from the function, as
discussed above. That means that if a PL/pgSQL function produces a
very large result set, performance might be poor: data will be written
to disk to avoid memory exhaustion, but the function itself will not
return until the entire result set has been generated. A future
version of PL/pgSQL might allow users to define set-returning
functions that do not have this limitation. Currently, the point at
which data begins being written to disk is controlled by the work_mem
configuration variable. Administrators who have sufficient memory to
store larger result sets in memory should consider increasing this
parameter.
Consider computing rows for one system at a time or set a high enough value for work_mem to cope with the load. Follow the link provided in the quote on more about work_mem.
One way would be to set a very high value for work_mem with SET LOCAL in your function, which is only effective for for the current transaction. I added a commented line in the function. Do not set it very high globally, as this could nuke your server. Read the manual.

Counting non-null columns in a rather strange way

I have a table which has 32 columns in an Oracle table.
Two of these columns are identity columns
the rest are values
I would like to get the average of all the value columns, which is complicated by the null (identity) columns. Below is the pseudocode for what I am trying to achieve:
SELECT
((nvl(val0, 0) + nvl(val1, 0) + ... nvl(valn, 0))
/ nonZero_Column_Count_In_This_Row)
Such that: nonZero_Column_Count_In_This_Row = (ifNullThenZeroElse1(val0) + ifNullThenZeroElse1(val1) ... ifNullThenZeroElse(valn))
The difficulty here is of course in getting 1 for any non-null column. It seems I need a function similar to NVL, but with an else clause. Something that will return 0 if the value is null, but 1 if not, rather than the value itself.
How should I go about about getting the value for the denominator?
PS: I feel I must explain some motivation behind this design. Ideally this table would have been organized as the identity columns and one value per row with some identifier for the row itself. This would have made it more normalized and the solution to this problem would have been pretty simple. The reasons for it not to be done like this are throughput, and saving space. This is a huge DB where we insert 10 million values per minute into. Making each of these values one row would mean 10M rows per minute, which is definitely not attainable. Packing 30 of them into a single row reduces the number of rows inserted to something we can do with a single DB, and the overhead data amount (the identity data) much less.
(Case When col is null then 0 else 1 end)
You could use NVL2(val0, 1, 0) + NVL2(val1, 1, 0) + ... since you are using Oracle.
Another option is to use the AVG function, which ignores NULLs:
SELECT AVG(v) FROM (
WITH q AS (SELECT val0, val1, val2, val3 FROM mytable)
SELECT val0 AS v FROM q
UNION ALL SELECT val1 FROM q
UNION ALL SELECT val2 FROM q
UNION ALL SELECT val3 FROM q
);
If you're using Oracle11g you can use the UNPIVOT syntax to make it even simpler.
I see this is a pretty old question, but I don't see a sufficient answer. I had a similar problem, and below is how I solved it. It's pretty clear a case statement is needed. This solution is a workaround for such cases where
SELECT COUNT(column) WHERE column {IS | IS NOT} NULL
does not work for whatever reason, or, you need to do several
SELECT COUNT ( * )
FROM A_TABLE
WHERE COL1 IS NOT NULL;
SELECT COUNT ( * )
FROM A_TABLE
WHERE COL2 IS NOT NULL;
queries but want it as a data set when you run the script. See below; I use this for analysis and it's been working great for me so far.
SUM(CASE NVL(valn, 'X')
WHEN 'X'
THEN 0
ELSE 1
END) as COLUMN_NAME
FROM YOUR_TABLE;
Cheers!
Doug
Generically, you can do something like this:
SELECT (
(COALESCE(val0, 0) + COALESCE(val1, 0) + ...... COALESCE(valn, 0))
/
(SIGN(ABS(COALESCE(val0, 0))) + SIGN(ABS(COALESCE(val1, 0))) + .... )
) AS MyAverage
The top line will return the sum of values (omitting NULL values) whereas the bottom line will return the number of non-null values.
FYI - it's SQL Server syntax, but COALESCE is just like ISNULL for the most part. SIGN just returns -1 for a negative number, 0 for zero, and 1 for a positive number. ABS is "absolute value".