So I have two identical values that are results of sum functions with the exact same length (no rounding is being done). (Update, data type is float)
Value_1 = 29.9539194336501
Value_2 = 29.9539194336501
The issue I'm having is when I do an IF statement for Value_1 = Value_2, it comes up as FALSE.
Value_1:
SELECT SUM([INVN_DOL])/SUM([AVG_DLY_SLS_LST_35_DYS]) end as DSO
FROM TABLE A
Value_2:
SELECT SUM ([Total_Inventory_Val]) / SUM ([Daily_Independent_Demand])
FROM TABLE B
Any idea why they may not be exactly equal and what I can do to get a TRUE value since they do match?
Thanks in advance
The issue you are having here is that your are using a calculated value that is held within a float, which will by design be slightly imprecise at higher levels of precision, which is why you are getting your mismatch.
Use data types like decimal with a defined precision and scale to hold your values and calculation results and you should get consistent results.
You can make use ROUND to limit the decimal points
Or
Try the ABS and see if that works out.
Related
I have encountered situation that I can't explain how Redshift handles division of SUMs.
There is example table:
create table public.datatype_test(
a numeric(19,6),
b numeric(19,6));
insert into public.datatype_test values(222222.2222, 333333.3333);
insert into public.datatype_test values(444444.4444, 666666.6666);
Now I try to run query:
select sum(a)/sum(b) from public.datatype_test;
I get result 0.6666 (4 decimals). It is not related to tool display, it really returns only 4 decimal places, and it doesn't matter how big or small numbers are in table. In my case 4 decimals is not precise enough.
Same stands true if I use AVG instead of SUM.
If I use MAX instead of SUM, I get : 0.6666666666666666666 (19 decimals).
It also returns correct result (0.6666666666666667) when no phisical table is used:
with t as (
select 222222.2222::numeric(19,6) as a, 333333.3333::numeric(19,6) as b union all
select 444444.4444::numeric(19,6) as a, 666666.6666::numeric(19,6) as b
)
select sum(a)/sum(b) as d from t;
I have looked into Redshift documentation about SUM and Computations with Numeric Values, but I still don't get result according to documentation.
Using float datatype for table columns is not an option as I need to store precise currency amounts and 15 significant digits is not enough.
Using cast on SUM aggregation also gives 0.6666666666666666666 (19 decimals).
select sum(a)::numeric(19,6)/sum(b) from public.datatype_test;
But it looks wrong, and I can't force BI tools to do this workaround, also everyone who uses this data should not use this kind of workaround.
I have tried to use same test in PostgreSQL 10, and it works as it should, returning sufficient amount of decimals for division.
Is there anything I can do with database setup to avoid casting in SQL Query?
Any advice or guidance is highly appreciated.
Redshift version:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.4081
Using dc2.8xlarge nodes
I have run into similar issues, and although I don't have a solution that doesn't require a workaround, I can at least explain it.
The precision/scale of the result of division is defined by the rules in the "computations with numeric values" document.
A consequence of those rules is that a decimal(19,6) divided by another decimal(19,6) will return decimal(38,19).
What's happening to you, though, is that MAX returns the same precision/scale as the underlying column, but SUM returns decimal(38,*) no matter what.
(This is probably a safety precaution to prevent overflow on sums of "big data"). If you divide decimal(38,6) by another, you get decimal(38,4).
AWS support will probably not consider this a defect -- there is no SQL standard for how to treat decimal precision in division, and given that this is documented behavior, it's probably a deliberate decision.
The only way to address this is to typecast the numerator, or multiply it by something like sum(a) * cast(1 as decimal(10,9)) which is portable SQL and will force more decimal places in the numerator and thus the result.
As a convenience I made a calculator in JSFiddle with the rules so you can play around with different options:
scale = Math.max(4, s1 + p2 - s2 + 1)
precision = p1 - s1 + s2 + scale
if (precision > 38) {
scale = Math.max((38 + scale - precision), 4)
precision = 38
}
I have a table with three columns: traffic INTEGER, revtraffic INTEGER, ratio REAL.
When I update the ratio column in the following way:
UPDATE table
SET ratio = revtraffic/traffic
It returns 0.0 in all cells. However, if write:
SET ratio = revtraffic*100/traffic
then it displays the right results but obviously, 100x magnitude too much. What is going on?
revtraffic and traffic are integers, so dividing them will be done by integer division, which returns only the "whole" part of the division (i.e, the part to the left of the decimal point). This result is then promoted to a real when you assign it to a real.
You can work around this problem by explicitly casting one of the arguments to a real:
UPDATE table SET ratio = (CAST revtraffic AS REAL)/traffic
I'm trying to query a database, I need to get a list of customers where their weight is equal to 60.5. The problem is that 60.5 is a real I've never query a database with a real in a where clause before.
I've tried this:
SELECT Name FROM Customers WHERE Weight=60.5
SELECT Name FROM Customers WHERE Weight=cast(60.5 as real)
SELECT Name FROM Customers WHERE Weight=cast(60.5 as decimal)
SELECT Name FROM Customers WHERE Weight=convert(real,'60.5')
SELECT Name FROM Customers WHERE Weight=convert(decimal,'60.5')
These queries return 0 values but in the Customers table their are 10 rows with Weight=60.5
Your problem is that floating point numbers are inaccurate by definition. Comparing what seems to be 60.5 to a literal 60.5 might not work as you've noticed.
A typical solution is to measure the difference between 2 values, and if it's smaller then some predefined epsilon, consider them equal:
SELECT Name FROM Customers WHERE ABS(Weight-60.5) < 0.001
For better performance, you should actually use:
SELECT Name FROM Customers WHERE Weight BETWEEN 64.999 AND 65.001
If you need equality comparison, you should change the type of the column to DECIMAL. Decimal numbers are stored and compared exactly, while real and float numbers are approximations.
#Amit's answer will work, but it will perform quite poorly in comparison to my approach. ABS(Weight-60.5) < 0.001 is unable to use index seeks. But if you convert the column to DECIMAL, then Weight=60.5 will perform well and use index seeks.
I'm trying to find geometric average of values from a table with millions of rows. For those that don't know, to find the geometric average, you mulitply each value times each other then divide by the number of rows.
You probably already see the problem; The number multiplied number will quickly exceed the maximum allowed system maximum. I found a great solution that uses the natural log.
http://timothychenallen.blogspot.com/2006/03/sql-calculating-geometric-mean-geomean.html
However that got me to wonder wouldn't the same problem apply with the arithmetic mean? If you have N records, and N is very large the running sum can also exceed the system maximum.
So how do RDMS calculate averages during queries?
I don't know an exact implementation for arithmetic mean in an RDBMS, nor did you specify one in your original question. But the RDBMS does not need to sum a million rows in a column in order to obtain the arithmetic mean. Consider the following summation:
sum = (x1 + x2 + x3 + ... + x1000000)
Then the mean can be written as
mean = sum / N = (x1 + x2 + x3 + ... + x1000000) / N, for N = 1,000,000
But this expression can be broken up into pieces like this:
mean = [(x1 + x2 + x3) / N ] + [(x4 + x5 + x6) / N] + ...
In other words, the RDBMS can simply scan down the million rows in a column and find the mean section by section, without running the risk of an overflow. And since each number in the column is presumably within range for the type storing it, there is no chance of the mean value itself overflowing.
Most databases don't support a product() function the way they support an average.
However, you can use do what you want with logs. The product (simplified) is like:
select exp(sum(ln(x)) as product
The average would be:
select power(exp(sum(ln(x))), 1.0 / count(*)) as geoaverage
or
select EXP(AVG(LN(x))) as geoaverage
The LN() function might be LOG() on some platforms...
These are schematics. The functions for exp() and ln() and power() vary, depending on the database. Plus, if you have to take into account zero or negative numbers, the logic is more complicated.
Very easy to check. For example, SQL Server 2008.
DECLARE #T TABLE(i int);
INSERT INTO #T(i) VALUES
(2147483647),
(2147483647);
SELECT AVG(i) FROM #T;
result
(2 row(s) affected)
Msg 8115, Level 16, State 2, Line 7
Arithmetic overflow error converting expression to data type int.
There is no magic. Column type is int, server adds values together using internal variable of the same type int and intermediary result exceeds range for int.
You can run the similar check for any other DBMS that you use. Different engines may behave differently, but I would expect all of them to stick to the original type of the column. For example, averaging two int values 100 and 101 may result in 100 or 101 (still int), but never 100.5.
For SQL Server this behavior is documented. I would expect something similar for all other engines:
AVG () computes the average of a set of values by dividing the sum of
those values by the count of nonnull values. If the sum exceeds the
maximum value for the data type of the return value an error will be
returned.
So, you have to be careful when calculating simple average as well, not just product.
Here is extract from SQL 92 Standard:
6) Let DT be the data type of the < value expression >.
9) If SUM or AVG is specified, then:
a) DT shall not be character string, bit string, or datetime.
b) If SUM is specified and DT is exact numeric with scale S, then the
data type of the result is exact numeric with implementation-defined
precision and scale S.
c) If AVG is specified and DT is exact numeric, then the data type of
the result is exact numeric with implementation- defined precision not
less than the precision of DT and implementation-defined scale not
less than the scale of DT.
d) If DT is approximate numeric, then the data type of the result is
approximate numeric with implementation-defined precision not less
than the precision of DT.
e) If DT is interval, then the data type of the result is inter- val
with the same precision as DT.
So, DBMS can convert int to larger type when calculating AVG, but it has to be an exact numeric type, not floating-point. In any case, depending on the values you can still get arithmetic overflow.
Some DBMS — specifically, the Informix DBMS — convert from an INT type to a floating point type to do the calculation:
SQL[2148]: create table t(i int);
SQL[2149]: insert into t values(214748347);
SQL[2150]: insert into t values(214748347);
SQL[2151]: insert into t values(214748347);
SQL[2152]: select avg(i) from t;
214748347.0
SQL[2153]: types on;
SQL[2154]: select i from t;
INTEGER
214748347
214748347
214748347
SQL[2155]: select avg(i) from t;
DECIMAL(32)
214748347.0
SQL[2156]:
Similarly with other types. This can still end with an overflow under some circumstances; you then get a runtime error. However, it is rather seldom that you exceed the precision — it typically takes a very large number of rows for the sum to exceed the limits, even if you're counting the US deficit over the next century in atto-Zimbabwean dollars circa 2009.
I want to store a value that represents a percent in SQL server, what data type should be the prefered one?
You should use decimal(p,s) in 99.9% of cases.
Percent is only a presentation concept: 10% is still 0.1.
Simply choose precision and scale for the highest expected values/desired decimal places when expressed as real numbers. You can have p = s for values < 100% and simply decide based on decimal places.
However, if you do need to store 100% or 1, then you'll need p = s+1.
This then allows up to 9.xxxxxx or 9xx.xxxx%, so I'd add a check constraint to keep it maximum of 1 if this is all I need.
decimal(p, s) and numeric(p, s)
p (precision):
The maximum total number of decimal digits that will be stored (both to the left and to the right of the decimal point)
s (scale):
The number of decimal digits that will be stored to the right of the decimal point (-> s defines the number of decimal places)
0 <= s <= p.
p ... total number of digits
s ... number of digits to the right of the decimal point
p-s ... number of digits to the left of the decimal point
Example:
CREATE TABLE dbo.MyTable
( MyDecimalColumn decimal(5,2)
,MyNumericColumn numeric(10,5)
);
INSERT INTO dbo.MyTable VALUES (123, 12345.12);
SELECT MyDecimalColumn, MyNumericColumn FROM dbo.MyTable;
Result:
MyDecimalColumn: 123.00 (p=5, s=2)
MyNumericColumn: 12345.12000 (p=10, s=5)
link: msdn.microsoft.com
I agree, DECIMAL is where you should store this type of number. But to make the decision easier, store it as a percentage of 1, not as a percentage of 100. That way you can store exactly the number of decimal places you need regardless of the "whole" number. So if you want 6 decimal places, use DECIMAL(9, 8) and for 23.3436435%, you store 0.23346435. Changing it to 23.346435% is a display problem, not a storage problem, and most presentation languages / report writers etc. are capable of changing the display for you.
I think decimal(p, s) should be used while s represents the percentage capability.
the 'p' could of been even 1 since we will never need more than one byte since each digit in left side of the point is one hunderd percent, so the p must be at least s+1, in order you should be able to store up to 1000%.
but SQL doesn't allow the 'p' to be smaller than the s.
Examples:
28.2656579879% should be decimal(13, 12) and should be stored 00.282656579879
128.2656579879% should be decimal(13, 12) and should be stored 01.282656579879
28% should be stored in decimal(3,2) as 0.28
128% should be stored in decimal(3,2) as 1.28
Note: if you know that you're not going to reach the 100% (i.e. your value will always be less than 100% than use decimal(s, s), if it will, use decimal(s+1, s).
And so on
The datatype of the column should be decimal.