I have encountered situation that I can't explain how Redshift handles division of SUMs.
There is example table:
create table public.datatype_test(
a numeric(19,6),
b numeric(19,6));
insert into public.datatype_test values(222222.2222, 333333.3333);
insert into public.datatype_test values(444444.4444, 666666.6666);
Now I try to run query:
select sum(a)/sum(b) from public.datatype_test;
I get result 0.6666 (4 decimals). It is not related to tool display, it really returns only 4 decimal places, and it doesn't matter how big or small numbers are in table. In my case 4 decimals is not precise enough.
Same stands true if I use AVG instead of SUM.
If I use MAX instead of SUM, I get : 0.6666666666666666666 (19 decimals).
It also returns correct result (0.6666666666666667) when no phisical table is used:
with t as (
select 222222.2222::numeric(19,6) as a, 333333.3333::numeric(19,6) as b union all
select 444444.4444::numeric(19,6) as a, 666666.6666::numeric(19,6) as b
)
select sum(a)/sum(b) as d from t;
I have looked into Redshift documentation about SUM and Computations with Numeric Values, but I still don't get result according to documentation.
Using float datatype for table columns is not an option as I need to store precise currency amounts and 15 significant digits is not enough.
Using cast on SUM aggregation also gives 0.6666666666666666666 (19 decimals).
select sum(a)::numeric(19,6)/sum(b) from public.datatype_test;
But it looks wrong, and I can't force BI tools to do this workaround, also everyone who uses this data should not use this kind of workaround.
I have tried to use same test in PostgreSQL 10, and it works as it should, returning sufficient amount of decimals for division.
Is there anything I can do with database setup to avoid casting in SQL Query?
Any advice or guidance is highly appreciated.
Redshift version:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.4081
Using dc2.8xlarge nodes
I have run into similar issues, and although I don't have a solution that doesn't require a workaround, I can at least explain it.
The precision/scale of the result of division is defined by the rules in the "computations with numeric values" document.
A consequence of those rules is that a decimal(19,6) divided by another decimal(19,6) will return decimal(38,19).
What's happening to you, though, is that MAX returns the same precision/scale as the underlying column, but SUM returns decimal(38,*) no matter what.
(This is probably a safety precaution to prevent overflow on sums of "big data"). If you divide decimal(38,6) by another, you get decimal(38,4).
AWS support will probably not consider this a defect -- there is no SQL standard for how to treat decimal precision in division, and given that this is documented behavior, it's probably a deliberate decision.
The only way to address this is to typecast the numerator, or multiply it by something like sum(a) * cast(1 as decimal(10,9)) which is portable SQL and will force more decimal places in the numerator and thus the result.
As a convenience I made a calculator in JSFiddle with the rules so you can play around with different options:
scale = Math.max(4, s1 + p2 - s2 + 1)
precision = p1 - s1 + s2 + scale
if (precision > 38) {
scale = Math.max((38 + scale - precision), 4)
precision = 38
}
Related
I'm new to this.
I have a column: (chocolate_weight) On the table : (Chocolate) which has g at the end of every number, so 30x , 2x5g,10g etc.
I want to remove the letter at the end and then query it to show any that weigh greater than 35.
So far I have done
Select *
From Chocolate
Where chocolate_weight IN
(SELECT
REPLACE(chocolote_weight,'x','') From Chocolate) > 35
It is coming back with 0 , even though there are many that weigh more than 35.
Any help is appreciated
Thanks
If 'g' is always the suffix then your current query is along the right lines, but you don't need the IN you can do the replace in the where clause:
SELECT *
FROM Chocolate
WHERE CAST(REPLACE(chocolate_weight,'g','') AS DECIMAL(10, 2)) > 35;
N.B. This works in both the tagged DBMS SQL-Server and MySQL
This will fail (although only silently in MySQL) if you have anything that contains units other than grams though, so what I would strongly suggest is that you fix your design if it is not too late, store the weight as an numeric type and lose the 'g' completely if you only ever store in grams. If you use multiple different units then you may wish to standardise this so all are as grams, or alternatively store the two things in separate columns, one as a decimal/int for the numeric value and a separate column for the weight, e.g.
Weight
Unit
10
g
150
g
1000
lb
The issue you will have here though is that you will have start doing conversions in your queries to ensure you get all results. It is easier to do the conversion once when the data is saved and use a standard measure for all records.
Consider these values which are of type MONEY (sample values and these can change)
select 4796.529 + 1585.0414 + 350.9863 + 223.3549 + 127.6314+479.6529 + 158.5041
for some reason I need to round each value to a scale of 3 like this
select round(4796.529,3)+ round(1585.0414,3)+ round(350.9863,3)+ round(223.3549,3)+ round(127.6314,3)+ round(479.6529,3)+ round(158.5041,3)
but when I take the sum they shows a very minor variation. first line of code returns 7721.7000. and the second one 7721.6990. But this variation in not acceptable. What is the best way to solve this ?
As Whencesoever said, your problem is mathmatical one, not a programming error.
12.5 + 11.6 = 24.1
ROUND(12.5) + ROUND(11.6) = 25
ROUND(12.5 + 11.6) = 24
I'd talk with the business and figure out where they want the rounding applied.
Also, as a side note, MONEY is a terrible datatype. If you can, you may want to consider switching to a DECIMAL. See Should you choose the MONEY or DECIMAL(x,y) datatypes in SQL Server?
When you round numbers before you sum them you will get a different result than if you round numbers after you have summed them. Simple as that. There is no way to solve this.
So I have two identical values that are results of sum functions with the exact same length (no rounding is being done). (Update, data type is float)
Value_1 = 29.9539194336501
Value_2 = 29.9539194336501
The issue I'm having is when I do an IF statement for Value_1 = Value_2, it comes up as FALSE.
Value_1:
SELECT SUM([INVN_DOL])/SUM([AVG_DLY_SLS_LST_35_DYS]) end as DSO
FROM TABLE A
Value_2:
SELECT SUM ([Total_Inventory_Val]) / SUM ([Daily_Independent_Demand])
FROM TABLE B
Any idea why they may not be exactly equal and what I can do to get a TRUE value since they do match?
Thanks in advance
The issue you are having here is that your are using a calculated value that is held within a float, which will by design be slightly imprecise at higher levels of precision, which is why you are getting your mismatch.
Use data types like decimal with a defined precision and scale to hold your values and calculation results and you should get consistent results.
You can make use ROUND to limit the decimal points
Or
Try the ABS and see if that works out.
I'm trying to query a database, I need to get a list of customers where their weight is equal to 60.5. The problem is that 60.5 is a real I've never query a database with a real in a where clause before.
I've tried this:
SELECT Name FROM Customers WHERE Weight=60.5
SELECT Name FROM Customers WHERE Weight=cast(60.5 as real)
SELECT Name FROM Customers WHERE Weight=cast(60.5 as decimal)
SELECT Name FROM Customers WHERE Weight=convert(real,'60.5')
SELECT Name FROM Customers WHERE Weight=convert(decimal,'60.5')
These queries return 0 values but in the Customers table their are 10 rows with Weight=60.5
Your problem is that floating point numbers are inaccurate by definition. Comparing what seems to be 60.5 to a literal 60.5 might not work as you've noticed.
A typical solution is to measure the difference between 2 values, and if it's smaller then some predefined epsilon, consider them equal:
SELECT Name FROM Customers WHERE ABS(Weight-60.5) < 0.001
For better performance, you should actually use:
SELECT Name FROM Customers WHERE Weight BETWEEN 64.999 AND 65.001
If you need equality comparison, you should change the type of the column to DECIMAL. Decimal numbers are stored and compared exactly, while real and float numbers are approximations.
#Amit's answer will work, but it will perform quite poorly in comparison to my approach. ABS(Weight-60.5) < 0.001 is unable to use index seeks. But if you convert the column to DECIMAL, then Weight=60.5 will perform well and use index seeks.
I'm trying to find geometric average of values from a table with millions of rows. For those that don't know, to find the geometric average, you mulitply each value times each other then divide by the number of rows.
You probably already see the problem; The number multiplied number will quickly exceed the maximum allowed system maximum. I found a great solution that uses the natural log.
http://timothychenallen.blogspot.com/2006/03/sql-calculating-geometric-mean-geomean.html
However that got me to wonder wouldn't the same problem apply with the arithmetic mean? If you have N records, and N is very large the running sum can also exceed the system maximum.
So how do RDMS calculate averages during queries?
I don't know an exact implementation for arithmetic mean in an RDBMS, nor did you specify one in your original question. But the RDBMS does not need to sum a million rows in a column in order to obtain the arithmetic mean. Consider the following summation:
sum = (x1 + x2 + x3 + ... + x1000000)
Then the mean can be written as
mean = sum / N = (x1 + x2 + x3 + ... + x1000000) / N, for N = 1,000,000
But this expression can be broken up into pieces like this:
mean = [(x1 + x2 + x3) / N ] + [(x4 + x5 + x6) / N] + ...
In other words, the RDBMS can simply scan down the million rows in a column and find the mean section by section, without running the risk of an overflow. And since each number in the column is presumably within range for the type storing it, there is no chance of the mean value itself overflowing.
Most databases don't support a product() function the way they support an average.
However, you can use do what you want with logs. The product (simplified) is like:
select exp(sum(ln(x)) as product
The average would be:
select power(exp(sum(ln(x))), 1.0 / count(*)) as geoaverage
or
select EXP(AVG(LN(x))) as geoaverage
The LN() function might be LOG() on some platforms...
These are schematics. The functions for exp() and ln() and power() vary, depending on the database. Plus, if you have to take into account zero or negative numbers, the logic is more complicated.
Very easy to check. For example, SQL Server 2008.
DECLARE #T TABLE(i int);
INSERT INTO #T(i) VALUES
(2147483647),
(2147483647);
SELECT AVG(i) FROM #T;
result
(2 row(s) affected)
Msg 8115, Level 16, State 2, Line 7
Arithmetic overflow error converting expression to data type int.
There is no magic. Column type is int, server adds values together using internal variable of the same type int and intermediary result exceeds range for int.
You can run the similar check for any other DBMS that you use. Different engines may behave differently, but I would expect all of them to stick to the original type of the column. For example, averaging two int values 100 and 101 may result in 100 or 101 (still int), but never 100.5.
For SQL Server this behavior is documented. I would expect something similar for all other engines:
AVG () computes the average of a set of values by dividing the sum of
those values by the count of nonnull values. If the sum exceeds the
maximum value for the data type of the return value an error will be
returned.
So, you have to be careful when calculating simple average as well, not just product.
Here is extract from SQL 92 Standard:
6) Let DT be the data type of the < value expression >.
9) If SUM or AVG is specified, then:
a) DT shall not be character string, bit string, or datetime.
b) If SUM is specified and DT is exact numeric with scale S, then the
data type of the result is exact numeric with implementation-defined
precision and scale S.
c) If AVG is specified and DT is exact numeric, then the data type of
the result is exact numeric with implementation- defined precision not
less than the precision of DT and implementation-defined scale not
less than the scale of DT.
d) If DT is approximate numeric, then the data type of the result is
approximate numeric with implementation-defined precision not less
than the precision of DT.
e) If DT is interval, then the data type of the result is inter- val
with the same precision as DT.
So, DBMS can convert int to larger type when calculating AVG, but it has to be an exact numeric type, not floating-point. In any case, depending on the values you can still get arithmetic overflow.
Some DBMS — specifically, the Informix DBMS — convert from an INT type to a floating point type to do the calculation:
SQL[2148]: create table t(i int);
SQL[2149]: insert into t values(214748347);
SQL[2150]: insert into t values(214748347);
SQL[2151]: insert into t values(214748347);
SQL[2152]: select avg(i) from t;
214748347.0
SQL[2153]: types on;
SQL[2154]: select i from t;
INTEGER
214748347
214748347
214748347
SQL[2155]: select avg(i) from t;
DECIMAL(32)
214748347.0
SQL[2156]:
Similarly with other types. This can still end with an overflow under some circumstances; you then get a runtime error. However, it is rather seldom that you exceed the precision — it typically takes a very large number of rows for the sum to exceed the limits, even if you're counting the US deficit over the next century in atto-Zimbabwean dollars circa 2009.