How to handle very big numbers in snowflake? - sql

I've a python program that goes over tables in a DB (not mine) and for each column from type number it performs some mathematical operations such as stdev. However, there are some columns with very numbers and when I'm trying to execute:
select STDDEV(big_col) from table1;
I'm getting the error:
Number out of representable range: type FIXED[SB16](38,0){not null}, value 3.67864e+38
Any idea how can I handle this one? It's ok for me just to ignore this values in this case but I don't want my query to fail.
Thanks,
Nir.

As #dnoeth mentioned in the comment section, casting the standard deviation as DOUBLE should fix the issue: STDDEV(CAST(big_col as DOUBLE)).
The OP asked how the resulting standard deviation seems to be significantly smaller than e+38 (which is the max number of digits that the NUMBER format can hold), then why do we need to cast this number as DOUBLE?
The reason for this lies in the standard deviation formula:
The first step in this process is to subtract the mean of the column from each individual value in that column. All those values are then squared. Now this square determines the 88 upper bound for the values that the NUMBER format needs to be able to handle before further arithmetic operations, like dividing by the number of records (N), and taking a square root reduce it down to the final answer. The final value of standard deviation that you get from this process could be significantly smaller than the sum of squares that's required to be calculated first.

Related

Clever way to check if value meets threshold in VBA

Disclaimer: Numbers below are randomly generated
What I'm trying to do is, purely in VBA, look at the ratio of [column B]/[column A] and checking whether or not the ratio in row 10 (=1,241/468) is below the minimum of the ratios or above the maximum of the ratios in rows 1 through 9 but only compared to the rows where there is a 1 in column C.
That is, compare Cell(B10)/Cell(A10) to Cell(B2)/Cell(A2), Cell(B3)/Cell(A3), etc. (only comparing against rows with a 1 in column C).
The workbook I'm working with has a lot more data and columns and I'm not allowed to explicitly edit the cells, so defining a new column is out of the question. Is there a way to do this in VBA such that it essentially returns a boolean depending whether or not the ratio in the last row violates the threshold defined above?
You can achieve the minimum and maximum ratios (with criteria) easily with the AGGREGATE¹ function's SMALL sub-function and LARGE sub-function.
        
The formulas in D13:E13 are,
=AGGREGATE(15, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
=AGGREGATE(14, 6, ((B1:B9)/(A1:A9))/C1:C9, 1)
The 6 is the AGGREGATE parameter for ignoring error values. By dividing the ratio
by the value in column C we are producing #DIV/0! errors for anything we do not want considered leaving them ignored. If the values in C were more diverse, we could divide by (C1:C9=1) to produce the same results.
Since we are using the SMALL and LARGE sub-functions, we can easily retrieve the second, third, etc. ratios by increasing the k parameter (the 1 off the back end).
I've modified some of the values in your sample slightly to demonstrate that the min and max with criteria are being picked up correctly.
These can be adapted to VBA with the WorksheetFunction object or Application.Evaluate method.
¹The AGGREGATE¹ function's was introduced with Excel 2010. It is not available in previous versions.

Handling variable DECIMAL data in SQL

I have schedule job to pull data from our legacy system every month. The data can sometime swell and shrink. This cause havoc for DECIMAL precision.
I just found this job failed because DECIMAL(5,3) was too restrictive. I changed it to DECIMAL(6,3) and life is back on track.
Is there any way to evaluate this shifting data so it doesn't break on the DECIMAL()?
Thanks,
-Allen
Is there any way to evaluate this shifting data so it doesn't break on the DECIMAL()
Find the maximum value your data can have and set the column size appropriately.
Decimal columns have two size factors: scale and precision. Set your precision to as many deimal paces you need (3 in your case), and set the scale based on the largest possible number you can have.
A DECIMAL(5,3) has three digits of precision past the decimal and 5 total digits, so it can store numbers up to 99.999. If your data can be 100 or larger, use a bigger scale.
If your data is scientific in nature (e.g. temperature readings) and you don't care about exact equality, only showing trends, relative value, etc.) then you might use real instead. It takes less space than a DECIMAL(5,3) (4 bytes vs 5), has 7 digits of precision (vs. 5) and a range of -3.4E38 to 3.4E38 (vs -99.999 to 99.999).
DECIMAL is more suited for financial data or other data where exact equality is important (i.e. rounding errors are bad)

Proper Data Type in SQL Server to store Scientific Notation value? (Ex. 10^3)

Say I have test results values for a lab procedure that come in as 103. What would be the best way to store this in SQL Server? I would think since this is numerical data it would be improper to just store it as string text and then program around calculating the data value from the string.
If you want to use your data in numeric calculations, it is probably best to represent your data using once of SQL servers native numeric data type. Since you show scientific notation, it is likely you will want to use either REAL or FLOAT.
Real is basically 7 decimal digits of precision and float has 15 digits of precision (at least this is how they are normally used). You can actually specify reduced precision for FLOAT, but in practice most people just use REAL in that case. REAL takes 4 bytes of storage, and FLOAT requires 8 bytes.
The other numeric types are for fixed decimal point arithmetic.
Numbers in scientific notation like this have three pieces of information:
The significand
The precision of the significand
The exponent of 10
Presuming we want to keep all this information as exact as possible, it may be best to store these in three non-floating point columns (floating-point values are inexact):
DECIMAL significand
INT precision (# of decimal places)
INT exponent
The downside to the approach of separating these parts out, of course, is that you'll have to put the values back together when doing calculations -- but by doing that you'll know the correct number of significant figures for the result. Storing these three parts will also take up 25 bytes per value (17 for the DECIMAL, and 4 each for the two INTs), which may be a concern if you're storing a very large quantity of values.
Update per explanatory comments:
Given that your goal is to store an exponent from 1-8, you really only need to store the exponent, since you know the mantissa is always 10. Therefore, if your value is always going to be a whole number, you can just use a single INT column; if it will have decimal places, you can use a FLOAT or REAL per Gary Walker, or use a DECIMAL to store a precise decimal to a specified number of places.
If you specify a DECIMAL, you can provide two arguments in the column type; the first is the total number of digits to be stored, while the second is the number of digits to the right of the decimal point. So if your values are going to be accurate to the tenths place, you might create a column of DECIMAL(2,1). SQL Server MSDN documentation: DECIMAL and NUMERIC types

Lucene fieldNorm discrepancy between Similarity calculation and query-time value

I'm trying to understand how fieldNorm is calculated (at index time) and then used (and apparentlly re-calculated) at query time.
In all the examples I'm using the StandardAnalyzer with no stop words.
Deugging the DefaultSimilarity's computeNorm method while indexing stuff, I've noticed that for 2 particular documents it returns:
0.5 for document A (which has 4 tokens in its field)
0.70710677 for document B (which has 2 tokens in its field)
It does this by using the formula:
state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
where boost is always 1
Afterwards, when I query for these documents I see that in the query explain I get
0.5 = fieldNorm(field=titre, doc=0) for document A
0.625 = fieldNorm(field=titre, doc=1) for document B
This is already strange (to me, I'm sure it's me who's missing something). Why don't I get the same values for field norm as those calculated at index time? Is this the "query normalization" thing in action? If so, how does it work?
This however is more or less ok since the two query-time fieldNorms give the same order as those calculated at index time (the field with the shorter value has the higher fieldNorm in both cases)
I've then made my own Similarity class where I've implemented the computeNorms method like so:
public float computeNorm(String pField, FieldInvertState state) {
norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength())));
return norm;
}
At index time I now get:
1.5 for document A (which has 4 tokens in its field)
1.7071068 for document B (which has 2 tokens in its field)
However now, when I query for these documents, I can see that they both have the same field norm as reported by the explain function:
1.5 = fieldNorm(field=titre, doc=0) for document A
1.5 = fieldNorm(field=titre, doc=1) for document B
To me, this is now really strange, how come if I use an apparently good similarity to calculate the fieldNorm at index time, which gives me proper values proportional to the number of tokens, later on, at query time, all this is lost and the query sais both documents have the same field norm?
So my questions are:
why does the index time fieldNorm as reported by the Similarity's computeNorm method not remain the same as that reported by query explain?
why, for two different fieldNorm values obtained at index time (via similarity computeNorm) I get identical fieldNorm values at query time?
== UPDATE
Ok, I've found something in Lucene's docs which clarifies some of my question, but not all of it:
However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
How much precision loss is there? Is there a minimum gap we should put between different values so that they remain different even after the precision-loss re-calculations?
The documentation of encodeNormValue describes the encoding step (which is where the precision is lost), and particularly the final representation of the value:
The encoding uses a three-bit mantissa, a five-bit exponent, and the zero-exponent point at 15, thus representing values from around 7x10^9 to 2x10^-9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
The most relevant piece to understand that that the mantissa is only 3 bits, which means precision is around one significant decimal digit.
An important note on the rationale comes a few sentences after where your quote ended, where the Lucene docs say:
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)