How to insert infinity in Impala Table - hive

How to insert Infinity and NaN in Impala. Same test works well with Hive but throwing error in Impala.
> create table z2 (x double);
> insert into z2 values (1),("NaN"),("Infinity"),("-Infinity");
Query: insert into z1 values (1),("NaN"),("Infinity"),("-Infinity")
ERROR: AnalysisException: Incompatible return types 'TINYINT' and 'STRING' of
exprs '1' and ''NaN''.
Can anyone tell me how to fix this for Impala.

Based on Impala Mathematical Functions, you're missing the CAST(x AS DOUBLE).
Infinity and NaN can be specified in text data files as inf and nan respectively, and Impala interprets them as these special values. They can also be produced by certain arithmetic expressions; for example, 1/0 returns Infinity and pow(-1, 0.5) returns NaN. Or you can cast the literal values, such as CAST('nan' AS DOUBLE) or CAST('inf' AS DOUBLE).
So your insert should read:
> insert into z2 values (1), (CAST ('nan' AS DOUBLE)),
(CAST ('inf' AS DOUBLE)), (- CAST ('inf' AS DOUBLE));
Then you'll see:
> select * from z2;
+-----------+
| x |
+-----------+
| 1 |
| NaN |
| Infinity |
| -Infinity |
+-----------+
Fetched 4 row(s) in 0.12s

Related

How to generate float sequence in Presto?

I want to generate float range which can be unnested into a column in PrestoDb. I am following documentation https://prestodb.io/docs/current/functions/array.html and trying out 'sequence' but looks like float ranges cannot be generated in sequence. I want to generate a table like below with the value interval reduced by 0.3
| date | value |
| 2020-01-31 | 47.6 |
| 2020-02-28 | 47.3 |
| 2020-03-31 | 47.0 |
I was trying to generate a sequence and then unnest it into column values. I am able to generate date column using the sequence in prestodb but not the value column
Any suggestions please
You can use sequence with bigint and convert to double after unnesting:
presto> SELECT x / 10e0 FROM UNNEST(sequence(476, 470, -3)) t(x);
_col0
-------
47.6
47.3
47.0
(verified on Presto 336)

Creating bins in presto sql - programmatically

I am new to Presto SQL syntax and and wondering if a function exists that will bin rows into n bins in a certain range.
For example, I have a a table with 1m different integers that range from 1 - 100. What can I do to create 20 bins between 1 and 100 (a bin for 1-5, 6-10, 11-15 ... etc. ) without using 20 separate CASE WHEN statements ? Are there any standard SQL functions that do will perform the binning function?
Any advice would be appreciated!
You can use the standard SQL function width_bucket. For example:
WITH data(value) AS (
SELECT rand(100)+1 FROM UNNEST(sequence(1,10000))
)
SELECT value, width_bucket(value, 1, 101, 20) bucket
FROM data
produces:
value | bucket
-------+--------
100 | 20
98 | 20
38 | 8
42 | 9
67 | 14
74 | 15
6 | 2
...
You can just use integer division:
select (intcol - 1) / 5 as bin
Presto does integer division, so you shouldn't have to worry about the remainder.

Float stored as Real in database Sql Server

Why is Float stored as Real in sys.columns or Information_schema.columns when precision <= 24.
CREATE TABLE dummy
(
a FLOAT(24),
b FLOAT(25)
)
checking the data type
SELECT TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
NUMERIC_PRECISION
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'dummy'
Result:
+------------+-------------+-----------+-------------------+
| TABLE_NAME | COLUMN_NAME | DATA_TYPE | NUMERIC_PRECISION |
+------------+-------------+-----------+-------------------+
| dummy | a | real | 24 |
| dummy | b | float | 53 |
+------------+-------------+-----------+-------------------+
So why is float stored as real when the precision is less than or equal to 24. Is this documented somewhere ?
From an MSDN article which discusses the difference between float and real in T-SQL:
The ISO synonym for real is float(24).
float [ (n) ]
Where n is the number of bits that are used to store the mantissa of the float number in scientific notation and, therefore, dictates the precision and storage size. If n is specified, it must be a value between 1 and 53. The default value of n is 53.
n value | Precision | Storage size
1-24 | 7 digits | 4 bytes
24-53 | 15 digits | 8 bytes
SQL Server treats n as one of two possible values. If 1<=n<=24, n is treated as 24. If 25<=n<=53, n is treated as 53.
As to why SQL Server labels it as real, I think it is just a synonym. However, underneath the hood it is still a float(24).
The ISO synonym for real is float(24).
Please refer for more info:
https://msdn.microsoft.com/en-us/library/ms173773.aspx

What is the expected behaviour for multiple set-returning functions in SELECT clause?

I'm trying to get a "cross join" with the result of two set-returning functions, but in some cases I don't get the "cross join", see example
Behaviour 1: When set lenghts are the same, it matches item by item from each set
postgres=# SELECT generate_series(1,3), generate_series(5,7) order by 1,2;
generate_series | generate_series
-----------------+-----------------
1 | 5
2 | 6
3 | 7
(3 rows)
Behaviour 2: When set lenghts are different, it "cross join"s the sets
postgres=# SELECT generate_series(1,2), generate_series(5,7) order by 1,2;
generate_series | generate_series
-----------------+-----------------
1 | 5
1 | 6
1 | 7
2 | 5
2 | 6
2 | 7
(6 rows)
I think I'm not understanding something here, can someone explain the expeted behaviour?
Another example, even weirder:
postgres=# SELECT generate_series(1,2) x, generate_series(1,4) y order by x,y;
x | y
---+---
1 | 1
1 | 3
2 | 2
2 | 4
(4 rows)
I am looking for an answer to the question in the title, ideally with link(s) to documentation.
Postgres 10 or newer
pads with null values for smaller set(s). Demo with generate_series():
SELECT generate_series( 1, 2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;
row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
null | 13 | 23
null | null | 24
dbfiddle here
The manual for Postgres 10:
If there is more than one set-returning function in the query's select
list, the behavior is similar to what you get from putting the
functions into a single LATERAL ROWS FROM( ... ) FROM-clause item. For
each row from the underlying query, there is an output row using the
first result from each function, then an output row using the second
result, and so on. If some of the set-returning functions produce
fewer outputs than others, null values are substituted for the missing
data, so that the total number of rows emitted for one underlying row
is the same as for the set-returning function that produced the most
outputs. Thus the set-returning functions run “in lockstep” until they
are all exhausted, and then execution continues with the next
underlying row.
This ends the traditionally odd behavior.
Some other details changed with this rewrite. The release notes:
Change the implementation of set-returning functions appearing in a query's SELECT list (Andres Freund)
Set-returning functions are now evaluated before evaluation of scalar
expressions in the SELECT list, much as though they had been placed
in a LATERAL FROM-clause item. This allows saner semantics for cases
where multiple set-returning functions are present. If they return
different numbers of rows, the shorter results are extended to match
the longest result by adding nulls. Previously the results were cycled
until they all terminated at the same time, producing a number of rows
equal to the least common multiple of the functions' periods. In
addition, set-returning functions are now disallowed within CASE and
COALESCE constructs. For more information see Section 37.4.8.
Bold emphasis mine.
Postgres 9.6 or older
The number of result rows (somewhat surprisingly!) is the lowest common multiple of all sets in the same SELECT list. (Only acts like a CROSS JOIN if there is no common divisor to all set-sizes!) Demo:
SELECT generate_series( 1, 2) AS row2
, generate_series(11, 13) AS row3
, generate_series(21, 24) AS row4;
row2 | row3 | row4
-----+------+-----
1 | 11 | 21
2 | 12 | 22
1 | 13 | 23
2 | 11 | 24
1 | 12 | 21
2 | 13 | 22
1 | 11 | 23
2 | 12 | 24
1 | 13 | 21
2 | 11 | 22
1 | 12 | 23
2 | 13 | 24
dbfiddle here
Documented in manual for Postgres 9.6 the chapter SQL Functions Returning Sets, along with the recommendation to avoid it:
Note: The key problem with using set-returning functions in the select
list, rather than the FROM clause, is that putting more than one
set-returning function in the same select list does not behave very
sensibly. (What you actually get if you do so is a number of output
rows equal to the least common multiple of the numbers of rows
produced by each set-returning function.) The LATERAL syntax produces
less surprising results when calling multiple set-returning functions,
and should usually be used instead.
Bold emphasis mine.
A single set-returning function is OK (but still cleaner in the FROM list), but multiple in the same SELECT list is discouraged now. This was a useful feature before we had LATERAL joins. Now it's merely historical ballast.
Related:
Parallel unnest() and sort order in PostgreSQL
Unnest multiple arrays in parallel
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
I cannot find any documentation for this. However, I can describe the behavior that I observe.
The set generating functions each return a finite number of rows. Postgres seems to run the set generating functions until all of them are on their last row -- or, more likely stop when all are back to their first rows. Technically, this would be the least common multiple (LCM) of the series lengths.
I'm not sure why this is the case. And, as I say in a comment, I think it is better to generally put the functions in the from clause.
There is the only note about the issue in the documentation. I'm not sure whether this explains the described behavior or not. Perhaps more important is that such function usage is deprecated:
Currently, functions returning sets can also be called in the select list of a query. For each row that the query generates by itself, the function returning set is invoked, and an output row is generated for each element of the function's result set. Note, however, that this capability is deprecated and might be removed in future releases.

NSNumber how to get smallest common denominator? Like 3/8 for 0.375?

Say if I have an NSNumber, which is something between 0 and 1, and it can be represented using X/Y, how do I calculate the X and Y in this case? I don't want to compare:
if (number.doubleValue == 0.125)
{
X = 1;
Y = 8;
}
so I get 1/8 for 0.125
That's relatively straightforward. For example, 0.375 is equivalent to 0.375/1.
First step is to multiply numerator and denominator until the numerator is an integral value (a), giving you 375/1000.
Then find the greatest common divisor and divide both numerator and denominator by that.
A (recursive) function for GCD is:
int gcd (int a, int b) {
return (b == 0) ? a : gcd (b, a%b);
}
If you call that with 375 and 1000, it will spit out 125 so that, when you divide the numerator and denominator by that, you get 3/8.
(a) As pointed out in the comments, there may be problems with numbers that have more precision bits than your integer types (such as IEEE754 doubles with 32-bit integers). You can solve this by choosing integers with a larger range (longs, or a bignum library like MPIR) or choosing a "close-enough" strategy (consider it an integer when the fractional part is relatively insignificant compared to the integral part).
Another issue is the fact that some numbers don't even exist in IEEE754, such as the infamous 0.1 and 0.3.
Unless a number can be represented as the sum of 2-n values where n is limited by the available precision (such as 0.375 being 1/4 + 1/8), the best you can hope for is an approximation.
Example, consider the single-precision (you'll see why below, I'm too lazy to do the whole 64 bits) 1/3. As a single precision value, this is stored as:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111101 01010101010101010101010
In this example, the sign is 0 hence it's a positive number.
The exponent bits give 125 which, when you subtract the 127 bias, gives you -2. Hence the multiplier will be 2-2, or 0.25.
The mantissa bits are a little trickier. They form the sum of an explicit 1 along with all the 2-n values for the 1 bits, where n is 1 through 23 (left to right. So the mantissa is calculated thus:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111101 01010101010101010101010
| | | | | | | | | | |
| | | | | | | | | | +-- 0.0000002384185791015625
| | | | | | | | | +---- 0.00000095367431640625
| | | | | | | | +------ 0.000003814697265625
| | | | | | | +-------- 0.0000152587890625
| | | | | | +---------- 0.00006103515625
| | | | | +------------ 0.000244140625
| | | | +-------------- 0.0009765625
| | | +---------------- 0.00390625
| | +------------------ 0.015625
| +-------------------- 0.0625
+---------------------- 0.25
Implicit 1
========================
1.3333332538604736328125
When you multiply that by 0.25 (see exponent earlier), you get:
0.333333313465118408203125
Now that's why they say you only get about 7 decimal digits of precision (15 for IEEE754 double precision).
Were you to pass that actual number through my algorithm above, you would not get 1/3, you would instead get:
5,592,405
---------- (or 0.333333313465118408203125)
16,777,216
But that's not a problem with the algorithm per se, more a limitation of the numbers you can represent.
Thaks to Wolfram Alpha for helping out with the calculations. If you ever need to do any math that stresses out your calculator, that's one of the best tools for the job.
As an aside, you'll no doubt notice the mantissa bits follow a certain pattern: 0101010101.... This is because 1/3 is an infinitely recurring binary value as well as an infinitely recurring decimal one. You would need and infinite number of 01 bits at the end to exactly represent 1/3 exactly.
You can try this:
- (CGPoint)yourXAndYValuesWithANumber:(NSNumber *)number
{
float x = 1.0f;
float y = x/number.doubleValue;
for(int i = 1; TRUE; i++)
{
if((float)(int)(y * i) == y * i)
// Alternatively floor(y * i), instead of (float)(int)(y * i)
{
x *= i;
y *= i;
break;
}
}
/* Also alternatively
int coefficient = 1;
while(floor(y * coefficient) != y * coefficient)coefficient++;
x *= coefficient, y *= coefficient;*/
return CGPointMake(x, y);
}
This will not work if you have invalid input. X and Y will have to exist and be valid natural numbers (1 to infinity). A good example that will break it is 1/pi. If you have limits, you can do some critical thinking to implement them.
The approach outlined by paxdiablo is spot-on.
I just wanted to provide an efficient GCD function (implemented iteratively):
int gcd (int a, int b){
int c;
while ( a != 0 ) {
c = a; a = b%a; b = c;
}
return b;
}
Source.