Why does count(*) return an unsigned integer? - google-bigquery

The following query is an example where default values (in this example INTEGER(21)) are mixed with computed values (in this example COUNT(*)).
SELECT
dimension,
SUM(metric)
FROM (
SELECT
"dim1" AS dimension,
INTEGER(21) AS metric),
(
SELECT
dimension,
COUNT(*) AS metric
FROM (
SELECT
"dim2" AS dimension,
INTEGER(42) AS metric)
GROUP BY
dimension)
GROUP BY
dimension
When running this query, it gets rejected with the following error message:
Cannot union tables : Incompatible types. 'metric' : TYPE_INT64 'metric' : TYPE_UINT64
In other words, the count operation returns an unsigned integer whereas an integer created manually is signed. I understand the underlaying logic of the count operation, which obviously always return an integer being greater or equal than 0. The same goes with the fact that this can be avoided by casting COUNT(*) by encapsulating it with the INTEGER constructor on line 11 of my sample query.
I guess my real question is: why does COUNT(*) return an unsigned integer instead of a signed one (which would allow for cleaner and simpler queries as is the case in other SQL-like environments)?

It was just an unfortunate mistake to make COUNT return unsigned integer type, especially since BigQuery doesn't even support unsigned integers in its metadata. But this (and many other issues) is fixed with standard SQL support in BigQuery, which is available as Alpha. For details how to enable it - check https://cloud.google.com/bigquery/sql-reference/enabling-standard-sql

If you're doing a count it isn't possible to have a negative number. So by making it an unsigned int then the range of numbers that can be handled is expanded.

There are several reasons why using an unsigned int is advantageous:
Philosophical: As you mentioned, COUNT cannot return negative numbers, only natural numbers, which is what unsigned ints are designed to represent. It's the right tool for the job.
Range: An unsigned int can store roughly twice as many non-negative values as a signed int. This greatly decreases the likelihood that the variable will overflow while representing the output of the function.
Type safety: By using a type that cannot represent invalid data it prevents you, the user, from trying to make invalid comparisons. If you try to compare the output of COUNT with a negative number the analyzer can tell you immediately that the comparison you are doing doesn't make sense and is likely to be wrong, thereby potentially saving you from annoying bugs down the line.

Related

Shouldn't binary_double store a higher value than number in Oracle?

Considering the following test code :
CREATE TABLE binary_test (bin_float BINARY_FLOAT, bin_double BINARY_DOUBLE, NUM NUMBER);
INSERT INTO binary_test VALUES (4356267548.32345E+100, 4356267548.32345E+2+300, 4356267548.32345E+100);
SELECT CASE WHEN bin_double>to_binary_double(num) THEN 'Greater'
WHEN bin_double=to_binary_double(num) THEN 'Equal'
WHEN bin_double<to_binary_double(num) THEN 'Lower'
ELSE 'Unknown' END comparison,
A.*
FROM binary_test A;
I've tried to see which one stores higher values. If I try to add E+300 for the number and binary_float columns, it returns numeric overflow error. So, I thought I could store a greater value with the binary_float.
However, when I tried to check it, it shows a lower value, and with the case comparison it says it is lower too. Could you please elaborate this situation?
You are inserting the value 4356267548.32345E+2+300 into the binary double column. That evaluates to 4356267548.32345E+2, which is 435626754832.345, plus 300 - which is 435626755132.345 (or 4.35626755132345E+011, which becomes 4.3562675513234497E+011 when converted to binary double). That is clearly lower than 4356267548.32345E+100 (or 4.35626754832345E+109, which becomes 4.3562675483234496E+109 when converted to binary double).
Not directly relevant, but you should also be aware that you're providing a decimal number literal, which will be implicitly converted to binary double during insert. So you can't use 4356267548.32345E+300, as that is too large for the number data type. If you want to specify a binary double literal then you need to append a d to it, i.e. 4356267548.32345E+300d; but that is still too large.
The highest you can go with that numeric part is 4356267548.32345E+298d, which evaluates to 4.3562675483234498E+307 - just below the data type limit of 1.79769313486231E+308; and note the loss of precision.
db<>fiddle

CASTING to NUMERIC in SQL

I am trying to understand the ARPU calculation in SQL from the following code, however I don't understand why the author has used NUMERIC with revenue in the 2nd query? Won't revenue (meal_price * order quantity) be numeric anyway?
The issue is probably the following. NUMERIC is a specific data type. However, it is not clear that meal_price and order_quantity are specifically NUMERIC -- and not some other type such as INT.
Many databases do integer division for INT, so 1 / 2 is 0 rather than 0.5.
The conversion to NUMERIC is a simple way to avoid integer division.
Of course if a and b are numeric types , a * b will be numeric type
But there are many different numeric types, see
https://www.postgresql.org/docs/13/datatype-numeric.html
NUMERIC is a KEYWORK to specify numeric type of arbitrary précision, see previous link, it's often used to do exact calculations (accouinting) that cannoy be done in foating type.
In your case the author choosed to define the type he wants to use and not let the system/db choose for him. (try to figure out if a and b are integer what shoult be the type of the result 2 * 4 / 3 ?). It's a good practice.

Bigquery: INTEGER type overflow

I'm experiencing problem with INTEGER type. It oveflows and where is no way to prevent it (as it's 64 bit unsigned int). The worst thing it oveflows with no error, just becoming negative number
SELECT 9223372036854775807 + 1
Is there any possibility to overcome this issue (maybe google has plans to introduce new int types)?
BigQuery will provide an option for SQL to raise error in such cases (integer overflow, division by zero etc)
You can detect such conditions and use e.g. NULL as an error indicator, at the cost of more typing.
Something like (assuming you are adding up two non-negative values):
select if(a + b >= a, a + b, NULL) from
( -- sample data
select 9223372036854775807 as a, 1 as b
)

Arithmetic operation with numeric datatype in SQL server yields different results

I get different results when using real and numeric data type.
When I use real as datatype I get finalValue as -139.2466, when I use numeric datatype I get finalVaue as --139.246409. Which value is correct?
When I plug these numbers in Excel, it matches to value -139.2466.
For .eg
create table #resr ( a1 real, a2 real, a3 real)
insert #resr select 0.471163361822717, 0.0096160000 , 0.001669000000000
select a1*a2*-51.295/a3 finalValue from #resr
create table #resn ( a1 numeric(30,15), a2 numeric(30,15), a3 numeric(30,15))
insert #resn select 0.471163361822717, 0.0096160000 , 0.001669000000000
select a1*a2*-51.295/a3 finalValue from #resn
Floating point data types (of which REAL is a member) are approximate values, and can use any of a number of algorithms to encode the sequence of number, causing minute differences in how they're interpreted in SQL. This is the reason you can have a single float(10) value of 1234567890 and .1234567890
select cast(1234567890 as float(10))
select cast(.1234567890 as float(10))
Exact values (such as Decimal and Numeric) define exactly how many decimal places are allowed, and fills in zeroes for any out to as many as have been defined.
Floats give you the ability to model a wider range of numbers since you can allow extremely large numbers and extremely small numbers by allowing the decimal point to "float" rather than be a fixed point in memory. They're also fine in most cases as usually the decimal precision you lose isn't a big deal. They also tend to be smaller than precise data types (not always). However, if you know the size of the values you're expecting ahead of time, it's usually best to use a decimal.
Which value is "correct"? The numeric value. If you're ever comparing a floating point representation of a number vs an exact representation, go with the exact representation.

Multiplication with NULL and empty column values in SQL

This was my Interview Question
there are two columns called Length and Breadth in Area table
Length Breadth Length*Breadth
20 NULL ?
30 ?
21.2 1 ?
I tried running the same question on MYSQL while inserting,To insert an empty value I tried the below query . Am I missing anything while inserting empty values in MYSQL.
insert into test.new_table values (30,);
Answers: With Null,Result is Null.
With float and int multiplication result is float
As per your question the expected results would be as below.
SELECT LENGTH,BREADTH,LENGTH*BREADTH AS CALC_AREA FROM AREA;
LENGTH BREADTH CALC_AREA
20
30 0 0
21.2 1 21.2
For any(first) record in SQL SERVER if you do computation with NULL the answer would be NULL.
For any(second) record in SQL SERVER, if you do product computation between a non-empty value and an empty value the result would be zero as empty value is treated as zero.
For any(third) record in SQL SERVER, if you do computation between two non-empty data type values the answer would be a NON-EMPTY value.
Check SQL Fiddle for reference - http://sqlfiddle.com/#!3/f250a/1
That blank Breath (second row) cannot happen unless Breath is VARCHAR. Assuming that, the answers will be:
NULL (since NULL times anything is NULL)
Throws error (since an empty string is not a number. In Sql Server, the error is "Error converting data type varchar to numeric.")
21.20 (since in Sql Server, for example, conversion to a numeric type is automatic, so SELECT 21.2 * '1' returns 21.20).
Assuming that Length and Breadth are numerical types of some kind the second record does not contain possible values — Breadth must be either 0 or NULL.
In any event, any mathematical operation in SQL involving a NULL value will return the value NULL, indicating that the expression cannot be evaluated. The answer are NULL, impossible, and 21.2.
The product of any value and NULL is NULL. This is called "NULL propagation" if you want to Google it. To score points in an interview, you might want to mention that NULL isn't a value; it's a special marker.
The fact that the column Breadth has one entry "NULL" and one entry that's blank (on the second row) is misleading. A numeric column that doesn't have a value in a particular row means that row is NULL. So the second column should also show "NULL".
The answer to the third row, 21.2 * 1, depends on the data type of the column "Length*Breadth". If it's a data type like float, double, or numberic(16,2), the answer is 21.2. If it's an integer column (integer, long, etc.), the answer is 21.
A more snarky answer might be "There's no answer. The string "Length*Breadth" isn't a legal SQL column name."
In standard SQL they would all generate errors because you are comparing values (or nulls) of different types:
CAST ( 20 AS FLOAT ) * CAST ( NULL AS INTEGER ) -- mismatched types error
CAST ( '' AS INTEGER ) -- type conversion error
CAST ( AS INTEGER ) -- type conversion error
CAST ( 21.2 AS FLOAT ) * CAST ( 2 AS INTEGER ) -- mismatched types error
On the other hand, most SQL product would implicitly cast values when comparing values (or nulls) of different types according to type precedence e.g. comparing float value to an integer value would in effect cast the integer to float and result in a float. At the product level, the most interesting question is what happens when you compare a null of type integer with a value (or even a null) of type float...
...but, frankly, not terribly interesting. In an interview you are presented with a framework (in the form of questions asked of you) on which to present your knowledge, skills and experience. The 'answer' here is to discuss nulls (e.g. point out that nulls are tricky to define and behave in unintuitive ways, which leads to frequent bugs and a desire to avoid nulls entirely, etc) and whether implicit casting is a good thing.