BigQuery check entire table for null values - google-bigquery

Not sure if a reproducible example is necessary here. I have a big-ish and wide-ish table in BigQuery (10K rows x 100 cols) and I would like to know if any columns have null values, and how many null values there are. Is there a query that I can run that would return a 1-row table indicating the number of null values in each column, that doesn't require 100 ifnull calls?
Thanks!

Below is for BigQuery Standard SQL
#standardSQL
SELECT col_name, COUNT(1) nulls_count
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(\w+)":null')) col_name
GROUP BY col_name
Instead of returning just one row - it returns those column which have NULL in them - each column and count in separate row - like in below example
Row col_name nulls_count
1 col_a 21
2 col_d 12

This will provide the percentage of null values:
SELECT col_name,
COUNT(1) AS nulls_count,
round(100*(count(1)/
(SELECT count(*)
FROM `project.dataset.table`)), 2) AS percent_nulls
FROM `project.dataset.table` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"(\w+)":null')) col_name
GROUP BY col_name
ORDER BY nulls_count DESC

Related

BigQuery Count Unique and Count Distinct

I am looking for SQL to count unique values in the column.
I am aware of DISTINCT - that gives me how many unique values there are. However, I am looking for - how many ONLY unique values there are.
So if my data is Letters: {A,A,A,B,B,B,C,D}. I am looking to get:
Count Distinct = 4 {A,B,C,D) and
Count Unique = 2 {C,D} <== this is what I am looking for
I am working with BigQuery.
Thank You,
Do
Below query will return only unique values in the column.
SELECT col
FROM UNNEST(SPLIT('A,A,A,B,B,B,C,D')) col
GROUP BY 1 HAVING COUNT(1) = 1;
Then, you can simply count rows.
WITH uniques AS (
SELECT col
FROM UNNEST(SPLIT('A,A,A,B,B,B,C,D')) col
GROUP BY 1 HAVING COUNT(1) = 1
)
SELECT COUNT(*) cnt FROM uniques;
Another option
select count(*) from (
select * from your_table
qualify 1 = count(*) over(partition by col)
)

How to find maximum value in entire table using SQL

I have a table with multiple columns and wanted to find the maximum value in the entire table (across all columns), let me know if it is possible? if yes how
All columns are in integer data type
You can use MAX and GREATEST to achieve this.
Data Set:
CREATE TABLE test
(
col1 INTEGER,
col2 INTEGER,
col3 INTEGER
);
INSERT INTO test VALUES
(1,100,2 ),(2,300,3 ),(3,350, 400 );
You can achieve it using below.
SELECT Greatest(Max(col1), Max(col2), Max(col3)) as Max_Value
FROM test;
DB Fiddle: Try it here
In Postgres and Oracle (and I believe in MySQL as well) you can use:
select max(greatest(col_1, col_2, col_3, col_4))
from the_table;
Supose that you have these columns:
ID (PK)
Column 1 (int)
Column 2 (int)
Column 3 (int)
You can use a SELECT with a UNION clause inside it, something like this:
SELECT ID, MAX(FindNumber) AS FoundedNumber
FROM
(
SELECT ID, Column1 AS FindNumber
FROM YourTable
UNION
SELECT ID, Column2 AS FindNumber
FROM YourTable
UNION
SELECT ID, Column3 AS FindNumber
FROM YourTable
) subselect
GROUP BY ID
This solution is for Microsoft SQL Server.

What is the difference between Postgres DISTINCT vs DISTINCT ON?

I have a Postgres table created with the following statement. This table is filled by as dump of data from another service.
CREATE TABLE data_table (
date date DEFAULT NULL,
dimension1 varchar(64) DEFAULT NULL,
dimension2 varchar(128) DEFAULT NULL
) TABLESPACE pg_default;
One of the steps in a ETL I'm building is extracting the unique values of dimension1 and inserting them in another intermediary table.
However, during some tests I found out that the 2 commands below do not return the same results. I would expect for both to return the same sum.
The first command returns more results compared with the second (1466 rows vs. 1504.
-- command 1
SELECT DISTINCT count(dimension1)
FROM data_table;
-- command 2
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
Any obvious explanations for this? Alternatively to an explanation, is there any suggestion of any check on the data I should do?
EDIT: The following queries both return 1504 (same as the "simple" DISTINCT)
SELECT count(*)
FROM data_table WHERE dimension1 IS NOT NULL;
SELECT count(dimension1)
FROM data_table;
Thank you!
DISTINCT and DISTINCT ON have completely different semantics.
First the theory
DISTINCT applies to an entire tuple. Once the result of the query is computed, DISTINCT removes any duplicate tuples from the result.
For example, assume a table R with the following contents:
#table r;
a | b
---+---
1 | a
2 | b
3 | c
3 | d
2 | e
1 | a
(6 rows)
SELECT distinct * from R will result:
# select distinct * from r;
a | b
---+---
1 | a
3 | d
2 | e
2 | b
3 | c
(5 rows)
Note that distinct applies to the entire list of projected attributes: thus
select distinct * from R
is semantically equivalent to
select distinct a,b from R
You cannot issue
select a, distinct b From R
DISTINCT must follow SELECT. It applies to the entire tuple, not to an attribute of the result.
DISTINCT ON is a postgresql addition to the language. It is similar, but not identical, to group by.
Its syntax is:
SELECT DISTINCT ON (attributeList) <rest as any query>
For example:
SELECT DISTINCT ON (a) * from R
It semantics can be described as follows. Compute the as usual--without the DISTINCT ON (a)---but before the projection of the result, sort the current result and group it according to the attribute list in DISTINCT ON (similar to group by). Now, do the projection using the first tuple in each group and ignore the other tuples.
Example:
select * from r order by a;
a | b
---+---
1 | a
2 | e
2 | b
3 | c
3 | d
(5 rows)
Then for every different value of a (in this case, 1, 2 and 3), take the first tuple. Which is the same as:
SELECT DISTINCT on (a) * from r;
a | b
---+---
1 | a
2 | b
3 | c
(3 rows)
Some DBMS (most notably sqlite) will allow you to run this query:
SELECT a,b from R group by a;
And this give you a similar result.
Postgresql will allow this query, if and only if there is a functional dependency from a to b. In other words, this query will be valid if for any instance of the relation R, there is only one unique tuple for every value or a (thus selecting the first tuple is deterministic: there is only one tuple).
For instance, if the primary key of R is a, then a->b and:
SELECT a,b FROM R group by a
is identical to:
SELECT DISTINCT on (a) a, b from r;
Now, back to your problem:
First query:
SELECT DISTINCT count(dimension1)
FROM data_table;
computes the count of dimension1 (number of tuples in data_table that where dimension1 is not null). This query
returns one tuple, which is always unique (hence DISTINCT
is redundant).
Query 2:
SELECT count(*)
FROM (SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
GROUP BY dimension1) AS tmp_table;
This is query in a query. Let me rewrite it for clarity:
WITH tmp_table AS (
SELECT DISTINCT ON (dimension1)
dimension1 FROM data_table
GROUP by dimension1)
SELECT count(*) from tmp_table
Let us compute first tmp_table. As I mentioned above,
let us first ignore the DISTINCT ON and do the rest of the
query. This is a group by by dimension1. Hence this part of the query
will result in one tuple per different value of dimension1.
Now, the DISTINCT ON. It uses dimension1 again. But dimension1 is unique already (due to the group by). Hence
this makes the DISTINCT ON superflouos (it does nothing).
The final count is simply a count of all the tuples in the group by.
As you can see, there is an equivalence in the following query (it applies to any relation with an attribute a):
SELECT (DISTINCT ON a) a
FROM R
and
SELECT a FROM R group by a
and
SELECT DISTINCT a FROM R
Warning
Using DISTINCT ON results in a query might be non-deterministic for a given instance of the database.
In other words, the query might return different results for the same tables.
One interesting aspect
Distinct ON emulates a bad behaviour of sqlite in a much cleaner way. Assume that R has two attributes a and b:
SELECT a, b FROM R group by a
is an illegal statement in SQL. Yet, it runs on sqlite. It simply takes a random value of b from any of the tuples in the group of same values of a.
In Postgresql this statement is illegal. Instead, you must use DISTINCT ON and write:
SELECT DISTINCT ON (a) a,b from R
Corollary
DISTINCT ON is useful in a group by when you want to access a value that is functionally dependent on the group by attributes. In other words, if you know that for every group of attributes they always have the same value of the third attribute, then use DISTINCT ON that group of attributes. Otherwise you would have to make a JOIN to retrieve that third attribute.
The first query gives the number of not null values of dimension1, while the second one returns the number of distinct values of the column. These numbers obviously are not equal if the column contains duplicates or nulls.
The word DISTINCT in
SELECT DISTINCT count(dimension1)
FROM data_table;
makes no sense, as the query returns a single row. Maybe you wanted
SELECT count(DISTINCT dimension1)
FROM data_table;
which returns the number of distinct not null values of dimension1. Note, that it is not the same as
SELECT count(*)
FROM (
SELECT DISTINCT ON (dimension1) dimension1
FROM data_table
-- GROUP BY dimension1 -- redundant
) AS tmp_table;
The last query yields the number of all (null or not) distinct values of the column.
To learn and understand what happens by visual example.
Here's a bit of SQL to execute on a PostgreSQL:
DROP TABLE IF EXISTS test_table;
CREATE TABLE test_table (
id int NOT NULL primary key,
col1 varchar(64) DEFAULT NULL
);
INSERT INTO test_table (id, col1) VALUES
(1,'foo'), (2,'foo'), (3,'bar'), (4,null);
select count(*) as total1 from test_table;
-- returns: 4
-- Because the table has 4 records.
select distinct count(*) as total2 from test_table;
-- returns: 4
-- The count(*) is just one value. Making 1 total unique can only result in 1 total.
-- So the distinct is useless here.
select col1, count(*) as total3 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- Since there are 3 unique col1 values. NULL's are included.
select distinct col1, count(*) as total4 from test_table group by col1 order by col1;
-- returns 3 rows: ('bar',1),('foo',2),(NULL,1)
-- The result is already grouped, and therefor already unique.
-- So again, the distinct does nothing extra here.
select count(distinct col1) as total5 from test_table;
-- returns 2
-- NULL's aren't counted in a count by value. So only 'foo' & 'bar' are counted
select distinct on (col1) id, col1 from test_table order by col1 asc, id desc;
-- returns 3 rows: (2,'a'),(3,'b'),(4,NULL)
-- So it gets the records with the maximum id per unique col1
-- Note that the "order by" matters here. Changing that DESC to ASC would get the minumum id.
select count(*) as total6 from (select distinct on (col1) id, col1 from test_table order by col1 asc, id desc) as q;
-- returns 3.
-- After seeing the previous query, what else would one expect?
select distinct col1 from test_table order by col1;
-- returns 3 unique values : ('bar'),('foo'),(null)
select distinct id, col1 from test_table order by col1;
-- returns all records.
-- Because id is the primary key and therefore makes each returned row unique
Here's a more direct summary that might useful for Googlers, answering the title but not the intricacies of the full post:
SELECT DISTINCT
availability: ISO
behaviour:
SELECT DISTINCT col1, col2, col3 FROM mytable
returns col1, col2 and col3 and omits any rows in which all of the tuple (col1, col2, col3) are the same. E.g. you could get a result like:
1 2 3
1 2 4
because those two rows are not identical due to the 4. But you could never get:
1 2 3
1 2 4
1 2 3
because 1 2 3 appears twice, and both rows are exactly the same. That is what DISTINCT prevents.
vs GROUP BY: SELECT DISTINCT is basically a subset of GROUP BY where you can't use aggregate functions: Is there any difference between GROUP BY and DISTINCT
SELECT DISTINCT ON
availability: PostgreSQL extension, WONTFIXED by SQLite
behavior: unlike DISTINCT, DISTINCT ON allows you to separate
what you want to be unique
from what you want to return
E.g.:
SELECT DISTINCT ON(col1) col2, col3 FROM mytable
returns col2 and col3, and does not return any two rows with the same col1. E.g.:
1 2 3
1 4 5
could not happen, because we have 1 twice on col1.
And e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
would prevent any duplicated (col1, col2) tuples, e.g. you could get:
1 2 3
1 4 3
as it has different (1, 2) and (1, 4) tuples, but not:
1 2 3
1 2 4
where (1, 2) happens twice, only one of those two could appear.
We can uniquely determine which one of the possible rows will be selected with ORDER BY which guarantees that the first match is taken, e.g.:
SELECT DISTINCT ON(col1, col2) col2, col3 FROM mytable
ORDER BY col1 DESC, col2 DESC, col3 DESC
would ensure that among:
1 2 3
1 2 4
only 1 2 4 would be picked as it happens first on our DESC sorting.
vs GROUP BY: DISTINCT ON is not a subset of GROUP BY because it allows you to access extra rows not present in the GROUP BY, which is generally not allowed in GROUP BY, unless:
you group by primary key in Postgres (unique not null is a TODO for them)
or if that is allows as an ISO extension as in SQLite/MySQL
This makes DISTINCT ON extremely useful to fulfill the common use case of "find the full row that reaches the maximum/minimum of some column": Is there any difference between GROUP BY and DISTINCT
E.g. to find the city of each country that has the most sales:
SELECT DISTINCT ON ("country") "country", "city", "amount"
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
or equivalently with * if we want all columns:
SELECT DISTINCT ON ("country") *
FROM "Sales"
ORDER BY "country" ASC, "amount" DESC, "city" ASC
Here each country appears only once, within each country we then sort by amount DESC and take the first, and therefore highest, amount.
RANK and ROW_NUMBER window functions
These can be used basically as supersets of DISTINCT ON, and implemented tested as of both SQLite 3.34 and PostgreSQL 14.3. I highly recommend also looking into them, see e.g.: How to SELECT DISTINCT of one column and get the others?
This is how the above "city with the highest amount of each country" query would look like with ROW_NUMBER:
SELECT *
FROM (
SELECT
ROW_NUMBER() OVER (
PARTITION BY "country"
ORDER BY "amount" DESC, "city" ASC
) AS "rnk",
*
FROM "Sales"
) sub
WHERE
"sub"."rnk" = 1
ORDER BY
"sub"."country" ASC
Try
SELECT count(dimension1a)
FROM (SELECT DISTINCT ON (dimension1) dimension1a
FROM data_table
ORDER BY dimension1) AS tmp_table;
DISTINCT ON appears to be synonymous with GROUP BY.

Does Hive use NULL as a possible value to aggregate by?

When I aggregate data in hive, how does the group by statement treat NULL values in the aggregating column?
Say I launch the following query
select col_a, count(1) from mytable group by col_a ;
and that col_a contains 0, 1 and NULL values. Will the result have 2 rows (0 and 1)
or 3 (0,1 and NULL)?
Hive does group by on NULL values and you will get 3 values. Also please modify your query to this:
Select col_a, count(1) from mytable group by col_a ;

GROUP BY without aggregate function

I am trying to understand GROUP BY (new to oracle dbms) without aggregate function.
How does it operate?
Here is what i have tried.
EMP table on which i will run my SQL.
SELECT ename , sal
FROM emp
GROUP BY ename , sal
SELECT ename , sal
FROM emp
GROUP BY ename;
Result
ORA-00979: not a GROUP BY expression
00979. 00000 - "not a GROUP BY expression"
*Cause:
*Action:
Error at Line: 397 Column: 16
SELECT ename , sal
FROM emp
GROUP BY sal;
Result
ORA-00979: not a GROUP BY expression
00979. 00000 - "not a GROUP BY expression"
*Cause:
*Action: Error at Line: 411 Column: 8
SELECT empno , ename , sal
FROM emp
GROUP BY sal , ename;
Result
ORA-00979: not a GROUP BY expression
00979. 00000 - "not a GROUP BY expression"
*Cause:
*Action: Error at Line: 425 Column: 8
SELECT empno , ename , sal
FROM emp
GROUP BY empno , ename , sal;
So, basically the number of columns have to be equal to the number of columns in the GROUP BY clause, but i still do not understand why or what is going on.
That's how GROUP BY works. It takes several rows and turns them into one row. Because of this, it has to know what to do with all the combined rows where there have different values for some columns (fields). This is why you have two options for every field you want to SELECT : Either include it in the GROUP BY clause, or use it in an aggregate function so the system knows how you want to combine the field.
For example, let's say you have this table:
Name | OrderNumber
------------------
John | 1
John | 2
If you say GROUP BY Name, how will it know which OrderNumber to show in the result? So you either include OrderNumber in group by, which will result in these two rows. Or, you use an aggregate function to show how to handle the OrderNumbers. For example, MAX(OrderNumber), which means the result is John | 2 or SUM(OrderNumber) which means the result is John | 3.
Given this data:
Col1 Col2 Col3
A X 1
A Y 2
A Y 3
B X 0
B Y 3
B Z 1
This query:
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2, Col3
Would result in exactly the same table.
However, this query:
SELECT Col1, Col2 FROM data GROUP BY Col1, Col2
Would result in:
Col1 Col2
A X
A Y
B X
B Y
B Z
Now, a query:
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2
Would create a problem: the line with A, Y is the result of grouping the two lines
A Y 2
A Y 3
So, which value should be in Col3, '2' or '3'?
Normally you would use a GROUP BY to calculate e.g. a sum:
SELECT Col1, Col2, SUM(Col3) FROM data GROUP BY Col1, Col2
So in the line, we had a problem with we now get (2+3) = 5.
Grouping by all your columns in your select is effectively the same as using DISTINCT, and it is preferable to use the DISTINCT keyword word readability in this case.
So instead of
SELECT Col1, Col2, Col3 FROM data GROUP BY Col1, Col2, Col3
use
SELECT DISTINCT Col1, Col2, Col3 FROM data
You're experiencing a strict requirement of the GROUP BY clause. Every column not in the group-by clause must have a function applied to reduce all records for the matching "group" to a single record (sum, max, min, etc).
If you list all queried (selected) columns in the GROUP BY clause, you are essentially requesting that duplicate records be excluded from the result set. That gives the same effect as SELECT DISTINCT which also eliminates duplicate rows from the result set.
The only real use case for GROUP BY without aggregation is when you GROUP BY more columns than are selected, in which case the selected columns might be repeated. Otherwise you might as well use a DISTINCT.
It's worth noting that other RDBMS's do not require that all non-aggregated columns be included in the GROUP BY. For example in PostgreSQL if the primary key columns of a table are included in the GROUP BY then other columns of that table need not be as they are guaranteed to be distinct for every distinct primary key column. I've wished in the past that Oracle did the same as it would have made for more compact SQL in many cases.
Let me give some examples.
Consider this data.
CREATE TABLE DATASET ( VAL1 CHAR ( 1 CHAR ),
VAL2 VARCHAR2 ( 10 CHAR ),
VAL3 NUMBER );
INSERT INTO
DATASET ( VAL1, VAL2, VAL3 )
VALUES
( 'b', 'b-details', 2 );
INSERT INTO
DATASET ( VAL1, VAL2, VAL3 )
VALUES
( 'a', 'a-details', 1 );
INSERT INTO
DATASET ( VAL1, VAL2, VAL3 )
VALUES
( 'c', 'c-details', 3 );
INSERT INTO
DATASET ( VAL1, VAL2, VAL3 )
VALUES
( 'a', 'dup', 4 );
INSERT INTO
DATASET ( VAL1, VAL2, VAL3 )
VALUES
( 'c', 'c-details', 5 );
COMMIT;
Whats there in table now
SELECT * FROM DATASET;
VAL1 VAL2 VAL3
---- ---------- ----------
b b-details 2
a a-details 1
c c-details 3
a dup 4
c c-details 5
5 rows selected.
--aggregate with group by
SELECT
VAL1,
COUNT ( * )
FROM
DATASET A
GROUP BY
VAL1;
VAL1 COUNT(*)
---- ----------
b 1
a 2
c 2
3 rows selected.
--aggregate with group by multiple columns but select partial column
SELECT
VAL1,
COUNT ( * )
FROM
DATASET A
GROUP BY
VAL1,
VAL2;
VAL1
----
b
c
a
a
4 rows selected.
--No aggregate with group by multiple columns
SELECT
VAL1,
VAL2
FROM
DATASET A
GROUP BY
VAL1,
VAL2;
VAL1
----
b b-details
c c-details
a dup
a a-details
4 rows selected.
--No aggregate with group by multiple columns
SELECT
VAL1
FROM
DATASET A
GROUP BY
VAL1,
VAL2;
VAL1
----
b
c
a
a
4 rows selected.
You have N columns in select (excluding aggregations), then you should have N or N+x columns
Use sub query e.g:
SELECT field1,field2,(SELECT distinct field3 FROM tbl2 WHERE criteria) AS field3
FROM tbl1 GROUP BY field1,field2
OR
SELECT DISTINCT field1,field2,(SELECT distinct field3 FROM tbl2 WHERE criteria) AS field3
FROM tbl1
If you have some column in SELECT clause , how will it select it if there is several rows ? so yes , every column in SELECT clause should be in GROUP BY clause also , you can use aggregate functions in SELECT ...
you can have column in GROUP BY clause which is not in SELECT clause , but not otherwise
As an addition
basically the number of columns have to be equal to the number of columns in the GROUP BY clause
is not a correct statement.
Any attribute which is not a part of GROUP BY clause can not be used for selection
Any attribute which is a part of GROUP BY clause can be used for selection but not mandatory.
For anyone trying to group data (from foreign tables as an example) like a json object with nested arrays of data you can achieve this in sql with array_agg (you can also use this in conjunction with json_build_object to create a json object with key-value pairs).
As a refference, I found helpful this video on yt: https://www.youtube.com/watch?v=A6N1h9mcJf4
-- Edit
If you want to have a nested array inside a nested array, you could do it by using array.
In the following example, 'variation_images' (subquery 2 - in relation to the variation table) are nested under the 'variation' query (subquery 1 - in relation to product table) which is nested under the product query (main query):
SELECT product.title, product.slug, product.description,
ARRAY(SELECT jsonb_build_object(
'var_id', variation.id, 'var_name', variation.name, 'images',
ARRAY(SELECT json_build_object('img_url', variation_images.images)
FROM variation_images WHERE variation_images.variation_id = variation.id)
)
FROM variation WHERE variation.product_id = product.id)
FROM product
I know you said you want to understand group by if you have data like this:
COL-A COL-B COL-C COL-D
1 Ac C1 D1
2 Bd C2 D2
3 Ba C1 D3
4 Ab C1 D4
5 C C2 D5
And you want to make the data appear like:
COL-A COL-B COL-C COL-D
4 Ab C1 D4
1 Ac C1 D1
3 Ba C1 D3
2 Bd C2 D2
5 C C2 D5
You use:
select * from table_name
order by col-c,colb
Because I think this is what you intend to do.