How to only SELECT rows with non-zero and non-null columns efficiently in Big Query? - sql

I am having a table with large number of columns in Big Query.
The table has lot of rows with some column values as 0/0.0 and null.
For example
Row A B C D E F
1 "abc" 0 null "xyz" 0 0.0
2 "bcd" 1 5 "wed" 4 65.5
I need to select only those rows which have non zero Integer, Float and non NULL values. Basically, I need just Row 2 in the above table
I know I can do this by using this query for each of the columns
SELECT * FROM table WHERE (B IS NOT NULL AND B is !=0) AND
.
.
.
But I have lot of columns and writing query like this for each of the columns would be difficult. Is there any better approach to handle this?

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "abc" a, 0 b, NULL c, "xyz" d, 0 e, 0.0 f UNION ALL
SELECT "bcd", 1, 5, "wed", 4, 65.5
)
SELECT *
FROM `project.dataset.table` t
WHERE NOT REGEXP_CONTAINS(TO_JSON_STRING(t), r':0[,}]|null[,}]')
with output
Row a b c d e f
1 bcd 1 5 wed 4 65.5

Related

SQL Return rows with mix of nulls and non nulls in certain columns

If I have the following table
id a b c time
-----------------------------
0 1 4 "ca" 23
1 NULL NULL NULL 18
2 NULL 1 "pn" 13
3 6 NULL "ar" 27
4 1 2 NULL 24
I want to return all rows with at least one null and one non-null in columns a, b, and c. So I want to return:
id a b c time
-----------------------------
2 NULL 1 "pn" 13
3 6 NULL "ar" 27
4 1 2 NULL 24
I know I can write
select *
from table
where ((a is null and (b is not null or c is not null))
or (a is not null and (b is null or c is null)))
But what happens if I need to consider 4 columns or more? It becomes a mess. Note that the table could have 20 or more columns, of which I am only considering a small subset of columns for null/non-null analysis. Is there a concise way of doing this? Thanks
One method would be to unpivot your data, and COUNT the NULL and non-NULL values, and filter on that:
SELECT V.ID,
V.a,
V.b,
V.c,
V.time
FROM (VALUES(0,1,4,'"ca"',23),
(1,NULL,NULL,NULL,18),
(2,NULL,1,'"pn"',13),
(3,6,NULL,'"ar"',27),
(4,1,2,NULL,24))V(ID,a,b,c,time)
CROSS APPLY (SELECT COUNT(UP.V) AS NonNull,
COUNT(CASE WHEN UP.V IS NULL THEN 1 END) AS IsNull
FROM (VALUES(CONVERT(varchar(1),V.a)),
(CONVERT(varchar(1),V.b)),
(CONVERT(varchar(1),V.c)))UP(V))C
WHERE C.[IsNull] > 0
AND C.NonNull > 0;

Split function across multiple fields in BigQuery SQL

I have data like this:
Each column will have the same number of elements across a row, where the first element in the first column corresponds to the first element in the second column etc.
How can I flatten this to get the below?
With a single column I am able to do this by combining a CROSS JOIN with an UNNEST but I cannot get this to work with multiple columns since the join ends up creating multiple variations and UNNEST loses the order of the array so I can't match them.
If I were building the arrays from scratch, I would use some kind of STRUCT element in there, but I can't find a way of doing this when the arrays are created from a SPLIT()?
WITH_OFFSET is your friend here:
WITH strings AS (
SELECT "a,b,c" a, "aa,bb,cc" b
UNION ALL
SELECT "a1,b1,c1" a, "aa1,bb1,cc1" b
)
SELECT x_a, x_b
FROM strings
, UNNEST(SPLIT(a)) x_a WITH OFFSET o_a
JOIN UNNEST(SPLIT(b)) x_b WITH OFFSET o_b
ON o_a=o_b
Another approach for BigQuery Standard SQL is shown below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, 'a|b|c' col1, 'n|o|p' col2 UNION ALL
SELECT 2, 'd|e', 'q|r' UNION ALL
SELECT 3, 'f|g|h|i', 's|t|u|v' UNION ALL
SELECT 4, 'j', 'w' UNION ALL
SELECT 5, 'k|l|m', 'x|y|z'
)
SELECT
id,
SPLIT(col1, '|')[SAFE_ORDINAL(pos)] value1,
SPLIT(col2, '|')[SAFE_ORDINAL(pos)] value2
FROM `project.dataset.table`,
UNNEST(GENERATE_ARRAY(1, ARRAY_LENGTH(SPLIT(col1, '|')))) pos
with expected result
Row id value1 value2
1 1 a n
2 1 b o
3 1 c p
4 2 d q
5 2 e r
6 3 f s
7 3 g t
8 3 h u
9 3 i v
10 4 j w
11 5 k x
12 5 l y
13 5 m z

To pull 1 record out of multiple records having same data in a field based on other fields

A | B | C | D | E
a y 6 12 21
b n 3 10 5
c n 4 12 12
c n 7 12 2
c y 1 12 22
d n 6 10 32
d n 7 10 32
OUTPUT TABLE:
A | B | C | F
a y 6 21
b n 3 12
c y 1 22
d n 6 10
I have a table that contains certain fields. From that table I want to remove duplicate records in A and produce the output table.
Now, the field F is calculated based on the field C when there are no duplicates for the records in A. So, if there is only one record of a in A then if C>5 then the F Column(Output table) pulls the record in E column. So, if record b has the value <5 in field C, then the F column (output table) will pull the record in D column for b. I have been able to achieve this using a case statement.
However, when there are duplicate records in column A, I want only one of the records based on the column B. Only that record should be pulled that has the value 'y' in column B and where the column F contains the value from column E. If none of the duplicate records in A have a value of 'n' in the B column, then pull any record with column D as column F in the output table. I am not able to figure out this part.
Please let me know if anything is not clear.
Code I am using:
SELECT A,B,C,
CASE
WHEN (SELECT COUNT(*) FROM MyTable t2 WHERE t1.A=t2.A)>1
THEN (SELECT TOP 1 CASE WHEN b='y' THEN E ELSE D END
FROM MyTable t3
WHERE t3.A=t1.A
ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END)
ELSE {
case when cast(C as float) >= 5.00 then (Case when E = '0.00' then D else E end)
when cast(C as float)< 5.00 then D end )
}
END AS F
FROM MyTable t1
You might want to encapsulate this logic in a Function to make it look cleaner, but the logic would go like this:
IF the record count of rows in the table with the same value for A as the current row is greater than 1, THEN SELECT the TOP 1 record with this value for A ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END
Use another CASE WHEN b='y' to determine if you will use column E or D for output column F.
And ELSE (the record count is not greater than 1), use your existing CASE expression.
EDIT: Here is a more psuedo-codey explanation:
WITH cte AS (SELECT A,B,C,
ROW_NUMBER() OVER (PARTITION BY A, ORDER BY CASE WHEN b='y' THEN 0 ELSE 1 END) rn
FROM MyTable
)
SELECT A,B,C,
CASE
WHEN (SELECT COUNT(*) FROM MyTable t2 WHERE t1.A=t2.A)>1
THEN CASE WHEN b='y' THEN E ELSE D END
ELSE {use your existing CASE Expression}
END AS F
FROM cte t1
WHERE rn=1

SQL combining of a COUNT with a WHERE in single query

Here is the data, call it table T
A B
-- --
1 14
2 15
3 16
4 1
4 3
4 6
4 9
4 12
4 15
I would like to get the value of A that has only one value and a B value of 15.
There are two rows where B=15 but there are 6 rows where A=4 and only one row where A=2.
So the correct SQL should return me the 2.
I have tried this but it returns both rows.
select A from T group by A,B having Count(A) = 1 and B = 15
This similarly fails:
select A from T where B = 15 group by A having count( A ) = 1
Try this:
select A
from T
group by A
having Count(A) = 1 and Max(B) = 15;
Your problem seems to be that you are grouping by both columns. You only want to group by A.
Admittedly, your query has group by A, T, but I think that is a typo, based on the described behavior.
You can check the count of B after grouping by A.
select A
from T
group by A
having Count(B) = 1 and max(B) = 15

Finding unique values with multiple columns using certain condition

ID? A B C
--- -- -- --
1 J 1 B
2 J 1 S
3 M 1 B
4 M 1 S
5 M 2 B
6 M 2 S
7 T 1 B
8 T 2 S
9 C 1 B
10 C 1 S
11 C 2 B
12 N 1 S
13 N 2 S
14 N 3 S
15 Q 1 S
16 Q 1 S
17 Z 1 B
I need to find unique values with multiple column with some added condition. The unique value are combination of Col A,B and C.
If Col A has only two rows (like record 1 and 2) and the Column B is same on both data and there is a different value as in Column C then i dont need those records.
If Col A has only multiple rows (like record 3 to 6 ) with different Col B and C combination we want to see those values.
If Col A has multiple rows (like record 7 to 8 ) with different Col B and C combination we want to see those values.
If Col A has only multiple rows (like record 9 to 11 ) with similar/different Col B and C combination we want to see those values.
If Col A has only multiple rows (like record 12onwards ) with similar Col C and similar or different Column B we dont need those values...
If single value like Row 17 there is no need to display either
Tried a lot but not getting exact answer any help is greatly appreciated..
Trying to go through all the logic, I think you want all rows where the values of both columns A and B differ. An easy way to see whether records differ is by looking at the min and max values. And, you can do this using analytic functions:
select A, B, C
from (select t.*,
count(*) over (partition by A) as Acnt,
min(B) over (partition by A) as Bmin,
max(B) over (partition by A) as Bmax,
min(C) over (partition by A) as Cmin,
max(C) over (partition by A) as Cmax
from t
) t
where (Bmin <> Bmax or Cmin <> Cmax)
Your example data does not have any actual duplicates, so I don't think a count(distinct) is necessary. Your rules say nothing about what to do when A only appears once. This version will filter those rows out.