How to select columns of data in BigQuery that has all NULL values - sql

How to select columns of data in BigQuery that has all NULL values
A B C
NULL 1 NULL
NULL NULL NULL
NULL 2 NULL
NULL 3 NULL
I want to retrieve columns A and C. Please can you help!!

Expanding on my comment on Mikhail's answer, this is what I had in mind. It doesn't require generating a query string, which could be quite long if you have a large number of columns. It compares the count of null values for each column name to the total number of rows in the table to decide if the column should be included in the result.
#standardSQL
WITH `project.dataset.table` AS (
SELECT NULL A, 1 B, NULL C UNION ALL
SELECT NULL, NULL, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT null_column
FROM `project.dataset.table` AS t,
UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":null')
) AS null_column
GROUP BY null_column
HAVING COUNT(*) = (SELECT COUNT(*) FROM `project.dataset.table`);

Below is for BigQuery StandardSQL
Simple option:
#standardSQL
WITH `project.dataset.table` AS (
SELECT NULL A, 1 B, NULL C UNION ALL
SELECT NULL, NULL, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT COUNT(A) A, COUNT(B) B, COUNT(C) C
FROM `project.dataset.table`
it returns below where 0(zero) indicates that respective column has all NULLs
A B C
0 3 0
If this is "not enough" - below is more "sophisticated" version:
#standardSQL
WITH `project.dataset.table` AS (
SELECT NULL A, 1 B, NULL C UNION ALL
SELECT NULL, NULL, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT SPLIT(y, ':')[OFFSET(0)] column
FROM (
SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', '') x
FROM (
SELECT COUNT(A) A, COUNT(B) B, COUNT(C) C
FROM `project.dataset.table`
) t
), UNNEST(SPLIT(x)) y
WHERE CAST(SPLIT(y, ':')[OFFSET(1)] AS INT64) = 0
it returns result as below - enlisting only columns with all NULLs
column
A
C
Note: for your real table - just remove WITH block and replace project.dataset.table with your real table reference
Also, of course, use real column names
My table has round 700 columns..
Below is an example of how you can easily generate above query for any number of columns.
1. Just run below
2. Copy result - this is a generated query
3. paste generated query into new UI and run it
4. Enjoy (I hope you will) result :o)
Of course, as usually replace project.dataset.table with your real table reference
#standardSQL
SELECT
CONCAT('''
SELECT SPLIT(y, ':')[OFFSET(0)] column
FROM (
SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', '') x
FROM (
SELECT ''', y,
'''
FROM `project.dataset.table`
) t
), UNNEST(SPLIT(x)) y
WHERE CAST(SPLIT(y, ':')[OFFSET(1)] AS INT64) = 0
'''
)
FROM (
SELECT
STRING_AGG(CONCAT('COUNT(', x, ') ', x), ', ') y
FROM (
SELECT REGEXP_EXTRACT_ALL(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''), r'"([\w_]+)":') x
FROM `project.dataset.table` t
LIMIT 1
), UNNEST(x) x
)
Note: please pay attention to query cost - both "generation query" and final query itself will do full scan
You can generate columns list much cheaper off of table schema in any client of your choice
To test / play with it - you can use same dummy data as for initial queries in my answer

Related

Snowflake SQL - OBJECT_CONSTRUCT from COUNT and GROUP BY

I'm trying to summarize data in a table:
counting total rows
counting values on specific fields
getting the distinct values on specific fields
and, more importantly, I'm struggling with:
getting the count for each field nested in an object
given this data
COL1
COL2
A
0
null
1
B
null
B
null
the expected result from this query would be:
with dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
)
select
count(1) as total
,count(col1) as col1
,array_agg(distinct col1) as dist_col1
--,object_construct(???) as col1_object_count
,count(col2) as col2
,array_agg(distinct col2) as dist_col2
--,object_construct(???) as col2_object_count
from
dummy
TOTAL
COL1
DIST_COL1
COL1_OBJECT_COUNT
COL2
DIST_COL2
COL2_OBJECT_COUNT
4
3
["A", "B"]
{"A": 1, "B", 2, null: 1}
2
[0, 1]
{0: 1, 1: 1, null: 2}
I've tried several functions inside OBJECT_CONSTRUCT mixed with ARRAY_AGG, but all failed
OBJECT_CONSTRUCT can work with several columns but only given all (*), if you try a select statement inside, it will fail
another issue is that analytical functions are not easily taken by the object or array functions in Snowflake.
You could use Snowflake Scripting or Snowpark for this but here's a solution that is somewhat flexible so you can apply it to different tables and column sets.
Create test table/view:
Create or Replace View dummy as (
select 'A' as col1, 0 as col2
union all
select null, 1
union all
select 'B', null
union all
select 'B', null
);
Set session variables for table and colnames.
set tbname = 'DUMMY';
set colnames = '["COL1", "COL2"]';
Create view that generates the required table_column_summary data:
Create or replace View table_column_summary as
with
-- Create table of required column names
cn as (
select VALUE::VARCHAR CNAME
from table(flatten(input => parse_json($colnames)))
)
-- Convert rows into objects
,ro as (
select
object_construct_keep_null(*) row_object
-- using identifier on session variable to dynamically supply table/view name
from identifier($tbname) )
-- Flatten row objects into key/values
,rof as (
select
key col_name,
ifnull(value,'null')::VARCHAR col_value
from ro, lateral flatten(input => row_object), cn
-- You will only need this filter if you need a subset
-- of columns from the source table/query summarised
where col_name = cn.cname)
-- Get the column value distinct value counts
,cdv as (
select col_name,
col_value,
sum(1) col_value_count
from rof
group by 1,2
)
-- and derive required column level stats and combine with cdv
,cv as (
select
(select count(1) from dummy) total,
col_name,
object_construct('COL_COUNT', count(col_value) ,
'COL_DIST', array_agg(distinct col_value),
'COL_OBJECT_COUNT', object_agg(col_value,col_value_count)) col_values
from cdv
group by 1,2)
-- Return result
Select * from cv;
Use this final query if you want a solution that works flexibility with any table/columns provided as input...
Select total, object_agg(col_name, col_values) col_values_obj
From table_column_summary
Group by 1;
Or use this final query if you want the fixed columns output as described in your question...
Select total,
COL1[0]:COL_COUNT COL1,
COL1[0]:COL_DIST DIST_COL1,
COL1[0]:COL_OBJECT_COUNT COL1_OBJECT_COUNT,
COL2[0]:COL_COUNT COL2,
COL2[0]:COL_DIST DIST_COL2,
COL2[0]:COL_OBJECT_COUNT COL2_OBJECT_COUNT
from table_column_summary
PIVOT ( ARRAY_AGG ( col_values )
FOR col_name IN ( 'COL1', 'COL2' ) ) as pt (total, col1, col2);

Eliminating null values in union

I'm doing a query across databases with an identical structure, to show a mapping from a source value to a target value.
Every one of my databases has a table with two columns: source and target
DB1
Source
Target
A
X
A
Y
B
NULL
C
NULL
DB2
Source
Target
A
NULL
A
Y
B
Z
So my query is
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
What I'm getting is
Source
Target
A
X
A
Y
B
NULL
C
NULL
B
Z
A
NULL
But I'm only interested in the target being NULL, if there is no other mapping present.
So I'm looking for the following result:
Source
Target
A
X
A
Y
C
NULL
B
Z
How can I easily eliminate the highlighted rows A | NULL and B | NULL from my results?
I've seen a few answers suggesting using MAX(Target), but that won't work for me since I can have multiple valid mappings for a single source (A | X and A | Y)
Something like this would work, just give a number based on NULL, and select the first:
SELECT TOP(1) WITH TIES UN.Source
, UN.Target
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
ORDER BY DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END)
You might find it easier to think in terms of minimums:
with data as (
select Source, Target from DB1.<table> union
select Source, Target from DB2.<table>
), qualified as (
select *,
case when Target is not null or min(Target) over (partition by Source) is null
then 1 end as Keep
from data
)
select Source, Target from qualified where Keep = 1;
For completeness, here's the solution I went with, based on the answer of #HoneyBadger, including the suggestion made by #MartinSmith in the comments
SELECT * FROM
(SELECT UN.Source
, UN.Target
, DENSE_RANK()OVER(PARTITION BY UN.Source ORDER BY CASE WHEN UN.Target IS NOT NULL THEN 1 ELSE 2 END) as ranking
FROM (
Select t.Source, t.Target
from DB1.table t
union
Select t.Source, t.Target
from DB2.table t
) AS UN
) UN2
WHERE UN2.ranking = 1
ORDER BY UN2.Source, UN2.Target
This solution selects only the records that have a DENSE_RANK of 1, avoiding the TOP(1) WITH TIES.

How to select columns of data in BigQuery that has all non-NULL values

I found this question on here: How to select columns of data in BigQuery that has all NULL values
but I would like to do the opposite and find all the columns with non-null values. How would I flip this previous solution to accomplish the opposite? I am not that familiar with regexp syntax and I couldn't figure out a solution trying to research this online.
Thank you for your help in advance.
The script of How to select columns of data in BigQuery that has all NULL values
can be modified as following:
WITH `project.dataset.table` AS (
SELECT 77 A, 1 B, NULL C UNION ALL
SELECT NULL, 6, NULL UNION ALL
SELECT NULL, 2, NULL UNION ALL
SELECT NULL, 3, NULL
)
SELECT all_column, count(null_column) as count_null, count(1) as total_rows
FROM `project.dataset.table` AS t,
UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":')
) AS all_column
left join UNNEST(REGEXP_EXTRACT_ALL(
TO_JSON_STRING(t),
r'\"([a-zA-Z0-9\_]+)\":null')
) AS null_column
on null_column=all_column
GROUP BY 1
HAVING count(null_column)=count(1)
The TO_JSON_STRING converts each entry to following string column_name:value.
The REGEXP_EXTRACT_ALL( ... , r'\"([a-zA-Z0-9\_]+)\":') extract from that string the column name.
if the value is null.

Big query find data that could be in multiple columns

I have a table with the following data
id|task1_name|task1_date|task2_name|task2_date
1,breakfast,1/1/20,,
2,null,null,breakfast,,1/1/20
3,null,null,lunch,,1/1/20
4,dinner,1/1/20,lunch,1/1/10
I'd like to build a view that always displayed the task names in the same column or null if it could not be found in any of the columns e.g.
id|dinner_date|lunch_date|breakfast_date
1,1/1/20, null, null
2,null, null, 1/1/20
2,1/1/20, 1/1/10, null
I've tried using a nested IF statement e.g.
SELECT *
IF(task_1_name = 'dinner', task1_date, IF(task2_date = 'dinner', task2_date, NULL)) as `dinner_date`
FROM t
But as there are 50 or so columns in the real dataset, this seems like a stupid solution and would get complex very quickly, is there a smarter way here?
One method uses case expressions:
select t.*,
(case when task1_name = 'dinner' then task1_date
when task2_name = 'dinner' then task2_date
when task3_name = 'dinner' then task3_date
end) as dinner_date
from t;
Below is for BigQuery Standard SQL and generic enough to addresses concerns expressed in question. You don't need to know in advance number of columns and tasks names (although they should not have , or : which should not be a big limitation here and can be addressed if needed)
#standardSQL
CREATE TEMP TABLE ttt AS
SELECT id,
SPLIT(k, '_')[OFFSET(0)] task,
MAX(IF(SPLIT(k, '_')[OFFSET(1)] = 'name', v, NULL)) AS name,
MAX(IF(SPLIT(k, '_')[OFFSET(1)] = 'date', v, NULL)) AS DAY
FROM (
SELECT id,
TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') k,
TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') v
FROM `project.dataset.table` t,
UNNEST(SPLIT(TRIM(TO_JSON_STRING(t), '{}'))) kv
WHERE TRIM(SPLIT(kv, ':')[OFFSET(0)], '"') != 'id'
AND TRIM(SPLIT(kv, ':')[OFFSET(1)], '"') != 'null'
)
GROUP BY id, task;
EXECUTE IMMEDIATE '''
SELECT id, ''' || (
SELECT STRING_AGG(DISTINCT "MAX(IF(name = '" || name || "', day, NULL)) AS " || name || "_date")
FROM ttt
) || '''
FROM ttt
GROUP BY 1
ORDER BY 1
'''
Note; the assumption here is only about columns name to be task<N>_name and task<N>_date
If to apply to sample data (similar) to yours in question
WITH `project.dataset.table` AS (
SELECT 1 id, 'breakfast' task1_name, '1/1/21' task1_date, NULL task2_name, NULL task2_date UNION ALL
SELECT 2, NULL, NULL, 'breakfast', '1/1/22' UNION ALL
SELECT 3, NULL, NULL, 'lunch', '1/1/23' UNION ALL
SELECT 4, 'dinner', '1/1/24', 'lunch', '1/1/10'
)
output is
Row id breakfast_date lunch_date dinner_date
1 1 1/1/21 null null
2 2 1/1/22 null null
3 3 null 1/1/23 null
4 4 null 1/1/10 1/1/24
Here is another solution which doesn't use dynamic SQL, doesn't rely on specific column names and works with arbitrary number of columns:
WITH table AS (
SELECT 1 id, 'breakfast' task1_name, '1/1/21' task1_date, NULL task2_name, NULL task2_date UNION ALL
SELECT 2, NULL, NULL, 'breakfast', '1/1/22' UNION ALL
SELECT 3, NULL, NULL, 'lunch', '1/1/23' UNION ALL
SELECT 4, 'dinner', '1/1/24', 'lunch', '1/1/10'
)
SELECT
REGEXP_EXTRACT(f, r'breakfast\, ([^\,\)]*)'),
REGEXP_EXTRACT(f, r'lunch\, ([^\,\)]*)'),
REGEXP_EXTRACT(f, r'dinner\, ([^\,\)]*)')
FROM (
SELECT FORMAT("%t", t) f FROM table t
)

Group and count by another columns value

I have a table like below:
CREATE TABLE public.test_table
(
"ID" serial PRIMARY KEY NOT NULL,
"CID" integer NOT NULL,
"SEG" integer NOT NULL,
"DDN" character varying(3) NOT NULL
)
and data looks like this:
ID CID SEG DDN
1 1 1 "711"
2 1 2 "800"
3 1 3 "124"
4 2 1 "711"
5 3 1 "711"
6 3 2 "802"
7 4 1 "799"
8 5 1 "799"
9 5 2 "804"
10 6 1 "799"
I need to group these data by CID column and get column counts depends on DDN columns first values but counts must give me two different information, if it's more than 1 or not.
I'm really sorry if couldn't explains clearly. Let me show you what I need..
DDN END TRA
711 1 2
799 2 1
As you can see, DDN:711 has 1 record of single count (ID:4). This is END column.
But 2 times has multiple SEG count (ID:1to3 and ID:5to6). This is TRA column.
I can not be sure what column should be in group clause!
My solution:
Just found a solution like below
WITH x AS (
SELECT
(SELECT t1."DDN" FROM public.test_table AS t1
WHERE t1."CID"=t."CID" AND t1."SEG"=1) AS ddn,
COUNT("CID") AS seg_count
FROM public.test_table AS t
GROUP BY "CID"
)
SELECT ddn, COUNT(seg_count) AS "TOTAL",
SUM(CASE WHEN x.seg_count=1 THEN 1 ELSE 0 END) as "END",
SUM(CASE WHEN x.seg_count>1 THEN 1 ELSE 0 END) as "TRA"
FROM x
GROUP BY ddn;
Equivalent, faster query:
SELECT "DDN"
, COUNT(*) AS "TOTAL"
, COUNT(*) FILTER (WHERE seg_count = 1) AS "END"
, COUNT(*) FILTER (WHERE seg_count > 1) AS "TRA"
FROM (
SELECT DISTINCT ON ("CID")
"DDN" -- assuming min "SEG" is always 1
, COUNT(*) OVER (PARTITION BY "CID") AS seg_count
FROM test_table
ORDER BY "CID", "SEG"
) sub
GROUP BY "DDN";
db<>fiddle here
Notes
CTEs are typically slower and should only be used where needed in Postgres.
This is equivalent to the query in the question assuming that the minimum "SEG" per "CID" is always 1 - since this query returns the row with the minimum "SEG" while your query returns the one with "SEG" = 1. Typically, you would want the "first" segment and my query implements this requirement more reliably, but that's not clear from the question.
COUNT(*) is slightly faster than COUNT(column) and equivalent while not involving NULL values (applicable here). Related:
PostgreSQL: running count of rows for a query 'by minute'
About DISTINCT ON:
Select first row in each GROUP BY group?
The aggregate FILTER syntax requires Postgres 9.4+:
Conditional SQL count
Here is the solution i propose, the query can be simplified i guess.
CREATE TABLE test_table
(
ID serial PRIMARY KEY NOT NULL,
CID integer NOT NULL,
SEG integer NOT NULL,
DDN character varying(3) NOT NULL
);
insert into test_table(CID,SEG,DDN)
values
( 1, 1, '711'),
( 1, 2, '800'),
( 1, 3, '124'),
( 2, 1, '711'),
( 3, 1, '711'),
( 3, 2, '802'),
( 4, 1, '799'),
( 5, 1, '799'),
( 5, 2, '804'),
( 6, 1, '799');
with summary as (with ddn_t as (select cid,ddn,row_number() OVER( PARTITION BY cid)from test_table)
select a.cid,count(distinct a.ddn),b.ddn
from ddn_t a
join ddn_t b on b.cid=a.cid and b.row_number=1
group by a.cid, b.ddn)
select ddn,
sum (case when count >1 then 1 else 0 end) as TRA,
sum (case when count = 1 then 1 else 0 end) as END
from summary
group by ddn;