Count how many percent of values on each column are nulls - sql

Is there a way, through the information_schema or otherwise, to calculate how many percent of each column of a table (or a set of tables, better yet) are NULLs?

Your query has a number of problems, most importantly you are not escaping identifiers (which could lead to exceptions at best or SQL injection attacks in the worst case) and you are not taking the schema into account.
Use instead:
SELECT 'SELECT ' || string_agg(concat('round(100 - 100 * count(', col
, ') / count(*)::numeric, 2) AS ', col_pct), E'\n , ')
|| E'\nFROM ' || tbl
FROM (
SELECT quote_ident(table_schema) || '.' || quote_ident(table_name) AS tbl
, quote_ident(column_name) AS col
, quote_ident(column_name || '_pct') AS col_pct
FROM information_schema.columns
WHERE table_name = 'my_table_name'
ORDER BY ordinal_position
) sub
GROUP BY tbl;
Produces a query like:
SELECT round(100 - 100 * count(id) / count(*)::numeric, 2) AS id_pct
, round(100 - 100 * count(day) / count(*)::numeric, 2) AS day_pct
, round(100 - 100 * count("oDd X") / count(*)::numeric, 2) AS "oDd X_pct"
FROM public.my_table_name;
Closely related answer on dba.SE with a lot more details:
Check whether empty strings are present in character-type columns

In PostgreSQL, you can easily compute it using the statistics tables if your autovacuum setting is on (check it by SHOW ALL;). You can also set the vacuum interval to configure how fast your statistics tables should be updated. You can then compute the NULL percentage (aka, null fraction) simply using the query below:
select attname, null_frac from pg_stats where tablename = 'table_name'

Think there is not built-in features for this. You can make this self
Just walk thorough each column in table and calc count() for all rows and count() for rows where column is null.
There is possible and optimize this for one query for one table.

OK, I played around a little and made a query that returns a query--or queries if you use LIKE 'my_table%' instead of = 'my_table_name':
SELECT 'select '|| string_agg('(count(*)::real-count('||column_name||')::real)/count(*)::real as '||column_name||'_percentage ', ', ') || 'from ' || table_name
FROM information_schema.columns
WHERE table_name LIKE 'my_table_name'
GROUP BY table_name
It returns a ready-to-run SQL query, like:
"SELECT (count(*)::real-count(id)::real)/count(*)::real AS id_percentage , (count(*)::real-count(value)::real)/count(*)::real AS value_percentage FROM my_table_name"
id_percentage;value_percentage
0;0.0177515
(The caps didn't go exactly right for readability.)

Related

How do I select columns based on a string pattern in BigQuery

I have a table in BigQuery with hundreds of columns, and it just happens that I want to select all of them except for those that begin with an underscore. I know how to do a query to select the columns beginning with an underscore using the INFORAMTION_SCHEMA.COLUMNS table, but I can't figure out how I would use this query to select the columns I want. I know BigQuery has EXCEPT but I want to avoid writing out each column that begins with an underscore, and I can't seem to pass to it a subquery or even something like a._*.
Consider below approach
execute immediate (select '''
select * except(''' || string_agg(col) || ''') from your_table
'''
from (
select col
from (select * from your_table limit 1) t,
unnest([struct(translate(to_json_string(t), '{}"', '') as kvs)]),
unnest(split(kvs)) kv,
unnest([struct(split(kv, ':')[offset(0)] as col)])
where starts_with(col, '_')
));
if apply to table like below
it generates below statement
select * except(_c,_e) from your_table
and produces below output

Query to count all rows in all Snowflake views

I'm trying to get an count of all the rows in a set of views in my Snowflake database.
The built-in row_count from information_schema.tables is not present in information_schema.views, unfortunately.
It seems I'd need to count all rows in each view, something like:
with view_name as (select table_name
from account_usage.views
where table_schema = 'ACCESS' and RIGHT(table_name,7) = 'CURRENT'
)
select count (*) from view_name;
But that returns only one results, instead of one for each line
If I change the select to include the view name, i.e.
select concat('Rows in ', view_name), count (*) from view_name;
…it returns the error "invalid identifier 'VIEW_NAME' (line 5)"
How can I show all results and include the view name?
You can create a query that looks at the information_schema to create a query that will go view by view getting its count:
select listagg(xx, ' union all ')
from (
select 'select count(*) c, \'' || x || '\' v from ' || x as xx
from (
select TABLE_CATALOG ||'.'|| TABLE_SCHEMA ||'."'||TABLE_NAME||'"' x
from KNOEMA_FORECAST_DATA_ATLAS.INFORMATION_SCHEMA.VIEWS
where table_schema='FORECAST'
)
)
See also How to find the number of rows for all views in a schema?

In BigQuery, identify when columns do not match on UNION ALL

with
table1 as (
select 'joe' as name, 17 as age, 25 as speed
),
table2 as (
select 'nick' as name, 21 as speed, 23 as strength
)
select * from table1
union all
select * from table2
In Google BigQuery, this union all does not throw an error because both tables have the same number of columns (3 each). However I receive bad data output because the columns do not match. Rather than outputting a new table with 4 columns name, age, speed, strength with correct values + nulls for missing values (which would probably be preferred), the union all keeps the 3 columns from the top row.
Is there a good way to catch that the columns do not match, rather than the query silently returning bad data? Is there any way for this to return an error perhaps, as opposed to a successful table? I'm not sure how to check in SQL that the columns in the 2 tables match.
Edit: in this example it is clear to see that the columns do not match, however in our data we have 100+ columns and we want to avoid a situation where we make an error in a UNION ALL
Below is for BigQuery Standard SQL and using scripting feature of BQ
DECLARE statement STRING;
SET statement = (
WITH table1_columns AS (
SELECT column FROM (SELECT * FROM `project.dataset.table1` LIMIT 1) t,
UNNEST(REGEXP_EXTRACT_ALL(TRIM(TO_JSON_STRING(t), '{}'), r'"([^"]*)":')) column
), table2_columns AS (
SELECT column FROM (SELECT * FROM `project.dataset.table2` LIMIT 1) t,
UNNEST(REGEXP_EXTRACT_ALL(TRIM(TO_JSON_STRING(t), '{}'), r'"([^"]*)":')) column
), all_columns AS (
SELECT column FROM table1_columns UNION DISTINCT SELECT column FROM table2_columns
)
SELECT (
SELECT 'SELECT ' || STRING_AGG(IF(t.column IS NULL, 'NULL as ', '') || a.column, ', ') || ' FROM `project.dataset.table1` UNION ALL '
FROM all_columns a LEFT JOIN table1_columns t USING(column)
) || (
SELECT 'SELECT ' || STRING_AGG(IF(t.column IS NULL, 'NULL as ', '') || a.column, ', ') || ' FROM `project.dataset.table2`'
FROM all_columns a LEFT JOIN table2_columns t USING(column)
)
);
EXECUTE IMMEDIATE statement;
when applied to sample data from your question - output is
Row name age speed strength
1 joe 17 25 null
2 nick null 21 23
After saving table1 and table2 as 2 tables in a dataset in BigQuery, I then used the metadata using INFORMATION_SCHEMA to check that the columns matched.
SELECT *
FROM models.INFORMATION_SCHEMA.COLUMNS
where table_name = 'table1'
SELECT *
FROM models.INFORMATION_SCHEMA.COLUMNS
where table_name = 'table2'
INFORMATION_SCHEMA.COLUMNS returns information including the column names and their positioning. I can join these 2 tables together then to check that the names match...

Run the same SQL query for multiple schemas

I am hoping to write one SQL query that goes through all 20+ schemas without the need to constantly replace the search_path. I've tried UNION ALL but in most situations separating the query might take all the time I saved by not hard writing schemas. The query itself can be very basic such as:
SELECT *FROM schm1.table1
UNION ALL
SELECT *FROM schm2.table1
Thank you for your assistance!
"The impossible will be completed as you wait; please allow two days for the delivery of miracles".
I'm afraid what you want to achieve can only be done by SQL generating SQL:
SELECT
CASE ROW_NUMBER() OVER(ORDER BY table_schema)
WHEN 1 THEN ''
ELSE 'UNION ALL '
END
||'SELECT * FROM '
||table_schema
||'.'
||table_name
|| CASE ROW_NUMBER() OVER(ORDER BY table_schema DESC)
WHEN 1 THEN ';'
ELSE CHR(10)
END
FROM tables
WHERE table_name='d_teas_scd'
ORDER BY table_schema
;
What I get with d_teas_scd as table_name, is this:
SELECT * FROM flatt.d_teas_scd
UNION ALL SELECT * FROM public.d_teas_scd
UNION ALL SELECT * FROM star.d_teas_scd;
It can't guarantee that all tables with the same name have the same structure, though, that's why the resulting query could fail - that's your responsibility...
Happy playing
Marco the Sane

PostgreSQL convert columns to rows? Transpose?

I have a PostgreSQL function (or table) which gives me the following output:
Sl.no username Designation salary etc..
1 A XYZ 10000 ...
2 B RTS 50000 ...
3 C QWE 20000 ...
4 D HGD 34343 ...
Now I want the Output as below:
Sl.no 1 2 3 4 ...
Username A B C D ...
Designation XYZ RTS QWE HGD ...
Salary 10000 50000 20000 34343 ...
How to do this?
SELECT
unnest(array['Sl.no', 'username', 'Designation','salary']) AS "Columns",
unnest(array[Sl.no, username, value3Count,salary]) AS "Values"
FROM view_name
ORDER BY "Columns"
Reference : convertingColumnsToRows
Basing my answer on a table of the form:
CREATE TABLE tbl (
sl_no int
, username text
, designation text
, salary int
);
Each row results in a new column to return. With a dynamic return type like this, it's hardly possible to make this completely dynamic with a single call to the database. Demonstrating solutions with two steps:
Generate query
Execute generated query
Generally, this is limited by the maximum number of columns a table can hold. So not an option for tables with more than 1600 rows (or fewer). Details:
What is the maximum number of columns in a PostgreSQL select query
Postgres 9.4+
Dynamic solution with crosstab()
Use the first one you can. Beats the rest.
SELECT 'SELECT *
FROM crosstab(
$ct$SELECT u.attnum, t.rn, u.val
FROM (SELECT row_number() OVER () AS rn, * FROM '
|| attrelid::regclass || ') t
, unnest(ARRAY[' || string_agg(quote_ident(attname)
|| '::text', ',') || '])
WITH ORDINALITY u(val, attnum)
ORDER BY 1, 2$ct$
) t (attnum bigint, '
|| (SELECT string_agg('r'|| rn ||' text', ', ')
FROM (SELECT row_number() OVER () AS rn FROM tbl) t)
|| ')' AS sql
FROM pg_attribute
WHERE attrelid = 'tbl'::regclass
AND attnum > 0
AND NOT attisdropped
GROUP BY attrelid;
Operating with attnum instead of actual column names. Simpler and faster. Join the result to pg_attribute once more or integrate column names like in the pg 9.3 example.
Generates a query of the form:
SELECT *
FROM crosstab(
$ct$
SELECT u.attnum, t.rn, u.val
FROM (SELECT row_number() OVER () AS rn, * FROM tbl) t
, unnest(ARRAY[sl_no::text,username::text,designation::text,salary::text]) WITH ORDINALITY u(val, attnum)
ORDER BY 1, 2$ct$
) t (attnum bigint, r1 text, r2 text, r3 text, r4 text);
This uses a whole range of advanced features. Just too much to explain.
Simple solution with unnest()
One unnest() can now take multiple arrays to unnest in parallel.
SELECT 'SELECT * FROM unnest(
''{sl_no, username, designation, salary}''::text[]
, ' || string_agg(quote_literal(ARRAY[sl_no::text, username::text, designation::text, salary::text])
|| '::text[]', E'\n, ')
|| E') \n AS t(col,' || string_agg('row' || sl_no, ',') || ')' AS sql
FROM tbl;
Result:
SELECT * FROM unnest(
'{sl_no, username, designation, salary}'::text[]
,'{10,Joe,Music,1234}'::text[]
,'{11,Bob,Movie,2345}'::text[]
,'{12,Dave,Theatre,2356}'::text[])
AS t(col,row1,row2,row3,row4);
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or older
Dynamic solution with crosstab()
Completely dynamic, works for any table. Provide the table name in two places:
SELECT 'SELECT *
FROM crosstab(
''SELECT unnest(''' || quote_literal(array_agg(attname))
|| '''::text[]) AS col
, row_number() OVER ()
, unnest(ARRAY[' || string_agg(quote_ident(attname)
|| '::text', ',') || ']) AS val
FROM ' || attrelid::regclass || '
ORDER BY generate_series(1,' || count(*) || '), 2''
) t (col text, '
|| (SELECT string_agg('r'|| rn ||' text', ',')
FROM (SELECT row_number() OVER () AS rn FROM tbl) t)
|| ')' AS sql
FROM pg_attribute
WHERE attrelid = 'tbl'::regclass
AND attnum > 0
AND NOT attisdropped
GROUP BY attrelid;
Could be wrapped into a function with a single parameter ...
Generates a query of the form:
SELECT *
FROM crosstab(
'SELECT unnest(''{sl_no,username,designation,salary}''::text[]) AS col
, row_number() OVER ()
, unnest(ARRAY[sl_no::text,username::text,designation::text,salary::text]) AS val
FROM tbl
ORDER BY generate_series(1,4), 2'
) t (col text, r1 text,r2 text,r3 text,r4 text);
Produces the desired result:
col r1 r2 r3 r4
-----------------------------------
sl_no 1 2 3 4
username A B C D
designation XYZ RTS QWE HGD
salary 10000 50000 20000 34343
Simple solution with unnest()
SELECT 'SELECT unnest(''{sl_no, username, designation, salary}''::text[] AS col)
, ' || string_agg('unnest('
|| quote_literal(ARRAY[sl_no::text, username::text, designation::text, salary::text])
|| '::text[]) AS row' || sl_no, E'\n , ') AS sql
FROM tbl;
Slow for tables with more than a couple of columns.
Generates a query of the form:
SELECT unnest('{sl_no, username, designation, salary}'::text[]) AS col
, unnest('{10,Joe,Music,1234}'::text[]) AS row1
, unnest('{11,Bob,Movie,2345}'::text[]) AS row2
, unnest('{12,Dave,Theatre,2356}'::text[]) AS row3
, unnest('{4,D,HGD,34343}'::text[]) AS row4
Same result.
If (like me) you were needing this information from a bash script, note there is a simple command-line switch for psql to tell it to output table columns as rows:
psql mydbname -x -A -F= -c "SELECT * FROM foo WHERE id=123"
The -x option is the key to getting psql to output columns as rows.
I have a simpler approach than Erwin pointed above, that worked for me with Postgres (and I think that it should work with all major relational databases whose support SQL standard)
You can use simply UNION instead of crosstab:
SELECT text 'a' AS "text" UNION SELECT 'b';
text
------
a
b
(2 rows)
Of course that depends on the case in which you are going to apply this. Considering that you know beforehand what fields you need, you can take this approach even for querying different tables. I.e.:
SELECT 'My first metric' as name, count(*) as total from first_table UNION
SELECT 'My second metric' as name, count(*) as total from second_table
name | Total
------------------|--------
My first metric | 10
My second metric | 20
(2 rows)
It's a more maintainable approach, IMHO. Look at this page for more information: https://www.postgresql.org/docs/current/typeconv-union-case.html
There is no proper way to do this in plain SQL or PL/pgSQL.
It will be way better to do this in the application, that gets the data from the DB.