I have the following SQL code below that is returning an error message stating that the AvgPYs column is not valid. This is also impacting my Auth column. Is there an issue with my formula? Any help is appreciated.
SELECT [tblCeiling].[Proj Code], [tblCeiling].[Act Code], [tblCeiling].[Cost Ctr], [tblCeiling].[Date], [tblCeiling].[Ref2], [tblCeiling].[Analyst], [tblCeiling].[Type], [tblCeiling].[B or O], [tblCeiling].[Jul], [tblCeiling].[Aug],
[tblCeiling].[Sep], [tblCeiling].[Oct], [tblCeiling].[Nov], [tblCeiling].[Dec], [tblCeiling].[Jan], [tblCeiling].[Feb], [tblCeiling].[Mar], [tblCeiling].[Apr], [tblCeiling].[May], [tblCeiling].[Jun], [tblCeiling].[Perm], [tblCeiling].[Temp], [tblCeiling].[LimitedTerm],
[tblCeiling].[LTDate], [tblCeiling].[Sal_Rate], [tblCeiling].[New], [Perm] + [Temp] + [LimitedTerm] AS Monthly, Format(([tblCeiling].[Jul] + [tblCeiling].[Aug] + [tblCeiling].[Sep] + [tblCeiling].[Oct] + [tblCeiling].[Nov] + [tblCeiling].[Dec] + [tblCeiling].[Jan] +
[tblCeiling].[Feb] + [tblCeiling].[Mar] + [tblCeiling].[Apr] + [tblCeiling].[May] + [tblCeiling].[Jun]) / 12, '0.0####') AS AvgPYs,
((([Sal_Rate] * [AvgPYs]) * 1000) / 1000) AS Auth, [Dollar Adj] + [Auth] AS Budget, [tblCeiling].[Import], [tblCeiling].[Dollar Adj], [tblCeiling].[OngoingOrOneTime], [tblCeiling].[OneTimeEndingDate]
FROM (SELECT DISTINCT *
FROM [tblactcode]) AS [tblactcode] RIGHT JOIN
(SELECT DISTINCT *
FROM [tblCeiling]) AS [tblCeiling] ON [tblactcode].[Act Code] = [tblCeiling].[Act Code]
WHERE ((([tblCeiling].[Jul]) = iif([jul] IS NULL, 0, [jul])) AND (([tblCeiling].[Aug]) = iif([aug] IS NULL, 0, [aug])) AND (([tblCeiling].[Sep]) = iif([sep] IS NULL, 0, [sep])) AND (([tblCeiling].[Oct]) = iif([oct] IS NULL, 0, [oct])) AND (([tblCeiling].[Nov]) = iif([nov] IS NULL,
0, [nov])) AND (([tblCeiling].[Dec]) = iif([dec] IS NULL, 0, [dec])) AND (([tblCeiling].[Jan]) = iif([jan] IS NULL, 0, [jan])) AND (([tblCeiling].[Feb]) = iif([feb] IS NULL, 0, [feb])) AND (([tblCeiling].[Mar]) = iif([mar] IS NULL, 0, [mar])) AND (([tblCeiling].[Apr])
= iif([apr] IS NULL, 0, [apr])) AND (([tblCeiling].[May]) = iif([may] IS NULL, 0, [may])) AND (([tblCeiling].[Jun]) = iif([jun] IS NULL, 0, [jun])) AND (([tblCeiling].[Import]) = 0))
ORDER BY [tblCeiling].[Proj Code], [tblCeiling].[Cost Ctr], [tblCeiling].[Date]
Here is the specific line I'm referring to:
Format(([tblCeiling].[Jul] + [tblCeiling].[Aug] + [tblCeiling].[Sep] + [tblCeiling].[Oct] + [tblCeiling].[Nov] + [tblCeiling].[Dec] + [tblCeiling].[Jan] +
[tblCeiling].[Feb] + [tblCeiling].[Mar] + [tblCeiling].[Apr] + [tblCeiling].[May] + [tblCeiling].[Jun]) / 12, '0.0####') AS AvgPYs,
The specific issue you have run into is attempting to use a calculation in the same scope you are calculating it - which isn't possible. You can only access a calculated value in an outer query.
Or a neat solution is to use CROSS APPLY which allows you to reuse a calculation as follows. In general this is done as:
select -- existing columns before AvgPYs
, AvgPYs
-- , some formula which depends on AvgPYs
from (
-- existing query
) C -- C is an acceptable short alias for Ceiling
cross apply (
values (formula)
) X (AvgPYs)
In your case I think the following is correct:
SELECT C.[Proj Code], C.[Act Code], C.[Cost Ctr], C.[Date], C.[Ref2], C.[Analyst], C.[Type], C.[B or O], C.[Jul], C.[Aug]
, C.[Sep], C.[Oct], C.[Nov], C.[Dec], C.[Jan], C.[Feb], C.[Mar], C.[Apr], C.[May], C.[Jun], C.[Perm], C.[Temp], C.[LimitedTerm]
, C.[LTDate], C.[Sal_Rate], C.[New], [Perm] + [Temp] + [LimitedTerm] AS Monthly
, X.AvgPYs
, Y.Auth
, [Dollar Adj] + Y.Auth AS Budget, C.[Import], C.[Dollar Adj], C.[OngoingOrOneTime], C.[OneTimeEndingDate]
FROM (
SELECT DISTINCT *
FROM [tblactcode]
) AS AC
RIGHT JOIN (
SELECT DISTINCT *
FROM [tblCeiling]
) AS C ON AC.[Act Code] = C.[Act Code]
CROSS APPLY (
VALUES (Format((C.[Jul] + C.[Aug] + C.[Sep] + C.[Oct] + C.[Nov] + C.[Dec] + C.[Jan] +
C.[Feb] + C.[Mar] + C.[Apr] + C.[May] + C.[Jun]) / 12, '0.0####'))
) AS X (AvgPYs)
CROSS APPLY (
VALUES (((([Sal_Rate] * X.AvgPYs) * 1000) / 1000))
) Y (Auth)
WHERE (((C.[Jul]) = iif([jul] IS NULL, 0, [jul])) AND ((C.[Aug]) = iif([aug] IS NULL, 0, [aug])) AND ((C.[Sep]) = iif([sep] IS NULL, 0, [sep])) AND ((C.[Oct]) = iif([oct] IS NULL, 0, [oct])) AND ((C.[Nov]) = iif([nov] IS NULL,
0, [nov])) AND ((C.[Dec]) = iif([dec] IS NULL, 0, [dec])) AND ((C.[Jan]) = iif([jan] IS NULL, 0, [jan])) AND ((C.[Feb]) = iif([feb] IS NULL, 0, [feb])) AND ((C.[Mar]) = iif([mar] IS NULL, 0, [mar])) AND ((C.[Apr])
= iif([apr] IS NULL, 0, [apr])) AND ((C.[May]) = iif([may] IS NULL, 0, [may])) AND ((C.[Jun]) = iif([jun] IS NULL, 0, [jun])) AND ((C.[Import]) = 0)
)
ORDER BY C.[Proj Code], C.[Cost Ctr], C.[Date];
Note: A key purpose of a table alias is to have a short reference to the table. See how much easier it is to read with shorter aliases.
I am trying to create a data quality dashboard, showing every table in my Snowflake database, the row count, the distinct row count, and the number of duplicates. The table I want should look like this:
table_name | row_count | distinct_row_count | duplicates
————————————————————————————————————————————————————————
table_a | 1,372 | 1,370 | 2
table_b | 4,735 | 4,735 | 0
I've been able to get the table name and row count using information_schema.tables. I'm trying to figure out how to get distinct counts for all of these tables. The primary key column for every table is different. On some tables it will be a user_id, on others a session_id, etc.
I've looked through the snowflake documentation for built in functions that could help. I've explored the information/usage schemas, etc. I'm not sure if a stored procedure would help here (I haven't used a lot of those).
In python or another language, I'd loop through every table and calculate what I need. Is there a way to do this in SQL?
create or replace TABLE DEMO_DB.PUBLIC.SNOWBALL (
TABLE_NAME VARCHAR(314),
TOTAL_ROWS NUMBER(18,0),
TABLE_LAST_ALTERED TIMESTAMP_LTZ(9),
TABLE_CREATED TIMESTAMP_LTZ(9),
TABLE_BYTES NUMBER(18,0),
COL_NAME ARRAY,
COL_DATA_TYPE ARRAY,
COL_HLL ARRAY,
COL_NULL_CNT ARRAY,
COL_MIN ARRAY,
COL_MAX ARRAY,
COL_TOP ARRAY,
COL_AVG ARRAY,
COL_MODE ARRAY,
COL_STDDEV ARRAY,
COL_VAR_POP ARRAY,
COL_AVG_LENGTH ARRAY,
STATS_RUN_DATE_TIME TIMESTAMP_LTZ(9)
);
create or replace view SNOWBALL_COLUMNS as
select
concat_ws('.', table_catalog, table_schema, table_name) as full_table_name,
*
from (
select * from demo_db.information_schema.columns
union
select * from snowflake_sample_data.information_schema.columns
union
select * from util_db.information_schema.columns
);
create or replace view SNOWBALL_TABLES as
select
concat_ws('.', table_catalog, table_schema, table_name) as full_table_name,
*
from (
select * from demo_db.information_schema.tables
union
select * from snowflake_sample_data.information_schema.tables
union
select * from util_db.information_schema.tables
);
CREATE OR REPLACE PROCEDURE DEMO_DB.PUBLIC.SNOWBALL(
db_name STRING,
schema_name STRING,
snowball_table STRING,
max_age_days FLOAT,
limit FLOAT
)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
COMMENT = 'Collects table and column stats.'
EXECUTE AS OWNER
AS
$$
var validLimit = Math.max(LIMIT, 0); // prevent SQL syntax error caused by negative numbers
var sqlGenerateInserts = `
WITH snowball_tables AS (
SELECT CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_table_name, *
FROM IDENTIFIER(?) -- <<DB_NAME>>.INFORMATION_SCHEMA.TABLES
),
snowball_columns AS (
SELECT CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_table_name, *
FROM IDENTIFIER(?) -- <<DB_NAME>>.INFORMATION_SCHEMA.COLUMNS
),
snowball AS (
SELECT table_name, MAX(stats_run_date_time) AS stats_run_date_time
FROM IDENTIFIER(?) -- <<SNOWBALL_TABLE>> table
GROUP BY table_name
)
SELECT full_table_name, aprox_row_count,
CONCAT (
'INSERT INTO IDENTIFIER(''', ?, ''') ', -- SNOWBALL table
'(table_name,total_rows,table_last_altered,table_created,table_bytes,col_name,',
'col_data_type,col_hll,col_avg_length,col_null_cnt,col_min,col_max,col_top,col_mode,col_avg,stats_run_date_time)',
'SELECT ''', full_table_name, ''' AS table_name, ',
table_stats_sql,
', ARRAY_CONSTRUCT( ', col_name, ') AS col_name',
', ARRAY_CONSTRUCT( ', col_data_type, ') AS col_data_type',
', ARRAY_CONSTRUCT( ', col_hll, ') AS col_hll',
', ARRAY_CONSTRUCT( ', col_avg_length, ') AS col_avg_length',
', ARRAY_CONSTRUCT( ', col_null_cnt, ') AS col_null_cnt',
', ARRAY_CONSTRUCT( ', col_min, ') AS col_min',
', ARRAY_CONSTRUCT( ', col_max, ') AS col_max',
', ARRAY_CONSTRUCT( ', col_top, ') AS col_top',
', ARRAY_CONSTRUCT( ', col_MODE, ') AS col_MODE',
', ARRAY_CONSTRUCT( ', col_AVG, ') AS col_AVG',
', CURRENT_TIMESTAMP() AS stats_run_date_time ',
' FROM ', quoted_table_name
) AS insert_sql
FROM (
SELECT
tbl.full_table_name,
tbl.row_count AS aprox_row_count,
CONCAT ( '"', col.table_catalog, '"."', col.table_schema, '"."', col.table_name, '"' ) AS quoted_table_name,
CONCAT (
'COUNT(1) AS total_rows,''',
IFNULL( tbl.last_altered::VARCHAR, 'NULL'), ''' AS table_last_altered,''',
IFNULL( tbl.created::VARCHAR, 'NULL'), ''' AS table_created,',
IFNULL( tbl.bytes::VARCHAR, 'NULL'), ' AS table_bytes' ) AS table_stats_sql,
LISTAGG (
CONCAT ('''', col.full_table_name, '.', col.column_name, '''' ), ', '
) AS col_name,
LISTAGG ( CONCAT('''', col.data_type, '''' ), ', ' ) AS col_data_type,
LISTAGG ( CONCAT( ' HLL(', '"', col.column_name, '"',') ' ), ', ' ) AS col_hll,
LISTAGG ( CONCAT( ' AVG(ZEROIFNULL(LENGTH(', '"', col.column_name, '"','))) ' ), ', ' ) AS col_avg_length,
LISTAGG ( CONCAT( ' SUM( IFF( ', '"', col.column_name, '"',' IS NULL, 1, 0) ) ' ), ', ') AS col_null_cnt,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MODE(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_MODE,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MIN(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_min,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MAX(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_max,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' AVG(', '"', col.column_name,'"',') ' ), 'NULL' ), ', ' ) AS col_AVG,
LISTAGG ( CONCAT ( ' APPROX_TOP_K(', '"', col.column_name, '"', ', 100, 10000)' ), ', ' ) AS col_top
FROM snowball_tables tbl JOIN snowball_columns col ON col.full_table_name = tbl.full_table_name
LEFT OUTER JOIN snowball sb ON sb.table_name = tbl.full_table_name
WHERE (tbl.table_catalog, tbl.table_schema) = (?, ?)
AND ( sb.table_name IS NULL OR sb.stats_run_date_time < TIMESTAMPADD(DAY, - FLOOR(?), CURRENT_TIMESTAMP()) )
--AND tbl.row_count > 0 -- NB: also excludes views (table_type = 'VIEW')
GROUP BY tbl.full_table_name, aprox_row_count, quoted_table_name, table_stats_sql, stats_run_date_time
ORDER BY stats_run_date_time NULLS FIRST )
LIMIT ` + validLimit;
var tablesAnalysed = [];
var currentSql;
try {
currentSql = sqlGenerateInserts;
var generateInserts = snowflake.createStatement( {
sqlText: currentSql,
binds: [
`"${DB_NAME}".information_schema.tables`,
`"${DB_NAME}".information_schema.columns`,
SNOWBALL_TABLE, SNOWBALL_TABLE,
DB_NAME, SCHEMA_NAME, MAX_AGE_DAYS, LIMIT
]
} );
var insertStatements = generateInserts.execute();
// loop over generated INSERT statements and execute them
while (insertStatements.next()) {
var tableName = insertStatements.getColumnValue('FULL_TABLE_NAME');
currentSql = insertStatements.getColumnValue('INSERT_SQL');
var insertStatement = snowflake.createStatement( {
sqlText: currentSql,
binds: [ SNOWBALL_TABLE ]
} );
var insertResult = insertStatement.execute();
tablesAnalysed.push(tableName);
}
return { result: "SUCCESS", analysedTables: tablesAnalysed };
}
catch (err) {
return {
error: err,
analysedTables: tablesAnalysed,
sql: currentSql
};
}
$$;
call DEMO_DB.PUBLIC.SNOWBALL(
'SNOWFLAKE_SAMPLE_DATA',
'TPCH_SF1',
'DEMO_DB.PUBLIC.SNOWBALL',
1, -- evals tables not analysed for x days -- first time you run this doesn't matter.
1000 -- limits # of tables analysed
);
CREATE OR REPLACE PROCEDURE DEMO_DB.PUBLIC.SNOWBALL(
db_name STRING,
schema_name STRING,
snowball_table STRING,
max_age_days FLOAT,
limit FLOAT
)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
COMMENT = 'Collects table and column stats.'
EXECUTE AS OWNER
AS
$$
var validLimit = Math.max(LIMIT, 0); // prevent SQL syntax error caused by negative numbers
var sqlGenerateInserts = `
WITH snowball_tables AS (
SELECT CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_table_name, *
FROM IDENTIFIER(?) -- <<DB_NAME>>.INFORMATION_SCHEMA.TABLES
),
snowball_columns AS (
SELECT CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_table_name, *
FROM IDENTIFIER(?) -- <<DB_NAME>>.INFORMATION_SCHEMA.COLUMNS
),
snowball AS (
SELECT table_name, MAX(stats_run_date_time) AS stats_run_date_time
FROM IDENTIFIER(?) -- <<SNOWBALL_TABLE>> table
GROUP BY table_name
)
SELECT full_table_name, aprox_row_count,
CONCAT (
'INSERT INTO IDENTIFIER(''', ?, ''') ', -- SNOWBALL table
'(table_name,total_rows,table_last_altered,table_created,table_bytes,col_name,',
'col_data_type,col_hll,col_avg_length,col_null_cnt,col_min,col_max,col_top,col_mode,col_avg,stats_run_date_time)',
'SELECT ''', full_table_name, ''' AS table_name, ',
table_stats_sql,
', ARRAY_CONSTRUCT( ', col_name, ') AS col_name',
', ARRAY_CONSTRUCT( ', col_data_type, ') AS col_data_type',
', ARRAY_CONSTRUCT( ', col_hll, ') AS col_hll',
', ARRAY_CONSTRUCT( ', col_avg_length, ') AS col_avg_length',
', ARRAY_CONSTRUCT( ', col_null_cnt, ') AS col_null_cnt',
', ARRAY_CONSTRUCT( ', col_min, ') AS col_min',
', ARRAY_CONSTRUCT( ', col_max, ') AS col_max',
', ARRAY_CONSTRUCT( ', col_top, ') AS col_top',
', ARRAY_CONSTRUCT( ', col_MODE, ') AS col_MODE',
', ARRAY_CONSTRUCT( ', col_AVG, ') AS col_AVG',
', CURRENT_TIMESTAMP() AS stats_run_date_time ',
' FROM ', quoted_table_name
) AS insert_sql
FROM (
SELECT
tbl.full_table_name,
tbl.row_count AS aprox_row_count,
CONCAT ( '"', col.table_catalog, '"."', col.table_schema, '"."', col.table_name, '"' ) AS quoted_table_name,
CONCAT (
'COUNT(1) AS total_rows,''',
IFNULL( tbl.last_altered::VARCHAR, 'NULL'), ''' AS table_last_altered,''',
IFNULL( tbl.created::VARCHAR, 'NULL'), ''' AS table_created,',
IFNULL( tbl.bytes::VARCHAR, 'NULL'), ' AS table_bytes' ) AS table_stats_sql,
LISTAGG (
CONCAT ('''', col.full_table_name, '.', col.column_name, '''' ), ', '
) AS col_name,
LISTAGG ( CONCAT('''', col.data_type, '''' ), ', ' ) AS col_data_type,
LISTAGG ( CONCAT( ' HLL(', '"', col.column_name, '"',') ' ), ', ' ) AS col_hll,
LISTAGG ( CONCAT( ' AVG(ZEROIFNULL(LENGTH(', '"', col.column_name, '"','))) ' ), ', ' ) AS col_avg_length,
LISTAGG ( CONCAT( ' SUM( IFF( ', '"', col.column_name, '"',' IS NULL, 1, 0) ) ' ), ', ') AS col_null_cnt,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MODE(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_MODE,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MIN(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_min,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' MAX(', '"', col.column_name, '"', ') ' ), 'NULL' ), ', ' ) AS col_max,
LISTAGG ( IFF ( col.data_type = 'NUMBER', CONCAT ( ' AVG(', '"', col.column_name,'"',') ' ), 'NULL' ), ', ' ) AS col_AVG,
LISTAGG ( CONCAT ( ' APPROX_TOP_K(', '"', col.column_name, '"', ', 100, 10000)' ), ', ' ) AS col_top
FROM snowball_tables tbl JOIN snowball_columns col ON col.full_table_name = tbl.full_table_name
LEFT OUTER JOIN snowball sb ON sb.table_name = tbl.full_table_name
WHERE (tbl.table_catalog, tbl.table_schema) = (?, ?)
AND ( sb.table_name IS NULL OR sb.stats_run_date_time < TIMESTAMPADD(DAY, - FLOOR(?), CURRENT_TIMESTAMP()) )
--AND tbl.row_count > 0 -- NB: also excludes views (table_type = 'VIEW')
GROUP BY tbl.full_table_name, aprox_row_count, quoted_table_name, table_stats_sql, stats_run_date_time
ORDER BY stats_run_date_time NULLS FIRST )
LIMIT ` + validLimit;
var tablesAnalysed = [];
var currentSql;
try {
currentSql = sqlGenerateInserts;
var generateInserts = snowflake.createStatement( {
sqlText: currentSql,
binds: [
`"${DB_NAME}".information_schema.tables`,
`"${DB_NAME}".information_schema.columns`,
SNOWBALL_TABLE, SNOWBALL_TABLE,
DB_NAME, SCHEMA_NAME, MAX_AGE_DAYS, LIMIT
]
} );
var insertStatements = generateInserts.execute();
// loop over generated INSERT statements and execute them
while (insertStatements.next()) {
var tableName = insertStatements.getColumnValue('FULL_TABLE_NAME');
currentSql = insertStatements.getColumnValue('INSERT_SQL');
var insertStatement = snowflake.createStatement( {
sqlText: currentSql,
binds: [ SNOWBALL_TABLE ]
} );
var insertResult = insertStatement.execute();
tablesAnalysed.push(tableName);
}
return { result: "SUCCESS", analysedTables: tablesAnalysed };
}
catch (err) {
return {
error: err,
analysedTables: tablesAnalysed,
sql: currentSql
};
}
$$;
I've done somewhat of an overkill solution solving this.
SQL used supplied ... basically does everything you've asked for plus top 100 values, min,max, stddev, avg, null % to the column level for every table in ALL databases.
Oh yes and works out ALL PK/FK's returning not just the PK but the description instead.
Runs in seconds ... All sql available from a post in the community snowflake. Hit me up if you want the really smart stuff :-)
SQL here :
https://community.snowflake.com/s/group/0F90Z000000IOX5SAO/general-snowflake-community-help