In BigQuery, identify when columns do not match on UNION ALL - sql

with
table1 as (
select 'joe' as name, 17 as age, 25 as speed
),
table2 as (
select 'nick' as name, 21 as speed, 23 as strength
)
select * from table1
union all
select * from table2
In Google BigQuery, this union all does not throw an error because both tables have the same number of columns (3 each). However I receive bad data output because the columns do not match. Rather than outputting a new table with 4 columns name, age, speed, strength with correct values + nulls for missing values (which would probably be preferred), the union all keeps the 3 columns from the top row.
Is there a good way to catch that the columns do not match, rather than the query silently returning bad data? Is there any way for this to return an error perhaps, as opposed to a successful table? I'm not sure how to check in SQL that the columns in the 2 tables match.
Edit: in this example it is clear to see that the columns do not match, however in our data we have 100+ columns and we want to avoid a situation where we make an error in a UNION ALL

Below is for BigQuery Standard SQL and using scripting feature of BQ
DECLARE statement STRING;
SET statement = (
WITH table1_columns AS (
SELECT column FROM (SELECT * FROM `project.dataset.table1` LIMIT 1) t,
UNNEST(REGEXP_EXTRACT_ALL(TRIM(TO_JSON_STRING(t), '{}'), r'"([^"]*)":')) column
), table2_columns AS (
SELECT column FROM (SELECT * FROM `project.dataset.table2` LIMIT 1) t,
UNNEST(REGEXP_EXTRACT_ALL(TRIM(TO_JSON_STRING(t), '{}'), r'"([^"]*)":')) column
), all_columns AS (
SELECT column FROM table1_columns UNION DISTINCT SELECT column FROM table2_columns
)
SELECT (
SELECT 'SELECT ' || STRING_AGG(IF(t.column IS NULL, 'NULL as ', '') || a.column, ', ') || ' FROM `project.dataset.table1` UNION ALL '
FROM all_columns a LEFT JOIN table1_columns t USING(column)
) || (
SELECT 'SELECT ' || STRING_AGG(IF(t.column IS NULL, 'NULL as ', '') || a.column, ', ') || ' FROM `project.dataset.table2`'
FROM all_columns a LEFT JOIN table2_columns t USING(column)
)
);
EXECUTE IMMEDIATE statement;
when applied to sample data from your question - output is
Row name age speed strength
1 joe 17 25 null
2 nick null 21 23

After saving table1 and table2 as 2 tables in a dataset in BigQuery, I then used the metadata using INFORMATION_SCHEMA to check that the columns matched.
SELECT *
FROM models.INFORMATION_SCHEMA.COLUMNS
where table_name = 'table1'
SELECT *
FROM models.INFORMATION_SCHEMA.COLUMNS
where table_name = 'table2'
INFORMATION_SCHEMA.COLUMNS returns information including the column names and their positioning. I can join these 2 tables together then to check that the names match...

Related

How to extract the table name from a CREATE/UPDATE/INSERT statement in an SQL query?

I am trying to parse the table being created, inserted into or updated from the following sql queries stored in a table column.
Let's call the table column query. Following is some sample data to demonstrate variations in how the data could look like.
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
),
Following are the formats of the queries (we are trying to extract table_name ):
#1
some optional statements like drop table
CREATE some comments or optional statement like OR REPLACE TABLE table_name
everything else
#2
some optional statements like drop table
INSERT some comments INTO some comments table_name
#3
some optional statements like drop table
UPDATE some comments table_name
everything else
Regular Expression
To construct a suitable regex, let's start with the following relatively simple/readable version:
((CREATE( OR REPLACE)?|DROP) TABLE( IF EXISTS)?|UPDATE|DELETE|INSERT INTO) ([^\s\/*]+)
All the spaces above could be replaced with "at least one whitespace character", i.e. \s+. But we also need to allow comments. For a comment that looks like /*anything*/ the regex looks like \/\*.*\*\/ (where the comment characters are escaped with \ and "anything" is the .* in the middle). Given there could be multiple such comments, optionally separated by whitespace, we end up with (\s*\/\*.*\*\/\s*?)*\s+. Plugging this in everywhere there was a space gives:
((CREATE((\s*\/\*.*\*\/\s*?)*\s+OR(\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(\s*\/\*.*\*\/\s*?)*\s+TABLE((\s*\/\*.*\*\/\s*?)*\s+IF(\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(\s*\/\*.*\*\/\s*?)*\s+INTO)(\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
One further refinement needs to be made: Bracketed expressions have been used for choices, e.g. (CHOICE1|CHOICE2). But this syntax includes them as capturing groups. Actually we only require one capturing group for the table name so we can exclude all the other capturing groups via ?:, e.g. (?:CHOICE1|CHOICE2). This gives:
(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)
Online Regex Demo
Here's a demo of it working with your examples: Regex101 demo
SQL
The Google BigQuery documentation for REGEXP_EXTRACT says it will return the substring matched by the capturing group. So I'd expect something like this to work:
with sample_data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'CREATE OR REPLACE TABLE tbl1 ...' as query union all
select 3 as id, 'DROP TABLE IF EXISTS tbl1; CREATE TABLE tbl1 ...' as query union all
select 4 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 5 as id, 'INSERT /*some comment*/ INTO tbl2 ...' as query union all
select 6 as id, 'UPDATE tbl3 SET col1 = ...' as query union all
select 7 as id, '/*some garbage comments*/ UPDATE tbl3 SET col1 = ...' as query union all
select 8 as id, 'DELETE tbl4 ...' as query
)
SELECT
*, REGEXP_EXTRACT(query, r"(?:(?:CREATE(?:(?:\s*\/\*.*\*\/\s*?)*\s+OR(?:\s*\/\*.*\*\/\s*?)*\s+REPLACE)?|DROP)(?:\s*\/\*.*\*\/\s*?)*\s+TABLE(?:(?:\s*\/\*.*\*\/\s*?)*\s+IF(?:\s*\/\*.*\*\/\s*?)*\s+EXISTS)?|UPDATE|DELETE|INSERT(?:\s*\/\*.*\*\/\s*?)*\s+INTO)(?:\s*\/\*.*\*\/\s*?)*\s+([^\s\/*]+)") AS table_name
FROM sample_data;
(The above is untested so please let me know in the comments if there are any issues.)
I think it really depends on your data, but you might find some success using an approach like this:
with data as (
select 1 as id, 'CREATE TABLE tbl1 ...' as query union all
select 2 as id, 'INSERT INTO tbl2 ...' as query union all
select 3 as id, 'UPDATE tbl3 ...' as query union all
select 4 as id, 'DELETE tbl4 ...' as query
),
splitted as (
select id, split(query, ' ') as query_parts from data
)
select
id,
case
when query_parts[safe_offset(0)] in('CREATE', 'INSERT') then query_parts[safe_offset(2)]
when query_parts[safe_offset(0)] in('UPDATE', 'DELETE') then query_parts[safe_offset(1)]
else 'Error'
end as table_name
from splitted
Of course this depends on the cleanliness and syntax in your query column. Also, if your table_name is qualified with project.table.dataset you would need to do further splitting.

How to avoid duplicates in the STRING_AGG function SQL Server

I was testing a query in SQL in which I need to concatenate values ​​in the form of a comma-separated list, and it works, I just have the problem of duplicate values.
This is the query:
SELECT t0.id_marcas AS CodMarca,
t0.nombremarcas AS NombreMarca,
t0.imagenmarcas,
(SELECT String_agg((t2.name), ', ')
FROM exlcartu_devcit.store_to_cuisine t1
INNER JOIN exlcartu_devcit.cuisine t2
ON t1.cuisine_id = t2.cuisine_id
WHERE store_id = (SELECT TOP 1 store_id
FROM exlcartu_devcit.store
WHERE id_marcas = t0.id_marcas
AND status = 1)) AS Descripcion,
t0.logo,
t0.imagen,
(SELECT TOP 1 preparing_time
FROM exlcartu_devcit.store
WHERE id_marcas = t0.id_marcas
AND status = 1) AS Tiempo,
t0.orden,
(SELECT TOP 1 Avg(minimum_amount)
FROM exlcartu_devcit.store_delivery_zone
WHERE id_marcas = t0.id_marcas) AS MontoMinimo
FROM exlcartu_devcit.[marcas] t0
I thought the solution could be just adding a DISTINCT to the query to avoid repeated values ​​in this way ...
(SELECT STRING_AGG(DISTINCT (t2.name), ', ') AS Descripcion
But apparently the STRING_AGG() function does not support it, any idea how to avoid repeated values?
Simplest way is just select from select, like this:
with dups as (select 1 as one union all select 1 as one)
select string_agg(one, ', ') from (select distinct one from dups) q;
vs original
with dups as (select 1 as one union all select 1 as one)
select string_agg(one, ', ') from dups;

SQL Search rows that contain strings from 2nd table list

I have a master table that contains a list of strings to search for. it returns TRUE/FALSE if any string in the cell contains text from the master lookup table. Currently I use excel's
=SUMPRODUCT(--ISNUMBER(SEARCH(masterTable,[#searchString])))>0
is there a way to do something like this in SQL? LEFT JOIN or OUTER APPLY would be simple solutions if the strings were equal; but they need be contains..
SELECT *
FROM t
WHERE col1 contains(lookupString,lookupColumn)
--that 2nd table could be maintained and referenced from multiple queries
hop
bell
PRS
2017
My desired results would be a column that shows TRUE/FALSE if the row contains any string from the lookup table
SEARCH_STRING Contained_in_lookup_column
hopping TRUE
root FALSE
Job2017 TRUE
PRS_tool TRUE
hand FALSE
Sorry i dont have access to the DB now to confirm the syntax, but should be something like this:
SELECT t.name,
case when (select count(1) from data_table where data_col like '%' || t.name || '%' > 0) then 'TRUE' else 'FALSE' end
FROM t;
or
SELECT t.name,
case when exists(select null from data_table where data_col like '%' || t.name || '%') then 'TRUE' else 'FALSE' end
FROM t;
Sérgio
You can use a combination of % wildcards with LIKE and EXISTS.
Example (using Oracle syntax) - we have a v_data table containing the data and a v_queries table containing the query terms:
with v_data (pk, value) as (
select 1, 'The quick brown fox jumps over the lazy dog' from dual union all
select 2, 'Yabba dabba doo' from dual union all
select 3, 'forty-two' from dual
),
v_queries (text) as (
select 'quick' from dual union all
select 'forty' from dual
)
select * from v_data d
where exists (
select null
from v_queries q
where d.value like '%' || q.text || '%');

T-SQL Comma delimited value from resultset to in clause in Subquery

I have an issue where in my data I will have a record returned where a column value will look like
-- query
Select col1 from myTable where id = 23
-- result of col1
111, 104, 34, 45
I want to feed these values to an in clause. So far I have tried:
-- Query 2 -- try 1
Select * from mytableTwo
where myfield in (
SELECT col1
from myTable where id = 23)
-- Query 2 -- try 2
Select * from mytableTwo
where myfield in (
SELECT '''' +
Replace(col1, ',', ''',''') + ''''
from myTable where id = 23)
-- query 2 test -- This works and will return data, so I verify here that data exists
Select * from mytableTwo
where myfield in ('111', '104', '34', '45')
Why aren't query 2 try 1 or 2 working?
You don't want an in clause. You want to use like:
select *
from myTableTwo t2
where exists (select 1
from myTable t
where id = 23 and
', '+t.col1+', ' like '%, '+t2.myfield+', %'
);
This uses like for the comparison in the list. It uses a subquery for the value. You could also phrase this as a join by doing:
select t2.*
from myTableTwo t2 join
myTable t
on t.id = 23 and
', '+t.col1+', ' like '%, '+t2.myfield+', %';
However, this could multiply the number of rows in the output if there is more than one row with id = 23 in myTable.
If you observe closely, Query 2 -- try 1 & Query 2 -- try 2 are considered as single value.
like this :
WHERE myfield in ('111, 104, 34, 45')
which is not same as :
WHERE myfield in ('111', '104', '34', '45')
So, If you intend to filter myTable rows from MyTableTwo, you need to extract the values of fields column data to a table variable/table valued function and filter the data.
I have created a table valued function which takes comma seperated string and returns a table value.
you can refer here T-SQL : Comma separated values to table
Final code to filter the data :
DECLARE #filteredIds VARCHAR(100)
-- Get the filter data
SELECT #filteredIds = col1
FROM myTable WHERE id = 23
-- TODO : Get the script for [dbo].[GetDelimitedStringToTable]
-- from the given link and execute before this
SELECT *
FROM mytableTwo T
CROSS APPLY [dbo].[GetDelimitedStringToTable] ( #filteredIds, ',') F
WHERE T.myfield = F.Value
Please let me know If this helps you!
I suppose col is a character type, whose result would be like like '111, 104, 34, 45'. If this is your situation, it's not the best of the world (denormalized database), but you can still relate these tables by using character operators like LIKE or CHARINDEX. The only gotcha is to convert the numeric column to character -- the default conversion between character and numeric is numeric and it will cause a conversion error.
Since #Gordon, responded using LIKE, I present a solution using CHARINDEX:
SELECT *
FROM mytableTwo tb2
WHERE EXISTS (
SELECT 'x'
FROM myTable tb1
WHERE tb1.id = 23
AND CHARINDEX(CONVERT(VARCHAR(20), tb2.myfield), tb1.col1) > 0
)

PostgreSQL convert columns to rows? Transpose?

I have a PostgreSQL function (or table) which gives me the following output:
Sl.no username Designation salary etc..
1 A XYZ 10000 ...
2 B RTS 50000 ...
3 C QWE 20000 ...
4 D HGD 34343 ...
Now I want the Output as below:
Sl.no 1 2 3 4 ...
Username A B C D ...
Designation XYZ RTS QWE HGD ...
Salary 10000 50000 20000 34343 ...
How to do this?
SELECT
unnest(array['Sl.no', 'username', 'Designation','salary']) AS "Columns",
unnest(array[Sl.no, username, value3Count,salary]) AS "Values"
FROM view_name
ORDER BY "Columns"
Reference : convertingColumnsToRows
Basing my answer on a table of the form:
CREATE TABLE tbl (
sl_no int
, username text
, designation text
, salary int
);
Each row results in a new column to return. With a dynamic return type like this, it's hardly possible to make this completely dynamic with a single call to the database. Demonstrating solutions with two steps:
Generate query
Execute generated query
Generally, this is limited by the maximum number of columns a table can hold. So not an option for tables with more than 1600 rows (or fewer). Details:
What is the maximum number of columns in a PostgreSQL select query
Postgres 9.4+
Dynamic solution with crosstab()
Use the first one you can. Beats the rest.
SELECT 'SELECT *
FROM crosstab(
$ct$SELECT u.attnum, t.rn, u.val
FROM (SELECT row_number() OVER () AS rn, * FROM '
|| attrelid::regclass || ') t
, unnest(ARRAY[' || string_agg(quote_ident(attname)
|| '::text', ',') || '])
WITH ORDINALITY u(val, attnum)
ORDER BY 1, 2$ct$
) t (attnum bigint, '
|| (SELECT string_agg('r'|| rn ||' text', ', ')
FROM (SELECT row_number() OVER () AS rn FROM tbl) t)
|| ')' AS sql
FROM pg_attribute
WHERE attrelid = 'tbl'::regclass
AND attnum > 0
AND NOT attisdropped
GROUP BY attrelid;
Operating with attnum instead of actual column names. Simpler and faster. Join the result to pg_attribute once more or integrate column names like in the pg 9.3 example.
Generates a query of the form:
SELECT *
FROM crosstab(
$ct$
SELECT u.attnum, t.rn, u.val
FROM (SELECT row_number() OVER () AS rn, * FROM tbl) t
, unnest(ARRAY[sl_no::text,username::text,designation::text,salary::text]) WITH ORDINALITY u(val, attnum)
ORDER BY 1, 2$ct$
) t (attnum bigint, r1 text, r2 text, r3 text, r4 text);
This uses a whole range of advanced features. Just too much to explain.
Simple solution with unnest()
One unnest() can now take multiple arrays to unnest in parallel.
SELECT 'SELECT * FROM unnest(
''{sl_no, username, designation, salary}''::text[]
, ' || string_agg(quote_literal(ARRAY[sl_no::text, username::text, designation::text, salary::text])
|| '::text[]', E'\n, ')
|| E') \n AS t(col,' || string_agg('row' || sl_no, ',') || ')' AS sql
FROM tbl;
Result:
SELECT * FROM unnest(
'{sl_no, username, designation, salary}'::text[]
,'{10,Joe,Music,1234}'::text[]
,'{11,Bob,Movie,2345}'::text[]
,'{12,Dave,Theatre,2356}'::text[])
AS t(col,row1,row2,row3,row4);
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or older
Dynamic solution with crosstab()
Completely dynamic, works for any table. Provide the table name in two places:
SELECT 'SELECT *
FROM crosstab(
''SELECT unnest(''' || quote_literal(array_agg(attname))
|| '''::text[]) AS col
, row_number() OVER ()
, unnest(ARRAY[' || string_agg(quote_ident(attname)
|| '::text', ',') || ']) AS val
FROM ' || attrelid::regclass || '
ORDER BY generate_series(1,' || count(*) || '), 2''
) t (col text, '
|| (SELECT string_agg('r'|| rn ||' text', ',')
FROM (SELECT row_number() OVER () AS rn FROM tbl) t)
|| ')' AS sql
FROM pg_attribute
WHERE attrelid = 'tbl'::regclass
AND attnum > 0
AND NOT attisdropped
GROUP BY attrelid;
Could be wrapped into a function with a single parameter ...
Generates a query of the form:
SELECT *
FROM crosstab(
'SELECT unnest(''{sl_no,username,designation,salary}''::text[]) AS col
, row_number() OVER ()
, unnest(ARRAY[sl_no::text,username::text,designation::text,salary::text]) AS val
FROM tbl
ORDER BY generate_series(1,4), 2'
) t (col text, r1 text,r2 text,r3 text,r4 text);
Produces the desired result:
col r1 r2 r3 r4
-----------------------------------
sl_no 1 2 3 4
username A B C D
designation XYZ RTS QWE HGD
salary 10000 50000 20000 34343
Simple solution with unnest()
SELECT 'SELECT unnest(''{sl_no, username, designation, salary}''::text[] AS col)
, ' || string_agg('unnest('
|| quote_literal(ARRAY[sl_no::text, username::text, designation::text, salary::text])
|| '::text[]) AS row' || sl_no, E'\n , ') AS sql
FROM tbl;
Slow for tables with more than a couple of columns.
Generates a query of the form:
SELECT unnest('{sl_no, username, designation, salary}'::text[]) AS col
, unnest('{10,Joe,Music,1234}'::text[]) AS row1
, unnest('{11,Bob,Movie,2345}'::text[]) AS row2
, unnest('{12,Dave,Theatre,2356}'::text[]) AS row3
, unnest('{4,D,HGD,34343}'::text[]) AS row4
Same result.
If (like me) you were needing this information from a bash script, note there is a simple command-line switch for psql to tell it to output table columns as rows:
psql mydbname -x -A -F= -c "SELECT * FROM foo WHERE id=123"
The -x option is the key to getting psql to output columns as rows.
I have a simpler approach than Erwin pointed above, that worked for me with Postgres (and I think that it should work with all major relational databases whose support SQL standard)
You can use simply UNION instead of crosstab:
SELECT text 'a' AS "text" UNION SELECT 'b';
text
------
a
b
(2 rows)
Of course that depends on the case in which you are going to apply this. Considering that you know beforehand what fields you need, you can take this approach even for querying different tables. I.e.:
SELECT 'My first metric' as name, count(*) as total from first_table UNION
SELECT 'My second metric' as name, count(*) as total from second_table
name | Total
------------------|--------
My first metric | 10
My second metric | 20
(2 rows)
It's a more maintainable approach, IMHO. Look at this page for more information: https://www.postgresql.org/docs/current/typeconv-union-case.html
There is no proper way to do this in plain SQL or PL/pgSQL.
It will be way better to do this in the application, that gets the data from the DB.