Is there something like Spark's unionByName in BigQuery? - google-bigquery

I'd like to concatenate tables with different schemas, filling unknown values with null.
Simply using UNION ALL of course does not work like this:
WITH
x AS (SELECT 1 AS a, 2 AS b ),
y AS (SELECT 3 AS b, 4 AS c )
SELECT * FROM x
UNION ALL
SELECT * FROM y
a b
1 2
3 4
(unwanted result)
In Spark, I'd use unionByName to get the following result:
a b c
1 2
3 4
(wanted result)
Of course, I can manually create the needed query (adding nullss) in BigQuery like so:
SELECT a, b, NULL c FROM x
UNION ALL
SELECT NULL a, b, c FROM y
But I'd prefer to have a generic solution, not requiring me to generate something like that.
So, is there something like unionByName in BigQuery? Or can one come up with a generic SQL function for this?

Consider below approach (I think it is as generic as one can get)
create temp function json_extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function json_extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp table temp_table as (
select json, key, value
from (
select to_json_string(t) json from table_x as t
union all
select to_json_string(t) from table_y as t
) t, unnest(json_extract_keys(json)) key with offset
join unnest(json_extract_values(json)) value with offset
using(offset)
order by key
);
execute immediate(select '''
select * except(json) from temp_table
pivot (any_value(value) for key in ("''' || string_agg(distinct key, '","') || '"))'
from temp_table
)
if applied to sample data in your question - output is

Related

PostgreSQL union a variable of type array to the result set of a query

Is it possible to union a to union a variable to a select statement in PostgreSQL? I have a recursive function at the moment that in essence does this:
create or replace function call_recurrsive_function(ids bigint[])
.
.
select id from x where y
union call_recurrsive_function(select id from x where y)
.
I've recently made some changes that increase the complexity of the select by a lot, and to increase performance I'd like to run that query only once per function call and do something like
var = select id from x where y
union call_recurrsive_function(var)
You can try using a CTE (Common Table Expression). For example:
with
r as (
select id from x where y
)
select *
from (
select id from r
union call_recurrsive_function(select id from r)
) x

Return five rows of random DNA instead of just one

This is the code I have to create a string of DNA:
prepare dna_length(int) as
with t1 as (
select chr(65) as s
union select chr(67)
union select chr(71)
union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i, round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;
execute dna_length(20);
I am trying to figure out how to re-write this to give a table of 5 rows of strings of DNA of length 20 each, instead of just one row. This is for PostgreSQL.
I tried:
CREATE TABLE dna_table(g int, dna text);
INSERT INTO dna_table (1, execute dna_length(20));
But this does not seem to work. I am an absolute beginner. How to do this properly?
PREPARE creates a prepared statement that can be used "as is". If your prepared statement returns one string then you can only get one string. You can't use it in other operations like insert, e.g.
In your case you may create a function:
create or replace function dna_length(int) returns text as
$$
with t1 as (
select chr(65) as s
union
select chr(67)
union
select chr(71)
union
select chr(84))
, t2 as (select s,
row_number() over () as rn
from t1)
, t3 as (select generate_series(1, $1) as i,
round(random() * 4 + 0.5) as rn)
, t4 as (select t2.s
from t2
join t3 on (t2.rn = t3.rn))
select array_to_string(array(select s from t4), '') as dna
$$ language sql;
And use it in a way like this:
insert into dna_table(g, dna) select generate_series(1,5), dna_length(20)
From the official doc:
PREPARE creates a prepared statement. A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
About functions.
This can be much simpler and faster:
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1,100) g -- 100 = 5 rows * 20 nucleotides
GROUP BY g%5;
random() produces random value in the range 0.0 <= x < 1.0. Multiply by 4 and take the mathematical ceiling with ceil() (cheaper than round()), and you get a random distribution of the numbers 1-4. Convert to ACTG, and aggregate with GROUP BY g%5 - % being the modulo operator.
About string_agg():
Concatenate multiple result rows of one column into one, group by another column
As prepared statement, taking
$1 ... the number of rows
$2 ... the number of nucleotides per row
PREPARE dna_length(int, int) AS
SELECT string_agg(CASE ceil(random() * 4)
WHEN 1 THEN 'A'
WHEN 2 THEN 'C'
WHEN 3 THEN 'T'
WHEN 4 THEN 'G'
END, '') AS dna
FROM generate_series(1, $1 * $2) g
GROUP BY g%$1;
Call:
EXECUTE dna_length(5,20);
Result:
| dna |
| :------------------- |
| ATCTTCGACACGTCGGTACC |
| GTGGCTGCAGATGAACAGAG |
| ACAGCTTAAAACACTAAGCA |
| TCCGGACCTCTCGACCTTGA |
| CGTGCGGAGTACCCTAATTA |
db<>fiddle here
If you need it a lot, consider a function instead. See:
What is the difference between a prepared statement and a SQL or PL/pgSQL function, in terms of their purposes?

Sybase -User Defined Functions

I have 2 user defined functions which return a table:
Lets say them UDF1 and UDF2
select * from UDF1(param1) -> returns 1 result
select * from UDF2(param2) -> returns 1 result
The problem is when i do
select * from UDF1(param1) union all select * from UDF2(param2) -returns only 1 result.
Ideally it should return 2 results as its a union all.
Can someone help me why this behaviour is observed in sybase?
The exact code is as follows:
Created function as below:
EXEC SQL.
CREATE FUNCTION "ZCHECK_4" (
#COL3_VAL smallint
)
RETURNS TABLE (
"COL1" varchar(000030),
"COL2" varchar(000030),
"COL3" smallint
) AS RETURN SELECT
"ZTESTFUNC"."COL1",
"ZTESTFUNC"."COL2",
"ZTESTFUNC"."COL3"
FROM "ZTESTFUNC" "ZTESTFUNC"
WHERE "ZTESTFUNC"."COL3" = #COL3_VAL
ENDEXEC.
Final Sql view ->Which is returing only 1 row
CREATE VIEW "ZCHECK_5" AS SELECT
"ZCHECK_4"."COL1",
"ZCHECK_4"."COL2",
"ZCHECK_4"."COL3"
FROM "ZCHECK_4"(
CAST(
20 AS TINYINT
)
) "ZCHECK_4"
UNION ALL SELECT
"ZCHECK_4"."COL1",
"ZCHECK_4"."COL2",
"ZCHECK_4"."COL3"
FROM "ZCHECK_4"(
CAST(
10 AS TINYINT
)
) "ZCHECK_4"
Note : the underlying table(ZTESTFUNC) has 2 records which I cross validated.
Apparently for a UDF(User Defined Function) ,syntax after a select statement of a function would be ignored by the Sybase compiler.
Consider the below sceanrio:
Select F1 UNION ALL F2.
F1 and F2 being UDF with parameters, the highlighted text wouldn't be compiled in Sybase.
It might be a limitation of Sybase.
Note :This is not a case with tables or views where Union all works perfectly fine.

Multiple Columns in an "in" statement

I am using DB 2 and i am trying to write a query which checks multiple columns against a given set of values.Like field a, field b and field c against values x,y,z,f. One way that i can think for is writing same condition 3 times with or i.e. field a in ('x','y','z','f') or field b in .... and so on . Please let me know if there is some other efficient and easy way to accomplish this. I am looking for a query like if any of the condition is true return yes else no . Please suggest !
This may or may not work on as400:
create table a (a int not null, b int not null);
insert into a (a,b) values (1,1),(1,3),(2,3),(0,23);
select a.*
from a
where a in (1,2) or b in (1,2);
A B
----------- -----------
1 1
1 3
2 3
Rewriting as a join:
select a.*
from a
join ( values (1),(2) ) b (x)
on b.x in (a.a, a.b);
A B
----------- -----------
1 1
1 3
2 3
Assuming the column data types are the same, Create a subquery joining all the columns you want to search with your IN into one column with a union
SELECT *
FROM (
SELECT
YOUR_TABLE_PRIMARY_KEY
,A AS Col
FROM YOUR_TABLE
UNION ALL
SELECT
YOUR_TABLE_PRIMARY_KEY
,B AS Col
FROM YOUR_TABLE
UNION ALL
SELECT
YOUR_TABLE_PRIMARY_KEY
,C AS Col
FROM YOUR_TABLE
) AS SQ
WHERE
SQ.Col IN ('x','y','z','f')
Make sure to include the table key so you know which row the data refers to
You can create a regular expression that describe the set of characters and use it with xquery
Assuming you're on a supported version of the OS (tested on 7.1 TR6), this should work...
with sel (val) as (values ('x'),('y'),('f'))
select * from mytbl
where flda in (select val from sel)
or fldb in (select val from sel)
or fldc in (select val from sel)
Expanding on the above since your OP asked for "condition is true return yes else no"
Assuming you've got the key to a row to check, would 'yes' or the empty set be good enough? somekey is the key for the row you want to check.
with sel (val) as (values ('x'),('y'),('f'))
select 'yes' from mytbl
where thekey = somekey
and ( flda in (select val from sel)
or fldb in (select val from sel)
or fldc in (select val from sel)
)
It's actually rather difficult to return a value when you don't have a matching row. Here's one way. Note I've switch to 1=yes, 0=no..
with sel (val) as (values ('x'),('y'),('f'))
select 1 from mytbl
where thekey = somekey
and ( flda in (select val from sel)
or fldb in (select val from sel)
or fldc in (select val from sel)
)
UNION ALL
select 0
from sysibm.sysdummy1
order by 1 desc
fetch first row only

Using IN with convert in sql

I would like to use the IN clause, but with the convert function.
Basically, I have a table (A) with the column of type int.
But in the other table (B) I Have values which are of type varchar.
Essentially, what I am looking for something like this
select *
from B
where myB_Column IN (select myA_Columng from A)
However, I am not sure if the int from table A, would map / convert / evaluate properly for the varchar in B.
I am using SQL Server 2008.
You can use CASE statement in where clause like this and CAST only if its Integer.
else 0 or NULL depending on your requirements.
SELECT *
FROM B
WHERE CASE ISNUMERIC(myB_Column) WHEN 1 THEN CAST(myB_Column AS INT) ELSE 0 END
IN (SELECT myA_Columng FROM A)
ISNUMERIC will be 1 (true) for Decimal values as-well so ideally you should implement your own IsInteger UDF .To do that look at this question
T-sql - determine if value is integer
Option #1
Select * from B where myB_Column IN
(
Select Cast(myA_Columng As Int) from A Where ISNUMERIC(myA_Columng) = 1
)
Option #2
Select B.* from B
Inner Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
Option #3
Select B.* from B
Left Join
(
Select Cast(myA_Columng As Int) As myA_Columng from A
Where ISNUMERIC(myA_Columng) = 1
) T
On T.myA_Columng = B.myB_Column
I will opt third one. Reason is below mentioned.
Disadvantages of IN Predicate
Suppose I have two list objects.
List 1 List 2
1 12
2 7
3 8
4 98
5 9
6 10
7 6
Using Contains, it will search for each List-1 item in List-2 that means iteration will happen 49 times !!!
You can also use exists caluse,
select *
from B
where EXISTS (select 1 from A WHERE CAST(myA_Column AS VARCHAR) = myB_Column)
You can use below query :
select B.*
from B
inner join (Select distinct MyA_Columng from A) AS X ON B.MyB_Column = CAST(x.MyA_Columng as NVARCHAR(50))
Try it by using CAST()
SELECT *
FROM B
WHERE CAST(myB_Column AS INT(11)) IN (
SELECT myA_Columng
FROM A
)