Combine 2 tables which doesn't have any relationship - sql

I have couple of tables like below-
Table1:
A B C D <<Columns
1 2 3 4 <<single row
Table2:
W X Y Z << Columns
5 6 7 8 << Single row
I want to combine these 2 tables such a way that it will give me following result
Result:
P Q R S << Column headers
1 2 3 4 << row from table1
5 6 7 8 << row from table2
Expected result will have column headers as P, Q, R, S and row from table1 and row from table2
How to achieve this using SQL?

UNION ALL will not eliminate duplicates
In set operations (UNION / INTERSECT / EXCEPT) the aliases are taken from the first query (Currently I'm aware of only one exception- Hive requires the aliases to be the same for all queries - I consider this as a bug)
select A as P, B as Q, C as R, D as S
from table1
union all
select W,X,Y,Z
from table2
+---+---+---+---+
| p | q | r | s |
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| 5 | 6 | 7 | 8 |
+---+---+---+---+
table2 with 3 Columns
select B as Q, C as R, D as S
from table1
union all
select X,Y,Z
from table2
+---+---+---+
| q | r | s |
+---+---+---+
| 2 | 3 | 4 |
| 6 | 7 | 8 |
+---+---+---+
or
select A as P, B as Q, C as R, D as S
from table1
union all
select null,X,Y,Z
from table2
+--------+---+---+---+
| p | q | r | s |
+--------+---+---+---+
| 1 | 2 | 3 | 4 |
| (null) | 6 | 7 | 8 |
+--------+---+---+---+

_Updated to be more strict and more complete, thanks to #AntDC (and #Matt) and #Dudu Markovitz__
Use UNION with aliases, like this:
SELECT A AS P, B AS Q, C AS R, D AS S
FROM table1
UNION
-- or UNION ALL if you want to keep duplicate rows
SELECT W, X, Y, Z
FROM table2

Related

Bigquery: Joining 2 tables one having repeated records and one with count ()

I want to join tables after unnest arrays in Table:1 but the records duplicated after the join because of the unnest.
Table:1
| a | d.b | d.c |
-----------------
| 1 | 5 | 2 |
- -------------
| | 3 | 1 |
-----------------
| 2 | 2 | 1 |
Table:2
| a | c | f |
-----------------
| 1 | 12 | 13 |
-----------------
| 2 | 14 | 15 |
I want to join table 1 and 2 on a but I need also to have the output of:
| a | d.b | d.c | f | h | Sum(count(a))
---------------------------------------------
| 1 | 5 | 2 | 13 | 12 |
- ------------- - - 1
| | 3 | 1 | | |
---------------------------------------------
| 2 | 2 | 1 | 15 | 14 | 1
a can be repeated in table 2 for that I need to count(a) then select the sum after join.
My problem is when I'm joining I need the nested and repeated record to be the same as in the first table but when use aggregation to get the sum I can't group by struct or arrays so I UNNEST the records first then use ARRAY_AGG function but also there was an issue in the sum.
SELECT
t1.a,
t2.f,
t2.h,
ARRAY_AGG(DISTINCT(t1.db)) as db,
ARRAY_AGG(DISTINCT(t1.dc)) as dc,
SUM(t2.total) AS total
FROM (
SELECT
a,
d.b as db,
d.c as dc
FROM
`table1`,
UNNEST(d) AS d,
) AS t1
LEFT JOIN (
SELECT
a,
f,
h,
COUNT(*) AS total,
FROM
`table2`
GROUP BY
a,f,h) AS t2
ON
t1.a = t2.a
GROUP BY
1,
2,
3
Note: the error is in the total number after the sum it is much higher than expected all other data are correct.
I guess your table 2 contains is not unique for column a.
Lets assume that the table 2 looks like this:
a
c
f
1
12
13
2
14
15
1
100
101
There are two rows where a is 1. Since b and f are different, the grouping does not solve this ( GROUP BY a,f,h) AS t2) and counts(*) as total is one for each row.
a
c
f
total
1
12
13
1
2
14
15
1
1
100
101
1
In the next step you join this table to your table 1. The rows of table1 with value 1 in column a are duplicated, because table2 has two entries. This lead to the fact that the sum is too high.
Instead of unnesting the tables, I recommend following approach:
-- Creating of sample data as given:
with tbl_A as (select 1 a, [struct(5 as b,2 as c),struct(3,1)] d union all select 2,[struct(2,1)] union all select null,[struct(50,51)]),
tbl_B as (select 1 as a,12 b, 13 f union all select 2,14,15 union all select 1,100,101 union all select null,500,501)
-- Query:
select *
from tbl_A A
left join
(Select a,array_agg(struct(b,f)) as B, count(1) as counts from tbl_B group by 1) B
on ifnull(A.a,-9)=ifnull(B.a,-9)

How to merge different schemas of structs in arrays (filling missing columns with null)?

Given these one tables (a being an array of structs).
baz_v1 (a being ARRAY<STRUCT<x INT64>>):
+===========+
| a.x | b |
+===========+
| 1 | one |
| 2 | |
+===========+
baz_v2 (a being ARRAY<STRUCT<x INT64, y INT64>>):
+=================+
| a.x | a.z | b |
+=================+
| 3 | 4 | one |
| 5 | 6 | |
+-----------------+
| 7 | 8 | two |
| 9 | 0 | |
+-----------------+
| 11 | 12 | two |
| 13 | 14 | |
+=================+
How can I obtain the following (concatenated) table/view?
+==================+
| a.x | a.y | b |
+==================+
| 1 | null | one |
| 2 | null | |
+------------------+
| 3 | 4 | one |
| 5 | 6 | |
+------------------+
| 7 | 8 | two |
| 9 | 10 | |
+------------------+
| 11 | 12 | two |
| 13 | 14 | |
+==================+
Code:
WITH `baz_v1` AS (
SELECT
[
STRUCT(1 AS x),
STRUCT(2 AS x)
]
a,
"one" b
), `baz_v2` AS (
SELECT
[
STRUCT(3 AS x, 4 AS y),
STRUCT(5 AS x, 6 AS y)
]
a,
"one" b
UNION ALL
SELECT
[
STRUCT(7 AS x, 8 AS y),
STRUCT(9 AS x, 10 AS y)
]
a,
"two" b
UNION ALL
SELECT
[
STRUCT(11 AS x, 12 AS y),
STRUCT(13 AS x, 14 AS y)
]
a,
"two" b
)
-- todo: Insert magic here, because the below, of course, does not work.
SELECT * FROM baz_v2
UNION ALL
SELECT * FROM baz_v1
Consider below
select * replace(
array(select as struct x, null as y from t.a) as a)
from `baz_v1` t
union all
select * from `baz_v2`
if applied to sample data in y our question - output is
Building on the very good answer given by Mikhail Berlyant, I've found another solution, one that does not use SELECT * REPLACE:
SELECT ARRAY(SELECT AS STRUCT x, NULL AS y FROM baz_v1.a) AS a, b FROM baz_v1
UNION ALL
SELECT * FROM baz_v2
The following method was used to merge these tables.
(1) To use an union all the target tables, array type was flattened.
(2) In order to match the number of columns in the UNION ALL target tables, the number of columns is appended as much as the insufficient number of columns by
LEFT JOIN (SELECT '' as y) ON FALSE - reference
(3) ARRAY_AGG, STRUCT, GROUP BY were used to output the array result. - reference
WITH `baz_v1` AS (
SELECT
[
STRUCT(1 AS x),
STRUCT(2 AS x)
] a,
"one" b
), `baz_v2` AS (
SELECT
[
STRUCT(3 AS x, 4 AS y),
STRUCT(5 AS x, 6 AS y)
] a,
"two" b
)
SELECT ARRAY_AGG (
STRUCT(x, y)
) as a,
b
FROM (
SELECT xy.x as x, xy.y as y, b
FROM baz_v2, UNNEST(a) as xy
UNION ALL
SELECT x, y, b
FROM (
SELECT x.x as x, CAST(y as INT64) as y, b
FROM baz_v1, UNNEST(a) as x
LEFT JOIN (SELECT '' as y) ON FALSE
)
)
GROUP BY b
with result)

generate serial number in decreasing order given a variable in netezza aginity sql

Is there any SQL syntax using netezza SQL, given column number, trying to generate rows for number in decreasing order down to 0.
Below is an example of what I'm trying to do
BEFORE
ID
NUMBER
A
4
B
5
AFTER
ID
NUMBER
A
4
A
3
A
2
A
1
B
5
B
4
B
3
B
2
B
1
please also click to see screenshot for example thanks
You can use the _v_vector_idx table for this purpose
select
id, idx
from
test join _v_vector_idx
on idx <= number
order
by id asc, idx desc ;
Here's the example in action
select * from test
ID | NUMBER
-------+--------
A | 4
B | 5
(2 rows)
select id, idx from test join _v_vector_idx on
idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
(11 rows)
insert into test values ('C', 3);
INSERT 0 1
select * from test;
ID | NUMBER
-------+--------
A | 4
B | 5
C | 3
(3 rows)
select id, idx from test join _v_vector_idx
on idx <= number order by id asc, idx desc ;
ID | IDX
-------+-----
A | 4
A | 3
A | 2
A | 1
A | 0
B | 5
B | 4
B | 3
B | 2
B | 1
B | 0
C | 3
C | 2
C | 1
C | 0
(15 rows)

Efficient query to Group by column name in SQL or hive

Imagine I have a table with 2 columns m_1 and m_2:
m1 | m2
3 | 17
3 | 18
4 | 17
9 | 9
I would like to get a table with 3 columns:
m is the index of m (in my exemple 1 or 2)
d is the data contains in the table .
count is the number of occurence of each data, group by value and index.
In the example, the result is:
m | d | count
m_1 | 3 | 2
m_1 | 4 | 1
m_1 | 9 | 1
m_2 | 17| 2
m_2 | 18| 1
m_2 | 9 | 1
The first ligne mus be read as 'data 3 occurs 2 times in column m_1'?
A naive solution is to execute two times a parametric query like this:
for (i in 1 .. 2)
SELECT CONCAT('m_', i), m_i, count(*) FROM table GROUP BY m_i
But this algorithm scans my table two times. This is a problem since I have 255 columns m and bilion of rows.
Will the solution becomes easier if I use hive instead of a relational data base?
You can write this using union all and group by:
select colname, d, count(*)
from ((select 'm_1' as colname, m1 as d from t) union all
(select 'm_2' as colname, m2 as d from t)
) m12
group by colname, d;
posexplode(array(m1,m2))
select concat('m_',cast(pe.pos+1 as string)) as m
,pe.val as d
,count(*) as `count`
from mytable t
lateral view posexplode(array(m1,m2)) pe
group by pos
,val
;
+------+-----+--------+
| m | d | count |
+------+-----+--------+
| m_1 | 3 | 2 |
| m_1 | 4 | 1 |
| m_1 | 9 | 1 |
| m_2 | 9 | 1 |
| m_2 | 17 | 2 |
| m_2 | 18 | 1 |
+------+-----+--------+

PostgreSQL distinct rows joined with a count of distinct values in one column

I'm using PostgreSQL 9.4, and I have a table with 13 million rows and with data roughly as follows:
a | b | u | t
-----+---+----+----
foo | 1 | 1 | 10
foo | 1 | 2 | 11
foo | 1 | 2 | 11
foo | 2 | 4 | 1
foo | 3 | 5 | 2
bar | 1 | 6 | 2
bar | 2 | 7 | 2
bar | 2 | 8 | 3
bar | 3 | 9 | 4
bar | 4 | 10 | 5
bar | 5 | 11 | 6
baz | 1 | 12 | 1
baz | 1 | 13 | 2
baz | 1 | 13 | 2
baz | 1 | 13 | 3
There are indices on md5(a), on b, and on (md5(a), b). (In reality, a may contain values longer than 4k chars.) There is also a primary key column of type SERIAL which I have omitted above.
I'm trying to build a query which will return the following results:
a | b | u | t | z
-----+---+----+----+---
foo | 1 | 1 | 10 | 3
foo | 1 | 2 | 11 | 3
foo | 2 | 4 | 1 | 3
foo | 3 | 5 | 2 | 3
bar | 1 | 6 | 2 | 5
bar | 2 | 7 | 2 | 5
bar | 2 | 8 | 3 | 5
bar | 3 | 9 | 4 | 5
bar | 4 | 10 | 5 | 5
bar | 5 | 11 | 6 | 5
In these results, all rows are deduplicated as if GROUP BY a, b, u, t were applied, z is a count of distinct values of b for every partition over a, and only rows with a z value greater than 2 are included.
I can get just the z filter working as follows:
SELECT a, COUNT(b) AS z from (SELECT DISTINCT a, b FROM t) AS foo GROUP BY a
HAVING COUNT(b) > 2;
However, I'm stumped on combining this with the rest of the data in the table.
What's the most efficient way to do this?
Your first step can be simpler already:
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2;
Working with md5(a) in place of a, since a can obviously be very long, and you already have an index on md5(a) etc.
Since your table is big, you need an efficient query. This should be among the fastest possible solutions - with adequate index support. Your index on (md5(a), b) is instrumental but - assuming b, u, and t are small columns - an index on (md5(a), b, u, t) would be even better for the second step of the query (the lateral join).
Your desired end result:
SELECT DISTINCT ON (md5(t.a), b, u, t)
t.a, t.b, t.u, t.t, a.z
FROM (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) a
JOIN t ON md5(t.a) = md5_a
ORDER BY 1, 2, 3, 4; -- optional
Or probably faster, yet:
SELECT a, b, u, t, z
FROM (
SELECT DISTINCT ON (1, 2, 3, 4)
md5(t.a) AS md5_a, t.b, t.u, t.t, t.a
FROM t
) t
JOIN (
SELECT md5(a) AS md5_a, count(DISTINCT b) AS z
FROM t
GROUP BY 1
HAVING count(DISTINCT b) > 2
) z USING (md5_a)
ORDER BY 1, 2, 3, 4; -- optional
Detailed explanation for DISTINCT ON:
Select first row in each GROUP BY group?