2 equals select expression within select - sql

I have to process my data by Levenshtein function.
In this case I'm using nested selection
SELECT levenshtein(a.param, b.param), *
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) a,
FROM (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
) b,
is there a way to not duplicate inner SELECT ?

Solution is pretty simple, thanks to #Nikarus for suggestion about WITH expression:
WITH subtable AS (
SELECT 5 fields
FROM table t,
JOIN x,
JOIN y,
GROUP BY 1, 2, 3
)
SELECT levenshtein(a.param, b.param), *
FROM subtable a, subtable b

Related

How do you aggregate two columns into arrays in BigQuery?

Say I have a table with 4 columns, a of type string, b of type integer, c of type integer, and d of type integer. How would I go ahead and use something like ARRAY_AGG on a STRUCT(b, c) and d separately (in other words, have two separate columns that would be arrays)?
The Query I have so far:
SELECT table1.a, table1.x, table2.y
FROM (
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x
FROM project.table
GROUP BY a
ORDER BY a
) AS table1
LEFT OUTER JOIN (
SELECT a, ARRAY_AGG(d) AS y
FROM project.table
GROUP BY a, d
ORDER BY a, d
) table2 ON table1.a = table2.a
GROUP BY table1.a
ORDER BY table1.a;
This gives me the error: SELECT list expression references table1.x which is neither grouped nor aggregated at [1:20]
But if I try to add table1.x to the GROUP BY clause at the end, I get a new error: Grouping by expressions of type ARRAY is not allowed at [14:22]
Below is for BigQuery Standard SQL
#standardSQL
SELECT a,
ARRAY(
SELECT AS STRUCT b, c FROM t.x GROUP BY b, c
) AS x, y
FROM (
SELECT a,
ARRAY_AGG(STRUCT(b, c)) AS x,
ARRAY_AGG(DISTINCT d) AS y
FROM `project.dataset.table`
GROUP BY a
) t
Why not just do this in one query?
SELECT a, ARRAY_AGG(STRUCT(b, c)) AS x, ARRAY_AGG(DISTINCT d) AS y
FROM project.table
GROUP BY a
ORDER BY a;
Your version will work without the outer GROUP BY. But it is needlessly compicated.

sql creating table from columns in other tables

I have 4 tables called table1, table2, table3, table4. Each has a column in it called x,y,z, and w respectively:
x y z w
--- ------ ------ ----
1 A 120 Red
2 B 33.3 Orange
3 C 81.3 Green
4 D 41.3 Blue
I would like to create a new table that simply has it its columns, just these columns where the order of the rows are unchanged. In R(), I would just do something like data.frame(x,y,z,w), but, I don't know how to do something equivalent in SQL Server 2016. The tables have no common key (except their row number of course!).
you can use something like this:
WITH CTE_Table1 AS (
SELECT x, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS x_row
FROM Table1
), CTE_Table2 AS (
SELECT y, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS y_row
FROM Table2
), CTE_Table3 AS (
SELECT z, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS z_row
FROM Table3
), CTE_Table4 AS (
SELECT w, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS w_row
FROM Table4
)
SELECT x, y, z, w
FROM CTE_Table1
INNER JOIN CTE_Table2 ON CTE_Table1.x_row = CTE_Table2.y_row
INNER JOIN CTE_Table3 ON CTE_Table1.x_row = CTE_Table3.z_row
INNER JOIN CTE_Table4 ON CTE_Table1.x_row = CTE_Table4.w_row

selecting incremental data from multiple tables in Hive

I have five tables(A,B,C,D,E) in Hive database and I have to union the data from these tables based on logic over column "id".
The condition is :
Select * from A
UNION
select * from B (except ids not in A)
UNION
select * from C (except ids not in A and B)
UNION
select * from D(except ids not in A,B and C)
UNION
select * from E(except ids not in A,B,C and D)
Have to insert this data into final table.
One way is to create a the target table (target)and append it with data for each UNION stage and then using this table for joining with the other UNION stage.
This would be the part of my .hql file :
insert into target
(select * from A
UNION
select B.* from
A
RIGHT OUTER JOIN B
on A.id=B.id
where ISNULL(A.id));
INSERT INTO target
select C.* from
target
RIGHT outer JOIN C
ON target.id=C.id
where ISNULL(target.id);
INSERT INTO target
select D.* from
target
RIGHT OUTER JOIN D
ON target.id=D.id
where ISNULL(target.id);
INSERT INTO target
select E.* from
target
RIGHT OUTER JOIN E
ON target.id=E.id
where ISNULL(target.id);
Is there a better to make this happen ? I assume we anyway have to do the
multiple joins/lookups .I am looking forward for best approach to achieve this
in
1) Hive with Tez
2) Spark-sql
Many Thanks in advance
If id is unique within each table, then row_number can be used instead of rank.
select *
from (select *
,rank () over
(
partition by id
order by src
) as rnk
from (
select 1 as src,* from a
union all select 2 as src,* from b
union all select 3 as src,* from c
union all select 4 as src,* from d
union all select 5 as src,* from e
) t
) t
where rnk = 1
;
I think I would try to do this as:
with ids as (
select id, min(which) as which
from (select id, 1 as which from a union all
select id, 2 as which from b union all
select id, 3 as which from c union all
select id, 4 as which from d union all
select id, 5 as which from e
) x
)
select a.*
from a join ids on a.id = ids.id and ids.which = 1
union all
select b.*
from b join ids on b.id = ids.id and ids.which = 2
union all
select c.*
from c join ids on c.id = ids.id and ids.which = 3
union all
select d.*
from d join ids on d.id = ids.id and ids.which = 4
union all
select e.*
from e join ids on e.id = ids.id and ids.which = 5;

Redshift Join VS. Union with Group By

Let's say I would like to pull the fields dim,a,b,c,d from 2 tables which one contains a,b and the other contains c,d.
I'm wondering if there's a preferred way (between the following) to do it - Performance wise:
1:
select t1.dim,a,b,c,d
from
(select dim,sum(a) as a,sum(b)as b from t1 group by dim)t1
join
(select dim,sum(c) as c,sum(d) as d from t2 group by dim)t2
on t1.dim=t2.dim;
2:
select dim,sum(a) as a,sum(b) as b,sum(c) as c,sum(d) as d
from
(
select dim,a,b,null as c, null as d from t1
union
select dim,null as a, null as b, c, d from t2
)a
group by dim
Of course when handling a large amount of data (5-30M records at the final query).
Thanks!
The first method filters would any dim values that are not in both tables. union is inefficient. So, neither is appealing.
I would go for:
select dim, sum(a) as a, sum(b) as b, sum(c) as c, sum(d) as d
from (select dim, a, b, null as c, null as d from t1
union all
select dim, null as a, null as b, c, d from t2
) a
group by dim;
You could also pre-aggregate the values in each subquery. Or use full outer join for the first method.

SQL Server : grouping by 2 variables, keeping all distinct

I made several attempts at googling a solution to this but I'm finding a hard time generating keywords for an accurate search.
Say I have a table with the following information.
A, B
1, 1
1, 2
2, 1
If I perform a group by operation on both columns A, B, I'll get a table indexed by the same set, but I'm interested in something of the form:
A, B, nRecords
1, 1, 1
1, 2, 1
2, 1, 1
2, 2, 0
Query:
SELECT
A, B, COUNT(*) nRecords
FROM
table
GROUP BY
A, B
will not include information for the A = 2, B = 2 case. Any thoughts on moving forward? This needs to be abstracted to large distinct values in both columns.
select a.A, a.B, count(*)
from
(select distinct A from T) as a
cross join
(select distinct B from T) as b
left outer join T as t on t.A = a.A and t.B = b.B
group by a.A, a.B