Joining arrays on internal(nested ) row ids - google-bigquery

how can I join/sum up array values row by row (i.e. rows representing offset in the array) when joining two tables
i.e. given below two tables
the output for row_id = 1 should be STRUCT("A",ARRAY[41,43,45]) i.e. array of sum of col1.val1 and col2.val2 values for each internal_id corresponding to index in each array
WITH table1 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[1,2,3] as val1) AS col1
),
table2 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[40,41,42] as val2) AS col2
)
EDIT:
So I tried to do the below, un-nesting the arrays first, but the result is incorrect, i.e. it puts the resulting array at each index, i.e. having 9 rows instead of just 3
WITH table1 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[1,2,3] as val1) AS col1
),
table2 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[40,41,42] as val2) AS col2
),
table1_unnested as(
select row_id, col1.internal_id, val1 from table1, unnest(col1.val1) as val1
),
table2_unnested as(
select row_id, col2.internal_id, val2 from table2, unnest(col2.val2) as val2
)
select t1.row_id, t1.internal_id, ARRAY_AGG(t1.val1+t2.val2) as newval
from table1_unnested as t1
join table2_unnested as t2 using(row_id)
group by t1.row_id, t1.internal_id

You were very close - see below correction
WITH table1 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[1,2,3] as val1) AS col1
),
table2 AS (
SELECT 1 AS row_id, STRUCT("A" AS internal_id, ARRAY[40,41,42] as val2) AS col2
),
table1_unnested as(
select row_id, col1.internal_id, val1, offset from table1, unnest(col1.val1) as val1 with offset
),
table2_unnested as(
select row_id, col2.internal_id, val2, offset from table2, unnest(col2.val2) as val2 with offset
)
select t1.row_id, t1.internal_id, ARRAY_AGG(t1.val1+t2.val2) as newval
from table1_unnested as t1
join table2_unnested as t2 using(row_id, offset)
group by t1.row_id, t1.internal_id
with output
As you can see - I added offset in five(5) places to fix your original query

Related

SQL with having statement now want complete rows

Here is a mock table
MYTABLE ROWS
PKEY 1,2,3,4,5,6
COL1 a,b,b,c,d,d
COL2 55,44,33,88,22,33
I want to know which rows have duplicated COL1 values:
select col1, count(*)
from MYTABLE
group by col1
having count(*) > 1
This returns :
b,2
d,2
I now want all the rows that contain b and d. Normally, I would use where in stmt, but with the count column, not certain what type of statement I should use?
maybe you need
select * from MYTABLE
where col1 in
(
select col1
from MYTABLE
group by col1
having count(*) > 1
)
Use a CTE and a windowed aggregate:
WITH CTE AS(
SELECT Pkey,
Col1,
Col2,
COUNT(1) OVER (PARTITION BY Col1) AS C
FROM dbo.YourTable)
SELECT PKey,
Col1,
Col2
FROM CTE
WHERE C > 1;
Lots of ways to solve this here's another
select * from MYTABLE
join
(
select col1 ,count(*)
from MYTABLE
group by col1
having count(*) > 1
) s on s.col1 = mytable.col1;

In SQL, how do you get the column name for largest item in a row?

I have a table that has unique id, col1, col2, col3...etc. All of the columns are numeric except the id column. I need to extract for each id, the column with the highest value. So let's say we have an id 1, for which col1 value is 10, col2 value is 20, and col3 value is 30. The result should be two columns. 1 and col3. Basically the id, and the name of the column with the highest value. I hope that is clear.
You should probably restructure your data. Have third repeated columns is a lousy way to store the data. Traditionally in SQL, you would use a separate table with one row per value. But BigQuery also supports arrays and JSON formats.
A brute force method uses a giant case expression:
select t.*,
(case greatest(col1, col2, col3, . . . )
when col1 then 'col1'
when col2 then 'col3'
. . .
when col30 then 'col30'
end) as greatest_value
from t;
I am not sure why the structure of your data is like that but a way around could be like this:
Use GREATEST(), which fetches the largest values from the values that are passed in
SELECT
id,
GREATEST(col1,col2,.....col30) AS largest_value
FROM [table name]
I would go for a union all to put all the columns as the same column and then either get the max or order descing and get the top 1.
Something like this:
Select id, (select top 1 name from (
(select id, col1 col, colname name from tablename)
union all
(select id, col2 col, colname name from tablename)
union all
(select id, col3 col, colname name from tablename)
) as t
where t.id = tablename.id
order by col desc)
from tablename
Consider below approach with use of UNPIVOT (assumes each row has unique id)
select as value array_agg(t order by value desc limit 1)[offset(0)]
from (
select * from `project.dataset.table`
unpivot (value for col in (col1, col2, col3, col4))
) t
group by id
you can test it with below dummy data
with `project.dataset.table` as (
select 1 id, 11 col1, 12 col2, 13 col3, 14 col4 union all
select 2, 24, 23, 22, 21 union all
select 3, 31, 34, 32, 33
)
select as value array_agg(t order by value desc limit 1)[offset(0)]
from (
select * from `project.dataset.table`
unpivot (value for col in (col1, col2, col3, col4))
) t
group by id
with output
Yet another option - requires no knowledge of column names so same query can be used no matter how many columns and their names
select id,
( select as struct split(kv, ':')[offset(0)] col,
cast(split(kv, ':')[offset(1)] as numeric) value
from t.kvs as kv
order by value desc
limit 1
).*
from(
select *,
split(translate(to_json_string((select as struct * except(id) from unnest([t]))), '{}"', '')) kvs
from `project.dataset.table` t
) t
You can test with dummy data as in below example
with `project.dataset.table` as (
select 1 id, 11 col1, 12 col2, 13 col3, 14 col4 union all
select 2, 24, 23, 22, 21 union all
select 3, 31, 34, 32, 33
)
select id,
( select as struct split(kv, ':')[offset(0)] col,
cast(split(kv, ':')[offset(1)] as numeric) value
from t.kvs as kv
order by value desc
limit 1
).*
from(
select *,
split(translate(to_json_string((select as struct * except(id) from unnest([t]))), '{}"', '')) kvs
from `project.dataset.table` t
) t
with output

Create multiple columns from existing Hive table columns

How to create multiple columns from an existing hive table. The example data would be like below.
My requirement is to create 2 new columns from existing table only when the condition met.
col1 when code=1. col2 when code=2.
expected output:
Please help in how to achieve it in Hive queries?
If you aggregate values required into arrays, then you can explode and filter only those with matching positions.
Demo:
with
my_table as (--use your table instead of this CTE
select stack(8,
'a',1,
'b',2,
'c',3,
'b1',2,
'd',4,
'c1',3,
'a1',1,
'd1',4
) as (col, code)
)
select c1.val as col1, c2.val as col2 from
(
select collect_set(case when code=1 then col else null end) as col1,
collect_set(case when code=2 then col else null end) as col2
from my_table where code in (1,2)
)s lateral view outer posexplode(col1) c1 as pos, val
lateral view outer posexplode(col2) c2 as pos, val
where c1.pos=c2.pos
Result:
col1 col2
a b
a1 b1
This approach will not work if arrays are of different size.
Another approach - calculate row_number and full join on row_number, this will work if col1 and col2 have different number of values (some values will be null):
with
my_table as (--use your table instead of this CTE
select stack(8,
'a',1,
'b',2,
'c',3,
'b1',2,
'd',4,
'c1',3,
'a1',1,
'd1',4
) as (col, code)
),
ordered as
(
select code, col, row_number() over(partition by code order by col) rn
from my_table where code in (1,2)
)
select c1.col as col1, c2.col as col2
from (select * from ordered where code=1) c1
full join
(select * from ordered where code=2) c2 on c1.rn = c2.rn
Result:
col1 col2
a b
a1 b1

SQL - avoid select if there is pair

Is it possible to write MS SQL query for this case? If there is pair with 1 and -1 , I don't want select those entries at all.
COL1
COL2
NOTE
A
1
I don't want select this entry becase is in pair with A -1
A
-1
I don't want select this entry becase is in pair with A 1
A
1
OK to select - no pair (no -1 for this A )
B
1
OK to select - no pair
C
1
OK to select - no pair
D
1
I don't want select this entry because is in pair with D -1
D
-1
I don't want select this entry because is in pair with D 1
I understand there is 1s and -1s and these are the only possible values for col2. If this is the case and there is at most one row difference, then you can just add the values up:
select col1, sum(col2)
from mytable
group by col1
having sum(col2) <> 0;
If there can be more rows different or there exist other values beside 1 and -1, then we must generate row numbers.
select col1, max(col2)
from
(
select
col1,
col2,
row_number() over (partition by col1, col2 order by col2) as rn
from mytable
) numbered
group by col1, rn
having count(*) = 1;
One method is aggregation. Assuming there are only -1 and 1 and no duplicates with the same sign:
select col1, max(col2), col3
from t
group by col1, col3
having count(*) = 1;
Alternatively, you could use `not exists:
select t.*
from t
where not exists (select 1
from t t2
where t2.col3 = c.col3 and t2.col1 = t.col1 and
t2.col2 = - t.col1
);
If for any value of Col1 sum of 1 and -1 is not 0, it means that it has unpaired value.
try this:
select *
from t
where col1 in
(select col1 from t group by col1 having sum(col2) <> 0);

Redshift sample from table based on count of another table

I have TableA of, say, 3000 rows (could be any number < 10000). I need to create TableX with 10000 rows. So I need to select random 10000 - (number of rows in TableA) from TableB (and add in TableA as well) to create TableX. Any ideas please?
Something like this (which obviously wouldnt work):
Create table TableX as
select * from TableA
union
select * from TableB limit (10000 - count(*) from TableA);
You could use union all and window functions. You did not list the table columns, so I assumed col1 and col2:
insert into tableX (col1, col2)
select col1, col2 from table1
union all
select t2.col1, t2.col2
from (select t2.*, row_number() over(order by random()) from table2 t2) t2
inner join (select count(*) cnt from table1) t1 on t2.rn <= 10000 - t1.cnt
The first query in union all selects all rows from table1. The second query assigns random row numbers to rows in table2, and then selects as many rows as needed to reach a total of 10000.
Actually it might be simpler to select all rows from both tables, then order by and limit in the outer query:
insert into tableX (col1, col2)
select col1, col2
from (
select col1, col2, 't1' which from table1
union all
select col1, col2, 't2' from table2
) t
order by which, random()
limit 10000
with inparms as (
select 10000 as target_rows
), acount as (
select count(*) as acount, inparms.target_rows
from tablea
cross join inparms
), btag as (
select b.*, 'tableb' as tabsource,
row_number() over (order by random()) as rnum
from tableb
)
select a.*, 'tablea', row_number() over (order by 1) as rnum
from tablea
union all
select b.*
from btag b
join acount a on b.rnum <= a.target_rows - a.acount
;