Big Query view (table without duplicate rows) - google-bigquery

I need to create a view that is pretty much just like some table with some simple transformations and I want to make sure the values in a particular column are not duplicate.
So let's say the table looks like this:
ID, ColumnA, ColumnB
-------------------
1 cars shirts
2 tvs dogs
1 fingers computers
And the resulting view would look like this:
ID, ColumnA, ColumnB
-------------------
1 cars shirts
2 tvs dogs
So, is there an equivalent to SELECT distint(ID), ColumnA, ColumnB?
What's the most efficient way to do it?

If you just want an arbitrary row for each ID, use ANY_VALUE:
#standardSQL
WITH Input AS (
SELECT 1 AS ID, 'cars' AS ColumnA, 'shirts' AS ColumnB UNION ALL
SELECT 2 AS ID, 'tvs' AS ColumnA, 'dogs' AS ColumnB UNION ALL
SELECT 1 AS ID, 'fingers' AS ColumnA, 'computers' AS ColumnB
)
SELECT
ANY_VALUE(t).*
FROM Input AS t
GROUP BY t.ID;
Or you can use the ARRAY_AGG trick to select the latest row based on a condition.

Below is for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, 'cars' AS columnA, 'shirts' AS columnB UNION ALL
SELECT 2, 'tvs', 'dogs' UNION ALL
SELECT 1, 'fingers', 'computers'
)
SELECT r.*
FROM (
SELECT ARRAY_AGG(t ORDER BY columnA LIMIT 1)[OFFSET (0)] AS r
FROM yourTable t
GROUP BY id
)
-- ORDER BY id
Note: you should have some logic about selecting row with cars over the fingers!
Above version (as an example) is based on asc order

Related

How to find mode of multiple columns in Snowflake SQL

Input column example :
ID
Column A
Column B
Column C
1
cat
cat
dog
2
dog
cat
dog
3
cat
cat
dog
4
bird
cat
dog
Output column example :
ID
Column A
Column B
Column C
Mode
1
cat
cat
dog
cat
2
dog
cat
dog
dog
3
cat
cat
dog
cat
4
bird
cat
bird
bird
So far I have only calculated mode for a single column. Not sure how we can do it horizontally by combining 4 columns.
We can use an unpivot approach with the help of a union query. Then, use ROW_NUMBER() to select the mode:
WITH cte AS (
SELECT ID, ColumnA AS val FROM yourTable UNION ALL
SELECT ID, ColumnB FROM yourTable UNION ALL
SELECT ID, ColumnC FROM yourTable
),
cte2 AS (
SELECT *, COUNT(*) OVER (PARTITION BY ID, val) cnt
FROM cte
),
cte3 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY cnt DESC, val) rn
FROM cte2
)
SELECT t1.ColumnA, t1.ColumnB, t1.ColumnC, t2.val AS Mode
FROM yourTable t1
INNER JOIN cte3 t2
ON t2.ID = t1.ID
WHERE t2.rn = 1
ORDER BY t1.ID;
In the event that two or more values are tied for the mode, it breaks the tie by arbitrarily returning the alphabetically lower value.
I think you can use the built in MODE function for this, and Snowflake semi-structured functionality to unpivot and apply it. Check the behaviour of MODE suits your needs with regards to breaking ties and null handling etc.
First create test data like your example (please include code to repro in future!)
create view YOURTABLE as
select
ID,
COLUMN_A,
COLUMN_B,
COLUMN_C
from (values
(1,'cat','cat','dog'),
(2,'dog','cat','dog'),
(3,'cat','cat','dog'),
(4,'bird','cat','dog')
) vw (ID, COLUMN_A, COLUMN_B, COLUMN_C);
Here's the query to get your output;
with o_r as (Select ID, COLUMN_A, COLUMN_B, COLUMN_C,
array_construct(COLUMN_A, COLUMN_B, COLUMN_C) arr_row from YOURTABLE )
select ID, COLUMN_A, COLUMN_B, COLUMN_C, mode(value::VARCHAR) MODE
from o_r, lateral flatten (input => o_r.arr_row) lf
group by 1,2,3,4
order by 1;
Gather up the columns we want to calculate MODE over with array_construct(),lateral flatten the array, and then group by your ID, and columns with MODE on the flattened VALUE column.

Counting the count of distinct values from two columns in sql

I have a table in data base in which there are corresponding values for the primary key.
I want to count the distinct values from two columns.
I already know one method of using union all and then applying groupby on that resultant table.
Select Id,Brand1
into #Temp
from data
union all
Select Id,Brand2
from data
Select ID,Count(Distinct Brand1)
from #Temp
group by ID
Same thing we can do in big query also using temp table only.
Sample Table
ID Brand1 Brand2
1 A B
1 B C
2 D A
2 A D
Resultant Table
ID Distinct_Count_Brand
1 3
2 2
As you can see in this column Distinct_count_Brand It is counting the unique count of Brand from two columns Brand1 and Brand2.
I already know one way (Basically unpivoting) but want to know if there is some other way around to count unique values from two columns.
I don't know BigQuery's quirks, but perhaps you can just inline the union query:
SELECT ID, COUNT(DISTINCT Brand)
FROM
(
SELECT ID, Brand1 AS Brand FROM data
UNION ALL
SELECT ID, Brand2 FROM data
) t
GROUP BY ID;
In SQL Server, I woud use:
Select b.id, count(distinct b.brand)
from data d cross apply
(values (id, brand1), (id, brand2)) b(id, brand)
group by b.id;
Here is a db<>fiddle.
In BigQuery, the equivalent would be expressed as:
select t.id, count(distinct brand)
from t cross join
unnest(array[brand1, brand2]) brand
group by t.id;
Here is a BQ query that demonstrates that this works:
with t as (
select 1 as id, 'A' as brand1, 'B' as brand2 union all
select 1, 'B', 'C' union all
select 2, 'D', 'A' union all
select 2, 'A', 'D'
)
select t.id, count(distinct brand)
from t cross join
unnest(array[brand1, brand2]) brand
group by t.id;

Sorting within collect_list() in hive

Let's say I have a hive table that looks like this:
ID event order_num
------------------------
A red 2
A blue 1
A yellow 3
B yellow 2
B green 1
...
I'm trying to use collect_list to generate a list of events for each ID. So something like the following:
SELECT ID,
collect_list(event) as events_list,
FROM table
GROUP BY ID;
However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:
ID events_list
------------------------
A ["blue","red","yellow"]
B ["green","red"]
I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?
Thanks!
So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:
WITH table1 AS (
SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)
-- Collect it
SELECT subquery.ID,
collect_list(subquery.event) as events_list
FROM (
SELECT
table1.ID,
table1.event,
table1.order_num
FROM table1
DISTRIBUTE BY
table1.ID
SORT BY
table1.ID,
table1.order_num
) subquery
GROUP BY subquery.ID;
The function sort_array() should sort the collect_list() items
select ID, sort_array(collect_list(event)) as events_list,
from table
group by ID;
this my first answer question of stack overflow.
but the answer is very very userful.
WITH table1 AS (
SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)
select ID
,sort_array(collect_list(struct(order_num, item_score))).col2 as item_list
from (
select ID
,event
,order_num
,concat(event, ':', order_num) as item_score
from table1
) t0
group by ID
Try the following:
WITH tmp AS (
SELECT * FROM data DISTRIBUTE BY ID SORT BY ID, order_num desc
)
SELECT ID, collect_list(event)
FROM tmp
GROUP BY ID

How to use order by with union all in sql?

I tried the sql query given below:
SELECT * FROM (SELECT *
FROM TABLE_A ORDER BY COLUMN_1)DUMMY_TABLE
UNION ALL
SELECT * FROM TABLE_B
It results in the following error:
The ORDER BY clause is invalid in views, inline functions, derived
tables, subqueries, and common table expressions, unless TOP or FOR
XML is also specified.
I need to use order by in union all. How do I accomplish this?
SELECT *
FROM
(
SELECT * FROM TABLE_A
UNION ALL
SELECT * FROM TABLE_B
) dum
-- ORDER BY .....
but if you want to have all records from Table_A on the top of the result list, the you can add user define value which you can use for ordering,
SELECT *
FROM
(
SELECT *, 1 sortby FROM TABLE_A
UNION ALL
SELECT *, 2 sortby FROM TABLE_B
) dum
ORDER BY sortby
You don't really need to have parenthesis. You can sort directly:
SELECT *, 1 AS RN FROM TABLE_A
UNION ALL
SELECT *, 2 AS RN FROM TABLE_B
ORDER BY RN, COLUMN_1
Not an OP direct response, but I thought I would jimmy in here responding to the the OP's ERROR messsage, which may point you in another direction entirely!
All these answers are referring to an overall ORDER BY once the record set has been retrieved and you sort the lot.
What if you want to ORDER BY each portion of the UNION independantly, and still have them "joined" in the same SELECT?
SELECT pass1.* FROM
(SELECT TOP 1000 tblA.ID, tblA.CustomerName
FROM TABLE_A AS tblA ORDER BY 2) AS pass1
UNION ALL
SELECT pass2.* FROM
(SELECT TOP 1000 tblB.ID, tblB.CustomerName
FROM TABLE_B AS tblB ORDER BY 2) AS pass2
Note the TOP 1000 is an arbitary number. Use a big enough number to capture all of the data you require.
There will be times when you need to do something like this :
Pull top 5 from table 1 based on a sort
and bottom 5 from table 2 based on another sort
and union these together.
solution
select * from (
-- top 5 records
select top 5 col1, col2, col3
from table1
group by col1, col2
order by col3 desc ) z
union all
select * from (
-- bottom 5 records
select top 5 col1, col2, col3
from table2
group by col1, col2
order by col3 ) z
this was the only way i was able to get around the error and worked fine for me.
SELECT * FROM (SELECT *
FROM TABLE_A ORDER BY COLUMN_1)DUMMY_TABLE
UNION ALL
SELECT * FROM TABLE_B
ORDER BY 2;
2 is column number here .. In Oracle SQL you can use the column number by which you want to sort the data
This solved my SELECT statement:
SELECT * FROM
(SELECT id,name FROM TABLE_A
UNION ALL
SELECT id,name FROM TABLE_B ) dum
order by dum.id , dum.name
where id and name columns available in tables and you can use your columns .
Simply use that , no need parenthesis or anything else
SELECT *, id as TABLE_A_ID FROM TABLE_A
UNION ALL
SELECT *, id as TABLE_B_ID FROM TABLE_B
ORDER BY TABLE_A_ID, TABLE_B_ID
ORDER BY after the last UNION should apply to both datasets joined by union.
The solution shown below:
SELECT *,id AS sameColumn1 FROM Locations
UNION ALL
SELECT *,id AS sameColumn2 FROM Cities
ORDER BY sameColumn1,sameColumn2
select CONCAT(Name, '(',substr(occupation, 1, 1), ')') AS f1
from OCCUPATIONS
union
select temp.str AS f1 from
(select count(occupation) AS counts, occupation, concat('There are a total of ' ,count(occupation) ,' ', lower(occupation),'s.') As str from OCCUPATIONS group by occupation order by counts ASC, occupation ASC
) As temp
order by f1

Union select statements horizontally

let's say result of my select statements as follows (I have 5 of those):
Id Animal AnimalId
1 Dog Dog1
1 Cat Cat57
Id Transport TransportId
2 Car Car100
2 Plane Plane500
I'd like to get a result as follows:
Id Animal AnimalId Transport TransportId
1 Dog Dog1
1 Cat Cat57
2 Car Car100
2 Plane Plane500
What I can do is I can crate a tablevariable and specify all possible columns and insert records from each select statement into it. But maybe better solution like PIVOT?
Edit
queries: 1st: Select CategoryId as Id, Animal, AnimalId from Animal
2nd: Select CategoryId as Id, Transport, TransportId from Transport
How about this, if you need them in the same rows, this gets the row_number() for each row and joins on those:
select a.id,
a.aname,
a.aid,
t.tname,
t.tid
from
(
select id, aname, aid, row_number() over(order by aid) rn
from animal
) a
left join
(
select id, tname, tid, row_number() over(order by tid) rn
from transport
) t
on a.rn = t.rn
see SQL Fiddle with Demo
If you don't need them in the same row, then use UNION ALL:
select id, aname, aid, 'Animal' tbl
from animal
union all
select id, tname, tid, 'Transport'
from transport
see SQL Fiddle with Demo
Edit #1, here is a version with an UNPIVOT and PIVOT:
select an_id, [aname], [aid], [tname], [tid]
from
(
select *, row_number() over(partition by col order by col) rn
from animal
unpivot
(
value
for col in (aname, aid)
) u
union all
select *, row_number() over(partition by col order by col) rn
from transport
unpivot
(
value
for col in (tname, tid)
) u
) x1
pivot
(
min(value)
for col in([aname], [aid], [tname], [tid])
) p
order by an_id
see SQL Fiddle with Demo
This would do it for you:
SELECT
ID, field1, field2, '' as field3, '' as field4
FROM sometable
UNION ALL
SELECT
ID, '', '', field3, field4
FROM someothertable
create table Animal (
Animal varchar(50)
,AnimalID varchar(50)
)
create table Transport (
Transport varchar(50)
,TransportID varchar(50)
)
insert into Animal values ('Dog', 'Dog1')
insert into Animal values ('Cat', 'Cat57')
insert into Transport values ('Car', 'Car100')
insert into Transport values ('Plane', 'Plane500')
select ID = 1
,A.Animal
,A.AnimalID
,Transport = ''
,TransportID = ''
from Animal A
union
select ID = 2
,Animal = ''
,AnimalID = ''
,T.Transport
,T.TransportID
from Transport T
To get it in the format you want, select the values you want, and then null (or an empty string) for the other columns.
SELECT
CategoryId as Id,
Animal as 'Animal',
AnimalId as 'AnimalId',
null as 'Transport',
null as 'TransportId'
FROM Animal
UNION
SELECT
CategoryId as Id,
null as 'Animal',
null as 'AnimalId',
Transport as 'Transport',
TransportId as 'TransportId'
FROM Transport
I'm still not sure of the purpose of this, but this should give the output you want.
You shouldn't need to pivot, your results are already fine.
If you want, you can just UNION all 5 statements together in the same format as the first select: ID/Category/CategoryID. Then you'll get one long result set with all 5 sets appended 3 columns wide.
Is that what you want? Or do you need to distinguish between 'categories'?
given your example, try:
Select CategoryId as Id, Animal, AnimalId from Animal
union all
Select CategoryId as Id, Transport, TransportId from Transport
if you want, you can alias the columns like:
Select CategoryId as Id, Animal as category, AnimalId as categoryID from Animal
union all
Select CategoryId as Id, Transport, TransportId from Transport
you really don't need to pivot, just space out your columns like you were thinking initially. You don't pivot to move columns, you pivot to perform an aggregate function over grouped data.