How to get Sum(Column) over (partition by other columns)? - hive

I am trying to convert Teradata code written like below
Select A.col1, sum(A.metric1) over (partition by A.col1, B.col1 order by
A.col2 asc) as Cust_col, B.col1 from A JOIN B on (A.join_key=B.join_key)
where A.col3='X' QUALIFY ROW_NUMBER () OVER (PARTITION BY A.col1,B.COL1
ORDER BY A.col3 DESC) = 1
In Hive:
Select C.col1,C.cust_col,C.col1,ROW_NUMBER () OVER (PARTITION BY A.col1,C.COL1
ORDER BY C.col3 DESC) as Row_num from (Select A.col1, sum(A.metric1) over
(partition by A.col1, B.col1 order by A.col2 asc) as Cust_col,B.col1 from A
JOIN B on (A.join_key=B.join_key) where A.col3='X') C where C.Row_num =1
But, I am getting error like
SemanticException Failed to breakup Windowing invocations into Groups.
At least 1 group must only depend on input columns. Also check for
circular dependencies. Underlying error: Primitve type DATE not
supported in Value Boundary expression
I know it is because with Sum(A.metric1) partition is creating a problem here, but how to resolve this?

select a_col1,
sum(metric1) over (partition by a_col1, b_col1 order by a_col2 asc) as Cust_col,
b_col1
from
(
Select A.metric1, A.col1 a_col1, B.COL1 b_col1, A.col2 a_col2
ROW_NUMBER () OVER (PARTITION BY A.col1,B.COL1 ORDER BY A.col3 DESC ) as rn
from A JOIN B on (A.join_key=B.join_key)
where A.col3='X'
) s
where rn=1

Related

First row in one to many relationship join

I have 2 tables like this:
Table A:
guv, col1, col2
Table B:
guv, col3, col4, col5..
Now each A and B have one to many relationship, so when I run the following query:
select * from A,B where a.guv. = b.guv
It returns all the rows in B that match the join, how do I return only one row(based on some order in one of the columns) that matches?
I tried to do this using Top as read in some other answers, but its not supported by aws athena.
You may use ROW_NUMBER() function within the join query as the following:
SELECT guv, col1, col2, col3, col4, col5
FROM
(
SELECT A.guv, A.col1, A.col2, B.col3, B.col4, B.col5,
ROW_NUMBER() OVER (PARTITION BY A.guv ORDER BY B.col3) rn
FROM TableA A JOIN TableB B
ON A.guv=B.guv
) T
WHERE rn = 1
In ROW_NUMBER() OVER (PARTITION BY A.guv ORDER BY B.col3) you may change the order by B.col3 to any other column order.

Find the first N nearest points in Bigquery

To find the nearest point and its distance in Bigquery I am using this query
WITH table_a AS (
SELECT id, geom
FROM bqtable
), table_b AS (
SELECT id, geom
FROM bqtable
)
SELECT AS VALUE ARRAY_AGG(STRUCT<id_a STRING,id_b STRING, dist FLOAT64>(a.id,b.id,ST_DISTANCE(a.geom, b.geom)) ORDER BY ST_DISTANCE(a.geom, b.geom) LIMIT 1)[OFFSET(0)]
FROM (SELECT id, geom FROM table_a) a
CROSS JOIN (SELECT id, geom FROM table_b) b
WHERE a.id <> b.id
GROUP BY a.id
How can I modify this query to find the nearest 10 points and their distances?
Thanks!
One method uses ORDER BY, LIMIT, and UNNEST(). Using your approach:
SELECT AS VALUE s
FROM (SELECT ARRAY_AGG(STRUCT<id_a STRING,id_b STRING, dist FLOAT64>(a.id, b.id, ST_DISTANCE(a.geom, b.geom))
ORDER BY ST_DISTANCE(a.geom, b.geom)
LIMIT 10
) as ar
FROM (SELECT id, geom FROM table_a) a CROSS JOIN
(SELECT id, geom FROM table_b) b
WHERE a.id <> b.id
GROUP BY a.id
) ab CROSS JOIN
UNNEST(ab.ar) s;
A simpler method would be
select id_a, id_b, ST_DISTANCE(a.geom, b.geom) as dist
from table_a a cross join
table_b b
where a.id <> b.id
qualify row_number() over (partition by id_a order by dist) <= 10;

How to avoid DISTINCT in a query that joins multiple tables?

I want to avoid using DISTINCT and produce the same result for queries that join multiple tables.
Without DISTINCT, it produces the same row multiple times.
I already tried looking up how to avoid DISTINCT, but nothing seems to work for me, seemingly because my table is more complicated and joining multiple tables at the same time.
SELECT DISTINCT C.COL3, B.COL1, A.COL2, A.COL4, B.COL5 FROM C
INNER JOIN B
ON B.COL1 = C.COL1
INNER JOIN A
ON B.COL2 = A.COL2
ORDER BY C.COL3 ASC;
I know I have to use GROUP BY somehow, but I just can't wrap my head around it...
You can just group by all the columns (without having ay aggregation):
SELECT
C.COL3, B.COL1, A.COL2, A.COL4, B.COL5
FROM C
JOIN B ON B.COL1 = C.COL1
JOIN A ON B.COL2 = A.COL2
GROUP BY C.COL3, B.COL1, A.COL2, A.COL4, B.COL5 -- group by all selected columns
ORDER BY C.COL3 ASC
If you then wanted to aggregate over the de-duped rows of the above query, use it as a subquery. For example, to SUM(B.COL5) of the de-duped rows:
SELECT
COL3, COL1, COL2, COL4, SUM(COL5)
FROM (
SELECT
C.COL3, B.COL1, A.COL2, A.COL4, B.COL5
FROM C
JOIN B ON B.COL1 = C.COL1
JOIN A ON B.COL2 = A.COL2
GROUP BY C.COL3, B.COL1, A.COL2, A.COL4, B.COL5
) deduped
GROUP BY COL3, COL1, COL2, COL4
ORDER BY COL3 ASC
Are you getting multiple duplicate rows of the same data if you do not use DISTINCT? If so, this query worked for me when I was joining multiple asp net tables in order to show the user info, plus the roles within the site they are assigned to. Hopefully this can help you.
SELECT AspNetUsers.Id, AspNetRoles.Name as SiteRole,
AspNetRoles.ID as RoleID, AspNetUsers.UserName,
AspNetUsers.Email FROM AspNetUserRoles INNER JOIN
AspNetUsers ON AspNetUserRoles.UserId = AspNetUsers.Id INNER JOIN
AspNetRoles ON AspNetUserRoles.RoleId = AspNetRoles.Id
You can use row_number() partition by [column you want to be distinct].
select *
from (select c.col3, b.col1, a.col2, a.col4, b.col5
, row_number() over (partition by c.col1 order by c.col3) as rn
from c
inner join b on b.col1 = c.col1
inner join a on a.col2 = b.col2) t1
where t1.rn = 1
order by t1.col3
SELECT COL3, COL1, SUM(COL5)
FROM
(
SELECT DISTINCT C.COL3, B.COL1, A.COL2, A.COL4, B.COL5 FROM C
INNER JOIN B
ON B.COL1 = C.COL1
INNER JOIN A
ON B.COL2 = A.COL2
) X
GROUP BY COL3, COL1
ORDER BY COL3, COL1

Removing duplicate values from a column in SQL

I have two tables A (group_id, id, subject) and B (id, date). Below is the joint table of tables A and B on id. I have tried using distinct and partition to remove the duplicates in group_id(field) only, but no luck:
My code:
select
a.group_id, a.id, a.subject, b.date
from
A a
inner join
(select
b.*,
row_number() over (partition by group_id order by date asc) as seqnum
from
B b) b on a.id = b.id and seqnum = 1
order by
date desc;
I got this error when I ran the code:
Partitioning can not be used stand-alone in query near 'partition by group_id order by date asc) as seqnum from B' at line 1
This is my expected result:
Thank you in advance!
It looks like you want the earliest date for each row in the table you show. Your question mentions two tables, but you only show one.
I recommend a correlated subquery in most databases:
select b.*
from b
where b.date = (select min(b2.date)
from b b2
where b2.group_id = b.group_id
);
I see. You need to join first and then use row_number():
select ab.*
from (select a.group_id, a.id, a.subject, b.date,
row_number() over (partition by a.group_id order by b.date) as seqnum
from A a join
B b
on a.id = b.id
) ab
where seqnum = 1
order by date desc;
You are almost there. But the column that you try to use to partition (ie group_id) comes from table a, which is not available in the subquery.
You would need to JOIN and assign the row number in a subquery, and then filter in the outer query.
select *
from (
select
a.group_id,
a.id,
a.subject,
b.date,
row_number() over (partition by a.group_id order by b.date asc) as seqnum
from a
inner join b on ON a.id = b.id
)
where seqnum = 1
ORDER BY date desc;
Another way to achieve your goal though it may not be the efficient one
SELECT
A.group_id, A.id, B.Date, A.subject
FROM A
INNER JOIN B
ON A.Id = B.Id
INNER JOIN
(
SELECT
A.Group_id, MIN(B.Date) AS Date
FROM A
INNER JOIN B
ON A.Id = B.Id
GROUP BY A.group_id
) AS supportTable
ON A.group_id = supportTable.group_id
AND B.Date = supportTable.Date

How would I write this query as a join instead of a correlated query?

So, Netezza can't use correlated subqueries in SELECT statements, which is unfortunate that I can't think of a single way to avoid this in my particular case. I was thinking about doing something with ROW_NUMBER(); however, I can't include windowing functions in a HAVING clause.
I've got the following query:
select
a.*
,( select b.col1
from b
where b.ky = a.ky
and a.date <= b.date
order by b.date desc
limit 1
) as new_col
from a
Any suggestions?
This should return the expected result:
select *
from
(
select
a.*
,b.col1 as b_col1
,row_number()
over (partition by a.ky
order by b.date desc NULLS FIRST) as rn
from a left join b
where b.ky = a.ky
and a.date <= b.date
) as dt
where rn = 1
I'm not completely sure I understand your question, but is this what you're looking for?
SELECT TOP 1 a.*, b.col1 FROM a JOIN b ON a.ky = b.ky
WHERE a.date <= b.date ORDER BY b.date desc