Efficiently take min of a column with other column as the key - hive

I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C\
GROUP BY A
I have 2 problems with this solution
It's not concise at all.
It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?

Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C,
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY clause.

You can try TD_first(B,C) in hive. Works in the same fashion.

Related

Can I (in a many to many relationship) select only those id:s in column A that has a connection to all id:s in column B?

I need to retrieve only those id's in "A" that has a connection to all id´s in "B".
In the example below, the result should be '...fa3e' because '...65d6' does NOT have a reference to all id´s in "B".
However, if '...fa3e' and '...65d6' reference the same id's in column B, then the query should return both '...fa3e' and '...65d6'.
And, subsequently, if a fifth row would connect '...fa3e' with a completely new id in "B". Then '...65d6' would be excluded because it no longer holds a reference to all id's in column "B".
Is there a way to accomplish this in SQL server?
I can´t really come up with a good description/search term of what it is I´m trying to do ("Exclude column A based on values in column B" is not quite right). Hence I´m striking out looking for resources.
I believe these values reside in the same table.
For distinct a values only:
select a
from T
group by a
having count(distinct b) = (select count(distinct b) from T);
To return all the rows:
select * from T where a in (
select a from T group by a
having count(distinct b) = (select count(distinct b) from T)
);
If (a, b) pairs are always unique then you wouldn't need the distinct qualifier on the left-hand counts. In fact you could even use count(*) for that.
This seems like it's going to be a terrible query, but at it's most basic, you want
All A where B in...
All B that are fully distinct
In SQL, that looks like
select distinct A
from test
where B in (select B from test group by B having count(1) = 1);
Absolutely zero guarantees on performance, but, this gives you the right value A. If you want to see which A/B pair actually made the cut, it could be SELECT A, B FROM test... too.

Postgres select distinct based on timestamp

I have a table with two columns a and b where a is an ID and b is a timestamp.
I need to select all of the a's which are distinct but I only care about the most up to date row per ID.
I.e. I need a way of selecting distinct a's conditional on the b values.
Is there a way to do this using DISTINCT ON in postgres?
Cheers
Like #a_horse_with_no_name suggests, the solution is
SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC
As the manual says,
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the rows
arriving at the DISTINCT filter. (DISTINCT ON processing occurs after
ORDER BY sorting.)
As posted by the upvoted answers, SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC works on Postgre 12. However, I am posting this answer to highlight few important points:
The results will be sorted based on column a; not column b.
Within each result row, the most recent (highest value) for column b would be picked.
In case, someone wants to get the most recent value for column b on the entire result set, in sql, we can run : SELECT MAX(b) from (SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC).

how to get the first (or any single) value in GROUP BY without ARRAY_AGG?

I'm migrating some SQL from PostgreSQL 9.2 to Vertica 7.0, and I could use some help replacing postgres's cool array_agg feature with something that Vertica (and possibly other RDBMS) supports, such as partitions and window functions. I'm new to these features, and I'd really appreciate your ideas.
The (working) query using array_agg ( sql fiddle demo ):
SELECT B.id, (array_agg(A.X))[1]
FROM B, AB, A
WHERE B.id = AB.B_id AND A.id = AB.A_id AND A.X IS NOT NULL
GROUP BY B.id;
If I try to naively select A.X by itself without the aggregation (i.e., to let the RDBMS pick - actually works with MySQL and SQLite), postgres complains. Running the same query but with "A.X" instead of "(array_agg(A.X))1":
ERROR: column "a.x" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT B.id, A.X
I was thinking of trying a window function, e.g., something like from this question:
SELECT email, FIRST_VALUE(email) OVER (PARTITION BY email)
FROM questions
GROUP BY email;
but I get the same error:
SELECT B.id, FIRST_VALUE(A.X) OVER (PARTITION BY A.id)
FROM B, AB, A
WHERE B.id = AB.B_id AND A.id = AB.A_id AND A.X IS NOT NULL
GROUP BY B.id;
ERROR: column "a.x" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT B.id AS id, FIRST_VALUE(A.X) OVER (PARTITION BY A.id)...
Note that we don't care so much about getting the first value, we just need any (ideally deterministic) single value.
Thank you in advance.
#a_horse_with_no_name's comment, along with that of Denis, was what we needed to rethink our approach. We have switched to MIN(). Thanks!

How to delay expensive calculation in postgres until final steps

I have a simple postgres 9.1 query that contains a very expensive calculation:
SELECT a, b, sum(c), very_expensive_calculation(f)
FROM my_table
GROUP BY a, b, f
The function very_expensive_calculation() is a custom function of mine (non-aggregate) that runs a recursive summation of values behind the scenes, and because of this it takes a while. The problem occurs because there are many duplicates in my_table because it's very de-normalized so it runs once for each row when it should only run on distinct values. I tried the following to run the function on the pre-grouped values:
SELECT a, b, c_sum, very_expensive_calculation(f)
FROM (
SELECT a, b, sum(c) c_sum, f
FROM my_table
GROUP BY a, b, f
) pre_group
This cuts down on the number of runs for very_expensive_calculations() because if the original query contained 100 rows but the grouped contained only 10, then I have a 90% reduction in executions. However, this is hacky and is coming with other problems (there's a lot more criteria and columns w/ custom logic that I'm not showing, and they're suffering from this hack).
Can I run the first query but delay very_expensive_calculation() to run on the already grouped values for f, possibly by declaring very_expensive_function() as an aggregate?
EDIT (#gordon linoff): Would the following behave the same as the answer mentioned below?
WITH fvec AS (
select f, very_expensive_calculation(f) as vec
from (select distinct f from my_table) mt
)
Select a, b, sum(c), fvec.vec
from my_table agg inner join fvec on agg.f = fvec.f
group by a, b, fvec.vec
Our auto-generated code can do WITH clauses easily but I'm not sure if this behaves the same as the join solution below.
Given your original query, I think this might be the cheapest way would be to do the calculation only one time for each f and join the results back in:
select agg.*, fvec.fec
from (SELECT a, b, sum(c) as sumc
FROM my_table
GROUP BY a, b, f
) agg join
(select f, very_expensive_calculation(f) as vec
from (select distinct f from my_table) mt
) fvec
on agg.f = fvec.f;
I don't see how writing a custom-aggregation function would help unless the aggregation function stashed already computed values. The aggregation function is going to be called for each row in the original table, not each row in the post-aggregated results.

Pass a function return to another in the same row

I Need to pass the return value of a function that is selected in previous column and pass it as parameter to a following function in the same row. I cannot use the alias:
What I want to have is:
SELECT
dbo.GetSecondID(f.ID) as SecondID,
dbo.GetThirdID(SecondID) as ThirdID
FROM Foo f
Any workaround? Thank you!
EDIT:
The method dbo.GetSecondID() is very heavy and I am dealing with a couple of million records in the table. It is not wise to pass the method as a parameter.
The way that SQL is designed, it is intended that all columns can be computed in parallel (in theory). This means that you cannot have one column's value depend on the result of computing a different column (within the same SELECT clause).
To be able to reference the column, you might introduce a subquery:
SELECT SecondID,dbo.GetThirdID(SecondID) as ThirdID
FROM
(
SELECT
dbo.GetSecondID(f.ID) as SecondID
FROM Foo f
) t
or a CTE:
;WITH Results1 AS (
SELECT
dbo.GetSecondID(f.ID) as SecondID
FROM Foo f
)
SELECT SecondID,dbo.GetThirdID(SecondID) as ThirdID
FROM Results1
If you're building up calculations multiple times (e.g. A depends on B, B depends on C, C depends on D...), then the CTE form usually ends up looking neater (IMO).
Bingo! The secret stand in applying a CROSS APPLY. The following code was helpful
SELECT
sndID.SecondID,
dbo.GetThirdID(sndID.SecondID) as ThirdID
FROM Foo f
CROSS APPLY
(
SELECT dbo.GetSecondID(f.ID) as SecondID
) sndID
EDIT:
This only works if SecondID is unique (only one record is returned) or GROUP BY is used
Did you mean this:
SELECT
dbo.GetThirdID(dbo.GetSecondID(f.ID)) as ThirdID
FROM Foo f