How to delay expensive calculation in postgres until final steps - sql

I have a simple postgres 9.1 query that contains a very expensive calculation:
SELECT a, b, sum(c), very_expensive_calculation(f)
FROM my_table
GROUP BY a, b, f
The function very_expensive_calculation() is a custom function of mine (non-aggregate) that runs a recursive summation of values behind the scenes, and because of this it takes a while. The problem occurs because there are many duplicates in my_table because it's very de-normalized so it runs once for each row when it should only run on distinct values. I tried the following to run the function on the pre-grouped values:
SELECT a, b, c_sum, very_expensive_calculation(f)
FROM (
SELECT a, b, sum(c) c_sum, f
FROM my_table
GROUP BY a, b, f
) pre_group
This cuts down on the number of runs for very_expensive_calculations() because if the original query contained 100 rows but the grouped contained only 10, then I have a 90% reduction in executions. However, this is hacky and is coming with other problems (there's a lot more criteria and columns w/ custom logic that I'm not showing, and they're suffering from this hack).
Can I run the first query but delay very_expensive_calculation() to run on the already grouped values for f, possibly by declaring very_expensive_function() as an aggregate?
EDIT (#gordon linoff): Would the following behave the same as the answer mentioned below?
WITH fvec AS (
select f, very_expensive_calculation(f) as vec
from (select distinct f from my_table) mt
)
Select a, b, sum(c), fvec.vec
from my_table agg inner join fvec on agg.f = fvec.f
group by a, b, fvec.vec
Our auto-generated code can do WITH clauses easily but I'm not sure if this behaves the same as the join solution below.

Given your original query, I think this might be the cheapest way would be to do the calculation only one time for each f and join the results back in:
select agg.*, fvec.fec
from (SELECT a, b, sum(c) as sumc
FROM my_table
GROUP BY a, b, f
) agg join
(select f, very_expensive_calculation(f) as vec
from (select distinct f from my_table) mt
) fvec
on agg.f = fvec.f;
I don't see how writing a custom-aggregation function would help unless the aggregation function stashed already computed values. The aggregation function is going to be called for each row in the original table, not each row in the post-aggregated results.

Related

Can I (in a many to many relationship) select only those id:s in column A that has a connection to all id:s in column B?

I need to retrieve only those id's in "A" that has a connection to all id´s in "B".
In the example below, the result should be '...fa3e' because '...65d6' does NOT have a reference to all id´s in "B".
However, if '...fa3e' and '...65d6' reference the same id's in column B, then the query should return both '...fa3e' and '...65d6'.
And, subsequently, if a fifth row would connect '...fa3e' with a completely new id in "B". Then '...65d6' would be excluded because it no longer holds a reference to all id's in column "B".
Is there a way to accomplish this in SQL server?
I can´t really come up with a good description/search term of what it is I´m trying to do ("Exclude column A based on values in column B" is not quite right). Hence I´m striking out looking for resources.
I believe these values reside in the same table.
For distinct a values only:
select a
from T
group by a
having count(distinct b) = (select count(distinct b) from T);
To return all the rows:
select * from T where a in (
select a from T group by a
having count(distinct b) = (select count(distinct b) from T)
);
If (a, b) pairs are always unique then you wouldn't need the distinct qualifier on the left-hand counts. In fact you could even use count(*) for that.
This seems like it's going to be a terrible query, but at it's most basic, you want
All A where B in...
All B that are fully distinct
In SQL, that looks like
select distinct A
from test
where B in (select B from test group by B having count(1) = 1);
Absolutely zero guarantees on performance, but, this gives you the right value A. If you want to see which A/B pair actually made the cut, it could be SELECT A, B FROM test... too.

Efficiently take min of a column with other column as the key

I recently transitioned over from using Presto to Hive. I have the following scenario. Column A, B, C. I want to aggregate on A and find the value of B for which the value of C is minimized. In presto you can something like this as
SELECT A, min_by(B, C) from <TABLE> GROUP BY A
Now I want to do the same thing in Hive. But unfortunately I couldn't find a UDF similar to this anywhere in the documentation. Now I know I can do the following
SELECT A, COALESCE(B, 0)
from <TABLE> as primary
JOIN (
SELECT A, MIN(C) as C FROM <TABLE> GROUP BY A
) secondary
ON primary.A = secondary.A AND primary.C = secondary.C\
GROUP BY A
I have 2 problems with this solution
It's not concise at all.
It's not efficient either. I am doing an extra subquery resulting and an extra aggregation and an extra JOIN. It would be nice to have a first class aggregation support for such a function.
Is there a way to achieve what I am trying to do without writing your custom UDF ?
Join works slower than analytic functions, try this approach without join, and table will be scanned only once:
select s.*
from
(
SELECT A, COALESCE(B, 0) as B, C,
min(C) over (partition by A) as min_C
from <TABLE> as primary
)s
where s.C=s.min_C;
If you need min(C) to be calculated by more group columns, add them to the partition BY clause.
You can try TD_first(B,C) in hive. Works in the same fashion.

SQL Basic Syntax

I have the following problem:
What happens if the query didn't ask for B in the select?. I think it would give an error because the aggregate is computed based on the values in the select clause.
I have the following relation schema and queries:
Suppose R(A,B) is a relation with a single tuple (NULL, NULL).
SELECT A, COUNT(B)
FROM R
GROUP BY A;
SELECT A, COUNT(*)
FROM R
GROUP BY A;
SELECT A, SUM(B)
FROM R
GROUP BY A;
The first query returns NULL and 0. I am not sure about what the second query returns. The aggregate COUNT(*) count the number of tuples in one table; however, I don't know what it does to a group. The third returns NULL,NULL
The only rule about SELECT and GROUP BY is that the unaggregated columns in the SELECT must be in the GROUP BY (with very specific exceptions).
You can have columns in the GROUP BY that never appear in the SELECT. That is fine. It doesn't affect the definition of a group, but multiple rows may seem to have the same values in the GROUP BY columns.

Postgres select distinct based on timestamp

I have a table with two columns a and b where a is an ID and b is a timestamp.
I need to select all of the a's which are distinct but I only care about the most up to date row per ID.
I.e. I need a way of selecting distinct a's conditional on the b values.
Is there a way to do this using DISTINCT ON in postgres?
Cheers
Like #a_horse_with_no_name suggests, the solution is
SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC
As the manual says,
Note that the "first row" of a set is unpredictable unless the query
is sorted on enough columns to guarantee a unique ordering of the rows
arriving at the DISTINCT filter. (DISTINCT ON processing occurs after
ORDER BY sorting.)
As posted by the upvoted answers, SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC works on Postgre 12. However, I am posting this answer to highlight few important points:
The results will be sorted based on column a; not column b.
Within each result row, the most recent (highest value) for column b would be picked.
In case, someone wants to get the most recent value for column b on the entire result set, in sql, we can run : SELECT MAX(b) from (SELECT DISTINCT ON (a) a, b FROM the_table ORDER BY a, b DESC).

SQL query in MySQL using GROUP BY

Okay so this query should be easy but I'm having a bit of difficult. Let's say I have a table called 'foo' with columns 'a', 'b'.
I'm trying to figure out the following in one query:
select how of column 'a' are available of type column 'b', this is done with the following:
mysql> select count(a),b from foo GROUP BY b;
that's straight forward. but now I want to add a third output to that query as well which shows the percentage of the result from count(a) divided by count(*). So if I have 100 rows total, and one of the GROUP BY results comes back with 20, I can get the third column to output 20%. Meaning that column a makes for 20% of the aggregate pool.
Assuming you have > 0 rows in foo
SELECT count(a), b, (count(a) / (SELECT count(*) FROM foo)) * 100
FROM foo
GROUP BY b
There is a risk of it running slow, best bet is to program whatever is to preform two separate queries.
SELECT count(*) INTO #c FROM foo;
SELECT count(a), b, (count(a)/#c)*100 FROM foo GROUP by b;