Can I add aggregated column without performing a join? - sql

I have a table table1 with three columns a, b, c. I am creating another column by doing a group by on c and some function func(a,b) as d giving me view1. In order to add the column d to table1, the only thing I can think of is to perform a join between view1 and table1. However, both of them have millions of rows and it gets really slow. Is there any other way without joining them? It looks intuitively that it should be possible.
Here is a snippet of the script
with
found_mean
as
(select sum(count*avg)/sum(count) as combined_avg , b from view_1 group by b),
view_1_m
as
(select combined_avg , count , avg, variance , found_mean.b from found_mean , view_1 where found_mean.b = view_1.b),

Depending on what your function is, you can use window functions (sometimes called analytic functions). For instance, if you wanted the maximum value of b for a given a:
select a, b, c, max(b) over (partition by a) as d
from table1;
Without more information, it is hard to be more specific.
EDIT:
You should be able to do this with analytic functions:
select count , avg, variance,
(sum(count * avg) over (partition by b) /
sum(count) over (partition by b)
) as weighted_average
from view_1;

Related

How to select * in addition to group by?

Consider a PostgreSQL table with fields a-z
a, b, c ... z
-------------
5, 6, 2 ... 9
5, 6, 3 ... 1
I'd like to do a group on fields a,b and keep only records where b was maximum.
SELECT a, max(b) as b, c, d, e ... z
FROM table
GROUP BY a, b
This works fine, but it's annoying to have to type out all the values in SELECT. I'd much rather do something like
SELECT max(b) as b, *
FROM TABLE
But doing so gives error
[42803] ERROR: column "table.id" must appear in the GROUP BY clause or
be used in an aggregate function.
Any idea how to avoid having to type all the column names in a lengthy table when doing a groupby operation?
You can use rank():
select t.*
from (select t.*, rank() over (partition by a order by b desc) as seqnum
from t
) t
where seqnum = 1;
Actually, in Postgres, the fastest method is usually distinct on:
select t.*
from t
order by a, b desc;
With an index on (a, b desc) this should be the fastest method.
Gordon Linoff's answer put me on the right track, namely using distinct on. This works in postgres
SELECT DISTINCT ON (a, b) *
FROM table
ORDER BY a, b DESC
Basically it lists the distinct rows of (a,b) and sorts them in order, hence taking only the first or last value depending on sort order. Actually surprised this works...

SQL Server, include columns that are not in group by statement

I have a permanent problem,
lets assume that I have a following columns:
T:A(PK), B, C, D, E
Now,
select A, MAX(B) from T group BY A
No, I cant do:
select A, C, MAX(B) from T group BY A
I don't understand why - when in comes to AVG or SUM I get it. However, MAX or MIN is getting from exactly one row.
How to deal with it?
You can use ROW_NUMBER() for that like this:
select A, C, B
from (
select *
, row_number() over (partition by A order by B desc) seq
-- group by ^ max(^)
from yourTable ) t
where seq = 1;
That's cause columns included in the select list should also be part of group by clause. You may have column which re part of group by but not present in select list but vice-versa not possible.
You generally, put only those columns in select clause on which you want the grouping to happen.
try this. it can help you find the MAX by just 1 column (f1), and also adding the column you wanted(f3) but not affecting your MAX operation
SELECT m.f1,s.f2,m.maxf3 FROM
(SELECT f1,max(f3) maxf3 FROM t1 GROUP BY f1) m
CROSS APPLY (SELECT TOP(1) f2,f1 FROM t1 WHERE m.f1 = f1) s
Your question isn't very clear in that we aren't sure what you are trying to do.
Assuming you don't actually want to do a group by in your main query but want to return the max of B based on column A you can do it like so.
select A, C,(Select Max(B) from T as T2 WHERE T.A = T2.A) as MaxB from T

Getting row with MAX value together with SUM

I have a PostgreSQL table example with three columns: a INT, b INT, c TEXT.
For each value of a I want the c with the highest value of b, together with the sum of all b. Something like (if there was an ARGMAX function):
SELECT a, ARGMAX(c for MAX(b)), SUM(b) FROM example GROUP BY a
I've found a lot of solutions with varying techniques to get the ARGMAX bit, but none of them seem to use GROUP BY, so I was wondering what we most efficient way would be to capture the SUM (or other aggregate functions) as well.
This can be easily achieved using window functions:
SELECT a, b, c, s
FROM (
SELECT a, b, c,
ROW_NUMBER() OVER (PARTITION BY a ORDER BY b DESC) AS rn,
SUM(b) OVER (PARTITION BY a) AS s
FROM example) AS t
WHERE t.rn = 1
ROW_NUMBER enumerates records within each a partition: the record having the highest b value is assigned a value of 1, next record a value of 2, etc.
SUM(b) OVER (PARTITION BY a) returns the sum of all b within each a partition.

Sorted SQL groups

I was trying to do something like:
SELECT
a, b, c, MAX(d)
FROM
table -- table with 4 columns a, b, c and d
GROUP BY
a, b
I would like to have c as an additional value from the table that I do not want to group by, but that distinguish rows within groups. My problem is that GROUP BY makes c look like the first rows from groups and not the ones that really contain
d = MAX(d)
in the table.
ORDER BY is applied to the whole result, so it's not an option. Can I achieve that in any other way than sorting the table prematurely (as a subquery) and then applying the grouping? Would that work in every SQL engine? Do standards define such behaviors?
Edit1:
I tested something like:
SELECT
t.*,
MAX(d) AS v
FROM
(SELECT
a, b, c, d
FROM
table
ORDER BY
d DESC) AS t
GROUP BY
a, b
and it works... but I do not think anybody can guarantee that the sort order will also be applied to the group rows... - maybe it works this way in MySQL, but how will it go with Oracle or PostgreSQL?
This is ANSI SQL:
SELECT a,
b,
c,
MAX(d) over (partition by a,b) as max_d
FROM the_table
This will still return all rows from the table. The max value will repeated for every row that is returned. If you want to get only the rows with the max value you need to wrap this in a derived table:
select a,b,c,d
from (
SELECT a,
b,
c,
d,
MAX(d) over (partition by a,b) as max_d
FROM the_table
) t
where d = max_d;
That will return multiple rows if the same max value occurs more than once. If you only want a single row for each max value you need to use row_number()
You can use
select x.*,y.c from
(SELECT a, b, MAX(d) as d FROM table GROUP BY a, b) x,(select c,d from table) y
where x.d = y.d

Group Count in SQL

I am looking for a way to display a table where a set of multiple attributes appear more than one time.
For example, suppose I had a table, Tbl1 with attributes A, B, C, D, E
How do I make a query such that it only shows rows where A, B, C appear more than once (as in the same A, B, C as a group), but D and E may or may not be different?
My attempt:
SELECT *
FROM Tbl1
WHERE COUNT(A, B, C) > 1
and I get an error: "group function is not allowed here"
The reason for this is, that you cannot use this grouping in the WHERE-part of an sql clause.
SELECT colums
FROM tables
WHERE condition
the condition refers to a single row of the table.
What you want is HAVING
SELECT colums
FROM tables
HAVING condition
The condition after HAVING is evaluated after the grouping and there you can use aggregation functions like COUNT or SUM
Use the GROUP BY clause (SQL Server: http://msdn.microsoft.com/en-us/library/ms177673.aspx, MySQL: http://www.tutorialspoint.com/mysql/mysql-group-by-clause.htm).
Within each group, you'll want to get get the count of rows in that group (using COUNT(*)) and then use a HAVING clause to filter on that count. HAVING is like a WHERE clause for GROUP BY. It filters on the results of the grouping, and can make reference to the grouped columns (in this case, A, B and C), or any aggregates (in this case, COUNT(*)).
Here's what your query could look like. Note that you can only include columns in the SELECT field list that are mentioned in the GROUP BY or that are contained in aggregate functions such as COUNT() and MAX(). MySQL will let you get away with putting other columns in, but SQL Server will give you an error. It's best to follow this rule even if the database allows it.
SELECT A,
B,
C,
COUNT(*) AS GroupCount
FROM Tbl1
GROUP BY A, B, C
HAVING COUNT(*) > 1
If you want the full rows where this is true, then you can used a derived table:
SELECT *
FROM Tbl1
JOIN (
SELECT A,
B,
C,
COUNT(*) AS GroupCount
FROM Tbl1
GROUP BY A, B, C
HAVING COUNT(*) > 1
) AS duplicates
ON duplicates.A = Tbl1.A AND
duplicates.B = Tbl1.B AND
duplicates.C = Tbl1.C