Postgres Aggregate over unnest - sql

I have a query like the following:
select count(unnest(regexp_matches(column_name, regex)))
from table_name group by unnest(regexp_matches(column_name, regex));
The above query gives the following error:
ERROR: aggregate function calls cannot contain set-returning function calls
Hint: You might be able to move the set-returning function into a LATERAL FROM item.
I know I can first calculate unnested values by nesting a select query in from clause and then find the total count. But I was wondering why Postgres does not allow such expression?

It's unclear to me, what result you are after. But in general, you need to move the unnest to the FROM clause to do anything "regular" with the values
If you want to count per value extracted you can use:
select u.val, count(*)
from table_name t
cross join unnest(regexp_matches(t.column_name, regex)) as u(val)
group by u.val;
Or maybe you want to count per "column_name"?
select t.column_name, count(*)
from table_name t
cross join unnest(regexp_matches(t.column_name, regex)) as u(val)
group by t.column_name;

Related

Filter SQL by Aggregate Not in SELECT Statement

Can you filter a SQL table based on an aggregated value, but still show column values that weren't in the aggregate statement?
My table has only 3 columns: "Composer_Tune", "_Year", and "_Rank".
I want to use SQL to find which "Composer_Tune" values are repeated in each annual list, as well as which ranks the duplicated items had.
Since I am grouping by "Composer_Tune" & "Year", I can't list "_Rank" with my current code.
The image shows the results of my original "find the duplicates" query vs what I want:
Current vs Desired Results
I tried applying the concepts in this Aggregate Subquery StackOverflow post but am still getting "_Rank is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause" from this code:
WITH DUPE_DB AS (SELECT * FROM DB.dbo.[NAME] GROUP BY Composer_Tune, _Year HAVING COUNT(*)>1)
SELECT Composer_Tune, _Year, _Rank
FROM DUPE_DB
You need to explicitly declare the columns used in the Group By expression in the select columns.
You can use the following documentation if you are using transact sql for the proper use of Group By.
Simply join the aggregated resultset to original unit level table:
WITH DUPE_DB AS (
SELECT Composer_Tune, _Year
FROM DB.dbo.[NAME]
GROUP BY Composer_Tune, _Year
HAVING COUNT(*) > 1
)
SELECT n.Composer_Tune, n._Year, n._Rank
FROM DB.dbo.[NAME] n
INNER JOIN DUPE_DB
ON n.Compuser_Tune = DUPE_DB.Composer_Tune
AND n._Year = DUPE_DB._Year
ORDER n.Composer_Tune, n._Year

Does it has to be that in SQL, aggregate function goes hand in hand with grouping?

Question in the title. Thanks for the time.
EXAMPLE v
SELECT
customer_id,
SUM(unit_price * quantity) AS total_price
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_id
GROUP BY customer_id
Yes, grouping does go hand in hand with aggregate functions, for any resulting column not contained within an aggregate function
Grouping and the operations of aggregation are typically linked to three keywords:
any aggregation function
the GROUP BY clause
the DISTINCT (ON) modifier after the SELECT keyword
You can use:
aggregation functions alone when they are used on every field found inside the SELECT statement
the GROUP BY clause alone - an allowed bad practice as you're intending to do an aggregation on your field but you're not specifying any aggregation function inside the SELECT statement (really you should go with DISTINCT ON)
the DISTINCT modifier alone to select distinct rows (not fields)
the DISTINCT ON modifier only when accompanied by an ORDER BY clause that defines the order for extracting the rows
aggregation functions + the GROUP BY clause, that is forced to contain all fields found in the SELECT statement that are not object of aggregation
the DISTINCT modifier + the GROUP BY clause - you can do it but the GROUP BY clause really is superfluous as it is already implied by the DISTINCT keyword itself.
You can't use:
aggregation functions + the GROUP BY clause when non-aggregated fields included in the SELECT statement are not found within the GROUP BY clause
aggregation functions alone when in the SELECT statement they are found together with non-aggregated fields.
When you can't aggregate because you need to select multiple fields, typically window functions + filtering operations (with a WHERE clause) can come in handy for computing values and row by row and removing unneeded records.

Alternatives of array_agg() or string_agg() on redshift

I am using this query to get the aggregated results:
select _bs, string_agg(_wbns, ',') from bag group by 1;
I am getting this error:
Error running query: function string_agg(character varying, "unknown")
does not exist HINT: No function matches the given name and argument
types. You may need to add explicit type casts.
I also tried array_agg() and getting the same error.
Please help me in figuring out the other options I can use to aggregate the results.
you have to use listagg for reshift
For each group in a query, the LISTAGG aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string.
LISTAGG is a compute-node only function. The function returns an error if the query doesn't reference a user-defined table or Amazon Redshift system table.
Your query will be as like below
select _bs,
listagg(_wbns,',')
within group (order by _wbns) as val
from bag
group by _bs
order by _bs;
for better understanding Listagg
Redshift has a listagg function you can use instead:
SELECT _bs, LISTAGG(_wbns, ',') FROM bag GROUP BY _bs;
To get an array type back instead of a varchar, you need to combine the LISTAGG function with the SPLIT_TO_ARRAY function like so:
SELECT
some_grouping_key,
SPLIT_TO_ARRAY(LISTAGG(col_to_agg, ','), ',')
FROM some_table
GROUP BY 1
Use listagg function:
select _bs,
listagg(_wbns,',')
within group (order by _bs) as val
from bag
group by _bs
Got Error:One or more of the used functions must be applied on at least one user created tables.
Examples of user table only functions are LISTAGG, MEDIAN, PERCENTILE_CONT, etc
SELECT refc.constraint_name, refc.update_rule, refc.delete_rule, kcu.table_name,
LISTAGG(distinct kcu.column_name, ',') AS columns
FROM information_schema.referential_constraints AS refc,
information_schema.key_column_usage AS kcu
WHERE refc.constraint_schema = 'abc' AND refc.constraint_name = kcu.constraint_name AND refc.constraint_schema = kcu.table_schema
AND kcu.table_name = 'xyz'
GROUP BY refc.constraint_name, refc.update_rule, refc.delete_rule, kcu.table_name;

Count of 2 columns by GROUP BY and catx giving different outputs

I have to find distinct count of combination of 2 variables. I used the following 2 queries to find the count:
select count(*) from
( select V1, V2
from table1
group by 1,2
) a
select count(distinct catx('-', V1, V2))
from table1
Logically, both the above queries should give the same count but I am getting different counts. Note that
both V1 and V2 are integers
Both variables can have null values, though there are no null values in my table
There are no negative values
Any idea why I might be getting different outputs? And which is the best way to find the count of distinct combinations of 2 or more columns?
Thanks.
The SAS log gives the answer when you run the first sql code. Using 'group by' requires a summary function, otherwise it is ignored. The count will therefore return the overall number of rows instead of a distinct count of the 2 variables combined.
Just add count(*) to the subquery and you will get the same answer with both methods.
select count(*) from
( select V1, V2, count(*)
from table1
group by 1,2
) a
Use distinct in the subquery for the first query..
When you do a group by but don't include any aggregate function, it discards the group by.
so you will still have duplicate combinations of v1 and v2.
It seems that GROUP BY doesn't work that way in SAS. You can't use it to remove duplicates unless you have an aggregate function in your query. I found this in the log of my query output -
NOTE: A GROUP BY clause has been discarded because neither the SELECT
clause nor the optional HAVING clause of the associated
table-expression referenced a summary function.
This answers the question.
you can ignore the group by part also and just add a distinct in the sub-query. Also the second query you wrote is more efficient

Give priority to ORDER BY over a GROUP BY in MySQL without subquery

I have the following query which does what I want, but I suspect it is possible to do this without a subquery:
SELECT *
FROM (SELECT *
FROM 'versions'
ORDER BY 'ID' DESC) AS X
GROUP BY 'program'
What I need is to group by program, but returning the results for the objects in versions with the highest value of "ID".
In my past experience, a query like this should work in MySQL, but for some reason, it's not:
SELECT *
FROM 'versions'
GROUP BY 'program'
ORDER BY MAX('ID') DESC
What I want to do is have MySQL do the ORDER BY first and then the GROUP BY, but it insists on doing the GROUP BY first followed by the ORDER BY. i.e. it is sorting the results of the grouping instead of grouping the results of the ordering.
Of course it is not possible to write
SELECT * FROM 'versions' ORDER BY 'ID' DESC GROUP BY 'program'
Thanks.
By definition, ORDER BY is processed after grouping with GROUP BY. By definition, the conceptual way any SELECT statement is processed is:
Compute the cartesian product of all tables referenced in the FROM clause
Apply the join criteria from the FROM clause to filter the results
Apply the filter criteria in the WHERE clause to further filter the results
Group the results into subsets based on the GROUP BY clause, collapsing the results to a single row for each such subset and computing the values of any aggregate functions -- SUM(), MAX(), AVG(), etc. -- for each such subset. Note that if no GROUP BY clause is specified, the results are treated as if there is a single subset and any aggregate functions apply to the entire results set, collapsing it to a single row.
Filter the now-grouped results based on the HAVING clause.
Sort the results based on the ORDER BY clause.
The only columns allowed in the results set of a SELECT with a GROUP BY clause are, of course,
The columns referenced in the GROUP BY clause
Aggregate functions (such as MAX())
literal/constants
expresssions derived from any of the above.
Only broken SQL implementations allow things like select xxx,yyy,a,b,c FROM foo GROUP BY xxx,yyy — the references to colulmsn a, b and c are meaningless/undefined, given that the individual groups have been collapsed to a single row,
This should do it and work pretty well as long as there is a composite index on (program,id). The subquery should only inspect the very first id for each program branch, and quickly retrieve the required record from the outer query.
select v.*
from
(
select program, MAX(id) id
from versions
group by program
) m
inner join versions v on m.program=v.program and m.id=v.id
SELECT v.*
FROM (
SELECT DISTINCT program
FROM versions
) vd
JOIN versions v
ON v.id =
(
SELECT vi.id
FROM versions vi
WHERE vi.program = vd.program
ORDER BY
vi.program DESC, vi.id DESC
LIMIT 1
)
Create an index on (program, id) for this to work fast.
Regarding your original query:
SELECT * FROM 'versions' GROUP BY 'program' ORDER BY MAX('ID') DESC
This query would not parse in any SQL dialect except MySQL.
It abuses MySQL's ability to return ungrouped and unaggregated expressions from a GROUP BY statement.