ORDER BY value in join table not grouped before aggregation - sql

I am trying to order a Postgres result set based on an array_aggregate function.
I have the following query that works great:
select a.id, a.name, array_agg(f.name)
from actors a
join actor_films af on a.id = actor_id
join films f on film_id = f.id
group by a.id
order by a.id;
This gives me the following results, for example:
id | name | array_agg
----+--------+---------------------------------
1 | bob | {"delta force"}
2 | joe | {"delta force","the funny one"}
3 | fred | {"bad movie",AARRR}
4 | sally | {"the funny one"}
5 | suzzy | {"bad movie","delta force"}
6 | jill | {AARRR}
7 | victor | {"the funny one"}
I want to sort the results so that it is sorted alphabetically by Film name. For example, the final order should be:
id | name | array_agg
----+--------+---------------------------------
3 | fred | {"bad movie",AARRR}
6 | jill | {AARRR}
5 | suzzy | {"bad movie","delta force"}
1 | bob | {"delta force"}
2 | joe | {"delta force","the funny one"}
4 | sally | {"the funny one"}
7 | victor | {"the funny one"}
This is based on the alphabetical name of any movies they are in. When I add the ORDER BY f.name I get the following error:
ERROR: column "f.name" must appear in the GROUP BY clause or be used in an aggregate function
I cannot add it to the group, because I need it aggregated in the array, and I want to sort pre-aggregation, such that I can get the following order. Is this possible?
If you would like reproduce this example, here is the setup code:
create table actors(id serial primary key, name text);
create table films(id serial primary key, name text);
create table actor_films(actor_id int references actors (id), film_id int references film (id));
insert into actors (name) values('bob'), ('joe'), ('fred'), ('sally'), ('suzzy'), ('jill'), ('victor');
insert into films (name) values('AARRR'), ('the funny one'), ('bad movie'), ('delta force');
insert into actor_films(actor_id, film_id) values (2, 2), (7, 2), (4,2), (2, 4), (1, 4), (5, 4), (6, 1), (3, 1), (3, 3), (5, 3);
And the final query with the error:
select a.id, a.name, array_agg(f.name)
from actors a
join actor_films af on a.id = actor_id
join films f on film_id = f.id
group by a.id
order by f.name, a.id;

You can use an aggregation function:
order by min(f.name), a.id

Related

group by after inner join not working as expected

I have two tables: a raw user information table called user_raw (a row per user in a company) and a separate user details table called user_details that holds the unique details and values in the user_raw.details JSON column.
create table user_details (
id numeric
,key text
,value text
);
create table user_raw (
id numeric
,amount numeric
,detail jsonb
);
insert into user_details values
(1, 'job', 'doctor'),
(1, 'job', 'police'),
(1, 'name', 'John'),
(1, 'name', 'Angela');
insert into user_raw values
(1, 500, '{"job": "doctor", "name": "John"}'::jsonb),
(1, 238, '{"job": "police", "name": "John"}'::jsonb),
(1, 486, '{"job": "police", "name": "Angela"}'::jsonb);
So, user_raw looks like:
id | amount | detail
---+--------+-------------------------------------
1 | 500 | {"job": "doctor", "name": "John"}
1 | 238 | {"job": "police", "name": "John"}
1 | 486 | {"job": "police", "name": "Angela"}
3 rows)
and user_details like:
id | key | value
---+------+--------
1 | job | doctor
1 | job | police
1 | name | John
1 | name | Angela
The IDs are all meant to be the same.
I want to produce a summary table that sums all the amounts per distinct user detail in the user_details.value column, i.e.
id | key | value | sum
---+------+--------+------
1 | job | police | 714
1 | name | John | 738
1 | name | Angela | 486
1 | job | doctor | 500
I tried to do this by the query:
select r.id
,d.key
,d.value
,sum(r.amount)
from user_raw as r
inner join user_details
on d.id = r.id
group by 1, 2, 3;
but that just summarizes the whole user_raw.amount column.
How would I produce the desired table? Thanks!
Join the tables and aggregate:
SELECT d.id, d.key, d.value,
SUM(r.amount)
FROM user_details d INNER JOIN user_raw r
ON r.detail ->> d.key = d.value AND r.id = d.id
GROUP BY d.id, d.key, d.value;
See the demo.

How can I choose which column do I refer to?

I have 2 tables with some duplicate columns. I need to join them without picking which columns I want to select:
CREATE TABLE IF NOT EXISTS animals (
id int(6) unsigned NOT NULL,
cond varchar(200) NOT NULL,
animal varchar(200) NOT NULL,
PRIMARY KEY (id)
) DEFAULT CHARSET=utf8;
INSERT INTO animals (id, cond, animal) VALUES
('1', 'fat', 'cat'),
('2', 'slim', 'cat'),
('3', 'fat', 'dog'),
('4', 'slim', 'dog'),
('5', 'normal', 'dog');
CREATE TABLE IF NOT EXISTS names (
id int(6) unsigned NOT NULL,
name varchar(200) NOT NULL,
animal varchar(200) NOT NULL,
PRIMARY KEY (id)
) DEFAULT CHARSET=utf8;
INSERT INTO names (id, name, animal) VALUES
('1', 'LuLu', 'cat'),
('2', 'DoDo', 'cat'),
('3', 'Jack', 'dog'),
('4', 'Shorty', 'dog'),
('5', 'Stinky', 'dog');
SELECT *
FROM animals AS a
JOIN names as n
ON a.id = n.id;
Result:
| id | cond | animal | id | name | animal |
| --- | ------ | ------ | --- | ------ | ------ |
| 1 | fat | cat | 1 | LuLu | cat |
| 2 | slim | cat | 2 | DoDo | cat |
| 3 | fat | dog | 3 | Jack | dog |
| 4 | slim | dog | 4 | Shorty | dog |
| 5 | normal | dog | 5 | Stinky | dog |
But when I try to make another request from the resulting table like:
SELECT name
FROM
(
SELECT *
FROM animals AS a
JOIN names as n
ON a.id = n.id
) as res_tbl
WHERE name = 'LuLu';
I get:
Query Error: Error: ER_DUP_FIELDNAME: Duplicate column name 'id'
Is there any way of avoiding it except removing duplicate columns from the 1st request?
P.S. in fact I am using PostgreSQL, I create my schema as MySQL because I am more used to it
You have columns with the same name in both tables, which causes ambiguity.
If you just want the name column in the outer query, then select that column only in the subquery:
select name
from (
select n.name
from animals a
inner join names n using (id)
) t
where ...
If you want more columns, then you would typically alias the homonym columns to remove the ambiguity - as for the joining column (here, id), the using() syntax is sufficient. So, for example:
select ...
from (
select id, a.cond, a.animal as animal1, n.name, n.animal as animal2
from animals a
inner join names n using (id)
) t
where ...
You may also select the records themselves, instead of the columns from them, which you can then access in an outer query using the usual record.column syntax;
SELECT a.cond animal_cond,
n.name animal_name
FROM (
SELECT a, n
FROM animals AS a
JOIN names as n
ON a.id = n.id
) t

Counting different occurrences

I am studying SQL using PostreSQL and I have a doubt regarding counting the number of different occurrences of a column's values with respect to another.
I suppose this is not the typical COUNT and GROUP BY problem because I cannot find any help or reference for my problem, so I will better explain what I would like to do (if possible) with a short example.
Suppose I have the following table:
CREATE TABLE MYTABLE
(
id INTEGER NOT NULL,
genre VARCHAR(20) NOT NULL,
country VARCHAR(20) NOT NULL,
CONSTRAINT PK_MOVIE PRIMARY KEY (id)
);
INSERT INTO MYTABLE VALUES (1, 'Horror', 'EEUU');
INSERT INTO MYTABLE VALUES (2, 'Drama', 'EEEU');
INSERT INTO MYTABLE VALUES (3, 'Drama', 'Italy');
INSERT INTO MYTABLE VALUES (4, 'Horror', 'UK');
INSERT INTO MYTABLE VALUES (5, 'Drama', 'EEEU');
INSERT INTO MYTABLE VALUES (6, 'Drama', 'EEEU');
So MYTABLE looks like this:
id | genre | country
----+--------+---------
1 | Horror | EEUU
2 | Drama | EEEU
3 | Drama | Italy
4 | Horror | UK
5 | Drama | EEEU
6 | Drama | EEEU
I can now count how many times the value of country is repeated for each value of genre with the following query:
select distinct count(*), m.genre, m.country
FROM MYTABLE m
GROUP BY m.genre, m.country;
which returns:
count | genre | country
-------+--------+---------
3 | Drama | EEEU
1 | Horror | EEUU
1 | Horror | UK
1 | Drama | Italy
(4 rows)
But how could I obtain how many different values of country I have for each genre ? In other words I would like to obtain such a table:
genre | different_countries
--------+------------------
Horror | 2
Drama | 2
Does exist such a query ?
You want count(distinct):
select m.genre, count(distinct m.country)
from MYTABLE m
group by m.genre;
As for your query, you almost never need to use select distinct with group by -- and not in this case. group by already removes duplicate rows for the group by keys.
you may want to use subquery
select count(1), t1.genre from (
select distinct country, genre
from MOVIE) as t1
group by t1.genre

SQL Joins with NOT IN displays incorrect data

I have 3 tables as below, and I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code.
Table: Base
+----+-----------+----------------+
| ID | Reference | Reference_Name |
+----+-----------+----------------+
| 1 | 10000 | AAAA |
| 2 | 10001 | BBBB |
| 3 | 10002 | CCCC |
+----+-----------+----------------+
Table: Expense
+-----+---------+--------------+----------------+
| EID | BASE_ID | Expense_Code | Expense_Amount |
+-----+---------+--------------+----------------+
| 1 | 1 | I0001 | 25 |
| 2 | 1 | I0002 | 50 |
| 3 | 2 | I0003 | 75 |
+-----+---------+--------------+----------------+
Table: Income
+------+---------+-------------+------------+
| I_ID | BASE_ID | Income_Code | Income_Amt |
+------+---------+-------------+------------+
| 1 | 1 | I0001 | 10 |
| 2 | 1 | I0002 | 20 |
| 3 | 1 | I0003 | 30 |
+------+---------+-------------+------------+
SELECT DISTINCT Base.Reference,Expense.Expense_Code
FROM Base
JOIN Expense ON Base.ID = Expense.BASE_ID
JOIN Income ON Base.ID = Income.BASE_ID
WHERE Expense.Expense_Code IN ('I0001','I0002')
AND Income.Income _CODE NOT IN ('I0001','I0002')
I expect no data be retured.
However I am getting the result as below:
+-----------+--------------+
| REFERENCE | Expense_Code |
+-----------+--------------+
| 10000 | I0001 |
| 10000 | I0002 |
+-----------+--------------+
For Base.Reference (10000), Expense.Expense_Code='I0001','I0002' the same expense_code is availalbe in Income table therefore I should not get any data.
Am I trying to do something wrong with the joins.
Thanks in advance for your help!
You are not joining EXPENSE and INCOME tables in your query at all. There needs to be a condition to join these tables in order to get desired result. You can also use NOT EXISTS clause. Prefer using NOT EXISTS over NOT IN as it performs better in case there are NULLS allowed in the columns that you're joining on.
SELECT * FROM BASE B
JOIN EXPENSE E ON B.ID=E.BASE_ID
WHERE E.EXPENSE_CODE NOT EXISTS (SELECT I.INCOME_CODE FROM INCOME I WHERE I.I_ID=E.EID)
When the first join is performed, you end with two lines possessing the ID 1, because the relationship between the tables is not 1o1, hence every line of the first table will have joined to it a line coming from the second table. Like so:
Output of the first join statement
Then, when the second part of your statement is executed, the DBMS finds two ID's 1 from the first joined table(BASE+EXPENSE) and 3 from the third table(INCOME).
Again since it's non a 1o1 relationship between tables, every row from the first joined table will have a joined line coming from the second table, like so: Output of the second join statement
Finally, when it reads your where clause and outputs what you see. I highlighted the excluded rows from the where clause
Output of where statement
...I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code
The following query will retrieve this data:
select b.*, e.*
from base b
join expense e on e.base_id = b.id
left join income i on i.base_id = e.base_id
and e.expense_code = i.income_code
where i.i_id is null
For reference the data script (slightly modified) is:
create table base (
id number(6),
reference number(6),
reference_name varchar2(10)
);
insert into base (id, reference, reference_name) values (1, 10000, 'AAAA');
insert into base (id, reference, reference_name) values (2, 10001, 'BBBB');
insert into base (id, reference, reference_name) values (3, 10002, 'CCCC');
create table expense (
eid number(6),
base_id number(6),
expense_code varchar2(10),
expense_amount number(6)
);
insert into expense (eid, base_id, expense_code, expense_amount) values (1, 1, 'I0001', 25);
insert into expense (eid, base_id, expense_code, expense_amount) values (2, 1, 'I0002', 50);
insert into expense (eid, base_id, expense_code, expense_amount) values (3, 1, 'I0003', 75);
insert into expense (eid, base_id, expense_code, expense_amount) values (4, 2, 'I0004', 101);
create table income (
i_id number(6),
base_id number(6),
income_code varchar2(10),
income_amt number(6)
);
insert into income (i_id, base_id, income_code, income_amt) values (1, 1, 'I0001', 10);
insert into income (i_id, base_id, income_code, income_amt) values (2, 1, 'I0002', 20);
insert into income (i_id, base_id, income_code, income_amt) values (3, 1, 'I0003', 30);
Result:
ID REFERENCE REFERENCE_NAME EID BASE_ID EXPENSE_CODE EXPENSE_AMOUNT
-- --------- -------------- --- ------- ------------ --------------
2 10,001 BBBB 4 2 I0004 101

SELECT check the colum of the max row

Here my row with my first select:
SELECT
user.id, analytic_youtube_demographic.age,
analytic_youtube_demographic.percent
FROM
`user`
INNER JOIN
analytic ON analytic.user_id = user.id
INNER JOIN
analytic_youtube_demographic ON analytic_youtube_demographic.analytic_id = analytic.id
Result:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |13-17| 19,6 |
| 1 |18-24| 38.4 |
| 1 |25-34| 22.5 |
| 1 |35-44| 11.5 |
| 1 |45-54| 5.3 |
| 1 |55-64| 1.6 |
| 1 |65+ | 1.2 |
| 2 |13-17| 10 |
| 2 |18-24| 10 |
| 2 |25-34| 25 |
| 2 |35-44| 5 |
| 2 |45-54| 25 |
| 2 |55-64| 5 |
| 1 |65+ | 20 |
---------------------------
The max value by user_id:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |18-24| 38.4 |
| 2 |45-54| 25 |
| 2 |25-34| 25 |
---------------------------
And I need to filter Age in ['25-34', '65+']
I must have at the end :
-----------
| id |
|----------
| 2 |
-----------
Thanks a lot for your help.
Have tried to use MAX(analytic_youtube_demographic.percent). But I don't know how to filter with the age too.
Thanks a lot for your help.
You can use the rank() function to identify the largest percentage values within each user's data set, and then a simple WHERE clause to get those entries that are both of the highest rank and belong to one of the specific demographics you're interested in. Since you can't use windowed functions like rank() in a WHERE clause, this is a two-step process with a subquery or a CTE. Something like this ought to do it:
-- Sample data from the question:
create table [user] (id bigint);
insert [user] values
(1), (2);
create table analytic (id bigint, [user_id] bigint);
insert analytic values
(1, 1), (2, 2);
create table analytic_youtube_demographic (analytic_id bigint, age varchar(32), [percent] decimal(5, 2));
insert analytic_youtube_demographic values
(1, '13-17', 19.6),
(1, '18-24', 38.4),
(1, '25-34', 22.5),
(1, '35-44', 11.5),
(1, '45-54', 5.3),
(1, '55-64', 1.6),
(1, '65+', 1.2),
(2, '13-17', 10),
(2, '18-24', 10),
(2, '25-34', 25),
(2, '35-44', 5),
(2, '45-54', 25),
(2, '55-64', 5),
(2, '65+', 20);
-- First, within the set of records for each user.id, use the rank() function to
-- identify the demographics with the highest percentage.
with RankedDataCTE as
(
select
[user].id,
youtube.age,
youtube.[percent],
[rank] = rank() over (partition by [user].id order by youtube.[percent] desc)
from
[user]
inner join analytic on analytic.[user_id] = [user].id
inner join analytic_youtube_demographic youtube on youtube.analytic_id = analytic.id
)
-- Now select only those records that are (a) of the highest rank within their
-- user.id and (b) either the '25-34' or the '65+' age group.
select
id,
age,
[percent]
from
RankedDataCTE
where
[rank] = 1 and
age in ('25-34', '65+');