group by after inner join not working as expected - sql

I have two tables: a raw user information table called user_raw (a row per user in a company) and a separate user details table called user_details that holds the unique details and values in the user_raw.details JSON column.
create table user_details (
id numeric
,key text
,value text
);
create table user_raw (
id numeric
,amount numeric
,detail jsonb
);
insert into user_details values
(1, 'job', 'doctor'),
(1, 'job', 'police'),
(1, 'name', 'John'),
(1, 'name', 'Angela');
insert into user_raw values
(1, 500, '{"job": "doctor", "name": "John"}'::jsonb),
(1, 238, '{"job": "police", "name": "John"}'::jsonb),
(1, 486, '{"job": "police", "name": "Angela"}'::jsonb);
So, user_raw looks like:
id | amount | detail
---+--------+-------------------------------------
1 | 500 | {"job": "doctor", "name": "John"}
1 | 238 | {"job": "police", "name": "John"}
1 | 486 | {"job": "police", "name": "Angela"}
3 rows)
and user_details like:
id | key | value
---+------+--------
1 | job | doctor
1 | job | police
1 | name | John
1 | name | Angela
The IDs are all meant to be the same.
I want to produce a summary table that sums all the amounts per distinct user detail in the user_details.value column, i.e.
id | key | value | sum
---+------+--------+------
1 | job | police | 714
1 | name | John | 738
1 | name | Angela | 486
1 | job | doctor | 500
I tried to do this by the query:
select r.id
,d.key
,d.value
,sum(r.amount)
from user_raw as r
inner join user_details
on d.id = r.id
group by 1, 2, 3;
but that just summarizes the whole user_raw.amount column.
How would I produce the desired table? Thanks!

Join the tables and aggregate:
SELECT d.id, d.key, d.value,
SUM(r.amount)
FROM user_details d INNER JOIN user_raw r
ON r.detail ->> d.key = d.value AND r.id = d.id
GROUP BY d.id, d.key, d.value;
See the demo.

Related

SQL LEFT JOIN WITH SPLIT

I want to do a left join on a table where the format of the two columns are not the same. I use REPLACE to remove the "[ ]" but I'm having trouble making one of the rows into two rows so be able to complete the join.
emp_tbl state_tbl
emp state id name
+--------+-------+ +------+-----+
| Steve | [1] | | 1 | AL |
| Greg | [2|3] | | 2 | NV |
| Steve | [4] | | 3 | AZ |
+--------+-------+ | 4 | NH |
+------+-----+
Desired output:
+--------+------+
| Steve | AL |
| Greg | NV |
| Greg | AZ |
| Steve | NH |
+--------+------+
SELECT emp_tbl.emp, state_tbl.name
FROM emp_tbl
LEFT JOIN state_tbl on state_tbl.id = REPLACE(REPLACE(emp_tbl.state, '[', ''), ']', '')
With this query i can remove the "[ ]" and do the join, but the row with two "states" does obiously not work.
Your query will never produce 4 rows because the left table only has 3 rows. You need to flatten the rows that contains multiple state_ids before the join.
Prepare the table and data:
create or replace table emp_tbl (emp varchar, state string);
create or replace table state_tbl (id varchar, name varchar);
insert into emp_tbl values
('Steve', '[1]'), ('Greg', '[2|3]'), ('Steve', '[4]');
insert into state_tbl values
(1, 'AL'), (2, 'NV'), (3, 'AZ'), (4, 'NH');
Then below query should give you the data you want:
with emp_tbl_tmp as (
select emp, parse_json(replace(state, '|', ',')) as states from emp_tbl
),
flattened_tbl as (
select emp, value as state_id from emp_tbl_tmp, table(flatten(input => states))
)
select emp, name from flattened_tbl emp
left join state_tbl state on (emp.state_id = state.id);
Or if you want to save one step:
with flattened_emp_tbl as (
select emp, value as state_id
from emp_tbl,
table(flatten(
input => parse_json(replace(state, '|', ','))
))
)
select emp, name from flattened_emp_tbl emp
left join state_tbl state
on (emp.state_id = state.id);
here is how you can do it :
select emp_tbl.emp, state_tbl.name
from emp_tbl tw
lateral flatten (input=>split(parse_json(tw.state), '|')) s
left join state_tbl on s.value = state_tbl.id

ORDER BY value in join table not grouped before aggregation

I am trying to order a Postgres result set based on an array_aggregate function.
I have the following query that works great:
select a.id, a.name, array_agg(f.name)
from actors a
join actor_films af on a.id = actor_id
join films f on film_id = f.id
group by a.id
order by a.id;
This gives me the following results, for example:
id | name | array_agg
----+--------+---------------------------------
1 | bob | {"delta force"}
2 | joe | {"delta force","the funny one"}
3 | fred | {"bad movie",AARRR}
4 | sally | {"the funny one"}
5 | suzzy | {"bad movie","delta force"}
6 | jill | {AARRR}
7 | victor | {"the funny one"}
I want to sort the results so that it is sorted alphabetically by Film name. For example, the final order should be:
id | name | array_agg
----+--------+---------------------------------
3 | fred | {"bad movie",AARRR}
6 | jill | {AARRR}
5 | suzzy | {"bad movie","delta force"}
1 | bob | {"delta force"}
2 | joe | {"delta force","the funny one"}
4 | sally | {"the funny one"}
7 | victor | {"the funny one"}
This is based on the alphabetical name of any movies they are in. When I add the ORDER BY f.name I get the following error:
ERROR: column "f.name" must appear in the GROUP BY clause or be used in an aggregate function
I cannot add it to the group, because I need it aggregated in the array, and I want to sort pre-aggregation, such that I can get the following order. Is this possible?
If you would like reproduce this example, here is the setup code:
create table actors(id serial primary key, name text);
create table films(id serial primary key, name text);
create table actor_films(actor_id int references actors (id), film_id int references film (id));
insert into actors (name) values('bob'), ('joe'), ('fred'), ('sally'), ('suzzy'), ('jill'), ('victor');
insert into films (name) values('AARRR'), ('the funny one'), ('bad movie'), ('delta force');
insert into actor_films(actor_id, film_id) values (2, 2), (7, 2), (4,2), (2, 4), (1, 4), (5, 4), (6, 1), (3, 1), (3, 3), (5, 3);
And the final query with the error:
select a.id, a.name, array_agg(f.name)
from actors a
join actor_films af on a.id = actor_id
join films f on film_id = f.id
group by a.id
order by f.name, a.id;
You can use an aggregation function:
order by min(f.name), a.id

SQL Joins with NOT IN displays incorrect data

I have 3 tables as below, and I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code.
Table: Base
+----+-----------+----------------+
| ID | Reference | Reference_Name |
+----+-----------+----------------+
| 1 | 10000 | AAAA |
| 2 | 10001 | BBBB |
| 3 | 10002 | CCCC |
+----+-----------+----------------+
Table: Expense
+-----+---------+--------------+----------------+
| EID | BASE_ID | Expense_Code | Expense_Amount |
+-----+---------+--------------+----------------+
| 1 | 1 | I0001 | 25 |
| 2 | 1 | I0002 | 50 |
| 3 | 2 | I0003 | 75 |
+-----+---------+--------------+----------------+
Table: Income
+------+---------+-------------+------------+
| I_ID | BASE_ID | Income_Code | Income_Amt |
+------+---------+-------------+------------+
| 1 | 1 | I0001 | 10 |
| 2 | 1 | I0002 | 20 |
| 3 | 1 | I0003 | 30 |
+------+---------+-------------+------------+
SELECT DISTINCT Base.Reference,Expense.Expense_Code
FROM Base
JOIN Expense ON Base.ID = Expense.BASE_ID
JOIN Income ON Base.ID = Income.BASE_ID
WHERE Expense.Expense_Code IN ('I0001','I0002')
AND Income.Income _CODE NOT IN ('I0001','I0002')
I expect no data be retured.
However I am getting the result as below:
+-----------+--------------+
| REFERENCE | Expense_Code |
+-----------+--------------+
| 10000 | I0001 |
| 10000 | I0002 |
+-----------+--------------+
For Base.Reference (10000), Expense.Expense_Code='I0001','I0002' the same expense_code is availalbe in Income table therefore I should not get any data.
Am I trying to do something wrong with the joins.
Thanks in advance for your help!
You are not joining EXPENSE and INCOME tables in your query at all. There needs to be a condition to join these tables in order to get desired result. You can also use NOT EXISTS clause. Prefer using NOT EXISTS over NOT IN as it performs better in case there are NULLS allowed in the columns that you're joining on.
SELECT * FROM BASE B
JOIN EXPENSE E ON B.ID=E.BASE_ID
WHERE E.EXPENSE_CODE NOT EXISTS (SELECT I.INCOME_CODE FROM INCOME I WHERE I.I_ID=E.EID)
When the first join is performed, you end with two lines possessing the ID 1, because the relationship between the tables is not 1o1, hence every line of the first table will have joined to it a line coming from the second table. Like so:
Output of the first join statement
Then, when the second part of your statement is executed, the DBMS finds two ID's 1 from the first joined table(BASE+EXPENSE) and 3 from the third table(INCOME).
Again since it's non a 1o1 relationship between tables, every row from the first joined table will have a joined line coming from the second table, like so: Output of the second join statement
Finally, when it reads your where clause and outputs what you see. I highlighted the excluded rows from the where clause
Output of where statement
...I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code
The following query will retrieve this data:
select b.*, e.*
from base b
join expense e on e.base_id = b.id
left join income i on i.base_id = e.base_id
and e.expense_code = i.income_code
where i.i_id is null
For reference the data script (slightly modified) is:
create table base (
id number(6),
reference number(6),
reference_name varchar2(10)
);
insert into base (id, reference, reference_name) values (1, 10000, 'AAAA');
insert into base (id, reference, reference_name) values (2, 10001, 'BBBB');
insert into base (id, reference, reference_name) values (3, 10002, 'CCCC');
create table expense (
eid number(6),
base_id number(6),
expense_code varchar2(10),
expense_amount number(6)
);
insert into expense (eid, base_id, expense_code, expense_amount) values (1, 1, 'I0001', 25);
insert into expense (eid, base_id, expense_code, expense_amount) values (2, 1, 'I0002', 50);
insert into expense (eid, base_id, expense_code, expense_amount) values (3, 1, 'I0003', 75);
insert into expense (eid, base_id, expense_code, expense_amount) values (4, 2, 'I0004', 101);
create table income (
i_id number(6),
base_id number(6),
income_code varchar2(10),
income_amt number(6)
);
insert into income (i_id, base_id, income_code, income_amt) values (1, 1, 'I0001', 10);
insert into income (i_id, base_id, income_code, income_amt) values (2, 1, 'I0002', 20);
insert into income (i_id, base_id, income_code, income_amt) values (3, 1, 'I0003', 30);
Result:
ID REFERENCE REFERENCE_NAME EID BASE_ID EXPENSE_CODE EXPENSE_AMOUNT
-- --------- -------------- --- ------- ------------ --------------
2 10,001 BBBB 4 2 I0004 101

SELECT check the colum of the max row

Here my row with my first select:
SELECT
user.id, analytic_youtube_demographic.age,
analytic_youtube_demographic.percent
FROM
`user`
INNER JOIN
analytic ON analytic.user_id = user.id
INNER JOIN
analytic_youtube_demographic ON analytic_youtube_demographic.analytic_id = analytic.id
Result:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |13-17| 19,6 |
| 1 |18-24| 38.4 |
| 1 |25-34| 22.5 |
| 1 |35-44| 11.5 |
| 1 |45-54| 5.3 |
| 1 |55-64| 1.6 |
| 1 |65+ | 1.2 |
| 2 |13-17| 10 |
| 2 |18-24| 10 |
| 2 |25-34| 25 |
| 2 |35-44| 5 |
| 2 |45-54| 25 |
| 2 |55-64| 5 |
| 1 |65+ | 20 |
---------------------------
The max value by user_id:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |18-24| 38.4 |
| 2 |45-54| 25 |
| 2 |25-34| 25 |
---------------------------
And I need to filter Age in ['25-34', '65+']
I must have at the end :
-----------
| id |
|----------
| 2 |
-----------
Thanks a lot for your help.
Have tried to use MAX(analytic_youtube_demographic.percent). But I don't know how to filter with the age too.
Thanks a lot for your help.
You can use the rank() function to identify the largest percentage values within each user's data set, and then a simple WHERE clause to get those entries that are both of the highest rank and belong to one of the specific demographics you're interested in. Since you can't use windowed functions like rank() in a WHERE clause, this is a two-step process with a subquery or a CTE. Something like this ought to do it:
-- Sample data from the question:
create table [user] (id bigint);
insert [user] values
(1), (2);
create table analytic (id bigint, [user_id] bigint);
insert analytic values
(1, 1), (2, 2);
create table analytic_youtube_demographic (analytic_id bigint, age varchar(32), [percent] decimal(5, 2));
insert analytic_youtube_demographic values
(1, '13-17', 19.6),
(1, '18-24', 38.4),
(1, '25-34', 22.5),
(1, '35-44', 11.5),
(1, '45-54', 5.3),
(1, '55-64', 1.6),
(1, '65+', 1.2),
(2, '13-17', 10),
(2, '18-24', 10),
(2, '25-34', 25),
(2, '35-44', 5),
(2, '45-54', 25),
(2, '55-64', 5),
(2, '65+', 20);
-- First, within the set of records for each user.id, use the rank() function to
-- identify the demographics with the highest percentage.
with RankedDataCTE as
(
select
[user].id,
youtube.age,
youtube.[percent],
[rank] = rank() over (partition by [user].id order by youtube.[percent] desc)
from
[user]
inner join analytic on analytic.[user_id] = [user].id
inner join analytic_youtube_demographic youtube on youtube.analytic_id = analytic.id
)
-- Now select only those records that are (a) of the highest rank within their
-- user.id and (b) either the '25-34' or the '65+' age group.
select
id,
age,
[percent]
from
RankedDataCTE
where
[rank] = 1 and
age in ('25-34', '65+');

Select Ticket records from one table that are associated with a Customer or the Customers children in another table

Let's say that I have a table containing all of my Customer records.
Each record has a unique ID, a name and possibly a parent record ID.
(In case it makes a difference a parent can have multiple children but children can only have one parent. There's also no grandfather records, so a parent may not have a parent and children may not have children)
Customers
+-----+------------+----------+
| ID | Name | ParentID |
+-----+------------+----------+
| 100 | Customer A | |
| 101 | Customer B | |
| 102 | Customer C | 100 |
| 103 | Customer D | 100 |
| 104 | Customer E | 101 |
+-----+------------+----------+
As you can see from this example I have 5 unique Customer records, with C & D being children of A and E a child of B.
Now I have a table containing all of the Tickets these Customers raise.
Each ticket has a unique ID, a description and a parent customer ID.
Tickets
+-----+-------------+----------+
| ID | Description | ParentID |
+-----+-------------+----------+
| 500 | Ticket A | 100 |
| 501 | Ticket B | 100 |
| 502 | Ticket C | 102 |
| 503 | Ticket D | 102 |
| 504 | Ticket E | 103 |
| 505 | Ticket F | 101 |
| 506 | Ticket G | 104 |
| 507 | Ticket H | 101 |
+-----+-------------+----------+
Goal
I will have been given a Customer ID and need to select all Tickets belonging to this record.
If the record has children I also need the tickets belonging to these records.
If the record is a child I'm not interested in its parent.
Example 1
I'm given the ID 100. This is Customer A and has two children, C & D.
As the results of my select I would expect the following:
Ticket A - Directly belongs to ID 100
Ticket B - Directly belongs to ID 100
Ticket C - Belongs to ID 102, a child of 100
Ticket D - Belongs to ID 102, a child of 100
Ticket E - Belongs to ID 103, a child of 100
Example 2
I'm given ID 104. This is Customer E, a child record.
As the results of my select I would expect the following:
Ticket G - Directly belongs to ID 104
I would not expect anything further as the record is a child and therefore has no children and I'm not looking upwards at parent records.
Where I'm stuck...
Getting Tickets belonging to one ID is easy:
SELECT
tickets.Description
FROM
Tickets AS tickets
LEFT JOIN
Customers AS customers ON
tickets.ParentID = customers.ID
WHERE
customers.ID = 100
I'm stuck getting the Tickets belonging to children.
It seems like I'd first have to request the Customer belonging to the given ID, then fetch all child Customers where the ParentID matched the given ID, then finally request Tickets belonging to any of these records.
Unfortunately I haven't got the faintest idea where to start and require some help.
In case it's relevant I'm using SQL Server 2008 R2.
You probably need to use a recursive common table expression to iterate through the ancestry and get all related records:
DECLARE #CustomerID INT = 100;
-- SAMPLE DATA FOR CUSTOMERS
DECLARE #Customers TABLE (ID INT, Name VARCHAR(255), ParentID INT);
INSERT #Customers (ID, Name, ParentID)
VALUES
(100, 'Customer A', NULL),
(101, 'Customer B', NULL),
(102, 'Customer C', 100),
(103, 'Customer D', 100),
(104, 'Customer E', 101);
-- SAMPLE DATA FOR TICKETS
DECLARE #Tickets TABLE (ID INT, Name VARCHAR(255), ParentID INT);
INSERT #Tickets (ID, Name, ParentID)
VALUES
(500, 'Ticket A', 100),
(501, 'Ticket B', 100),
(502, 'Ticket C', 102),
(503, 'Ticket D', 102),
(504, 'Ticket E', 103),
(505, 'Ticket F', 101),
(506, 'Ticket G', 104),
(507, 'Ticket H', 101);
-- USE RECURSIVE CTE TO LOOP THROUGH HIERARCHY AND GET ALL ANCESTORS
WITH RecursiveCustomers AS
( SELECT c.ID, c.Name, c.ParentID
FROM #Customers AS c
UNION ALL
SELECT rc.ID, rc.Name, c.ParentID
FROM RecursiveCustomers AS rc
INNER JOIN #Customers AS c
ON rc.ParentID = c.ID
)
SELECT t.ID, t.Name, t.ParentID
FROM #Tickets AS t
INNER JOIN RecursiveCustomers AS rc
ON rc.ID = t.ParentID
WHERE rc.ParentID = #CustomerID OR (rc.ID = #CustomerID AND rc.ParentID IS NULL);
RESULT FOR 100
+-----+-------------+----------+
| ID | Description | ParentID |
+-----+-------------+----------+
| 500 | Ticket A | 100 |
| 501 | Ticket B | 100 |
| 502 | Ticket C | 102 |
| 503 | Ticket D | 102 |
| 504 | Ticket E | 103 |
+-----+-------------+----------+
RESULT FOR 104
+-----+-------------+----------+
| ID | Description | ParentID |
+-----+-------------+----------+
| 506 | Ticket G | 104 |
+-----+-------------+----------+
Select tickets.Description
FROM
Tickets AS tickets
LEFT JOIN
Customers ON
customers.ID= tickets.ParentID
OR
customers.ParentID =tickets.ParentID
WHERE
Tickets.ParentID = 100
Although the demo provided by GarethD worked fine in the demonstration (and at first I thought it worked on the live data) I got weird results in the live data where tickets belonging to the parent were repeated 4 times each although rows related to children only appeared once.
In my ignorance and with an inability to fix said issue I've used a different approach that works fine so I thought I'd leave it here as an alternative.
First I select the parent and child records and store them in a table:
DECLARE #CUSTOMERS TABLE (
ID BIGINT,
PARENTID BIGINT,
NAME VARCHAR(MAX)
)
INSERT INTO #CUSTOMERS(ID, PARENTID, NAME)
SELECT
id,
parent_id,
name
FROM [customer_table]
WHERE id = '194' OR parent_id = '194';
Now I can join to this table as normal to select the correct Tickets:
SELECT
customer.[NAME],
ticket.[id],
ticket.[description],
FROM [ticket_table] AS ticket
LEFT JOIN #CUSTOMERS AS customer
ON ticket.[id] = customer.[ID]
WHERE
ticket.[id] = customer.[ID];
This produces the correct number of tickets and seems fast enough. Hope it helps someone.