PostgreSQL - Query with aggregate functions - sql

I need some help for a PostgreSQL query.
I have 4 tables involved on it: customer, organization_complete, entity and address. I retrieve some data from everyone and with this query:
SELECT distinct ON (c.customer_number, trim(lower(o.name)), a.street, a.zipcode, a.area, a.country)
c.xid AS customer_xid, o.xid AS entity_xid, c.customer_number, c.deleted, o.name, o.vat, 'organisation' AS customer_type, a.street, a.zipcode, a.city, a.country
FROM customer c
INNER JOIN organisation_complete o ON (c.xid = o.customer_xid AND c.deleted = 'FALSE')
INNER JOIN entity e ON e.customer_xid = o.customer_xid
INNER JOIN address a ON (a.contact_info_xid = e.contact_info_xid and a.address_type = 'delivery')
WHERE c.account_xid = "<value>"
I get a distinct of all the customers splitted by customer_number, name, street, zipcode, area and country (what's specified after the DISTINCT ON statement).
What I need to retrieve now is a distinct of all customers having a doubled row on DB but I also need to retrieve the customer_xid and the entity_xid, that are primary keys of the respective tables and so are unique. For this reason they can't be included into an aggregate function. All I need is to count how many rows with the same customer_number, name, street, zipcode, area and country I have for each distinct tuple and to select only tuples with a count bigger than 1.
For each selected tuple I need also to take a customer_xid and an entity_xid, at random, like MySQL would do with a_key in a query like this:
SELECT COUNT(*), tab.a_key, tab.b, tab.c from tab
WHERE 1
GROUP BY tab.b
I know MySQL is quite an exception regarding this, I just want to know if may be possible to obtain the same result on PostgreSQL.
Thanks,
L.

This query in MySql is using a nonstandard (see note below) "MySql group by extension": http://dev.mysql.com/doc/refman/5.0/en/group-by-extensions.html
SELECT COUNT(*), tab.a_key, tab.b, tab.c
from tab
WHERE 1
GROUP BY tab.b
Note: This is a feature definied in SQL:2003 Standard as T301 Functional dependencies, it is not required by the standard, and many RDBMS don't support it, including PostgreSql (see this link for version 9.3 - unsupported features: http://www.postgresql.org/docs/9.3/static/unsupported-features-sql-standard.html ).
The above query could be expressed in PostgreSQL in this way:
SELECT tab.a_key, tab.b, tab.c,
q.cnt
FROM (
SELECT tab.b,
COUNT(*) As cnt,
MIN(tab.unique_id) As unique_id /* could be also MAX */
from tab
WHERE 1
GROUP BY tab.b
) q
JOIN tab ON tab.unique_id = q.unique_id
where unique_id is a column that uniquely identifies each row in tab (usually a primary key).
Min or Max functions choose one row from the table in a pseudo-random manner.

Related

How can I make my PostgreSQL select query shorter?

I am working on the website and I need to execute a pretty complex select query. I managed to write it, but It looks too long for me and I want to make it shorter. Here is my query:
select friend_id, name, surname, mail, birth_date, address
from (select friend_id, name, surname, mail, birth_date, address, count(friend_id) as rents
from (select friend.friend_id, name, surname, mail, birth_date, address
from friend
left join profile p on friend.profile_id = p.profile_id
left join friend_group_record fgr on friend.friend_id = fgr.friend_id
left join meeting m on fgr.friend_group_id = m.friend_group_id
where m.date between '2019-09-01' and '2020-01-01') as filtered_friend_profiles
group by friend_id, name, surname, mail, birth_date, address) as counted_friend_profiles
where counted_friend_profiles.rents >= 25;
Does anyone have idea how to make it shorter?
Shorter? (And correct?!)
SELECT f.friend_id, name, surname, mail, birth_date, address -- ⑤
FROM friend f
JOIN profile p USING (profile_id) -- ①
JOIN friend_group_record fgr ON f.friend_id = fgr.friend_id -- also USING?
JOIN meeting m ON fgr.friend_group_id = m.friend_group_id -- also USING?
WHERE m.date >= '2019-09-01' -- ②
AND m.date < '2020-01-01'
GROUP BY 1, 2, 3, 4, 5, 6 -- ③
HAVING count(*) >= 25; -- ④
① Changed all three instances of LEFT JOIN to [INNER] JOIN, since the condition on the last table in the food chain (meeting) forces all joins to behave like plain joins anyway. (As #wildplasser pointed out.) See:
PostgreSQL: Using AND statement in LEFT JOIN is not working as expected
SQL / PostgreSQL left join ignores "on = constant" predicate, on left table
Also, if column names are identical and distinct among all tables left of the join, USING is convenient short syntax. (Returning only one of each pair of joined columns, which has no bearing on the query at hand.)
Not knowing exact table definitions, I only applied it in the first join, where ambiguities are impossible as there is only a single table left of the join. For persisted queries, generally only advisable if involved column names are stable and distinct over all joined tables.
② Typically, you'd want to exclude the upper bound and BETWEEN is the wrong tool. Related:
How to add a day/night indicator to a timestamp column?
POSTGRES select n equally distributed rows by time over millions of records
How do I write a function in plpgsql that compares a date with a timestamp without time zone?
③ About this short syntax:
Select first row in each GROUP BY group?
Possibly shorter if there are functional dependencies with PK columns. See:
Why can't I exclude dependent columns from `GROUP BY` when I aggregate by a key?
④ count(*) is shorter & faster (and equivalent in this case). Also, can be used in HAVING clause without listing in SELECT list. See:
Postgres: count(*) vs count(id)
⑤ All source column names should be table-qualified, if only for documentation. Also avoids breakage and confusion from later changes to underlying tables.
How about just doing this?
select friend.friend_id, name, surname, mail, birth_date, address
from friend
left join profile p on friend.profile_id = p.profile_id
left join friend_group_record fgr on friend.friend_id = fgr.friend_id
left join meeting m on fgr.friend_group_id = m.friend_group_id
where m.date between '2019-09-01' and '2020-01-01'
group by friend.friend_id, name, surname, mail, birth_date, address
having count(friend.friend_id) >= 25;
See the addition of having clause.

Get row from one table, plus COUNT from a related table

I'm trying to build an SQL query where I grab one table's information (WHERE shops.shop_domain = X) along with the COUNT of the customers table WHERE customers.product_id = 4242451.
The shops table DOES NOT have product.id in it, but the customers table DOES HAVE the shop_domain in it, hence my attempt to do some sort of join.
I essentially want to return the following:
shops.id
shops.name
shops.shop_domain
COUNT OF CUSTOMERS WHERE customers.product_id = '4242451'
Here is my not so lovely attempt at the query.
I think I have the idea right (maybe...) but I can't wrap my head around building this query.
SELECT shops.id, shops.name, shops.shop_domain, COUNT(customers.customer_id)
FROM shops
LEFT JOIN customers ON shops.shop_domain = customers.shop_domain
WHERE shops.shop_domain = 'myshop.com' AND
customers.product_id = '4242451'
GROUP BY shops.shop_id
Relevant database schemas:
shops:
id, name, shop_domain
customers:
id, name, product_id, shop_domain
You are close. The condition on customers needs to go in the ON clause, because this is a LEFT JOIN and customers is the second table:
SELECT s.id, s.name, s.shop_domain, COUNT(c.customer_id)
FROM shops s LEFT JOIN
customers c
ON s.shop_domain = c.shop_domain AND c.product_id = '4242451'
WHERE s.shop_domain = 'myshop.com'
GROUP BY s.id, s.name, s.shop_domain;
I am also inclined to include all three columns in the GROUP BY, although Postgres (and ANSI/ISO standards) are happy with just id if it is declared as the primary key in the table.
A correlated subquery should be substantially cheaper (and simpler) for the purpose:
SELECT id, name, shop_domain
, (SELECT count(*)
FROM customers
WHERE shop_domain = s.shop_domain
AND product_id = 4242451) AS special_count
FROM shops s
WHERE shop_domain = 'myshop.com';
This way you only need to aggregate in the subquery, and need not worry about undesired effects on the outer query.
Assuming product_id is a numeric data type, so I use a numeric literal (4242451) instead of a string literal '4242451' - which might cause problems otherwise.

SQL Server 2016 Sub Query Guidance

I am currently working on an assignment for my SQL class and I am stuck. I'm not looking for full code to answer the question, just a little nudge in the right direction. If you do provide full code would you mind a small explanation as to why you did it that way (so I can actually learn something.)
Here is the question:
Write a SELECT statement that returns three columns: EmailAddress, ShipmentId, and the order total for each Client. To do this, you can group the result set by the EmailAddress and ShipmentId columns. In addition, you must calculate the order total from the columns in the ShipItems table.
Write a second SELECT statement that uses the first SELECT statement in its FROM clause. The main query should return two columns: the Client’s email address and the largest order for that Client. To do this, you can group the result set by the EmailAddress column.
I am confused on how to pull in the EmailAddress column from the Clients table, as in order to join it I have to bring in other tables that aren't being used. I am assuming there is an easier way to do this using sub Queries as that is what we are working on at the time.
Think of SQL as working with sets of data as opposed to just tables. Tables are merely a set of data. So when you view data this way you immediately see that the query below returns a set of data consisting of the entirety of another set, being a table:
SELECT * FROM MyTable1
Now, if you were to only get the first two columns from MyTable1 you would return a different set that consisted only of columns 1 and 2:
SELECT col1, col2 FROM MyTable1
Now you can treat this second set, a subset of data as a "table" as well and query it like this:
SELECT
*
FROM (
SELECT
col1,
col2
FROM
MyTable1
)
This will return all the columns from the two columns provided in the inner set.
So, your inner query, which I won't write for you since you appear to be a student, and that wouldn't be right for me to give you the entire answer, would be a query consisting of a GROUP BY clause and a SUM of the order value field. But the key thing you need to understand is this set thinking: you can just wrap the ENTIRE query inside brackets and treat it as a table the way I have done above. Hopefully this helps.
You need a subquery, like this:
select emailaddress, max(OrderTotal) as MaxOrder
from
( -- Open the subquery
select Cl.emailaddress,
Sh.ShipmentID,
sum(SI.Value) as OrderTotal -- Use the line item value column in here
from Client Cl -- First table
inner join Shipments Sh -- Join the shipments
on Sh.ClientID = Cl.ClientID
inner join ShipItem SI -- Now the items
on SI.ShipmentID = Sh.ShipmentID
group by C1.emailaddress, Sh.ShipmentID -- here's your grouping for the sum() aggregation
) -- Close subquery
group by emailaddress -- group for the max()
For the first query you can join the Clients to Shipments (on ClientId).
And Shipments to the ShipItems table (on ShipmentId).
Then group the results, and count or sum the total you need.
Using aliases for the tables is usefull, certainly when you select fields from the joined tables that have the same column name.
select
c.EmailAddress,
i.ShipmentId,
SUM((i.ShipItemPrice - i.ShipItemDiscountAmount) * i.Quantity) as TotalPriceDiscounted
from ShipItems i
join Shipments s on (s.ShipmentId = i.ShipmentId)
left join Clients c on (c.ClientId = s.ClientId)
group by i.ShipmentId, c.EmailAddress
order by i.ShipmentId, c.EmailAddress;
Using that grouped query in a subquery, you can get the Maximum total per EmailAddress.
select EmailAddress,
-- max(TotalShipItems) as MaxTotalShipItems,
max(TotalPriceDiscounted) as MaxTotalPriceDiscounted
from (
select
c.EmailAddress,
-- i.ShipmentId,
-- count(*) as TotalShipItems,
SUM((i.ShipItemPrice - i.ShipItemDiscountAmount) * i.Quantity) as TotalPriceDiscounted
from ShipItems i
join Shipments s on (s.ShipmentId = i.ShipmentId)
left join Clients c on (c.ClientId = s.ClientId)
group by i.ShipmentId, c.EmailAddress
) q
group by EmailAddress
order by EmailAddress
Note that an ORDER BY is mostly meaningless inside a subquery if you don't use TOP.

query SQL Aggregate rows inside join

Table: ID, Person_ID,Name
Each Person ID can have several rows because he can have several names (first name, last name, nick name, etc..)
I have another table that contains one row per person and some other data in it
I want to join both tables into 1 row per person and in the last column to aggregate all of the person names in to one string like this: "Thomas, anderson, neo"
Something like this:
SELECT A.*,
B.PERSON_ID,
B.(aggregated names here)
FROM USERS A, USERS_NAMES B;
How do i do this?
I would do this in the following way:
select u.*, un.names
from users u left outer join
(select un.person_id, listagg(un.name, ',') within group (order by un.id) as names
from users_names un
group by un.person_id
) un
on u.person_id = un.person_id;
Note that the list aggregation is being done in a subquery. That allows the use of u.* in the outer query with no aggregation. Otherwise, you have to group by each column in users explicitly.

Using group by and having clause

Using the following schema:
Supplier (sid, name, status, city)
Part (pid, name, color, weight, city)
Project (jid, name, city)
Supplies (sid, pid, jid**, quantity)
Get supplier numbers and names for suppliers of parts supplied to at least two different projects.
Get supplier numbers and names for suppliers of the same part to at least two different projects.
These were my answers:
1.
SELECT s.sid, s.name
FROM Supplier s, Supplies su, Project pr
WHERE s.sid = su.sid AND su.jid = pr.jid
GROUP BY s.sid, s.name
HAVING COUNT (DISTINCT pr.jid) >= 2
2.
SELECT s.sid, s.name
FROM Suppliers s, Supplies su, Project pr, Part p
WHERE s.sid = su.sid AND su.pid = p.pid AND su.jid = pr.jid
GROUP BY s.sid, s.name
HAVING COUNT (DISTINCT pr.jid)>=2
Can anyone confirm if I wrote this correctly? I'm a little confused as to how the Group By and Having clause works
The semantics of Having
To better understand having, you need to see it from a theoretical point of view.
A group by is a query that takes a table and summarizes it into another table. You summarize the original table by grouping the original table into subsets (based upon the attributes that you specify in the group by). Each of these groups will yield one tuple.
The Having is simply equivalent to a WHERE clause after the group by has executed and before the select part of the query is computed.
Lets say your query is:
select a, b, count(*)
from Table
where c > 100
group by a, b
having count(*) > 10;
The evaluation of this query can be seen as the following steps:
Perform the WHERE, eliminating rows that do not satisfy it.
Group the table into subsets based upon the values of a and b (each tuple in each subset has the same values of a and b).
Eliminate subsets that do not satisfy the HAVING condition
Process each subset outputting the values as indicated in the SELECT part of the query. This creates one output tuple per subset left after step 3.
You can extend this to any complex query there Table can be any complex query that return a table (a cross product, a join, a UNION, etc).
In fact, having is syntactic sugar and does not extend the power of SQL. Any given query:
SELECT list
FROM table
GROUP BY attrList
HAVING condition;
can be rewritten as:
SELECT list from (
SELECT listatt
FROM table
GROUP BY attrList) as Name
WHERE condition;
The listatt is a list that includes the GROUP BY attributes and the expressions used in list and condition. It might be necessary to name some expressions in this list (with AS). For instance, the example query above can be rewritten as:
select a, b, count
from (select a, b, count(*) as count
from Table
where c > 100
group by a, b) as someName
where count > 10;
The solution you need
Your solution seems to be correct:
SELECT s.sid, s.name
FROM Supplier s, Supplies su, Project pr
WHERE s.sid = su.sid AND su.jid = pr.jid
GROUP BY s.sid, s.name
HAVING COUNT (DISTINCT pr.jid) >= 2
You join the three tables, then using sid as a grouping attribute (sname is functionally dependent on it, so it does not have an impact on the number of groups, but you must include it, otherwise it cannot be part of the select part of the statement). Then you are removing those that do not satisfy your condition: the satisfy pr.jid is >= 2, which is that you wanted originally.
Best solution to your problem
I personally prefer a simpler cleaner solution:
You need to only group by Supplies (sid, pid, jid**, quantity) to
find the sid of those that supply at least to two projects.
Then join it to the Suppliers table to get the supplier same.
SELECT sid, sname from
(SELECT sid from supplies
GROUP BY sid
HAVING count(DISTINCT jid) >= 2
) AS T1
NATURAL JOIN
Supliers;
It will also be faster to execute, because the join is only done when needed, not all the times.
--dmg
Because we can not use Where clause with aggregate functions like count(),min(), sum() etc. so having clause came into existence to overcome this problem in sql. see example for having clause go through this link
http://www.sqlfundamental.com/having-clause.php
First of all, you should use the JOIN syntax rather than FROM table1, table2, and you should always limit the grouping to as little fields as you need.
Altought I haven't tested, your first query seems fine to me, but could be re-written as:
SELECT s.sid, s.name
FROM
Supplier s
INNER JOIN (
SELECT su.sid
FROM Supplies su
GROUP BY su.sid
HAVING COUNT(DISTINCT su.jid) > 1
) g
ON g.sid = s.sid
Or simplified as:
SELECT sid, name
FROM Supplier s
WHERE (
SELECT COUNT(DISTINCT su.jid)
FROM Supplies su
WHERE su.sid = s.sid
) > 1
However, your second query seems wrong to me, because you should also GROUP BY pid.
SELECT s.sid, s.name
FROM
Supplier s
INNER JOIN (
SELECT su.sid
FROM Supplies su
GROUP BY su.sid, su.pid
HAVING COUNT(DISTINCT su.jid) > 1
) g
ON g.sid = s.sid
As you may have noticed in the query above, I used the INNER JOIN syntax to perform the filtering, however it can be also written as:
SELECT s.sid, s.name
FROM Supplier s
WHERE (
SELECT COUNT(DISTINCT su.jid)
FROM Supplies su
WHERE su.sid = s.sid
GROUP BY su.sid, su.pid
) > 1
What type of sql database are using (MSSQL, Oracle etc)?
I believe what you have written is correct.
You could also write the first query like this:
SELECT s.sid, s.name
FROM Supplier s
WHERE (SELECT COUNT(DISTINCT pr.jid)
FROM Supplies su, Projects pr
WHERE su.sid = s.sid
AND pr.jid = su.jid) >= 2
It's a little more readable, and less mind-bending than trying to do it with GROUP BY. Performance may differ though.
1.Get supplier numbers and names for suppliers of parts supplied to at least two different projects.
SELECT S.SID, S.NAME
FROM SUPPLIES SP
JOIN SUPPLIER S
ON SP.SID = S.SID
WHERE PID IN
(SELECT PID FROM SUPPPLIES GROUP BY PID, JID HAVING COUNT(*) >= 2)
I am not slear about your second question