PostgreSQL remove duplicates by GROUP BY - sql

I would like to print the last message of a person, but only his latest message should be printed per person. I use PostgreSQL 10.
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test3 | 2017-07-07 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I have tried this with the following SQL query, this gives me exactly that back but unfortunately the people are doubled in it.
SELECT * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test1 | 2016-06-01 |
| Maria | Test2 | 2016-11-01 |
| Maria | Test3 | 2017-07-07 |
| Paul | Test4 | 2017-01-01 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I tried to remove the duplicates with a DISTINCT, but unfortunately I get this error message:
SELECT DISTINCT ON (name) * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC
Do you have any ideas how I can solve this ?

You would use DISTINCT ON as follows:
SELECT DISTINCT ON (name) *
FROM messages
WHERE receive = 't'
ORDER BY name, created_at DESC
That is:
no GROUP BY clause is needed
the column(s) listed in DISTINCT ON(...) must appear first in the ORDER BY clause
... followed by the column that should be use to break the group (here, that is created_at)
Note that the results of a distinct on query are always sorted by columns in the clause (because this sort is what is used to identifiy which rows should be kept).
If you want more control over the sort order, then you can use window functions instead:
SELECT *
FROM (
SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
FROM messages m
WHERE receive = 't'
) t
WHERE rn = 1
ORDER BY created_at DESC

Use DISTINCT ON, but with the right ORDER BY:
SELECT DISTINCT ON (name) m.*
FROM messages m
WHERE receive = 't'
ORDER BY name, created_at DESC;
In general, you don't use DISTINCT ON with GROUP BY. It is used with ORDER BY. The way it works is to that it chooses the first row for each name based on the ORDER BY clause.
You should not be thinking of what you are doing as aggregation. You want to filter based on the created_at. In many databases, you would express this using a correlated subquery:
select m.*
from messages m
where m.created_at = (select max(m2.created_at)
from messages m2
where m2.name = m.name and m2.receive = 't'
) and
m.receive = 't'; -- this condition is probably not needed

SELECT *
FROM messages
WHERE receive = 't' and not exists (
select 1
from messages m
where m.receive = message.receive and messages.name = m.name and m.created_at > messages.created_at
)
ORDER BY created_at DESC
The query above finds the messages which fulfill the following criteria:
receive is 't'
there not exists another message which
has the same value for receive
has the same name
and is newer
Assuming that the same name does not send two messages at exactly the same time this should be enough. Another point to make is that the name might look similar, but be different, if some white characters are present inside the value, so, if you see two records in the result with the same name, but with different created_at in the query above, then it is highly probable that white characters are playing tricks on you.

Related

Is it possible simplify this query to work with paginator (without creating a custom paginator)?

I have list of messages by users and I need to get the oldest message of each user, then use paginate() on the result.
example data:
id | user_id | message | date_posted
1 | 5 | some_message | 2022-07-15
2 | 125 | some_message | 2022-08-02
3 | 5 | some_message | 2022-04-05
So in this case I need to get only rows with id 2 and 3
The problem is that I got this complex query to do it, and I have to use it inside DB::select(DB::raw($query));, which returns an array, and paginate can't be used on array.
This is the query:
select T.*
from (select *,
row_number() over (partition by user_id order by date_posted, id) as sn
from my_table
) T
where sn = 1;
Is there a way to get these results with statements that can be converted to Query Builder or Eloquent?
*I can't disable only_full_group_by
I'm solve the problem with this sql command:
select MIN(id) AS id, user_id, MIN(date_posted) AS date_posted, substring_index(GROUP_CONCAT(message), ',', 1) AS message from `messages` group by `user_id` order by `date_posted` asc
if you need Laravel code:
$message = Message::
selectRaw("MIN(id) AS id, user_id,
MIN(date_posted) AS date_posted,
substring_index(GROUP_CONCAT(message), ',', 1) AS message")
->groupBy('user_id')
->orderBy('date_posted')
->get();
if you need paginate just insted of get write paginate()

Oracle distinct on single column returning row

I have an api endpoint that accepts distinct arguments for filtering on specific columns. For this reason I'm trying to build a query that is easy to add arbitrary filters to the base query. For some reason if I use:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
I get terrible performance so I started using this query:
SELECT * FROM "MY_VIEW"
-- Distinct on ID filter
LEFT JOIN(
SELECT DISTINCT
FIRST_VALUE("MY_VIEW"."ID")
OVER(PARTITION BY "MY_VIEW"."UNIQUE_ID") as DISTINCT_ID
FROM "MY_VIEW"
) d ON d.DISTINCT_ID = "MY_VIEW"."ID"
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
)
However when I left join it discards the distinct filter.
Also I can't use rowid because it is a view.
The view is a versioned table.
Index Info
UNIQUENESS | STATUS | INDEX_TYPE | TEMPORARY | PARTITIONED | JOIN_INDEX | COLUMNS
NONUNIQUE | VALID | NORMAL | N | NO | NO | ID
UNIQUE | VALID | NORMAL | N | NO | NO | UNIQUE_ID
NONUNIQUE | VALID | DOMAIN | N | NO | NO | NAME
I don't have enough reputation to leave a "comment" so I will post this as an "answer." Your first example is:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
But do you realize that the "GROUP BY ID" clause negates the effect of the MAX() function on ID? In other words, you will get all the rows and the MAX will be computed on each row's ID, returning . . . that row's ID. Perhaps try:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW")
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC

Greatest N Per Group with JOIN and multiple order columns

I have two tables:
Table0:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-18 | 100 |
| aa | 1 | 12-10 | 101 |
| bb | 2 | 12-10 | 102 |
| cc | 1 | 12-09 | 100 |
| cc | 2 | 12-12 | 103 |
| cc | 2 | 12-01 | 109 |
| cc | 1 | 12-07 | 101 |
| dd | 1 | 12-08 | 100 |
and
Table1:
| ID |
|----|
| aa |
| cc |
| cc |
| dd |
| dd |
I'm trying to output results where:
ID must exist in both tables.
TYPE must be the maximum for each ID.
TIME must be the minimum value for the maximum TYPE for each ID.
SITE should be the value from the same row as the minimum TIME value.
Given my sample data, my results should look like this:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-10 | 101 |
| cc | 2 | 12-01 | 109 |
| dd | 1 | 12-08 | 100 |
I've tried these statements:
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MASTY, MIN("TIME") AS MASTM
FROM TABLE0
GROUP BY "ID") AS MAS,
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MSD.MASTY =MA."TYPE"
...which generates a syntax error
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MAB
FROM TABLE0
GROUP BY "ID") AS MAS,
((SELECT "ID", MIN("TIME") AS MACTM, MIN("TYPE") AS MACTY
FROM TABLE0
WHERE "TYPE" = 1
GROUP BY "ID")
UNION
(SELECT "ID", MIN("TIME"), MAX("TYPE")
FROM TABLE0
WHERE "TYPE" = 2
GROUP BY "ID")) AS MACU
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MACU."ID" = QTS."ID"
AND MA."TIME" = MACU.MACTM
AND MA."TYPE" = MACU.MACTB
... which is getting the wrong results.
Answering your direct question "how to avoid...":
You get this error when you specify a column in a SELECT area of a statement that isn't present in the GROUP BY section and isn't part of an aggregating function like MAX, MIN, AVG
in your data, I cannot say
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id
I didn't say what to do with SITE; it's either a key of the group (in which case I'll get every unique combination of ID,site and the min time in each) or it should be aggregated (eg max site per ID)
These are ok:
SELECT
ID, max(site), min(time)
FROM
table
GROUP BY
id
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id,site
I cannot simply not specify what to do with it- what should the database return in such a case? (If you're still struggling, tell me in the comments what you think the db should do, and I'll better understand your thinking so I can tell you why it can't do that ). The programmer of the database cannot make this decision for you; you must make it
Usually people ask this when they want to identify:
The min time per ID, and get all the other row data as well. eg "What is the full earliest record data for each id?"
In this case you have to write a query that identifies the min time per id and then join that subquery back to the main data table on id=id and time=mintime. The db runs the subquery, builds a list of min time per id, then that effectively becomes a filter of the main data table
SELECT * FROM
(
SELECT
ID, min(time) as mintime
FROM
table
GROUP BY
id
) findmin
INNER JOIN table t ON t.id = findmin.id and t.time = findmin.mintime
What you cannot do is start putting the other data you want into the query that does the grouping, because you either have to group by the columns you add in (makes the group more fine grained, not what you want) or you have to aggregate them (and then it doesn't necessarily come from the same row as other aggregated columns - min time is from row 1, min site is from row 3 - not what you want)
Looking at your actual problem:
The ID value must exist in two tables.
The Type value must be largest group by id.
The Time value must be smallest in the largest type group.
Leaving out a solution that involves having or analytics for now, so you can get to grips with the theory here:
You need to find the max type group by id, and then join it back to the table to get the other relevant data also (time is needed) for that id/maxtype and then on this new filtered data set you need the id and min time
SELECT t.id,min(t.time) FROM
(
SELECT
ID, max(type) as maxtype
FROM
table
GROUP BY
id
) findmax
INNER JOIN table t ON t.id = findmax.id and t.type = findmax.maxtype
GROUP BY t.id
If you can't see why, let me know
demo:db<>fiddle
SELECT DISTINCT ON (t0.id)
t0.id,
type,
time,
first_value(site) OVER (PARTITION BY t0.id ORDER BY time) as site
FROM table0 t0
JOIN table1 t1 ON t0.id = t1.id
ORDER BY t0.id, type DESC, time
ID must exist in both tables
This can be achieved by joining both tables against their ids. The result of inner joins are rows that exist in both tables.
SITE should be the value from the same row as the minimum TIME value.
This is the same as "Give me the first value of each group ofids ordered bytime". This can be done by using the first_value() window function. Window functions can group your data set (PARTITION BY). So you are getting groups of ids which can be ordered separately. first_value() gives the first value of these ordered groups.
TYPE must be the maximum for each ID.
To get the maximum type per id you'll first have to ORDER BY id, type DESC. You are getting the maximum type as first row per id...
TIME must be the minimum value for the maximum TYPE for each ID.
... Then you can order this result by time additionally to assure this condition.
Now you have an ordered data set: For each id, the row with the maximum type and its minimum time is the first one.
DISTINCT ON gives you exactly the first row of each group. In this case the group you defined is (id). The result is your expected one.
I would write this using distinct on and in/exists:
select distinct on (t0.id) t0.*
from table0 t0
where exists (select 1 from table1 t1 where t1.id = t0.id)
order by t0.id, type desc, time asc;

Redundant values while fetching distinct values from column after joins

While I was trying to fetch unique email ids from my postgres database, I am still getting redundant values. The query is as follows :
select distinct(t2.email_id), user_id, registration_date,
last_login, status, count_uo
from (
select t1.*
from (
select distinct(u.email_id), u.user_id,
u.registration_date, u.last_login,
u.status, count(distinct(uo.id)) as count_uo
from users u
join user_offers uo on u.user_id = uo.user_id
and u.email_id != ''
and uo.offer_id in ('13', '9', 18, 7, 19, 25)
join user_utils uu on u.user_id = uu.user_id
and uu.carrier ~* 'Airtel'
or uu.carrier ~* 'Jio'
or uu.carrier ~* 'Idea'
or uu.carrier ~* '!dea'
where u.registration_date::date between date'2016-08-04' and date'2017-09-28'
and u.last_login::date between date'2017-06-01' and date'2017-09-29'
and u.gender = 'm'
and u.status = 'sms-verified'
and u.email_verification_status = 'UN-VERIFIED'
and u.email_id != '' group by u.user_id
) as t1
where t1.count_uo >1 and t1.count_uo < 100
) t2;
I get the output as follows, even after applying distinct twice.
email_id | user_id | registration_date | last_login | status | count_uo
---------------+---------+----------------------------+----------------------------+--------------+----------
abc#gmail.com | 509 | 2017-07-26 16:59:50.608219 | 2017-07-26 17:56:54.88664 | sms-verified | 3
def#gmail.com | 518 | 2017-08-18 19:26:45.217283 | 2017-08-22 15:38:01.591841 | sms-verified | 3
abc#gmail.com | 512 | 2017-08-17 12:01:00.003048 | 2017-08-21 17:52:56.303841 | sms-verified | 3
Since I'm weak in SQL, any help will be appreciated very much.
If you are using Postgres, you can use distinct on:
select distinct on (t2.email_id) t2.email_id, user_id,
registration_date, last_login, status, count_uo
from ( . . . ) t2
order by t2.email_id;
You can add a second key to the order by to get a particular row (say the most recent login by using order by t2.email_id, last_login desc).
You have two users (rows) with 'abc#gmail.com' as email_id: Notice that they have distinct value in user_id column (509 and 512).
As #GordonLinoff said, you can hide one of that results by using DISTINCT ON clause. But I figure out that it's not what you want...
I imagine it's more likely you inserted some test data and duplicated 'abc#gmail.com' in it.
This also point out (I think) a mistake in your model definition. (missing UNIQUE constraints over both email_id and user_id columns in your users table to avoid it could happen again I mean).

select rows satisfying some criteria and with maximum value in a certain column

I have a table of metadata for updates to a software package. The table has columns id, name, version. I want to select all rows where the name is one of some given list of names and the version is maximum of all the rows with that name.
For example, given these records:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 1 | foo | 1 |
| 2 | foo | 2 |
| 3 | bar | 4 |
| 4 | bar | 5 |
+----+------+---------+
And a task "give me the highest versions of records "foo" and "bar", I want the result to be:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 2 | foo | 2 |
| 4 | bar | 5 |
+----+------+---------+
What I come up with so far, is using nested queries:
SELECT *
FROM updates
WHERE (
id IN (SELECT id
FROM updates
WHERE name = 'foo'
ORDER BY version DESC
LIMIT 1)
) OR (
id IN (SELECT id
FROM updates
WHERE name = 'bar'
ORDER BY version DESC
LIMIT 1)
);
This works, but feels wrong. If I want to filter on more names, I have to replicate the whole subquery multiple times. Is there a better way to do this?
select distinct on (name) id, name, version
from metadata
where name in ('foo', 'bar')
order by name, version desc
NOT EXISTS is a way to avoid unwanted sub optimal tuples:
SELECT *
FROM updates uu
WHERE uu.zname IN ('foo', 'bar')
AND NOT EXISTS (
SELECT *
FROM updates nx
WHERE nx.zname = uu.zanme
AND nx.version > uu.version
);
Note: I replaced name by zname, since it is more or less a keyword in postgresql.
Update after rereading the Q:
I want to select all rows where the name is one of some given list
of names and the version is maximum of all the rows with that name.
If there can be ties (multiple rows with the maximum version per name), you could use the window function rank() in a subquery. Requires PostgreSQL 8.4+.
SELECT *
FROM (
SELECT *, rank() OVER (PARTITION BY name ORDER BY version DESC) AS rnk
FROM updates
WHERE name IN ('foo', 'bar')
)
WHERE rnk = 1;