Postgresql select distinct Column A based on certain conditions on Column B - sql

I have a table with data:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | fail |
| user_2 | success |
| user_3 | fail |
| user_3 | fail |
+--------+---------+
I would like my query output to be distinct on userid but with condition that between fail and success values in status column. I would like to choose success instead (if both fail as in user_3, choose fail then). The table below shows the output that I would like to have as my result:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | success |
| user_3 | fail |
+--------+---------+
Any efficient query would be nice as well. Thanks!

Here is a pretty efficient way to get the results you need.
SELECT userid, MAX(status)
FROM table1
GROUP BY userid
The MAX() function will work for strings as well.
Since, "success" > "fail",
if a userid has 1 row of "success" and 1 row of "fail", the maximum value is "success"

Use SELECT DISTINCT ON, which provides a simple and readable method to get the rows unique on userid. The ORDER BY ensures that status = 'success' is sorted before 'fail', and hence 'success'is selected if present:
SELECT DISTINCT ON (userid) userid,
status
FROM my_table
ORDER BY userid,
status DESC;
Note: An multicolumn index on (status, userid) may help performance. Also, in some cases a query using GROUP BY (see the answer from Terence) may be faster than the one using DISTINCT.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row
of each set of rows where the given expressions evaluate to equal. ...
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
(From SELECT DISTINCT docs )

First select status is success userid.
select distinct userid,status from yourtable where status='success'
Second select userid not contain success status.
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')
Then union.
select distinct userid,status from yourtable where status='success'
union
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')

Related

PostgreSQL remove duplicates by GROUP BY

I would like to print the last message of a person, but only his latest message should be printed per person. I use PostgreSQL 10.
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test3 | 2017-07-07 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I have tried this with the following SQL query, this gives me exactly that back but unfortunately the people are doubled in it.
SELECT * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test1 | 2016-06-01 |
| Maria | Test2 | 2016-11-01 |
| Maria | Test3 | 2017-07-07 |
| Paul | Test4 | 2017-01-01 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I tried to remove the duplicates with a DISTINCT, but unfortunately I get this error message:
SELECT DISTINCT ON (name) * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC
Do you have any ideas how I can solve this ?
You would use DISTINCT ON as follows:
SELECT DISTINCT ON (name) *
FROM messages
WHERE receive = 't'
ORDER BY name, created_at DESC
That is:
no GROUP BY clause is needed
the column(s) listed in DISTINCT ON(...) must appear first in the ORDER BY clause
... followed by the column that should be use to break the group (here, that is created_at)
Note that the results of a distinct on query are always sorted by columns in the clause (because this sort is what is used to identifiy which rows should be kept).
If you want more control over the sort order, then you can use window functions instead:
SELECT *
FROM (
SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
FROM messages m
WHERE receive = 't'
) t
WHERE rn = 1
ORDER BY created_at DESC
Use DISTINCT ON, but with the right ORDER BY:
SELECT DISTINCT ON (name) m.*
FROM messages m
WHERE receive = 't'
ORDER BY name, created_at DESC;
In general, you don't use DISTINCT ON with GROUP BY. It is used with ORDER BY. The way it works is to that it chooses the first row for each name based on the ORDER BY clause.
You should not be thinking of what you are doing as aggregation. You want to filter based on the created_at. In many databases, you would express this using a correlated subquery:
select m.*
from messages m
where m.created_at = (select max(m2.created_at)
from messages m2
where m2.name = m.name and m2.receive = 't'
) and
m.receive = 't'; -- this condition is probably not needed
SELECT *
FROM messages
WHERE receive = 't' and not exists (
select 1
from messages m
where m.receive = message.receive and messages.name = m.name and m.created_at > messages.created_at
)
ORDER BY created_at DESC
The query above finds the messages which fulfill the following criteria:
receive is 't'
there not exists another message which
has the same value for receive
has the same name
and is newer
Assuming that the same name does not send two messages at exactly the same time this should be enough. Another point to make is that the name might look similar, but be different, if some white characters are present inside the value, so, if you see two records in the result with the same name, but with different created_at in the query above, then it is highly probable that white characters are playing tricks on you.

Oracle distinct on single column returning row

I have an api endpoint that accepts distinct arguments for filtering on specific columns. For this reason I'm trying to build a query that is easy to add arbitrary filters to the base query. For some reason if I use:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
I get terrible performance so I started using this query:
SELECT * FROM "MY_VIEW"
-- Distinct on ID filter
LEFT JOIN(
SELECT DISTINCT
FIRST_VALUE("MY_VIEW"."ID")
OVER(PARTITION BY "MY_VIEW"."UNIQUE_ID") as DISTINCT_ID
FROM "MY_VIEW"
) d ON d.DISTINCT_ID = "MY_VIEW"."ID"
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
)
However when I left join it discards the distinct filter.
Also I can't use rowid because it is a view.
The view is a versioned table.
Index Info
UNIQUENESS | STATUS | INDEX_TYPE | TEMPORARY | PARTITIONED | JOIN_INDEX | COLUMNS
NONUNIQUE | VALID | NORMAL | N | NO | NO | ID
UNIQUE | VALID | NORMAL | N | NO | NO | UNIQUE_ID
NONUNIQUE | VALID | DOMAIN | N | NO | NO | NAME
I don't have enough reputation to leave a "comment" so I will post this as an "answer." Your first example is:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
But do you realize that the "GROUP BY ID" clause negates the effect of the MAX() function on ID? In other words, you will get all the rows and the MAX will be computed on each row's ID, returning . . . that row's ID. Perhaps try:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW")
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC

Selecting all elements that only one of their associated value is not in a predefined list

Note: I had a hard time choosing a title for this question. I am not sure it describes accurately what I want, so I will be grateful if instead of a downvote you will help to improve the title. :)
I have a table with the following structure:
log:
+-----+-----+
| uid | uip | <- user id and user ip
+-----+-----+
I also have a table with some predefined user id's:
predfined_users:
+-----+
| uid |
+-----+
| 1 |
-------
| 2 |
-------
| 3 |
-------
What I am trying to achieve:
My "algorithm" should find all the uip that the result of the following steps for them is 1:
Collect from log all distinct users which are associated with the uip
Count how many of the users are NOT in the predfined_users table.
Example:
Let's say this is the list of the users that are associated with the IP address 1.0.0.0:
+-----+
| uid |
+-----+
| 1 |
-------
| 3 |
-------
| 7 |
Only one of these values is not in predfined_users (7), so 1.0.0.0 should be returned. I want to select all the uip that satisfy this as well, meaning, only one of the uid associated with them is not in predfined_users. Also, it is worth noting that if a uip is associated only with one uid, then the query should not return it.
What I have already tried
Here is my general idea but I am not sure what to write instead of the ??? or even if I am in the right direction:
SELECT [uip]
FROM log
WHERE (
SELECT COUNT(*)
FROM (
SELECT DISTINCT [uid]
FROM log WHERE [uip] = ???
)a
WHERE uid NOT IN (
SELECT uid
from predfined_users
)
)=1
Something like this:
select l.uip, count(distinct l.uid)
from log l left join
predefined_users pu
on l.uid = pu.uid
where pu.uid is null
group by l.uip;
You can use this query to retrieve the uips that having only one associated uid which does not exist in table predfined_users:
Select distinct uip from [dbo].[Log]
where uid not in(Select uid from predfined_users)
group by uip
having count(uip) = 1
if you want to retrieve the uips which have associations which do not exist, regardless of the count of irrelevant uids, you can use this code:
Select distinct uip from [dbo].[Log]
where uid not in(Select uid from predfined_users)

Greatest N Per Group with JOIN and multiple order columns

I have two tables:
Table0:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-18 | 100 |
| aa | 1 | 12-10 | 101 |
| bb | 2 | 12-10 | 102 |
| cc | 1 | 12-09 | 100 |
| cc | 2 | 12-12 | 103 |
| cc | 2 | 12-01 | 109 |
| cc | 1 | 12-07 | 101 |
| dd | 1 | 12-08 | 100 |
and
Table1:
| ID |
|----|
| aa |
| cc |
| cc |
| dd |
| dd |
I'm trying to output results where:
ID must exist in both tables.
TYPE must be the maximum for each ID.
TIME must be the minimum value for the maximum TYPE for each ID.
SITE should be the value from the same row as the minimum TIME value.
Given my sample data, my results should look like this:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-10 | 101 |
| cc | 2 | 12-01 | 109 |
| dd | 1 | 12-08 | 100 |
I've tried these statements:
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MASTY, MIN("TIME") AS MASTM
FROM TABLE0
GROUP BY "ID") AS MAS,
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MSD.MASTY =MA."TYPE"
...which generates a syntax error
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MAB
FROM TABLE0
GROUP BY "ID") AS MAS,
((SELECT "ID", MIN("TIME") AS MACTM, MIN("TYPE") AS MACTY
FROM TABLE0
WHERE "TYPE" = 1
GROUP BY "ID")
UNION
(SELECT "ID", MIN("TIME"), MAX("TYPE")
FROM TABLE0
WHERE "TYPE" = 2
GROUP BY "ID")) AS MACU
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MACU."ID" = QTS."ID"
AND MA."TIME" = MACU.MACTM
AND MA."TYPE" = MACU.MACTB
... which is getting the wrong results.
Answering your direct question "how to avoid...":
You get this error when you specify a column in a SELECT area of a statement that isn't present in the GROUP BY section and isn't part of an aggregating function like MAX, MIN, AVG
in your data, I cannot say
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id
I didn't say what to do with SITE; it's either a key of the group (in which case I'll get every unique combination of ID,site and the min time in each) or it should be aggregated (eg max site per ID)
These are ok:
SELECT
ID, max(site), min(time)
FROM
table
GROUP BY
id
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id,site
I cannot simply not specify what to do with it- what should the database return in such a case? (If you're still struggling, tell me in the comments what you think the db should do, and I'll better understand your thinking so I can tell you why it can't do that ). The programmer of the database cannot make this decision for you; you must make it
Usually people ask this when they want to identify:
The min time per ID, and get all the other row data as well. eg "What is the full earliest record data for each id?"
In this case you have to write a query that identifies the min time per id and then join that subquery back to the main data table on id=id and time=mintime. The db runs the subquery, builds a list of min time per id, then that effectively becomes a filter of the main data table
SELECT * FROM
(
SELECT
ID, min(time) as mintime
FROM
table
GROUP BY
id
) findmin
INNER JOIN table t ON t.id = findmin.id and t.time = findmin.mintime
What you cannot do is start putting the other data you want into the query that does the grouping, because you either have to group by the columns you add in (makes the group more fine grained, not what you want) or you have to aggregate them (and then it doesn't necessarily come from the same row as other aggregated columns - min time is from row 1, min site is from row 3 - not what you want)
Looking at your actual problem:
The ID value must exist in two tables.
The Type value must be largest group by id.
The Time value must be smallest in the largest type group.
Leaving out a solution that involves having or analytics for now, so you can get to grips with the theory here:
You need to find the max type group by id, and then join it back to the table to get the other relevant data also (time is needed) for that id/maxtype and then on this new filtered data set you need the id and min time
SELECT t.id,min(t.time) FROM
(
SELECT
ID, max(type) as maxtype
FROM
table
GROUP BY
id
) findmax
INNER JOIN table t ON t.id = findmax.id and t.type = findmax.maxtype
GROUP BY t.id
If you can't see why, let me know
demo:db<>fiddle
SELECT DISTINCT ON (t0.id)
t0.id,
type,
time,
first_value(site) OVER (PARTITION BY t0.id ORDER BY time) as site
FROM table0 t0
JOIN table1 t1 ON t0.id = t1.id
ORDER BY t0.id, type DESC, time
ID must exist in both tables
This can be achieved by joining both tables against their ids. The result of inner joins are rows that exist in both tables.
SITE should be the value from the same row as the minimum TIME value.
This is the same as "Give me the first value of each group ofids ordered bytime". This can be done by using the first_value() window function. Window functions can group your data set (PARTITION BY). So you are getting groups of ids which can be ordered separately. first_value() gives the first value of these ordered groups.
TYPE must be the maximum for each ID.
To get the maximum type per id you'll first have to ORDER BY id, type DESC. You are getting the maximum type as first row per id...
TIME must be the minimum value for the maximum TYPE for each ID.
... Then you can order this result by time additionally to assure this condition.
Now you have an ordered data set: For each id, the row with the maximum type and its minimum time is the first one.
DISTINCT ON gives you exactly the first row of each group. In this case the group you defined is (id). The result is your expected one.
I would write this using distinct on and in/exists:
select distinct on (t0.id) t0.*
from table0 t0
where exists (select 1 from table1 t1 where t1.id = t0.id)
order by t0.id, type desc, time asc;

Query to count the frequence of many-to-many associations

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
For performing the association, a join table exists between those two tables:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
I should have something like:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
Or, eventually, something like :
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
I can't find the SQL query to do this.
I think you can get what you want using this query:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.
Here is a sqlfiddle to demonstrate
As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.
We need to compare sorted lists of reasons to identify equal sets.
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.
And we don't strictly need the innermost subquery at all. This works as well:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
But it's typically slower for processing all or most of the table. Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.
If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
Related, with more explanation:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations
You can do this using string_agg():
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;