Oracle distinct on single column returning row - sql

I have an api endpoint that accepts distinct arguments for filtering on specific columns. For this reason I'm trying to build a query that is easy to add arbitrary filters to the base query. For some reason if I use:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
I get terrible performance so I started using this query:
SELECT * FROM "MY_VIEW"
-- Distinct on ID filter
LEFT JOIN(
SELECT DISTINCT
FIRST_VALUE("MY_VIEW"."ID")
OVER(PARTITION BY "MY_VIEW"."UNIQUE_ID") as DISTINCT_ID
FROM "MY_VIEW"
) d ON d.DISTINCT_ID = "MY_VIEW"."ID"
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
)
However when I left join it discards the distinct filter.
Also I can't use rowid because it is a view.
The view is a versioned table.
Index Info
UNIQUENESS | STATUS | INDEX_TYPE | TEMPORARY | PARTITIONED | JOIN_INDEX | COLUMNS
NONUNIQUE | VALID | NORMAL | N | NO | NO | ID
UNIQUE | VALID | NORMAL | N | NO | NO | UNIQUE_ID
NONUNIQUE | VALID | DOMAIN | N | NO | NO | NAME

I don't have enough reputation to leave a "comment" so I will post this as an "answer." Your first example is:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW"
GROUP BY ID)
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC
But do you realize that the "GROUP BY ID" clause negates the effect of the MAX() function on ID? In other words, you will get all the rows and the MAX will be computed on each row's ID, returning . . . that row's ID. Perhaps try:
SELECT "MY_VIEW".*
FROM "MY_VIEW"
-- Distinct on ID filter
WHERE ID IN (SELECT Max(ID)
FROM "MY_VIEW")
-- Other arbitrary filters...
ORDER BY "MY_VIEW"."NAME" DESC

Related

Postgresql select distinct Column A based on certain conditions on Column B

I have a table with data:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | fail |
| user_2 | success |
| user_3 | fail |
| user_3 | fail |
+--------+---------+
I would like my query output to be distinct on userid but with condition that between fail and success values in status column. I would like to choose success instead (if both fail as in user_3, choose fail then). The table below shows the output that I would like to have as my result:
+--------+---------+
| userid | status |
+--------+---------+
| user_1 | success |
| user_2 | success |
| user_3 | fail |
+--------+---------+
Any efficient query would be nice as well. Thanks!
Here is a pretty efficient way to get the results you need.
SELECT userid, MAX(status)
FROM table1
GROUP BY userid
The MAX() function will work for strings as well.
Since, "success" > "fail",
if a userid has 1 row of "success" and 1 row of "fail", the maximum value is "success"
Use SELECT DISTINCT ON, which provides a simple and readable method to get the rows unique on userid. The ORDER BY ensures that status = 'success' is sorted before 'fail', and hence 'success'is selected if present:
SELECT DISTINCT ON (userid) userid,
status
FROM my_table
ORDER BY userid,
status DESC;
Note: An multicolumn index on (status, userid) may help performance. Also, in some cases a query using GROUP BY (see the answer from Terence) may be faster than the one using DISTINCT.
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row
of each set of rows where the given expressions evaluate to equal. ...
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
(From SELECT DISTINCT docs )
First select status is success userid.
select distinct userid,status from yourtable where status='success'
Second select userid not contain success status.
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')
Then union.
select distinct userid,status from yourtable where status='success'
union
select distinct userid,status from yourtable where
userid not in(select distinct userid from yourtable where status='success')

PostgreSQL remove duplicates by GROUP BY

I would like to print the last message of a person, but only his latest message should be printed per person. I use PostgreSQL 10.
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test3 | 2017-07-07 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I have tried this with the following SQL query, this gives me exactly that back but unfortunately the people are doubled in it.
SELECT * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name | body | created_at |
+-----------+----------+--------------+
| Maria | Test1 | 2016-06-01 |
| Maria | Test2 | 2016-11-01 |
| Maria | Test3 | 2017-07-07 |
| Paul | Test4 | 2017-01-01 |
| Paul | Test5 | 2017-06-01 |
+-----------+----------+--------------+
I tried to remove the duplicates with a DISTINCT, but unfortunately I get this error message:
SELECT DISTINCT ON (name) * FROM messages
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC
Do you have any ideas how I can solve this ?
You would use DISTINCT ON as follows:
SELECT DISTINCT ON (name) *
FROM messages
WHERE receive = 't'
ORDER BY name, created_at DESC
That is:
no GROUP BY clause is needed
the column(s) listed in DISTINCT ON(...) must appear first in the ORDER BY clause
... followed by the column that should be use to break the group (here, that is created_at)
Note that the results of a distinct on query are always sorted by columns in the clause (because this sort is what is used to identifiy which rows should be kept).
If you want more control over the sort order, then you can use window functions instead:
SELECT *
FROM (
SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
FROM messages m
WHERE receive = 't'
) t
WHERE rn = 1
ORDER BY created_at DESC
Use DISTINCT ON, but with the right ORDER BY:
SELECT DISTINCT ON (name) m.*
FROM messages m
WHERE receive = 't'
ORDER BY name, created_at DESC;
In general, you don't use DISTINCT ON with GROUP BY. It is used with ORDER BY. The way it works is to that it chooses the first row for each name based on the ORDER BY clause.
You should not be thinking of what you are doing as aggregation. You want to filter based on the created_at. In many databases, you would express this using a correlated subquery:
select m.*
from messages m
where m.created_at = (select max(m2.created_at)
from messages m2
where m2.name = m.name and m2.receive = 't'
) and
m.receive = 't'; -- this condition is probably not needed
SELECT *
FROM messages
WHERE receive = 't' and not exists (
select 1
from messages m
where m.receive = message.receive and messages.name = m.name and m.created_at > messages.created_at
)
ORDER BY created_at DESC
The query above finds the messages which fulfill the following criteria:
receive is 't'
there not exists another message which
has the same value for receive
has the same name
and is newer
Assuming that the same name does not send two messages at exactly the same time this should be enough. Another point to make is that the name might look similar, but be different, if some white characters are present inside the value, so, if you see two records in the result with the same name, but with different created_at in the query above, then it is highly probable that white characters are playing tricks on you.

Greatest N Per Group with JOIN and multiple order columns

I have two tables:
Table0:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-18 | 100 |
| aa | 1 | 12-10 | 101 |
| bb | 2 | 12-10 | 102 |
| cc | 1 | 12-09 | 100 |
| cc | 2 | 12-12 | 103 |
| cc | 2 | 12-01 | 109 |
| cc | 1 | 12-07 | 101 |
| dd | 1 | 12-08 | 100 |
and
Table1:
| ID |
|----|
| aa |
| cc |
| cc |
| dd |
| dd |
I'm trying to output results where:
ID must exist in both tables.
TYPE must be the maximum for each ID.
TIME must be the minimum value for the maximum TYPE for each ID.
SITE should be the value from the same row as the minimum TIME value.
Given my sample data, my results should look like this:
| ID | TYPE | TIME | SITE |
|----|------|-------|------|
| aa | 1 | 12-10 | 101 |
| cc | 2 | 12-01 | 109 |
| dd | 1 | 12-08 | 100 |
I've tried these statements:
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MASTY, MIN("TIME") AS MASTM
FROM TABLE0
GROUP BY "ID") AS MAS,
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MSD.MASTY =MA."TYPE"
...which generates a syntax error
INSERT INTO "NuTable"
SELECT DISTINCT(QTS."ID"), "SITE",
CASE WHEN MAS.MAB=1 THEN 'B'
WHEN MAS.MAB=2 THEN 'F'
ELSE NULL END,
"TIME"
FROM (SELECT DISTINCT("ID") FROM TABLE1) AS QTS,
TABLE0 AS MA,
(SELECT "ID", MAX("TYPE") AS MAB
FROM TABLE0
GROUP BY "ID") AS MAS,
((SELECT "ID", MIN("TIME") AS MACTM, MIN("TYPE") AS MACTY
FROM TABLE0
WHERE "TYPE" = 1
GROUP BY "ID")
UNION
(SELECT "ID", MIN("TIME"), MAX("TYPE")
FROM TABLE0
WHERE "TYPE" = 2
GROUP BY "ID")) AS MACU
WHERE QTS."ID" = MA."ID"
AND QTS."ID" = MAS."ID"
AND MACU."ID" = QTS."ID"
AND MA."TIME" = MACU.MACTM
AND MA."TYPE" = MACU.MACTB
... which is getting the wrong results.
Answering your direct question "how to avoid...":
You get this error when you specify a column in a SELECT area of a statement that isn't present in the GROUP BY section and isn't part of an aggregating function like MAX, MIN, AVG
in your data, I cannot say
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id
I didn't say what to do with SITE; it's either a key of the group (in which case I'll get every unique combination of ID,site and the min time in each) or it should be aggregated (eg max site per ID)
These are ok:
SELECT
ID, max(site), min(time)
FROM
table
GROUP BY
id
SELECT
ID, site, min(time)
FROM
table
GROUP BY
id,site
I cannot simply not specify what to do with it- what should the database return in such a case? (If you're still struggling, tell me in the comments what you think the db should do, and I'll better understand your thinking so I can tell you why it can't do that ). The programmer of the database cannot make this decision for you; you must make it
Usually people ask this when they want to identify:
The min time per ID, and get all the other row data as well. eg "What is the full earliest record data for each id?"
In this case you have to write a query that identifies the min time per id and then join that subquery back to the main data table on id=id and time=mintime. The db runs the subquery, builds a list of min time per id, then that effectively becomes a filter of the main data table
SELECT * FROM
(
SELECT
ID, min(time) as mintime
FROM
table
GROUP BY
id
) findmin
INNER JOIN table t ON t.id = findmin.id and t.time = findmin.mintime
What you cannot do is start putting the other data you want into the query that does the grouping, because you either have to group by the columns you add in (makes the group more fine grained, not what you want) or you have to aggregate them (and then it doesn't necessarily come from the same row as other aggregated columns - min time is from row 1, min site is from row 3 - not what you want)
Looking at your actual problem:
The ID value must exist in two tables.
The Type value must be largest group by id.
The Time value must be smallest in the largest type group.
Leaving out a solution that involves having or analytics for now, so you can get to grips with the theory here:
You need to find the max type group by id, and then join it back to the table to get the other relevant data also (time is needed) for that id/maxtype and then on this new filtered data set you need the id and min time
SELECT t.id,min(t.time) FROM
(
SELECT
ID, max(type) as maxtype
FROM
table
GROUP BY
id
) findmax
INNER JOIN table t ON t.id = findmax.id and t.type = findmax.maxtype
GROUP BY t.id
If you can't see why, let me know
demo:db<>fiddle
SELECT DISTINCT ON (t0.id)
t0.id,
type,
time,
first_value(site) OVER (PARTITION BY t0.id ORDER BY time) as site
FROM table0 t0
JOIN table1 t1 ON t0.id = t1.id
ORDER BY t0.id, type DESC, time
ID must exist in both tables
This can be achieved by joining both tables against their ids. The result of inner joins are rows that exist in both tables.
SITE should be the value from the same row as the minimum TIME value.
This is the same as "Give me the first value of each group ofids ordered bytime". This can be done by using the first_value() window function. Window functions can group your data set (PARTITION BY). So you are getting groups of ids which can be ordered separately. first_value() gives the first value of these ordered groups.
TYPE must be the maximum for each ID.
To get the maximum type per id you'll first have to ORDER BY id, type DESC. You are getting the maximum type as first row per id...
TIME must be the minimum value for the maximum TYPE for each ID.
... Then you can order this result by time additionally to assure this condition.
Now you have an ordered data set: For each id, the row with the maximum type and its minimum time is the first one.
DISTINCT ON gives you exactly the first row of each group. In this case the group you defined is (id). The result is your expected one.
I would write this using distinct on and in/exists:
select distinct on (t0.id) t0.*
from table0 t0
where exists (select 1 from table1 t1 where t1.id = t0.id)
order by t0.id, type desc, time asc;

Query to count the frequence of many-to-many associations

I have two tables with a many-to-many association in postgresql. The first table contains activities, which may count zero or more reasons:
CREATE TABLE activity (
id integer NOT NULL,
-- other fields removed for readability
);
CREATE TABLE reason (
id varchar(1) NOT NULL,
-- other fields here
);
For performing the association, a join table exists between those two tables:
CREATE TABLE activity_reason (
activity_id integer NOT NULL, -- refers to activity.id
reason_id varchar(1) NOT NULL, -- refers to reason.id
CONSTRAINT activity_reason_activity FOREIGN KEY (activity_id) REFERENCES activity (id),
CONSTRAINT activity_reason_reason FOREIGN KEY (reason_id) REFERENCES reason (id)
);
I would like to count the possible association between activities and reasons. Supposing I have those records in the table activity_reason:
+--------------+------------+
| activity_id | reason_id |
+--------------+------------+
| 1 | A |
| 1 | B |
| 2 | A |
| 2 | B |
| 3 | A |
| 4 | C |
| 4 | D |
| 4 | E |
+--------------+------------+
I should have something like:
+-------+---+------+-------+
| count | | | |
+-------+---+------+-------+
| 2 | A | B | NULL |
| 1 | A | NULL | NULL |
| 1 | C | D | E |
+-------+---+------+-------+
Or, eventually, something like :
+-------+-------+
| count | |
+-------+-------+
| 2 | A,B |
| 1 | A |
| 1 | C,D,E |
+-------+-------+
I can't find the SQL query to do this.
I think you can get what you want using this query:
SELECT count(*) as count, reasons
FROM (
SELECT activity_id, array_agg(reason_id) AS reasons
FROM (
SELECT A.activity_id, AR.reason_id
FROM activity A
LEFT JOIN activity_reason AR ON AR.activity_id = A.activity_id
ORDER BY activity_id, reason_id
) AS ordered_reasons
GROUP BY activity_id
) reason_arrays
GROUP BY reasons
First you aggregate all the reasons for an activity into an array for each activity. You have to order the associations first, otherwise ['a','b'] and ['b','a'] will be considered different sets and will have individual counts. You also need to include the join or any activity that doesn't have any reasons won't show up in the result set. I'm not sure if that is desirable or not, I can take it back out if you want activities that don't have a reason to not be included. Then you count the number of activities that have the same sets of reasons.
Here is a sqlfiddle to demonstrate
As mentioned by Gordon Linoff you could also use a string instead of an array. I'm not sure which would be better for performance.
We need to compare sorted lists of reasons to identify equal sets.
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id) AS reason_list
FROM (SELECT * FROM activity_reason ORDER BY activity_id, reason_id) ar1
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
ORDER BY reason_id in the innermost subquery would work, too, but adding activity_id is typically faster.
And we don't strictly need the innermost subquery at all. This works as well:
SELECT count(*) AS ct, reason_list
FROM (
SELECT array_agg(reason_id ORDER BY reason_id) AS reason_list
FROM activity_reason
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
But it's typically slower for processing all or most of the table. Quoting the manual:
Alternatively, supplying the input values from a sorted subquery will usually work.
We could use string_agg() instead of array_agg(), and that would work for your example with varchar(1) (which might be more efficient with data type "char", btw). It can fail for longer strings, though. The aggregated value can be ambiguous.
If reason_id would be an integer (like it typically is), there is another, faster solution with sort() from the additional module intarray:
SELECT count(*) AS ct, reason_list
FROM (
SELECT sort(array_agg(reason_id)) AS reason_list
FROM activity_reason2
GROUP BY activity_id
) ar2
GROUP BY reason_list
ORDER BY ct DESC, reason_list;
Related, with more explanation:
Compare arrays for equality, ignoring order of elements
Storing and comparing unique combinations
You can do this using string_agg():
select reasons, count(*)
from (select activity_id, string_agg(reason_id, ',' order by reason_id) as reasons
from activity_reason
group by activity_id
) a
group by reasons
order by count(*) desc;

Deterministic sort order for window functions

I've a status table and I want to fetch the latest details.
Slno | ID | Status | date
1 | 1 | Pass | 15-06-2015 11:11:00 - this is inserted first
2 | 1 | Fail | 15-06-2015 11:11:00 - this is inserted second
3 | 2 | Fail | 15-06-2015 12:11:11 - this is inserted first
4 | 2 | Pass | 15-06-2015 12:11:11 - this is inserted second
I use a window function with partition by ID order by date desc to fetch the first value.
Excepted Output :
2 | 1 | Fail | 15-06-2015 11:11:00 - this is inserted second
4 | 2 | Pass | 15-06-2015 12:11:11 - this is inserted second
Actual Output :
1 | 1 | Pass | 15-06-2015 11:11:00 - this is inserted first
3 | 2 | Fail | 15-06-2015 12:11:11 - this is inserted first
According to [http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_order_by_WF.html], adding a second ORDER BY column to the window function may solve the problem. But I don't have any other column to differentiate the rows!
Is there another approach to solve the issue?
EDIT: I've added slno here for clarity. I don't have slno as such in the table!
My SQL:
with range as (
select id from status where date between 01-06-2015 and 30-06-2015
), latest as (
select status, id, row_number() OVER (PARTITION BY id ORDER BY date DESC) row_num
)
select * from latest where row_num = 1
If you don't have slno in your table, then you don't have any reliable information which row was inserted first. There is no natural order in a table, the physical order of rows can change any time (with any update, or with VACUUM, etc.)
You could use an unreliable trick: order by the internal ctid.
select *
from (
select id, status
, row_number() OVER (PARTITION BY id
ORDER BY date, ctid) AS row_num
from status -- that's your table name??
where date >= '2015-06-01' -- assuming column is actually a date
and date < '2015-07-01'
) sub
where row_num = 1;
In absence of any other information which row came first (which is a design error to begin with, fix it!), you might try to save what you can using the internal tuple ID ctid
In-order sequence generation
Rows will be in physical order when inserted initially, but that can change any time with any write operation to the table or VACUUM or other events.
This is a measure of last resort and it will break.
Your presented query was invalid on several counts: missing column name in 1st CTE, missing table name in 2nd CTE, ...
You don't need a CTE for this.
Simpler with DISTINCT ON (considerations for ctid apply the same):
SELECT DISTINCT ON (id)
id, status
FROM status
WHERE date >= '2015-06-01'
AND date < '2015-07-01'
ORDER BY id, date, ctid;
Select first row in each GROUP BY group?