Oracle dedupe rows based on max values of 2 columns in conjunction

Oracle dedupe rows based on max values of 2 columns in conjunction - sql

Was wondering if anyone knew an efficient way to dedupe records in a large dataset using Oracle SQL based on the max values of 2 attributes in conjunction.
In the hypothetical example below, I am looking remove all duplicate COMPANYID / CHILD ID Pairs by selecting first the maximum transactionid. Where the payload ID still has duplicates, the maximum BATCHID.
note: transactionID and batchID may have null values (which would be expected to the lowest value)
Table: Transaction
<p> CompanyID| ChildID | transactionid| BatchID | Product Details </P>
<p> ABC EFG 306 Product1 </p>
<p>ABC EFG 306 54 Product2</p>
<p>ZXY BFG 405 003 Product1</p>
<p>ZXY BFG 405 004 Product2</p>
<p>ZXY BFG 407 Product3</p>
Expected Result:
<p>ABC | EFG | 306 | 54 | Product 2 --selected on basis of highest transactionid and batchid </P>
<p>ZXY | BFG | 405 | 407 | Product 3 --selected on basis of highest transactionid </p>
I envisioned simply:
1) Using a max function on the transactionid and subquerying the result to max the batchID in addition
2) Self joining the "de-duped' set to the original set to obtain product information
Does anybody know of a more efficient / cleaner way to achieve this and a way to handle the nulls better?
Appreciate any feedback.

From Oracle 11g, you can use this kind of requests:
with w(CompanyID, ChildID, transactionid, BatchID, Product_Details) as
(
select 'ABC', 'EFG', 306, null, 'Product1 ' from dual
union all
select 'ABC', 'EFG', 306, 54, 'Product2' from dual
union all
select 'ZXY', 'BFG', 405, 003, 'Product1' from dual
union all
select 'ZXY', 'BFG', 405, 004, 'Product2' from dual
union all
select 'ZXY', 'BFG', 407, null, 'Product3' from dual
)
select w.CompanyID,
w.ChildID,
max(w.transactionid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_transactionid,
max(w.batchid) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_batchid,
max(w.Product_Details) keep (dense_rank last order by nvl(w.transactionid, 0), nvl(w.batchid, 0)) max_Product_Details
from w
group by w.CompanyID, w.ChildID
;
The nvl function allows you to handle null cases. Here is the output (which does not fit yours, but I did the request as I understood what you wanted):
COMPANYID CHILDID MAX_TRANSACTIONID MAX_BATCHID MAX_PRODUCT_DETAILS
ABC EFG 306 54 Product2
ZXY BFG 407 Product3
EDIT: Let me try to explain further DENSE_RANK and LAST: inside a GROUP BY, this syntax appears as an aggregate function (like SUM, AVG...).
In a group, the ORDER BY gives the sorting (here, transactionid and batchid)
then the DENSE_RANK LAST states that you will focus on the last ranked row(s) of this sorting (you can have indeed several rows with same rank)
the MAX takes the maximum value inside these top-ranked rows. Most of the time, you only have one row so MAX can appear useless, but it is not. So you will often see MIN and DENSE_RANK FIRST, or MAX and DENSE_RANK LAST.
Here is the Oracle doc on this subject.

Because you are dealing with multiple columns, you should also consider just using row_number():
select t.*
from (select t.*,
row_number() over (partition by CompanyId, ChildId
order by transactionid desc nulls last, BatchID desc nulls last
) as seqnum
from t
) t
where seqnum = 1;
The keep/dense_rank method is fast. I'm not sure if doing it multiple times is faster than using row_number(). Testing can give you this information.

Related

Filter the rows of the last N group

Suppose I have created the following query (I use SQL Server), which returns the following output:
SELECT *
FROM DB
ORDER BY CLIENT_ID
In such case how can I update my above query to select only the 2 last CLIENT ID, and I should be able to use whatever other number like last 20, last 60, last 100, etc
In my example the expected output would be
meaning that we see only the rows related to the 2 last clients which are client B99 and C93 (meaning that first client A19 is filtered out since it does not belong to the last 2)

This will give you the expected output. But as others already mentioned it's unclear what last mean. I'm just guessing from your expected result.
Also please don't post photo of tables next time.
SELECT A.CLIENT_ID, A.PRICE_BILL
FROM DB A
WHERE A.CLIENT_ID IN (
SELECT DISTINCT TOP(2) A.CLIENT_ID
FROM DB A
ORDER BY A.CLIENT_ID DESC
)
ORDER BY A.CLIENT_ID ASC, A.PRICE_BILL ASC
See Demo

You can accomplish what you require using dense_rank() and filtering out the last 2 rankings.
The reason you use dense_rank is because it assigns the same ranking to ties thereby ranking all of the same CLIENT_ID the same. Also note the reverse ordering of the dense_rank to make it easy to filter out the last 2 values because they are ranked 1 & 2.
declare #MyTable table (CLIENT_ID varchar(3), PRICE_BILL int);
insert into #MyTable (CLIENT_ID, PRICE_BILL)
values
('A19',91), ('A19',29), ('A19',92)
, ('B99',85), ('B99',202)
, ('C93',399), ('C93',929), ('C93',929);
with cte as (
select *
, dense_rank() over (order by CLIENT_ID desc) dr
from #MyTable
)
select *
from cte
where dr < 3
order by CLIENT_ID;
Returns:
CLIENT_ID
PRICE_BILL
dr
B99
85
2
B99
202
2
C93
399
1
C93
929
1
C93
929
1
fiddle
Note the provision of sample data as DDL+DML makes it much easier for people to assist.

How to limit number of groups returned in a query, but not the number of rows in Oracle

How to limit the number of groups in a query, but not the number of rows in Oracle?
If I had to do that manually, I would have to use a DISTINCT.
Would be something like this:
FOR d IN (
SELECT DISTINCT COLUMN_1 FROM myTable
WHERE myDate BETWEEN x AND y
OFFSET o ROWS
FETCH NEXT l ROWS ONLY
) LOOP
And then, do the selects from each of the ids returned in the query, which, in my opinion, is a terrible solution.
SAMPLE DATA:
If I limit the number of groups to 2 by using COLUMN_2, the expected result should be something like:

I believe you may be looking for something like this:
select *
from mytable
where id in (
select distinct id
from my_table
where my_date between x and y
fetch first :n rows only
)
;
:n is a bind variable, encoding the number of groups you want to select.
This should be more efficient than solutions using analytic functions - even if it must read the base table twice. In tests posted on OTN, I showed that the difference is not small.
EDIT If I remember correctly, FETCH is not implemented in the most efficient way (perhaps for good reasons, having to do with features we don't need in this query - such as how to deal with ties). FETCH itself resembles a DENSE_RANK() implementation rather than the faster row limiting clause (using ROWNUM). I would likely need to modify the query to do away with FETCH, if speed was really important. END EDIT
Further edit to do with performance comparisons
Frequent poster MT0 requested a pointer for the claim that aggregate solutions can (and often are) more efficient than analytic function approaches, even when the former may require multiple passes through the data where the analytic function approach requires only one.
Alas, OTN (what now calls itself the "Oracle Groundbreakers Developer Community", the discussion board hosted by Oracle itself) went through a massive - and massively botched - platform change at the end of September 2020; that messed up both the search facilities and the formatting of old posts, to the point of rendering them almost unusable.
Instead, I will show here a simple mock-up of the OP's problem in this thread; code that anyone can run so they can repeat the tests on their own machine.
I created a table with two columns, ID and STR - the ID plays the same role as in the OP's question, and STR is just extra payload to mimic real-life data. ID is number and STR is varchar2(100). I populated the table with 9 million rows - 1 million ID's, nine rows for each ID. The task is to select just three "groups" (three distinct ID's, then select all the rows from the base table for those three distinct ID's).
With no index on the ID column, the aggregate solution runs in 0.81 seconds on my machine; with an index on ID, it runs in 0.47 seconds. The analytic functions solution runs in 0.91 seconds, with or without an index (obviously - there is no way an index can benefit the analytic function solution). All these results are for column ID not declared NOT NULL.
Here is the code to create the table, the index on ID, and the two queries I tested. Note: As I explained in my first edit (above), fetch is slow; I replaced it with a standard row-limiting technique using ROWNUM in an over-query.
drop table t purge;
create table t (id number, str varchar2(100));
insert into t
with row_gen as (select level from dual connect by level <= 3000)
select mod(344227 * rownum, 1000000), rpad('x', 100, 'x')
from row_gen cross join row_gen
;
commit;
create index t_idx on t(id);
select *
from t
where id in (
select id from (select distinct id from t)
where rownum <= 3
);
select *
from ( select t.*, dense_rank() over (order by id) dr from t )
where dr <= 3;

You can use DENSE_RANK:
SELECT *
FROM (
SELECT t.*,
DENSE_RANK() OVER ( ORDER BY column2 ) AS rnk
FROM table_name t
)
WHERE rnk <= 2;
Which, for the sample data:
CREATE TABLE table_name ( column1, column2, column3, column4 ) AS
SELECT 1, 1, 1.0, 1.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.0 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.1 FROM DUAL UNION ALL
SELECT 2, 2, 2.2, 2.2 FROM DUAL UNION ALL
SELECT 2, 2, 2.0, 2.3 FROM DUAL UNION ALL
SELECT 3, 3, 3.0, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 3, 3, 3.1, 3.1 FROM DUAL UNION ALL
SELECT 4, 4, 4.2, 4.0 FROM DUAL;
Outputs:
COLUMN1 | COLUMN2 | COLUMN3 | COLUMN4 | RNK
------: | ------: | ------: | ------: | --:
1 | 1 | 1 | 1 | 1
2 | 2 | 2 | 2 | 2
2 | 2 | 2.2 | 2.1 | 2
2 | 2 | 2.2 | 2.2 | 2
2 | 2 | 2 | 2.3 | 2
(and, if you want DISTINCT rows then add DISTINCT to the outer query)
db<>fiddle here

If I understand correctly, you want ROW_NUMBER():
SELECT t.*
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) as seqnum
FROM myTable t
WHERE t.myDate BETWEEN x AND y
) t
WHERE seqnum = 1;
This returns an arbitrary row for each id meeting the conditions.

Getting the row which has the maximum value of a column [duplicate]

As the title suggests, I'd like to select the first row of each set of rows grouped with a GROUP BY.
Specifically, if I've got a purchases table that looks like this:
SELECT * FROM purchases;
My Output:
id
customer
total
1
Joe
5
2
Sally
3
3
Joe
2
4
Sally
1
I'd like to query for the id of the largest purchase (total) made by each customer. Something like this:
SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY total DESC;
Expected Output:
FIRST(id)
customer
FIRST(total)
1
Joe
5
2
Sally
3

DISTINCT ON is typically simplest and fastest for this in PostgreSQL.
(For performance optimization for certain workloads see below.)
SELECT DISTINCT ON (customer)
id, customer, total
FROM purchases
ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2)
id, customer, total
FROM purchases
ORDER BY 2, 3 DESC, 1;
If total can be null, add NULLS LAST:
...
ORDER BY customer, total DESC NULLS LAST, id;
Works either way, but you'll want to match existing indexes
db<>fiddle here
Major points
DISTINCT ON is a PostgreSQL extension of the standard, where only DISTINCT on the whole SELECT list is defined.
List any number of expressions in the DISTINCT ON clause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least
one column value. Null values are considered equal in this
comparison.
Bold emphasis mine.
DISTINCT ON can be combined with ORDER BY. Leading expressions in ORDER BY must be in the set of expressions in DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
I added id as last item to break ties:
"Pick the row with the smallest id from each group sharing the highest total."
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another ORDER BY. Example.
If total can be null, you most probably want the row with the greatest non-null value. Add NULLS LAST like demonstrated. See:
Sort by column ASC, but NULL values first?
The SELECT list is not constrained by expressions in DISTINCT ON or ORDER BY in any way:
You don't have to include any of the expressions in DISTINCT ON or ORDER BY.
You can include any other expression in the SELECT list. This is instrumental for replacing complex subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 15. But the feature has been there at least since version 7.1, so basically always.
Index
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized. But use it if read performance for the particular query is crucial. If you have DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is perfectly applicable.
Effectiveness / Performance optimization
Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though. Example.
For few rows per customer (high cardinality in column customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough work_mem to process the involved sort step in RAM and not spill to disk. But generally setting work_mem too high can have adverse effects. Consider SET LOCAL for exceptionally big queries. Find how much you need with EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:
Configuration parameter work_mem in PostgreSQL on Linux
Optimize simple query using ORDER BY date and text
For many rows per customer (low cardinality in column customer), an "index skip scan" or "loose index scan" would be (much) more efficient. But that's not implemented up to Postgres 15. Serious work to implement it one way or another has been ongoing for years now, but so far unsuccessful. See here and here.
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Optimize GROUP BY query to retrieve latest row per user
Optimize groupwise maximum query
Query last N related rows per row
Benchmarks
See separate answer.

On databases that support CTE and windowing functions:
WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total

Benchmarks
I tested the most interesting candidates:
Initially with Postgres 9.4 and 9.5.
Added accented tests for Postgres 13 later.
Basic test setup
Main table: purchases:
CREATE TABLE purchases (
id serial -- PK constraint added below
, customer_id int -- REFERENCES customer
, total int -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);
Dummy data (with some dead tuples), PK, index:
INSERT INTO purchases (customer_id, total, some_column) -- 200k rows
SELECT (random() * 10000)::int AS customer_id -- 10k distinct customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,200000) g;
ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);
DELETE FROM purchases WHERE random() > 0.9; -- some dead rows
INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int AS customer_id -- 10k customers
, (random() * random() * 100000)::int AS total
, 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k
CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);
VACUUM ANALYZE purchases;
customer table - used for optimized query:
CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM purchases
GROUP BY 1
ORDER BY 1;
ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);
VACUUM ANALYZE customer;
In my second test for 9.5 I used the same setup, but with 100000 distinct customer_id to get few rows per customer_id.
Object sizes for table purchases
Basic setup: 200k rows in purchases, 10k distinct customer_id, avg. 20 rows per customer.
For Postgres 9.5 I added a 2nd test with 86446 distinct customers - avg. 2.3 rows per customer.
Generated with a query taken from here:
Measure the size of a PostgreSQL table row
Gathered for Postgres 9.5:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
core_relation_size | 20496384 | 20 MB | 102
visibility_map | 0 | 0 bytes | 0
free_space_map | 24576 | 24 kB | 0
table_size_incl_toast | 20529152 | 20 MB | 102
indexes_size | 10977280 | 10 MB | 54
total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157
live_rows_in_text_representation | 13729802 | 13 MB | 68
------------------------------ | | |
row_count | 200045 | |
live_tuples | 200045 | |
dead_tuples | 19955 | |
Queries
1. row_number() in CTE, (see other answer)
WITH cte AS (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
)
SELECT id, customer_id, total
FROM cte
WHERE rn = 1;
2. row_number() in subquery (my optimization)
SELECT id, customer_id, total
FROM (
SELECT id, customer_id, total
, row_number() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
FROM purchases
) sub
WHERE rn = 1;
3. DISTINCT ON (see other answer)
SELECT DISTINCT ON (customer_id)
id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC, id;
4. rCTE with LATERAL subquery (see here)
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT id, customer_id, total
FROM purchases
ORDER BY customer_id, total DESC
LIMIT 1
)
UNION ALL
SELECT u.*
FROM cte c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id > c.customer_id -- lateral reference
ORDER BY customer_id, total DESC
LIMIT 1
) u
)
SELECT id, customer_id, total
FROM cte
ORDER BY customer_id;
5. customer table with LATERAL (see here)
SELECT l.*
FROM customer c
, LATERAL (
SELECT id, customer_id, total
FROM purchases
WHERE customer_id = c.customer_id -- lateral reference
ORDER BY total DESC
LIMIT 1
) l;
6. array_agg() with ORDER BY (see other answer)
SELECT (array_agg(id ORDER BY total DESC))[1] AS id
, customer_id
, max(total) AS total
FROM purchases
GROUP BY customer_id;
Results
Execution time for above queries with EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF, best of 5 runs to compare with warm cache.
All queries used an Index Only Scan on purchases2_3c_idx (among other steps). Some only to benefit from the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per customer_id
1. 273.274 ms
2. 194.572 ms
3. 111.067 ms
4. 92.922 ms -- !
5. 37.679 ms -- winner
6. 189.495 ms
B. Same as A. with Postgres 9.5
1. 288.006 ms
2. 223.032 ms
3. 107.074 ms
4. 78.032 ms -- !
5. 33.944 ms -- winner
6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per customer_id
1. 381.573 ms
2. 311.976 ms
3. 124.074 ms -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms
Retest with Postgres 13 on 2021-08-11
Simplified test setup: no deleted rows, because VACUUM ANALYZE cleans the table completely for the simple case.
Important changes for Postgres:
General performance improvements.
CTEs can be inlined since Postgres 12, so query 1. and 2. now perform mostly identical (same query plan).
D. Like B. ~ 20 rows per customer_id
1. 103 ms
2. 103 ms
3. 23 ms -- winner
4. 71 ms
5. 22 ms -- winner
6. 81 ms
db<>fiddle here
E. Like C. ~ 2.3 rows per customer_id
1. 127 ms
2. 126 ms
3. 36 ms -- winner
4. 620 ms
5. 145 ms
6. 203 ms
db<>fiddle here
Accented tests with Postgres 13
1M rows, 10.000 vs. 100 vs. 1.6 rows per customer.
F. with ~ 10.000 rows per customer
1. 526 ms
2. 527 ms
3. 127 ms
4. 2 ms -- winner !
5. 1 ms -- winner !
6. 356 ms
db<>fiddle here
G. with ~ 100 rows per customer
1. 535 ms
2. 529 ms
3. 132 ms
4. 108 ms -- !
5. 71 ms -- winner
6. 376 ms
db<>fiddle here
H. with ~ 1.6 rows per customer
1. 691 ms
2. 684 ms
3. 234 ms -- winner
4. 4669 ms
5. 1089 ms
6. 1264 ms
db<>fiddle here
Conclusions
DISTINCT ON uses the index effectively and typically performs best for few rows per group. And it performs decently even with many rows per group.
For many rows per group, emulating an index skip scan with an rCTE performs best - second only to the query technique with a separate lookup table (if that's available).
The row_number() technique demonstrated in the currently accepted answer never wins any performance test. Not then, not now. It never comes even close to DISTINCT ON, not even when the data distribution is unfavorable for the latter. The only good thing about row_number(): it does not scale terribly, just mediocre.
More benchmarks
Benchmark by "ogr" with 10M rows and 60k unique "customers" on Postgres 11.5. Results are in line with what we have seen so far:
Proper way to access latest row for each individual identifier?
Original (outdated) benchmark from 2011
I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing #OMGPonies' first query (A) to the above DISTINCT ON solution (B):
Select the whole table, results in 5958 rows in this case.
A: 567.218 ms
B: 386.673 ms
Use condition WHERE customer BETWEEN x AND y resulting in 1000 rows.
A: 249.136 ms
B: 55.111 ms
Select a single customer with WHERE customer = x.
A: 0.143 ms
B: 0.072 ms
Same test repeated with the index described in the other answer:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms
1B: 193.547 ms
2A: 249.796 ms -- special index not used
2B: 28.679 ms
3A: 0.120 ms
3B: 0.048 ms

This is common greatest-n-per-group problem, which already has well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).
Note that bunch of solutions to this common problem can surprisingly be found in the MySQL manual -- even though your problem is in Postgres, not MySQL, the solutions given should work with most SQL variants. See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.

In Postgres you can use array_agg like this:
SELECT customer,
(array_agg(id ORDER BY total DESC))[1],
max(total)
FROM purchases
GROUP BY customer
This will give you the id of each customer's largest purchase.
Some things to note:
array_agg is an aggregate function, so it works with GROUP BY.
array_agg lets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
You could use array_agg in a similar way for your third output column, but max(total) is simpler.
Unlike DISTINCT ON, using array_agg lets you keep your GROUP BY, in case you want that for other reasons.

The Query:
SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p
ON
p.customer = purchases.customer
AND
purchases.total < p.total
WHERE p.total IS NULL
HOW DOES THAT WORK! (I've been there)
We want to make sure that we only have the highest total for each purchase.
Some Theoretical Stuff (skip this part if you only want to understand the query)
Let Total be a function T(customer,id) where it returns a value given the name and id
To prove that the given total (T(customer,id)) is the highest we have to prove that
We want to prove either
∀x T(customer,id) > T(customer,x) (this total is higher than all other
total for that customer)
OR
¬∃x T(customer, id) < T(customer, x) (there exists no higher total for
that customer)
The first approach will need us to get all the records for that name which I do not really like.
The second one will need a smart way to say there can be no record higher than this one.
Back to SQL
If we left joins the table on the name and total being less than the joined table:
LEFT JOIN purchases as p
ON
p.customer = purchases.customer
AND
purchases.total < p.total
we make sure that all records that have another record with the higher total for the same user to be joined:
+--------------+---------------------+-----------------+------+------------+---------+
| purchases.id | purchases.customer | purchases.total | p.id | p.customer | p.total |
+--------------+---------------------+-----------------+------+------------+---------+
| 1 | Tom | 200 | 2 | Tom | 300 |
| 2 | Tom | 300 | | | |
| 3 | Bob | 400 | 4 | Bob | 500 |
| 4 | Bob | 500 | | | |
| 5 | Alice | 600 | 6 | Alice | 700 |
| 6 | Alice | 700 | | | |
+--------------+---------------------+-----------------+------+------------+---------+
That will help us filter for the highest total for each purchase with no grouping needed:
WHERE p.total IS NULL
+--------------+----------------+-----------------+------+--------+---------+
| purchases.id | purchases.name | purchases.total | p.id | p.name | p.total |
+--------------+----------------+-----------------+------+--------+---------+
| 2 | Tom | 300 | | | |
| 4 | Bob | 500 | | | |
| 6 | Alice | 700 | | | |
+--------------+----------------+-----------------+------+--------+---------+
And that's the answer we need.

The solution is not very efficient as pointed by Erwin, because of presence of SubQs
select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

I use this way (postgresql only): https://wiki.postgresql.org/wiki/First/last_%28aggregate%29
-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $1;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
sfunc = public.first_agg,
basetype = anyelement,
stype = anyelement
);
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
Then your example should work almost as is:
SELECT FIRST(id), customer, FIRST(total)
FROM purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;
CAVEAT: It ignore's NULL rows
Edit 1 - Use the postgres extension instead
Now I use this way: http://pgxn.org/dist/first_last_agg/
To install on ubuntu 14.04:
apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'
It's a postgres extension that gives you first and last functions; apparently faster than the above way.
Edit 2 - Ordering and filtering
If you use aggregate functions (like these), you can order the results, without the need to have the data already ordered:
http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES
So the equivalent example, with ordering would be something like:
SELECT first(id order by id), customer, first(total order by id)
FROM purchases
GROUP BY customer
ORDER BY first(total);
Of course you can order and filter as you deem fit within the aggregate; it's very powerful syntax.

Use ARRAY_AGG function for PostgreSQL, U-SQL, IBM DB2, and Google BigQuery SQL:
SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer

In SQL Server you can do this:
SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1
Explaination:Here Group by is done on the basis of customer and then order it by total then each such group is given serial number as StRank and we are taking out first 1 customer whose StRank is 1

Very fast solution
SELECT a.*
FROM
purchases a
JOIN (
SELECT customer, min( id ) as id
FROM purchases
GROUP BY customer
) b USING ( id );
and really very fast if table is indexed by id:
create index purchases_id on purchases (id);

Snowflake/Teradata supports QUALIFY clause which works like HAVING for windowed functions:
SELECT id, customer, total
FROM PURCHASES
QUALIFY ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) = 1

In PostgreSQL, another possibility is to use the first_value window function in combination with SELECT DISTINCT:
select distinct customer_id,
first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from purchases;
I created a composite (id, total), so both values are returned by the same aggregate. You can of course always apply first_value() twice.

This way it work for me:
SELECT article, dealer, price
FROM shop s1
WHERE price=(SELECT MAX(s2.price)
FROM shop s2
WHERE s1.article = s2.article
GROUP BY s2.article)
ORDER BY article;
Select highest price on each article

This is how we can achieve this by using windows function:
create table purchases (id int4, customer varchar(10), total integer);
insert into purchases values (1, 'Joe', 5);
insert into purchases values (2, 'Sally', 3);
insert into purchases values (3, 'Joe', 2);
insert into purchases values (4, 'Sally', 1);
select ID, CUSTOMER, TOTAL from (
select ID, CUSTOMER, TOTAL,
row_number () over (partition by CUSTOMER order by TOTAL desc) RN
from purchases) A where RN = 1;

The accepted OMG Ponies' "Supported by any database" solution has good speed from my test.
Here I provide a same-approach, but more complete and clean any-database solution. Ties are considered (assume desire to get only one row for each customer, even multiple records for max total per customer), and other purchase fields (e.g. purchase_payment_id) will be selected for the real matching rows in the purchase table.
Supported by any database:
select * from purchase
join (
select min(id) as id from purchase
join (
select customer, max(total) as total from purchase
group by customer
) t1 using (customer, total)
group by customer
) t2 using (id)
order by customer
This query is reasonably fast especially when there is a composite index like (customer, total) on the purchase table.
Remark:
t1, t2 are subquery alias which could be removed depending on database.
Caveat: the using (...) clause is currently not supported in MS-SQL and Oracle db as of this edit on Jan 2017. You have to expand it yourself to e.g. on t2.id = purchase.id etc. The USING syntax works in SQLite, MySQL and PostgreSQL.

If you want to select any (by your some specific condition) row from the set of aggregated rows.
If you want to use another (sum/avg) aggregation function in addition to max/min. Thus you can not use clue with DISTINCT ON
You can use next subquery:
SELECT
(
SELECT **id** FROM t2
WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )
) id,
name,
MAX(amount) ma,
SUM( ratio )
FROM t2 tf
GROUP BY name
You can replace amount = MAX( tf.amount ) with any condition you want with one restriction: This subquery must not return more than one row
But if you wanna to do such things you probably looking for window functions

For SQl Server the most efficient way is:
with
ids as ( --condition for split table into groups
select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i)
)
,src as (
select * from yourTable where <condition> --use this as filter for other conditions
)
,joined as (
select tops.* from ids
cross apply --it`s like for each rows
(
select top(1) *
from src
where CommodityId = ids.i
) as tops
)
select * from joined
and don't forget to create clustered index for used columns

This can be achieved easily by MAX FUNCTION on total and GROUP BY id and customer.
SELECT id, customer, MAX(total) FROM purchases GROUP BY id, customer
ORDER BY total DESC;

My approach via window function dbfiddle:
Assign row_number at each group: row_number() over (partition by agreement_id, order_id ) as nrow
Take only first row at group: filter (where nrow = 1)
with intermediate as (select
*,
row_number() over ( partition by agreement_id, order_id ) as nrow,
(sum( suma ) over ( partition by agreement_id, order_id ))::numeric( 10, 2) as order_suma,
from <your table>)
select
*,
sum( order_suma ) filter (where nrow = 1) over (partition by agreement_id)
from intermediate

group by 2 fields while returning the first value from other columns

I have a query that will select two fields, group them and then order the results based on the first result of the second field. There is a one to many relationship between these two fields.
select Product
, Material
from dbo.Files
group by Product, Material
ORDER BY Product, MIN(Material)
The result is exactly what we want. In reality, there are actually dozens of "10-001" records but we just get the first one.
Product Material
10 001
10 002
10 003
10 004
10 005
10 006
10 007
10 008
10 009
11 001
11 009
13 012
13 013
13 014
The problem is when I want to display additional columns. Obviously I cannot select additional columns unless I also add them to the group by statement. But when I add them to my group by statement it changes the results.
This is as close as I have come.
select Product
,Material
,XXIMPORT
,Field1
,Field2
,Field3
,Field4
,Field5
from dbo.files
where Field1 is not null
group by Product
,Material
,XXIMPORT
,Field1
,Field2
,Field3
,Field4
,Field5
ORDER BY Product
, MIN(Material)
, MIN(Field1)
, MIN(Field2)
With these results:
Product Material XXIMPORT Field1 Field2
10 NULL NULL OUTER DIAMETER CRITICAL FIT
10 001 5/27/15 Inside Diameter Cross Section
10 001 5/27/15 Part INSIDE DIAMETER
10 002 5/27/15 OUTER DIAMETER INSIDE DIAMETER
10 003 5/27/15 ID OD
10 003 5/27/15 TYPE (TY) Thickness (T)
10 011 5/27/15 OVERALL LENGTH THREAD SIZE
10 012 5/27/15 Height (HT) Outer Diameter (OD)
So I know why the results are changing... but how can I tell SQL Serve to just return the first result it finds in the matching "other" columns? For example, just the top "10-001" row or the top "10-003" row.
Everything depends on the result of the first query. Based on other posts on stackoverflow that look very similar, I have tried to put the first part in a sub query with typical results. I have also tried to join the table on itself and add the columns but I must not have the syntax correct.

You can take advantage of the ROW_NUMBER function here to sort your rows in the order you desire, based on Product and Material.
WITH cteFiles AS (
SELECT Product
,Material
,XXIMPORT
,Field1
,Field2
,Field3
,Field4
,Field5
,ROW_NUMBER() OVER(PARTITION BY Product, Material ORDER BY Field1, Field2) AS RowNum
FROM dbo.Files
)
SELECT *
FROM cteFiles
WHERE RowNum = 1;

First, your ORDER BY ... MIN() makes no sense. Material is used in your grouping term, so there is only 1 possible value in each group, and MIN is redundant.
As for your question, there is logically no definition of "first" or any other order related feature unless you explicitly specify one (such as using ORDER BY).
You need to explicitly tell SQL what you want it to do.
If you really don't care which values you get in the non-grouping columns than:
1. you probably shouldn't include them in the result set anyway.
2. You could use one of:
2.1. Correlated Subquery with TOP 1
2.2. CROSS APPLY
2.3. PARTITION BY
2....

SQL Remove Duplicates, save lowest of certain column

I've been looking for an answer to this but couldn't find anything the same as this particular situation.
So I have a one table that I want to remove duplicates from.
__________________
| JobNumber-String |
| JobOp - Number |
------------------
So there are multiples of these two values, together they make the key for the row. I want keep all distinct job numbers with the lowest job op. How can I do this? I've tried a bunch of things, mainly trying the min function, but that only seems to work on the entire table not just the JobNumber sets. Thanks!

Original Table Values:
JobNumber Jobop
123 100
123 101
456 200
456 201
780 300
Code Ran:
DELETE FROM table
WHERE CONCAT(JobNumber,JobOp) NOT IN
(
SELECT CONCAT(JobNumber,MIN(JobOp))
FROM table
GROUP BY JobNumber
)
Ending Table Values:
JobNumber Jobop
123 100
456 200
780 300

With SQL Server 2008 or higher you can enhance the MIN function with an OVER clause specifying a PARTITION BY section.
Please have a look at https://msdn.microsoft.com/en-us/library/ms189461.aspx

You can simply select the values you want to keep:
select jobOp, min(number) from table group by jobOp
Then you can delete the records you don't want:
DELETE t FROM table t
left JOIN (select jobOp, min(number) as minnumber from table group by jobOp ) e
ON t.jobob = e.jobob and t.number = e.minnumber
Where e.jobob is null

I like to do this with window functions:
with todelete as (
select t.*, min(jobop) over (partition by numbers) as minjop
from table t
)
delete from todelete
where jobop > minjop;

It sounds like you are not using the correct GROUP BY clause when using the MIN function. This sql should give you the minimum JobOp value for each JobNumber:
SELECT JobNumber, MIN(JobOp) FROM test.so_test GROUP BY JobNumber;
Using this in a subquery, along with CONCAT (this is from MySQL, SQL Server might use different function) because both fields form your key, gives you this sql:
SELECT * FROM so_test WHERE CONCAT(JobNumber,JobOp)
NOT IN (SELECT CONCAT(JobNumber,MIN(JobOp)) FROM test.so_test GROUP BY JobNumber);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas