SQL - Need to find duplicates where one column can have multiple values

SQL - Need to find duplicates where one column can have multiple values - sql

I am pretty sure this SQL requires using GROUP BY and HAVING, but not sure how to write it.
I have a table similar to this:
ID
Cust#
Order#
ItemCode
DataPoint1
DataPoint2
1
001
123
I
xxxyyyxxx
123456
2
001
123
Insert
xxxyyyxxx
123456
3
001
123
Delete
asdf
9999
4
001
123
D
asdf
9999
In this table Rows 1 & 2 are effectively duplicates, as are rows 3 & 4.
This is determined by the ItemCode having the value of 'I' or 'Insert' in rows 1 & 2. And 'D' or 'Delete' in rows 3 & 4.
How could I write a SQL select statement to return rows 2 and 4, as I am interested in pulling out the duplicated rows with the higher ID value.
Thanks for any help.

Replace the "offending" column with a consistent value. Then, you can use row_number() or a similar mechanism:
select t.*
from (select t.*,
row_number() over (partition by Cust#, Order#, left(ItemCode, 1), DataPoint1, DataPoint2
order by id asc
) as seqnum
from t
) t
where seqnum > 1;
Note: Not all databases support left(), but all support the functionality somehow. This does assume that the first character of the ItemCode is sufficient to identify identical rows, regardless of the value.

Related

Get certain rows, plus rows before and after

Let's say I have the following data set:
ID
Identifier
Admission_Date
Release_Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
234
2
4/15/22
4/18/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
789
2
7/1/22
7/5/22
321
2
6/1/21
6/3/21
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
321
2
5/6/21
5/10/21
I want all rows with identifier=1. I also want rows that are either directly below or above rows with Identifier=1 - sorted by most recent to least recent.
There is always a row below rows with identifier=1. There may or may not be a row above. If there is no row with identifier=1 for an ID, then it will not be brought in with a prior step.
The resulting data set should be as follows:
ID
Identifier
Admission Date
Release Date
234
2
5/1/22
5/5/22
234
1
4/25/22
4/30/22
234
2
4/20/22
4/24/22
789
1
7/15/22
7/19/22
789
2
7/8/22
7/14/22
321
2
5/27/21
5/31/21
321
1
5/20/21
5/26/21
321
2
5/15/21
5/19/21
I am using DBeaver, which runs PostgreSQL.

I admittedly don't know Postgres well so the following could possibly be optimised, however using a combination of lag and lead to obtain the previous and next dates (assuming Admission_date is the one to order by) you could try
with d as (
select *,
case when identifier = 1 then Lag(admission_date) over(partition by id order by Admission_Date desc) end pd,
case when identifier = 1 then Lead(admission_date) over(partition by id order by Admission_Date desc) end nd
from t
)
select id, Identifier, Admission_Date, Release_Date
from d
where identifier = 1
or exists (
select * from d d2
where d2.id = d.id
and (d.Admission_Date = pd or d.admission_date = nd)
)
order by Id, Admission_Date desc;

One way:
SELECT (x.my_row).* -- decompose fields from row type
FROM (
SELECT identifier
, lag(t) OVER w AS t0 -- take whole row
, t AS t1
, lead(t) OVER w AS t2
FROM tbl t
WINDOW w AS (PARTITION BY id ORDER BY admission_date)
) sub
CROSS JOIN LATERAL (
VALUES (t0), (t1), (t2) -- pivot
) x(my_row)
WHERE sub.identifier = 1
AND (x.my_row).id IS NOT NULL; -- exclude rows with NULL ( = missing row)
db<>fiddle here
The query is designed to only make a single pass over the table.
Uses some advanced SQL / Postgres features.
About LATERAL:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
About the VALUES expression:
Postgres: convert single row to multiple rows (unpivot)
The manual about extracting fields from a composite type.
If there are many rows per id, other solutions will be (much) faster - with proper index support. You did not specify ...

Getting aggregate data in MySql

I am attempting to write a sql query to fetch aggregate data from a table. I have a table with data that looks as follows (example data):
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
10
100
01/01/2021
example.doc
gateway
in-progress
1
20
100
02/01/2021
null
checkpoint1
in-progress
1
20
100
03/01/2021
null
checkpoint2
in-progress
The aggregate data I would like to query would look like:
trackingId
numberOfRecords
totalRecords
dateSubmitted
fileName
checkpoint
status
1
50
100
03/01/2021
example.doc
checkpoint2
in-progress
In summary, I would like to:
group on trackingId (done)
Sum of all records fetched (done)
get the latest date (done)
name of original document (not sure how to fetch a value from the first row only, I am trying to avoid subqueries due to inefficiencies)
latest checkpoint (get value from the newest record)
latest status (get value from the newest record)
My issue mainly is fetching specific data from either the newest or oldest record.
Thanks.

Consider below
select trackingId,
sum(numberOfRecords) as numberOfRecords,
any_value(totalRecords) as totalRecords,
max(dateSubmitted) as dateSubmitted,
array_agg(fileName order by dateSubmitted limit 1)[offset(0)] as fileName,
array_agg(checkpoint order by dateSubmitted desc limit 1)[offset(0)] as checkpoint,
array_agg(status order by dateSubmitted desc limit 1)[offset(0)] as status,
from `project.dataset.table`
group by trackingId
if applied to sample data in your question - output is

Look for this:
CREATE TABLE test ( id INT, -- will be used for ordering
cat INT, -- will be used for aggregation
col1 INT, -- will be used for to get SUM
col2 INT, -- will be used for to get value from 1st row
col3 INT -- will be used for to get value from last row
);
INSERT INTO test VALUES
(1,1,11,111,1111), (2,1,22,222,2222), (3,1,33,333,3333),
(4,2,4,4,4), (5,2,5,5,5);
SELECT * FROM test;
id
cat
col1
col2
col3
1
1
11
111
1111
2
1
22
222
2222
3
1
33
333
3333
4
2
4
4
4
5
2
5
5
5
SELECT cat,
SUM(col1) col1_sum,
SUBSTRING_INDEX(GROUP_CONCAT(col2 ORDER BY id), ',', 1) col2_first,
SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id), ',', -1) col3_last
FROM test
GROUP BY cat;
cat
col1_sum
col2_first
col3_last
1
66
111
3333
2
9
4
5
db<>fiddle here
The values processed by GROUP_CONCAT() must have no comma in the value.
PS. Do not forget about group_concat_max_len, especially when single value in the column may be long.
PPS. The expression for last value may be SUBSTRING_INDEX(GROUP_CONCAT(col3 ORDER BY id DESC), ',', 1) col3_last.

SQL compares the value of 2 columns and select the column with max value row-by-row

I have table something like:
GROUP
NAME
Value_1
Value_2
1
ABC
0
0
1
DEF
4
4
50
XYZ
6
6
50
QWE
6
7
100
XYZ
26
2
100
QWE
26
2
What I would like to do is to groupby group and select the name with highest value_1. If their value_1 are the same, compare and select the max with value_2. If they're still the same, select the first one.
The output will be something like:
GROUP
NAME
Value_1
Value_2
1
DEF
4
4
50
QWE
6
7
100
XYZ
26
2
The challenge for me here is I don't know how many categories in NAME so a simple case when is not working. Thanks for help

You can use window functions to solve the bulk of your problem:
select t.*
from (select t.*,
row_number() over (partition by group order by value1 desc, value2 desc) as seqnum
from t
) t
where seqnum = 1;
The one caveat is the condition:
If they're still the same, select the first one.
SQL tables represent unordered (multi-) sets. There is no "first" one unless a column specifies the ordering. The best you can do is choose an arbitrary value when all the other values are the same.
That said, you might have another column that has an ordering. If so, add that as a third key to the order by.

Cumulative count of duplicates

For a table looking like
ID | Value
-------------
1 | 2
2 | 10
3 | 3
4 | 2
5 | 0
6 | 3
7 | 3
I would like to calculate the number of IDs with a higher Value, for each Value that appears in the table, i.e.
Value | Position
----------------
10 | 0
3 | 1
2 | 4
0 | 6
This equates to the offset of the Value in a ORDER BY Value ordering.
I have considered doing this by calculating the number of duplicates with something like
SELECT Value, count(*) AS ct FROM table GROUP BY Value";
And then cumulating the result, but I guess that is not the optimal way to do it (nor have I managed to combine the commands accordingly)
How would one go about calculating this efficiently (for several dozens of thousands of rows)?

This seems like a perfect opportunity for the window function rank() (not the related dense_rank()):
SELECT DISTINCT ON (value)
value, rank() OVER (ORDER BY value DESC) - 1 AS position
FROM tbl
ORDER BY value DESC;
rank() starts with 1, while your count starts with 0, so subtract 1.
Adding a DISTINCT step (DISTINCT ON is slightly cheaper here) to remove duplicate rows (after computing counting ranks). DISTINCT is applied after window functions. Details in this related answer:
Best way to get result count before LIMIT was applied
Result exactly as requested.
An index on value will help performance.
SQL Fiddle.

You might also try this if you're not comfortable with window functions:
SELECT t1.value, COUNT(DISTINCT t2.id) AS position
FROM tbl t1 LEFT OUTER JOIN tbl t2
ON t1.value < t2.value
GROUP BY t1.value
Note the self-join.

How to distinguish between the first and rest for duplicate records using sql?

These are the input table and required output table.
Input table
ID Name
-------------
1 aaa
1 ababaa
2 bbbbbb
2 bcbcbccbc
2 bcdbcdbbbbb
3 ccccc
Output table
ID Name Ord
-----------------------------
1 aaa first
1 ababaa rest
2 bbbbbb first
2 bcbcbccbc rest
2 bcdbcdbbbbb rest
3 ccccc first
First and Rest is based on the occurrence of an ID field.
Is there a way to write a SQL query to achieve this ?
P.S. - This question is somewhat similar to what I am looking for.

select id, name, case rnk when 1 then 'first' else 'rest' end ord
from(
select *, RANK() over(partition by id order by id,name) rnk
from input
) X

You can also try this
SELECT id, name,
Decode(ROW_NUMBER() OVER (partition by id order by id,name),1,'First','Rest') Ord
FROM Input_table;
You can use this query as this is much simple and yields good performance

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - Need to find duplicates where one column can have multiple values - sql

Related

Get certain rows, plus rows before and after

Getting aggregate data in MySql

SQL compares the value of 2 columns and select the column with max value row-by-row

Cumulative count of duplicates

How to distinguish between the first and rest for duplicate records using sql?

Categories

Resources