How do I merge and delete duplicated rows in SQL using UPDATE? - sql

For example, I have a table of:
id | code | name | type | deviceType
---+------+------+------+-----------
1 | 23 | xyz | 0 | web
2 | 23 | xyz | 0 | mobile
3 | 24 | xyzc | 0 | web
4 | 25 | xyzc | 0 | web
I want the result to be:
id | code | name | type | deviceType
---+------+------+------+-----------
1 | 23 | xyz | 0 | web&mobile
2 | 24 | xyzc | 0 | web
3 | 25 | xyzc | 0 | web
How do I do this in SQL Server using UPDATE and DELETE statements?
Any help is greatly appreciated!

I might actually suggest just leaving the original data intact, and instead creating a view here:
CREATE VIEW yourView AS
SELECT ROW_NUMBER() OVER (ORDER BY MIN(id)) AS id,
code, name, type,
STRING_AGG(deviceType, '&') WITHIN GROUP (ORDER BY id) AS deviceType
FROM yourTable
GROUP BY code, name, type;
Demo
One main reason for not actually doing the update is that every time new data comes in, you might possibly have to run that update, over and over. Instead, just keeping the original data and running the view occasionally might perform better here.
Note that I assume that you are using SQL Server 2017 or later. If not, then STRING_AGG would have to be replaced with an uglier approach, but you should consider upgrading in this case.

To do what you want, you would need two separate statements.
This updates the "first" row of each group with all the device types in the group:
update t
set t.devicetype = t1.devicetype
from mytable t
inner join (
select min(id) as id, string_agg(devicetype, '&') within group(order by id) as devicetype
from mytable
group by code, name, type
having count(*) > 1
) t1 on t1.id = t.id
This deletes everything but the first row per group:
with t as (
select row_number() over(partition by code, name, type order by id) rn
from mytable
)
delete from t where rn > 1
Demo on DB Fiddle

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

Self join to create a new column with updated records

I am trying to write a SQL query to get the start date for employees in a store. As seen in the first screenshot, employee number 5041 had the number A0EH but as the number got updated, it updated the start date for the employee as well. This effects the metric of total duration in the store.
I am trying to get to the output below but haven't been able to figure out how to get this view.
This is the code I was trying but I am not getting the correct output.
select
esd.employee_number,
(case when esd.old_employee_number is null then es.employee_number else es.old_employee_number end) as old_employee_number,
esd.entity_id,
esd.original_start_date
from earliest_start_date as esd
left join earliest_start_date as es
on (es.employee_number = esd.old_employee_number)
How do I solve this on SQL?
Redshift reportedly supports recursion via WITH clause. Here's an example:
MariaDB 10.5 has similar support. Test case is here:
Fully working test case (via MariaDB 10.5) (Updated)
Link to Amazon Redshift detail for WITH clause and window functions:
Amazon Redshift - WITH clause
Amazon redshift - Window functions
WITH RECURSIVE cte (employee_number, original_no, entity_id, original_start_date, n) AS (
SELECT employee_number, employee_number, entity_id, original_start_date, 1 FROM earliest_start_date WHERE old_employee_number IS NULL UNION ALL
SELECT new_tbl.employee_number, cte.original_no, cte.entity_id, cte.original_start_date, n+1
FROM earliest_start_date new_tbl
JOIN cte
ON cte.employee_number = new_tbl.old_employee_number
)
, xrows AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY entity_id ORDER BY n DESC) AS rn
FROM cte
)
SELECT * FROM xrows WHERE rn = 1
;
Result:
+-----------------+-------------+-----------+---------------------+------+----+
| employee_number | original_no | entity_id | original_start_date | n | rn |
+-----------------+-------------+-----------+---------------------+------+----+
| XXXX | XXXX | 88 | 2021-09-02 | 1 | 1 |
| 5041 | A0EH | 96 | 2021-09-05 | 2 | 1 |
+-----------------+-------------+-----------+---------------------+------+----+
2 rows in set
Raw test data:
SELECT * FROM earliest_start_date;
+-----------------+---------------------+-----------+---------------------+
| employee_number | old_employee_number | entity_id | original_start_date |
+-----------------+---------------------+-----------+---------------------+
| 5041 | A0EH | 96 | 2021-09-10 |
| A0EH | NULL | 96 | 2021-09-05 |
| XXXX | NULL | 88 | 2021-09-02 |
+-----------------+---------------------+-----------+---------------------+
Note that the logic makes assumption about uniqueness of the employee_number and, in the current form, can't handle cases where the employee_number is reused by the same employee or used again with a different employee without adjusting prior data. There may not be enough detail in the current structure to handle those cases.

CTE to represent a logical table for the rows in a table which have the max value in one column

I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12

select rows satisfying some criteria and with maximum value in a certain column

I have a table of metadata for updates to a software package. The table has columns id, name, version. I want to select all rows where the name is one of some given list of names and the version is maximum of all the rows with that name.
For example, given these records:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 1 | foo | 1 |
| 2 | foo | 2 |
| 3 | bar | 4 |
| 4 | bar | 5 |
+----+------+---------+
And a task "give me the highest versions of records "foo" and "bar", I want the result to be:
+----+------+---------+
| id | name | version |
+----+------+---------+
| 2 | foo | 2 |
| 4 | bar | 5 |
+----+------+---------+
What I come up with so far, is using nested queries:
SELECT *
FROM updates
WHERE (
id IN (SELECT id
FROM updates
WHERE name = 'foo'
ORDER BY version DESC
LIMIT 1)
) OR (
id IN (SELECT id
FROM updates
WHERE name = 'bar'
ORDER BY version DESC
LIMIT 1)
);
This works, but feels wrong. If I want to filter on more names, I have to replicate the whole subquery multiple times. Is there a better way to do this?
select distinct on (name) id, name, version
from metadata
where name in ('foo', 'bar')
order by name, version desc
NOT EXISTS is a way to avoid unwanted sub optimal tuples:
SELECT *
FROM updates uu
WHERE uu.zname IN ('foo', 'bar')
AND NOT EXISTS (
SELECT *
FROM updates nx
WHERE nx.zname = uu.zanme
AND nx.version > uu.version
);
Note: I replaced name by zname, since it is more or less a keyword in postgresql.
Update after rereading the Q:
I want to select all rows where the name is one of some given list
of names and the version is maximum of all the rows with that name.
If there can be ties (multiple rows with the maximum version per name), you could use the window function rank() in a subquery. Requires PostgreSQL 8.4+.
SELECT *
FROM (
SELECT *, rank() OVER (PARTITION BY name ORDER BY version DESC) AS rnk
FROM updates
WHERE name IN ('foo', 'bar')
)
WHERE rnk = 1;

Row Rank in a MySQL View

I need to create a view that automatically adds virtual row number in the result. the graph here is totally random all that I want to achieve is the last column to be created dynamically.
> +--------+------------+-----+
> | id | variety | num |
> +--------+------------+-----+
> | 234 | fuji | 1 |
> | 4356 | gala | 2 |
> | 343245 | limbertwig | 3 |
> | 224 | bing | 4 |
> | 4545 | chelan | 5 |
> | 3455 | navel | 6 |
> | 4534345| valencia | 7 |
> | 3451 | bartlett | 8 |
> | 3452 | bradford | 9 |
> +--------+------------+-----+
Query:
SELECT id,
variety,
SOMEFUNCTIONTHATWOULDGENERATETHIS() AS num
FROM mytable
Use:
SELECT t.id,
t.variety,
(SELECT COUNT(*) FROM TABLE WHERE id < t.id) +1 AS NUM
FROM TABLE t
It's not an ideal manner of doing this, because the query for the num value will execute for every row returned. A better idea would be to create a NUMBERS table, with a single column containing a number starting at one that increments to an outrageously large number, and then join & reference the NUMBERS table in a manner similar to the variable example that follows.
MySQL Ranking, or Lack Thereof
You can define a variable in order to get psuedo row number functionality, because MySQL doesn't have any ranking functions:
SELECT t.id,
t.variety,
#rownum := #rownum + 1 AS num
FROM TABLE t,
(SELECT #rownum := 0) r
The SELECT #rownum := 0 defines the variable, and sets it to zero.
The r is a subquery/table alias, because you'll get an error in MySQL if you don't define an alias for a subquery, even if you don't use it.
Can't Use A Variable in a MySQL View
If you do, you'll get the 1351 error, because you can't use a variable in a view due to design. The bug/feature behavior is documented here.
Oracle has a rowid pseudo-column. In MySQL, you might have to go ugly:
SELECT id,
variety,
1 + (SELECT COUNT(*) FROM tbl WHERE t.id < id) as num
FROM tbl
This query is off the top of my head and untested, so take it with a grain of salt. Also, it assumes that you want to number the rows according to some sort criteria (id in this case), rather than the arbitrary numbering shown in the question.