Remove duplicate rows from multiple criteria using SQL queries

Remove duplicate rows from multiple criteria using SQL queries - sql

I have a table with columns Machine, Product and Sources:
Machine
Product
Sources
M3
H
cmdd6
M3
H
91
M3
H
cmdd3
M4
I
cmdd7
M4
J
cmdd7
M4
B
827
M4
B
cmdd7
In the above table where Machine is M3 the product is same but the Sources column has multiple intake. So the requirement is to remove the duplicate rows where Sources should always be 'cmdd' in ascending order.
For example if there is duplicate with product and sources are different i.e 'cmdd6' or 'cmdd3', then duplicate row should be removed and values would remain with sources 'cmdd3'.
Below is the result table would to like to achieve
Machine
Product
Sources
M3
H
cmdd3
M4
I
cmdd7
M4
J
cmdd7
M4
B
cmdd7
Below is the query which I tried to remove duplicates on the values of count >1.
WITH CTE(Machine, Product,Sources, duplicatecount) AS
(
SELECT
Machine, Product, Sources,
ROW_NUMBER() OVER (PARTITION BY Machine, Product
ORDER BY Machine, Sources) AS DuplicateCount
FROM
Concatcleanup
)
DELETE FROM cte
WHERE duplicatecount > 1
Any help is highly appreciated.

You can use one extra crafted field inside the ORDER BY clause ROW_NUMBER window function, to pull "cmdd%"-like values above all the others.
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER(
PARTITION BY Machine, Product
ORDER BY CASE WHEN Sources NOT LIKE 'cmdd%' THEN 1 END,
Sources
) AS DuplicateCount
FROM Concatcleanup
)
DELETE FROM cte
WHERE DuplicateCount > 1;
Check the demo here.
If you want to avoid the deletion, you can use the notation SELECT ... INTO <new_table> FROM ... and use the code for the cte:
SELECT Machine, Product, Sources,
ROW_NUMBER() OVER(
PARTITION BY Machine, Product
ORDER BY CASE WHEN Sources NOT LIKE 'cmdd%' THEN 1 END,
Sources
) AS DuplicateCount
INTO newtab
FROM Concatcleanup;
Check the demo here.

Related

SQL/BigQuery - List of products sold together

how are you doing?
I have a sales table with DATE, TICKET_ID (transaction id) and PRODUCT_ID (product sold). I'd like to have a list of the items sold together PER DAY (that is, today product X was sold with product Y 10 times, yesterday product X was sold with product Y 5 times...)
I have this code, however it has two problems:
1- Generate inverted duplicates. Example:
product_id product_id_bought_with counting
12345 98765 130
98765 12345 130
abcde fghij 88
fghij abcde 88
2- This code ran fine WITHOUT THE DATA COLUMN. After I entered the data volume is much larger and I get a limit error.
"Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 152% of limit. Top memory consumer(s): ORDER BY operations: 99% other/unattributed: 1%"
My code:
SELECT
c.DATE,
c.product_id,
c.product_id_bought_with,
count(*) counting
FROM ( SELECT a.DATE, a.product_id, b.product_id as product_id_bought_with
FROM `MY-TABLE` a
INNER JOIN `THE-SAME-TABLE` b
ON a.ID_TICKETS = b.ID_TICKETS
AND a.product_id != b.product_id
AND a.DATE = b.DATE
) c
GROUP BY DATE, product_id, product_id_bought_with
ORDER BY counting DESC
I'm open to new ideas on how to do this. Thanks in advance!
Edit: Sample example
CREATE TABLE `project_id.dataset.table_name` (
DAT_VTE DATE,
ID_TICKET STRING,
product_id int
);
INSERT INTO `project_id.dataset.table_name` (DAT_VTE, ID_TICKET, product_id)
VALUES
(DATE('2022-01-01'), '123_abc', 876123),
(DATE('2022-01-01'), '123_abc', 324324),
(DATE('2022-01-02'), '456_def', 876123),
(DATE('2022-01-02'), '456_def', 324324),
(DATE('2022-01-02'), '456_def', 432321),
(DATE('2022-05-23'), '987_xyz', 876123),
(DATE('2022-05-23'), '987_xyz', 324324)

For your requirement, you can try the below query:
with mytable as(
select *,row_number()over (partition by DAT_VTE,ID_TICKET)rownum from `project_id.dataset.MY-TABLE`
)
select DAT_VTE
,product_id
,product_id_bought_with
,count(*) counting
from (
select a.DAT_VTE,a.ID_TICKET,a.product_id as product_id, b.product_id as product_id_bought_with
from mytable a
join mytable b
ON a.ID_TICKET = b.ID_TICKET
AND a.DAT_VTE = b.DAT_VTE
and a.rownum <b.rownum
)
GROUP BY DAT_VTE, product_id, product_id_bought_with
According to the error you provided, resources exceeded errors are usually triggered when an operation needs to gather all the data on a single computation unit and if it doesn’t fit in, then the job will fail. Ordering a huge amount of data involves heavy computation resources that can be better utilized if partitions are used.
Below are the ways to resolve your issue :
1 Usually partition helps with the resources issue as given in the documentation and in this link.
2 You can also try to split your query, write the results of every individual sub/inner query to another table as a temporary storage space for further processing.

Oracle SQL Merge Statement with Conditions

I"m relatively new to SQL, and I'm having an issue where the target table is not being updated.
I have duplicate account # (key) with different contact information in the associated columns. I’m attempting to consolidate the contact information (source) into a single row / account number with the non duplicate contact information going into (target) extended columns.
I constructed a Merge statement with a case condition to check if the data exists in the target table. If the data is not in the target table then add the information in the extended columns. The issue is that the target table doesn’t get updated. Both Source and Target tables are similarity defined.
**Merge SQL- reduced query**
MERGE INTO target tgt
USING (select accountno, cell, site, contact, email1 from (select w.accountno, w.cell, w.site, w.contact, email1, row_number() over (PARTITION BY w.accountno order by accountno desc) acct
from source w) inn where inn.acct =1) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE WHEN src.cell <> tgt.cell
THEN src.cell
END,
tgt.phone5 =
CASE WHEN src.site <> tgt.site
THEN src.site
END
I have validated that there is contact information in the source table for an accountno that should be added to the target table. I greatly appreciate any insight as to why the target table is not being updated.
I saw a similar question on Stack, but it didn't have a response.

Your SRC subquery in using clause, returns just 1 random row for each accountno.
You need to aggregate them, for example using PIVOT:
with source(accountno, cell, site, contact) as ( --test data:
select 1,8881234567,8881235678,8881236789 from dual union all
select 1,8881234567,8881235678,8881236789 from dual
)
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
So finally you can compare r1_cell, r1_site, r2_cell, r2_site with destination values and use required ones:
MERGE INTO target tgt
USING (
select accountno, contact,
r1_cell, r1_site,
r2_cell, r2_site
from (select s.*,row_number()over(partition by accountno order by cell) rn
from source s
)
pivot (
max(cell) cell,max(site) site
FOR rn
IN (1 R1,2 R2)
)
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED
THEN
UPDATE SET
tgt.phone4 =
CASE
WHEN src.r1_cell <> tgt.cell
THEN src.r1_cell
ELSE src.r2_cell
END,
tgt.phone5 =
CASE WHEN src.r1_site <> tgt.site
THEN src.r1_site
ELSE src.r2_site
END
/

the issue is with regards to the logic you have used in row_numbering the rows with identical account_number.
MERGE
INTO target tgt
USING (select accountno, cell, site, contact, email1
from (select w.accountno, w.cell, w.site, w.contact, email1
, row_number() over (PARTITION BY w.accountno order by w.accountno desc) acct
from source w
left join target w2
on w.accountno=w2.accountno
where w2.cell is null /* get records which are not in target*/
) inn
where inn.acct =1
) src
ON (tgt.accountno = src.accountno)
WHEN MATCHED THEN
UPDATE
SET tgt.phone4 = src.cell,
tgt.phone5 = src.site

Pivot multiple rows / columns into 1 row

We need to take multiple rows and multiple columns and transpose them into 1 row per key. I have a pivot query, but it is not working. I get some error about "Column ambiguously defined'
Our data looks like this:
SECTOR TICKER COMPANY
-----------------------------------------------------
5 ADNT Adient PLC
5 AUTO Autobytel Inc.
5 THRM Gentherm Inc
5 ALSN Allison Transmission Holdings, Inc.
5 ALV Autoliv, Inc.
12 HES Hess Corporation
12 AM Antero Midstrm
12 PHX Panhandle Royalty Company
12 NBR Nabors Industries Ltd.
12 AMRC Ameresco, Inc.
What we need is 1 row per ID, with each TICKER / COMPANY in a different column. So, output would look like:
5 ADNT Adient PLC AUTO Autobytel Inc. THRM Gentherm Inc........
You get the idea. 1 row per ID, and each other value in its own column. The query I tried is:
SELECT sector, ticker, company_name
FROM (SELECT d.sector, d.ticker, v.company_name, ROW_NUMBER() OVER(PARTITION BY d.sector ORDER BY d.sector) rn
FROM template13_ticker_data d, template13_vw v
WHERE d.m_ticker = v.m_ticker)
PIVOT (MAX(sector) AS sector, MAX(ticker) AS ticker, MAX(company_name) AS company_name
FOR (rn) IN (1 AS sector, 2 AS ticker, 3 AS company_name))
ORDER BY sector;

First thing to understand about pivots, you pick a single column in a result set to act as the as the PIVOT anchor, the hinge that the data will be pivoted around, this is specified in the FOR clause.
You can only PIVOT FOR a single column, but you can construct this column in a subquery or from joins or views as your target data query, OP has used ROW_NUMBER() but you can use any SQL mechanism you wish, even CASE statement to build a custom column to pivot around if there is no natural column within the dataset to use.
PIVOT will make a column for each value in the FOR column and will give that column the value of the aggregation function that you specify
It helps to visualise the constructed record set, before you apply the pivot, the following SQL can recreate the data scenario that OP has presented. I have used table variables here in place of OPs tables and views.
-- template13_ticker_data (with sector_char added)
DECLARE #tickerData Table
(
sector INT,
ticker CHAR(4),
m_ticker CHAR(4),
sector_char char(10)
)
-- template13_vw
DECLARE #Company Table
(
m_ticker CHAR(4),
ticker CHAR(4),
company_name VARCHAR(100)
)
INSERT INTO #tickerData (sector, ticker)
VALUES (5 ,'ADNT')
,(5 ,'AUTO')
,(5 ,'THRM')
,(5 ,'ALSN')
,(5 ,'ALV')
,(12,'HES')
,(12,'AM')
,(12,'PHX')
,(12,'NBR')
,(12,'AMRC')
INSERT INTO #Company (ticker, company_name)
VALUES ('ADNT','Adient PLC')
,('AUTO','Autobytel Inc.')
,('THRM','Gentherm Inc')
,('ALSN','Allison Transmission Holdings, Inc.')
,('ALV ','Autoliv, Inc.')
,('HES ','Hess Corporation')
,('AM ','Antero Midstrm')
,('PHX ','Panhandle Royalty Company')
,('NBR ','Nabors Industries Ltd.')
,('AMRC','Ameresco, Inc.')
-- Just re-creating a record set that matches the given data and query structure
UPDATE #tickerData SET m_ticker = ticker
UPDATE #Company SET m_ticker = ticker
-- populate 'sector_char' to show multiple aggregates
UPDATE #tickerData SET sector_char = '|' + cast(sector as varchar) + '|'
-- Unpivoted data Proof
SELECT d.sector, d.sector_char, d.ticker, v.company_name, ROW_NUMBER() OVER(PARTITION BY d.sector ORDER BY d.sector) rn
FROM #tickerData d, #Company v
WHERE d.m_ticker = v.m_ticker
The data before the pivot looks like this:
sector sector_char ticker company_name rn
------------------------------------------------------------------------
5 |5| ADNT Adient PLC 1
5 |5| AUTO Autobytel Inc. 2
5 |5| THRM Gentherm Inc 3
5 |5| ALSN Allison Transmission Holdings, Inc. 4
5 |5| ALV Autoliv, Inc. 5
12 |12| HES Hess Corporation 1
12 |12| AM Antero Midstrm 2
12 |12| PHX Panhandle Royalty Company 3
12 |12| NBR Nabors Industries Ltd. 4
12 |12| AMRC Ameresco, Inc. 5
Now visualise a subset of the results that you are expecting, to show the limitations around multiple column operations I have created sector_char to include in the final output
sector sector_char ticker_1 company_1 ticker_2 company_2
-----------------------------------------------------------------------------
5 |5| ADNT Adient PLC AUTO Autobytel Inc.
12 |12| HES Hess Corporation AM Antero Midstrm
Because we want more than 1 column output from the original row output, (ticker and company from each row) we have to use one of the following techniques:
Concatenate the values from multiple columns into a single column
Only useful if you can easily split those columns before you need to use the individual values, or if you don't need to process the columns, it is purely for visualisations.
execute multiple PIVOT queries and join the results
Necessary when the aggregation logic is different for each column, or you are not simply transposing a row value into a column value (aggregating multiple rows into a single cell response.)
In scenarios like this one, when we are just transposing the value (eg, the result of the aggregate will match the original cell value) I regard this as a bit of a hack but can also be less syntax than the alternative.
I say hack because the core PIVOT logic is duplicated, which makes it harder to maintain as the query evolves.
execute a single PIVOT on the unique column, join on other tables to build out the additional columns
This easily allows an unlimited number of additional rows in the output. The PIVOT resolves the ID of the table that holds the multiple values that we want to display in the final results.
Lets look at 3 first, as this demonstrates a single PIVOT and how to include multiple columns for each of the PIVOT results:
In this example I have allowed for up to 8 results for each sector, it is important to note that you MUST specify all the output columns from the PIVOT, it is not dynamic.
You could use dynamic queries to test for the max number of columns you need and generate out the following query based on those results.
Also note that in this solution, we do not need to join on the template13_vw table within the PIVOT source query, instead we have joined on the result, that is why the pivot is returning m_ticker (which I assume to be the key) instead of ticker that is displayed in the final result.
-- NOTE: using CTE here, you could use table variables, temporary tables or whatever else you need
;WITH TickersBySector as
(
-- You must specify the fixed number of columns in the output
SELECT sector, sector_char, [1] as [m_ticker_1],[2] as [m_ticker_2],[3] as [m_ticker_3],[4] as [m_ticker_4],[5] as [m_ticker_5],[6] as [m_ticker_6],[7] as [m_ticker_7],[8] as [m_ticker_8]
FROM (
SELECT d.sector, d.sector_char, d.m_ticker, ROW_NUMBER() OVER(PARTITION BY d.sector ORDER BY d.sector) rn
FROM template13_ticker_data d /* OPs Syntax */
-- FROM #tickerData d /* Use this with the proof table variables */
) data
PIVOT (
MAX(m_ticker)
FOR rn IN ( [1],[2],[3],[4],[5],[6],[7],[8])
) as PivotTable
)
-- To use with the proof table variables, replace 'template13_vw' with '#Company'
SELECT sector, sector_char
,c1.[ticker] as [ticker_1], c1.company_name as [company_1]
,c2.[ticker] as [ticker_2], c2.company_name as [company_2]
,c3.[ticker] as [ticker_3], c3.company_name as [company_3]
,c4.[ticker] as [ticker_4], c4.company_name as [company_4]
,c5.[ticker] as [ticker_5], c5.company_name as [company_5]
,c6.[ticker] as [ticker_6], c6.company_name as [company_6]
,c7.[ticker] as [ticker_7], c7.company_name as [company_7]
,c8.[ticker] as [ticker_8], c8.company_name as [company_8]
FROM TickersBySector
LEFT OUTER JOIN template13_vw c1 ON c1.m_ticker = TickersBySector.m_ticker_1
LEFT OUTER JOIN template13_vw c2 ON c2.m_ticker = TickersBySector.m_ticker_2
LEFT OUTER JOIN template13_vw c3 ON c3.m_ticker = TickersBySector.m_ticker_3
LEFT OUTER JOIN template13_vw c4 ON c4.m_ticker = TickersBySector.m_ticker_4
LEFT OUTER JOIN template13_vw c5 ON c5.m_ticker = TickersBySector.m_ticker_5
LEFT OUTER JOIN template13_vw c6 ON c6.m_ticker = TickersBySector.m_ticker_6
LEFT OUTER JOIN template13_vw c7 ON c7.m_ticker = TickersBySector.m_ticker_7
LEFT OUTER JOIN template13_vw c8 ON c8.m_ticker = TickersBySector.m_ticker_8
The following is the same query, using multiple PIVOT queries joins together.
Notice that in this scenario it is not important that both PIVOTs bring back the additional common column sector_char, so use this style of syntax when the aggregate or the additional common column might be different for the different result sets.
;WITH TickersBySector as
(
-- You must specify the fixed number of columns in the output
SELECT sector, sector_char, [1] as [ticker_1],[2] as [ticker_2],[3] as [ticker_3],[4] as [ticker_4],[5] as [ticker_5],[6] as [ticker_6],[7] as [ticker_7],[8] as [ticker_8]
FROM (
SELECT d.sector, d.sector_char, d.m_ticker, ROW_NUMBER() OVER(PARTITION BY d.sector ORDER BY d.sector) rn
FROM template13_ticker_data d /* OPs Syntax */
-- FROM #tickerData d /* Use this with the proof table variables */
) data
PIVOT (
MAX(m_ticker)
FOR rn IN ( [1],[2],[3],[4],[5],[6],[7],[8])
) as PivotTable
)
, CompanyBySector as
(
-- You must specify the fixed number of columns in the output
SELECT sector,[1] as [company_1],[2] as [company_2],[3] as [company_3],[4] as [company_4],[5] as [company_5],[6] as [company_6],[7] as [company_7],[8] as [company_8]
FROM (
SELECT d.sector, v.company_name, ROW_NUMBER() OVER(PARTITION BY d.sector ORDER BY d.sector) rn
FROM template13_ticker_data d /* OPs Syntax */
-- FROM #tickerData d /* Use this with the proof table variables */
INNER JOIN template13_vw v /* OPs Syntax */
-- INNER JOIN #Company v /* Use this with the proof table variables */
ON d.m_ticker = v.m_ticker
) data
PIVOT (
MAX(company_name)
FOR rn IN ( [1],[2],[3],[4],[5],[6],[7],[8])
) as PivotTable
)
SELECT TickersBySector.sector, sector_char
,[ticker_1], [company_1]
,[ticker_2], [company_2]
,[ticker_3], [company_3]
,[ticker_4], [company_4]
,[ticker_5], [company_5]
,[ticker_6], [company_6]
,[ticker_7], [company_7]
,[ticker_8], [company_8]
FROM TickersBySector
INNER JOIN CompanyBySector ON TickersBySector.sector = CompanyBySector.sector

Selecting Top Row in Calculated Column

I need to subtract the top row in a table that has multiple records from another table that has one row. One table has assets with one date and the other has multiple assets grouped by older dates. I am also limiting the results to times when the newer asset is greater than 40% or less than 40% the older asset.
I have already tried using the row_number function to pull the top row from the second table but am having trouble with the subquery.
Select
p.pid, e.coname, p.seq, p.valmo, p.valyr, p.assets,
(case
when ((p.assets-p1.assets)/p.assets) * 100 <= -40
or ((p.assets-p1.assets)/p.assets) * 100 >=40
and p.assets <> p1.assets
then ((p.assets - p1.assets) / p.assets) * 100
end) as "PercentDiff"
from
pen_plans p
join
pen_plans_archive p1 on p.pid = p1.pid and p.seq = p1.seq
join
entities e on p.pid = e.pid
where
p.assets > 500000 and e.mmd = 'A'
order by
VALYR desc
So I need to subtract the top row in "pen_plans_archive" from the assets in "pen_plans". I've tried to combine something like this in a subquery into the above:
select assets from (select assets row_number() over (partition by assets
order by valyr DESC) as R
from pen_plans_archive) RS
where R=1 order by valyr DESC
The "assets" column definition is Number(12,0).
I expect the query to produce the columns, PID, CONAME, SEQ, VALMO, VALYR, ASSETS, and the Calculated Column PERCENTDIFF with no null values.
The first query produces null values and also is subtracting every asset figure in pen_plans_archive from pen_plans which is not what I need.

Are you just trying to do the Top function?
Select TOP 1 <column>

SQL For Each - write to file

So we have a Production Table with the following data (in simple terms)
ID, Item, QTY
1,AAA,3
2,BBB,4
so 2 production tasks, one for a quantity of 3, and one with a quantity of 4. I require an export file (txt) that would display the following
ID,Item
1,AAA
1,AAA
1,AAA
2,BBB
2,BBB
2,BBB
2,BBB
Basically, I need a file with a line for each of the quantity. This is because I use a 3rd party software that uses each line in the file to create a ticket/label for the task.
any help on the above would be gratefully appreciated.
Thanks,
Dean

Basically, you need a numbers table, so you can do:
select p.id, p.item
from production p join
numbers n
on n.n <= p.qty;
If your table has enough rows, then one ANSI-standard method that will work in many databases is:
select p.id, p.item
from production p join
(select row_number() over (order by p.id) as n
from production
) n
on n.n <= p.qty;
There are other database-specific ways of generating numbers.
Another ANSI compatible method is recursive CTEs:
with cte (id, item) as (
select id, item, qty
from production
union all
select id, item, qty - 1
from production
where qty > 0
)
select id, item
from cte;
(Note: sometimes the recursive keyword is needed.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicate rows from multiple criteria using SQL queries - sql

Related

SQL/BigQuery - List of products sold together

Oracle SQL Merge Statement with Conditions

Pivot multiple rows / columns into 1 row

Selecting Top Row in Calculated Column

SQL For Each - write to file

Categories

Resources