How to find bad references in a table in Oracle - sql

I have a data problem I need to clean up. Basically I have two tables storing "package" information, one table for documents and one table for audit information. I have entries in the package tables that reference documents that no longer exist and have been replaced (same name but different id) and I want to write a query to find all the bad ones and which new document should replace them. The only thing linking these two is a string value in the audit table which stores the document name (not id).
I've setup a sample schema here: http://sqlfiddle.com/#!4/997bda/1
package_s is the single values for a package in our application
package_r is the repeating values for a package in our application
(these are joined with the same value in the id column)
audit_info is all the audit information in a package
docs is all the documents that can be attached to a package
This query finds the packages with bad attachments (may be more than one per package)
select distinct ps.pkgname, pr.doc_list
from package_s ps, package_r pr
where ps.id = pr.id
and not exists (
select 1 from docs
where pr.doc_list = id
)
order by 1,2 asc
;
I need to build a query with the following rules:
I need to return at least the package id, the position value and the new document id (I will build an update statement to put this new document id in the row matching the package id / position in the package_r table)
the way to get the document name from the audit information is:
SUBSTR(description,0,INSTR(description,'[')-2)
If the document was Added and then Removed, it should be ignored (string_1)
string_2 must not be 'Supporting'
the new document must match
state = 'Master'
latest = 1
pub = '0'
Right now I have a semi-working script that works on a per package basis, but the problem is affecting 2000+ packages. I find the audit entries that don't match documents correctly attached to the package and then search for those names in the document table. The problem with this is since there is no direct link between the package and document tables, if there are multiple problem attachments on one package, each "new" document is returned once per position value, i.e.
package id bad doc id position new doc id
p1 d1 -1 d1-new
p1 d1 -1 d4-new
p1 d4 -2 d1-new
p1 d4 -2 d4-new
It doesn't matter which new id goes into which position value, but the duplication result problem like this makes it hard to mass generate update scripts, some manual filtering would be required.
This is a somewhat complex and unique data issue, so any help would be greatly appreciated.

This query works according to informations provided:
with ai as (
select a1.audited_id id, dc.id doc_id, dc.docname,
row_number() over (partition by a1.audited_id order by dc.id) rn
from audit_info a1
join docs dc
on dc.state = 'Master' and dc.latest = 1 and dc.pub = '0'
and dc.docname = substr(a1.description, 1, instr(a1.description, '[')-2)
where string_1 = 'Added' and string_2 <> 'Supporting'
and not exists (
select * from audit_info a2
where a2.audited_id = a1.audited_id and string_1 = 'Removed'
and a2.description = a1.description )
and not exists ( -- here matching docs are eliminated
select 1 from package_r pr
where pr.id = a1.audited_id and pr.doc_list = dc.id ) ),
p as (
select ps.id, ps.pkgname, pr.doc_list, pr.position,
row_number() over (partition by ps.id order by doc_list) rn
from package_s ps
join package_r pr on pr.id = ps.id
where not exists ( select * from docs where pr.doc_list = docs.id )
)
select p.id, p.pkgname, p.doc_list, p.position
, ai.docname, ai.doc_id
from p join ai on ai.id = p.id and p.rn = ai.rn
order by p.id, p.doc_list, ai.doc_id
Output:
ID PKGNAME DOC_LIST POSITION DOCNAME DOC_ID
-- ------- -------- -------- ------- ------
p1 000001 d3 -3 doc3 d3-new
p1 000001 d4 -4 doc4 d4-new
p2 000002 d5 -2 doc5 d5-new
p4 000004 d6 -1 doc6 d6-new
Edit: Answers to issues reported in comments
it is identifying packages that do not have bad values, and then the doc_list column is blank,
Note that query (my subquery p) for identyfing packages is basically your query, I just added counter there.
I guess that some process/application or someone manually cleared column doc_list in package_r.
If you don't want such entries, just add condition and trim(doc_list) is not null in subquery p.
for the ones it gets right on the package part (they have a bad value) it is bringing back the wrong docname/doc_id to replace the bad value with, it is a different doc_id in the list.
I understand this only partially. Can you add such entries to your examples (in Fiddle or just edit your question and add problematic input rows and expected output for them?)
"It doesn't matter which new id goes into which position value".
Assignment I made this way - if we had two old docs with names "ABC", "DEF" and corrected docs have names "XXA", "DE12"
then they will be linked as "ABC"->"DE12" and "DEF"->"XXA" (alphabetical ordering seems more rational than totally random).
To make assigning random change order by ... to order by null in both row_number() functions.

Related

How to delete duplicate entries from a specific field by SQL query in DSpace?

I am using DSpace version 6.3 here. I discovered that I have created duplicate entries when performing a batch import. Using the SQL query from this answer, I managed to list all the duplicates in a given field. For this example, I am using the dc.subject field (metadata_field_id=57) to list items (dspace_object_id) that have duplicate values in dc.subject field.
Below is the query that I used:
SELECT metadata_value_id,
dspace_object_id,
text_value
FROM (SELECT *,
COUNT(*) OVER (PARTITION BY dspace_object_id, text_value) AS cnt
FROM metadatavalue where metadata_field_id=57) e
WHERE cnt > 1
Below is the sample list generated from that query:
metadata_value_id
dspace_object_id
text_value
503018
13f07109-7797-4d5b-a8bd-1f9e91a2433d
pompanos
503021
13f07109-7797-4d5b-a8bd-1f9e91a2433d
pompanos
503217
233d1e67-b698-4e90-8e70-a175776b2d80
pests
503219
233d1e67-b698-4e90-8e70-a175776b2d80
pests
83574
47753988-fc2a-4416-b20d-acbff6e256de
Penaeus monodon
10800
47753988-fc2a-4416-b20d-acbff6e256de
Penaeus monodon
531520
50965923-bc65-4fdf-af61-2c8debdfe057
Penaeus monodon
531521
50965923-bc65-4fdf-af61-2c8debdfe057
Penaeus monodon
483882
57d0544c-1825-431a-acf9-eb835c24920b
development
483879
57d0544c-1825-431a-acf9-eb835c24920b
development
Based on the table above, you can see that there are 2 occurrences (others have more than 2) of a text_value in the same item (dspace_object_id).
My question is how can I delete the duplicate but retain the first occurrence?
In the table above, rows with the following metadata_value_id should be deleted:
503021
503219
83574 <-- This should be deleted instead of 10800 because 10800 was put in (inserted) first.
531521
483882 <-- Should be deleted instead of 483879 because of the same reason above.
I'm hoping to do this via SQL query alone because what I have done before is import this list to a Google spreadsheet, remove the duplicates there using a Remove Duplicates add-on, download it as a CSV, and then use that CSV to update the metadatavalue table.
Assuming that the 'first occurrence' means the one with smaller metadata_value_id:
delete from metadatavalue as Not1stOcc
where exists (select *
from metadatavalue as FirstOcc
where FirstOcc.text_value = Not1stOcc.text_value
and FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
)
should work
For my future reference, I modified tinazmu's answer that worked for my use case.
DELETE FROM metadatavalue as Not1stOcc
WHERE EXISTS (SELECT metadata_value_id, dspace_object_id, text_value, metadata_field_id
FROM metadatavalue AS FirstOcc
WHERE FirstOcc.text_value = Not1stOcc.text_value
AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
AND FirstOcc.dspace_object_id = Not1stOcc.dspace_object_id
AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id
AND metadata_field_id = 57
AND FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id
)

SQL - Returning fields based on where clause then joining same table to return max value?

I have a table named Ticket Numbers, which (for this example) contain the columns:
Ticket_Number
Assigned_Group
Assigned_Group_Sequence_No
Reported_Date
Each ticket number could contain 4 rows, depending on how many times the ticket changed assigned groups. Some of these rows could contain an assigned group of "Desktop Support," but some may not. Here is an example:
Example of raw data
What I am trying to accomplish is to get the an output that contains any ticket numbers that contain 'Desktop Support', but also the assigned group of the max sequence number. Here is what I am trying to accomplish with SQL:
Queried Data
I'm trying to use SQL with the following query but have no clue what I'm doing wrong:
select ih.incident_number,ih.assigned_group, incident_history2.maxseq, incident_history2.assigned_group
from incident_history_public as ih
left join
(
select max(assigned_group_seq_no) maxseq, incident_number, assigned_group
from incident_history_public
group by incident_number, assigned_group
) incident_history2
on ih.incident_number = incident_history2.incident_number
and ih.assigned_group_seq_no = incident_history2.maxseq
where ih.ASSIGNED_GROUP LIKE '%DS%'
Does anyone know what I am doing wrong?
You might want to create a proper alias for incident_history. e.g.
from incident_history as incident_history1
and
on incident_history1.ticket_number = incident_history2.ticket_number
and incident_history1.assigned_group_seq_no = incident_history2.maxseq
In my humble opinion a first error could be that I don't see any column named "incident_history2.assigned_group".
I would try to use common table expression, to get only ticket number that contains "Desktop_support":
WITH desktop as (
SELECT distinct Ticket_Number
FROM incident_history
WHERE Assigned_Group = "Desktop Support"
),
Than an Inner Join of the result with your inner table to get ticket number and maxSeq, so in a second moment you can get also the "MAXGroup":
WITH tmp AS (
SELECT i2.Ticket_Number, i2.maxseq
FROM desktop D inner join
(SELECT Ticket_number, max(assigned_group_seq_no) as maxseq
FROM incident_history
GROUP BY ticket_number) as i2
ON D.Ticket_Number = i2.Ticket_Number
)
SELECT i.Ticket_Number, i.Assigned_Group as MAX_Group, T.maxseq, i.Reported_Date
FROM tmp T inner join incident_history i
ON T.Ticket_Number = i.Ticket_Number and i.assigned_group_seq_no = T.maxseq
I think there are several different method to resolve this question, but I really hope it's helpful for you!
For more information about Common Table Expression: https://www.essentialsql.com/introduction-common-table-expressions-ctes/

How to select and alias data only from the column that contains a certain string in Oracle SQL

I am working with address records in an Oracle db. Each row contains information on two parents. There are four columns for phone number types and four columns for numbers. The types are Other_No_Type_1, Other_No_Type_2, Other_No_Type_3, Other_No_Type_4 and any one of them might contain a value of either 'Name1:Mobile', 'Name2:Mobile', 'Father Work', or 'Mother Work', That value refers to the number in the next column (Other_No_1, Other_No_2, Other_No_3 or Other_No_4). I need to pull the Other_No_x value when Other_No_Type_x is equal to Name1:Mobile and alias it "contact_1_mobile" and pull Name2:Mobile and alias it "contact_2_mobile". In my SELECT below, you can see that I've just written "a.other_no_1 as contact_1_mobile" for example, but actually that might be retrieving a work number or the Name2:mobile number. This is my first request for help to the forum, so I apologize for probably not presenting my question properly. Thank you for any help you can give. Here is my statement as it stands now:
SELECT final.*
FROM (
SELECT
--Name 1 in P1 household
a.id
,a.name1_web_user_id as contact_1_id
,a.name1_full_name as contact_1_name
,a.other_no_1 as contact_1_mobile (THIS IS MY PROBLEM. "Other_No_1" MAY NOT ACTUALLY BE THE NAME1:MOBILE NUMBER TYPE I NEED. THIS PROBLEM IS THE SAME IN EACH SECTION OF MY STATEMENT.)
,a.email as contact_1_email
--Name 2 in P1 household
,a.name2_web_user_id as contact_2_id
,a.name2_full_name as contact_2_name
,a.other_no_2 as contact_2_mobile (PROBLEM: HERE I ACTUALLY NEED TO FIND THE COLUMN THAT CONTAINS THE "Name2:Mobile" NUMBER)
,a.EMAIL_2 as contact_2_email
FROM rg_student s left outer join rg_addr a on s.id = a.id
WHERE (
(a.addr_code='P1' AND a.rg_active = 'Y') AND ((a.name1_web_user_id is not null) OR (a.name2_web_user_id is not null))
AND a.id in(SELECT id from rg_student where student_group='Student')
)
UNION
SELECT
--Name 1 in P2 household
a.id
,a.name1_web_user_id as contact_3_id
,a.name1_full_name as contact_3_name
,a.other_no_1 as contact_3_mobile (PROBLEM LINE)
,a.email as contact_3_email
--Name 2 in P2 household
,a.name2_web_user_id as contact_4_id
,a.name2_full_name as contact_4_name
,a.other_no_2 as contact_4_mobile (PROBLEM LINE)
,a.EMAIL_2 as contact_4_email
FROM rg_student s left outer join rg_addr a on s.id = a.id
WHERE (
(a.addr_code='P2' AND a.rg_active = 'Y') AND ((a.name1_web_user_id is not null) OR (a.name2_web_user_id is not null))
AND a.id in(SELECT id from rg_student where student_group='Student')
)
)final
ORDER BY final.id
Just use case or decode:
case 'Name1:Mobile'
when Other_No_Type_1 then Other_No_1
when Other_No_Type_2 then Other_No_2
when Other_No_Type_3 then Other_No_3
when Other_No_Type_4 then Other_No_4
end as contact_1_mobile

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.
I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).
Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!
I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

outer query to list only if its rowcount equates to inner subquery

Need help on a query using sql server 2005
I am having two tables
code
chargecode
chargeid
orgid
entry
chargeid
itemNo
rate
I need to list all the chargeids in entry table if it contains multiple entries having different chargeids
which got listed in code table having the same charge code.
data :
code
100,1,100
100,2,100
100,3,100
101,11,100
101,12,100
entry
1,x1,1
1,x2,2
2,x3,2
11,x4,1
11,x5,1
using the above data , it query should list chargeids 1 and 2 and not 11.
I got the way to know how many rows in entry satisfies the criteria, but m failing to get the chargeids
select count (distinct chargeId)
from entry where chargeid in (select chargeid from code where chargecode = (SELECT A.chargecode
from code as A join code as B
ON A.chargecode = B.chargeCode and A.chargetype = B.chargetype and A.orgId = B.orgId AND A.CHARGEID = b.CHARGEid
group by A.chargecode,A.orgid
having count(A.chargecode) > 1)
)
First off: I apologise for my completely inaccurate original answer.
The solution to your problem is a self-join. Self-joins are used when you want to select more than one row from the same table. In our case we want to select two charge IDs that have the same charge code:
SELECT DISTINCT c1.chargeid, c2.chargeid FROM code c1
JOIN code c2 ON c1.chargeid != c2.chargeid AND c1.chargecode = c2.chargecode
JOIN entry e1 ON e1.chargeid = c1.chargeid
JOIN entry e2 ON e2.chargeid = c2.chargeid
WHERE c1.chargeid < c2.chargeid
Explanation of this:
First we pick any two charge IDs from 'code'. The DISTINCT avoids duplicates. We make sure they're two different IDs and that they map to the same chargecode.
Then we join on 'entry' (twice) to make sure they both appear in the entry table.
This approach gives (for your example) the pairs (1,2) and (2,1). So we also insist on an ordering; this cuts to result set down to just (1,2), as you described.