SQL to determine which rows to retain and which to remove - pdf

Good Morning Everyone,
I am having a bit of a mental block and was going to see if anyone can help me. I have a table that inventories PDF Files that our office creates. We have changed the naming convention and I am trying to develop logic that specifies when a PDF with the new naming convention has been created to tag the old one so that I can develop a batch script to move them out of the files location. Below are some examples. Each file is its own row in the table by the way.
PAR ORIGFILENAME
111100000012 | 1100000012.pdf
111100000012 | 1100000012_C_1_UB.pdf
111100000012 | 1100000012_R_1.pdf
The new naming convention contains _R_Number or _C_Number. In the above example The first file is old and I want to retain the second 2.
A second example that has a bit more. Below are 5 files. I want to retain the last two that have the new naming convention and remove the top 3.
PAR ORIGFILENAME
1100000076 1100000076-2.pdf
1100000076 1100000076-3.pdf
1100000076 1100000076.pdf
1100000076 1100000076_C_7_BARN.pdf
1100000076 1100000076_R_1.pdf
My plans if I can key on these old files when a new one exists is to develop those names into a batch script and incorporate it into a SSIS package that will run weekly to keep our PDF repository clean. I appreciate any help and incite.

The following should work, although a more varied amount of sample data would be useful.
The following uses an updatable CTE to identify the old/new format names and deletes the old format where the same PAR has a new format
with f as (
select *,
case when OrigFilename like '%*_%' escape '*' then 0 else 1 end del
from t
)
delete from f
where del=1
and exists (select * from f f2 where f2.par=f.par and f2.del=0)

If you're trying to highlight the records you want to remove - and you only want to return the records in the old format, when both a C_7 and R_1 record exists, maybe something like this?
;WITH c_7_records AS (
SELECT par
FROM my_table
WHERE origfilename LIKE '%_C_7_%'
),
r_1_records AS (
SELECT par
FROM my_table
WHERE origfilename LIKE '%_R_1%'
),
records_to_remove AS (
SELECT
DISTINCT mt.origfilename
FROM my_table AS mt
JOIN c_7_records AS cr ON mt.par = cr.par
JOIN r_1_records AS rr ON mt.par = rr.par
WHERE mt.origfilename NOT LIKE '%_C_7_%'
AND mt.origfilename NOT LIKE '%_R_1%'
)
SELECT * FROM records_to_remove;
sql fiddle

Related

Completely Unique Rows and Columns in SQL

I want to randomly pick 4 rows which are distinct and do not have any entry that matches with any of the 4 chosen columns.
Here is what I coded:
SELECT DISTINCT en,dialect,fr FROM words ORDER BY RANDOM() LIMIT 4
Here is some data:
**en** **dialect** **fr**
number SFA numero
number TRI numero
hotel CAI hotel
hotel SFA hotel
I want:
**en** **dialect** **fr**
number SFA numero
hotel CAI hotel
Some retrieved rows would have something similar with each other, like having the same en or the same fr, I would like to retrieved rows that do not share anything similar with each other, how do I do that?
I think I’d do this in the front end code rather the dB, here’s a pseudo code (don’t know what your node looks like):
var seenEn = “en not in (''“;
var seenFr = “fr not in (''“;
var rows =[];
while(rows.length < 4)
{
var newrow = sqlquery(“SELECT *
FROM table WHERE “ + seenEn + “) and ”
+ seenFr + “) ORDER BY random() LIMIT 1”);
if(!newrow)
break;
rows.push(newrow);
seenEn += “,‘“+ newrow.en + “‘“;
seenFr += “,‘“+ newrow.fr + “‘“;
}
The loop runs as many times as needed to retrieve 4 rows (or maybe make it a for loop that runs 4 times) unless the query returns null. Each time the query returns the values are added to a list of values we don’t want the query to return again. That list had to start out with some values (null) that are never in the data, to prevent a syntax error when concatenation a comma-value string onto the seenXX variable. Those syntax errors can be avoided in other ways like having a Boolean of “if it’s the first value don’t put the comma” but I chose to put dummy ineffective values into the sql to make the JS simpler. Same goes for the
As noted, it looks like JS to ease your understanding but this should be treated as pseudo code outlining a general algorithm - it’s never been compiled/run/tested and may have syntax errors or not at all work as JS if pasted into your file; take the idea and work it into your solution
Please note this was posted from an iphone and it may have done something stupid with all the apostrophes and quotes (turned them into the curly kind preferred by writers rather than the straight kind used by programmers)
You can use Rank or find first row for each group to achieve your result,
Check below , I hope this code will help you
SELECT 'number' AS Col1, 'SFA' AS Col2, 'numero' AS Col3 INTO #tbl
UNION ALL
SELECT 'number','TRI','numero'
UNION ALL
SELECT 'hotel','CAI' ,'hotel'
UNION ALL
SELECT 'hotel','SFA','hotel'
UNION ALL
SELECT 'Location','LocationA' ,'Location data'
UNION ALL
SELECT 'Location','LocationB','Location data'
;
WITH summary AS (
SELECT Col1,Col2,Col3,
ROW_NUMBER() OVER(PARTITION BY p.Col1 ORDER BY p.Col2 DESC) AS rk
FROM #tbl p)
SELECT s.Col1,s.Col2,s.Col3
FROM summary s
WHERE s.rk = 1
DROP TABLE #tbl

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

How to find bad references in a table in Oracle

I have a data problem I need to clean up. Basically I have two tables storing "package" information, one table for documents and one table for audit information. I have entries in the package tables that reference documents that no longer exist and have been replaced (same name but different id) and I want to write a query to find all the bad ones and which new document should replace them. The only thing linking these two is a string value in the audit table which stores the document name (not id).
I've setup a sample schema here: http://sqlfiddle.com/#!4/997bda/1
package_s is the single values for a package in our application
package_r is the repeating values for a package in our application
(these are joined with the same value in the id column)
audit_info is all the audit information in a package
docs is all the documents that can be attached to a package
This query finds the packages with bad attachments (may be more than one per package)
select distinct ps.pkgname, pr.doc_list
from package_s ps, package_r pr
where ps.id = pr.id
and not exists (
select 1 from docs
where pr.doc_list = id
)
order by 1,2 asc
;
I need to build a query with the following rules:
I need to return at least the package id, the position value and the new document id (I will build an update statement to put this new document id in the row matching the package id / position in the package_r table)
the way to get the document name from the audit information is:
SUBSTR(description,0,INSTR(description,'[')-2)
If the document was Added and then Removed, it should be ignored (string_1)
string_2 must not be 'Supporting'
the new document must match
state = 'Master'
latest = 1
pub = '0'
Right now I have a semi-working script that works on a per package basis, but the problem is affecting 2000+ packages. I find the audit entries that don't match documents correctly attached to the package and then search for those names in the document table. The problem with this is since there is no direct link between the package and document tables, if there are multiple problem attachments on one package, each "new" document is returned once per position value, i.e.
package id bad doc id position new doc id
p1 d1 -1 d1-new
p1 d1 -1 d4-new
p1 d4 -2 d1-new
p1 d4 -2 d4-new
It doesn't matter which new id goes into which position value, but the duplication result problem like this makes it hard to mass generate update scripts, some manual filtering would be required.
This is a somewhat complex and unique data issue, so any help would be greatly appreciated.
This query works according to informations provided:
with ai as (
select a1.audited_id id, dc.id doc_id, dc.docname,
row_number() over (partition by a1.audited_id order by dc.id) rn
from audit_info a1
join docs dc
on dc.state = 'Master' and dc.latest = 1 and dc.pub = '0'
and dc.docname = substr(a1.description, 1, instr(a1.description, '[')-2)
where string_1 = 'Added' and string_2 <> 'Supporting'
and not exists (
select * from audit_info a2
where a2.audited_id = a1.audited_id and string_1 = 'Removed'
and a2.description = a1.description )
and not exists ( -- here matching docs are eliminated
select 1 from package_r pr
where pr.id = a1.audited_id and pr.doc_list = dc.id ) ),
p as (
select ps.id, ps.pkgname, pr.doc_list, pr.position,
row_number() over (partition by ps.id order by doc_list) rn
from package_s ps
join package_r pr on pr.id = ps.id
where not exists ( select * from docs where pr.doc_list = docs.id )
)
select p.id, p.pkgname, p.doc_list, p.position
, ai.docname, ai.doc_id
from p join ai on ai.id = p.id and p.rn = ai.rn
order by p.id, p.doc_list, ai.doc_id
Output:
ID PKGNAME DOC_LIST POSITION DOCNAME DOC_ID
-- ------- -------- -------- ------- ------
p1 000001 d3 -3 doc3 d3-new
p1 000001 d4 -4 doc4 d4-new
p2 000002 d5 -2 doc5 d5-new
p4 000004 d6 -1 doc6 d6-new
Edit: Answers to issues reported in comments
it is identifying packages that do not have bad values, and then the doc_list column is blank,
Note that query (my subquery p) for identyfing packages is basically your query, I just added counter there.
I guess that some process/application or someone manually cleared column doc_list in package_r.
If you don't want such entries, just add condition and trim(doc_list) is not null in subquery p.
for the ones it gets right on the package part (they have a bad value) it is bringing back the wrong docname/doc_id to replace the bad value with, it is a different doc_id in the list.
I understand this only partially. Can you add such entries to your examples (in Fiddle or just edit your question and add problematic input rows and expected output for them?)
"It doesn't matter which new id goes into which position value".
Assignment I made this way - if we had two old docs with names "ABC", "DEF" and corrected docs have names "XXA", "DE12"
then they will be linked as "ABC"->"DE12" and "DEF"->"XXA" (alphabetical ordering seems more rational than totally random).
To make assigning random change order by ... to order by null in both row_number() functions.

MS SQL complex update query

I've multiple tables and try to update some tables based on matching translation in master tables. I could update the table based on the first available translation, but I would like to do it based on the most frequent one. I found quite a lot of samples on how to achieve that in simple queries, but can't have it work in this complex query below.
In brief I just need to take the most frequent translation [target] from [EC3_800_FR_M] and copy it to [EC3_800_FR], instead of taking the first occurence in the table.
UPDATE [ec3_800_fr]
SET [ec3_800_fr].[target] = (SELECT TOP 1
[ec3_800_fr_m].[target]
FROM [ec3_800_fr_m]
WHERE (
[ec3_800_fr].[enus] = [ec3_800_fr_m].[enus] AND
[ec3_800_fr].[length] = [ec3_800_fr_m].[length] AND
[ec3_800_fr].[key6_domainname] = [ec3_800_fr_m].[key6_domainname] AND
[ec3_800_fr_m].[status] > 1
)
);
who can help me???
merci - thank you - Dank u - Danke
BR
lolo

how to i remove characheters from the prefix table and make sure only from the start of string are removed

I need to be able to strip the following prefixes from product codes as you see I have included a simple query while yes the below shows me the cm im not wanting i cant use replace as it code replace any instance of cm the prefixes are held in the supplire table cross refer with the products table
prefixes are not always two chrachters for example can be TOW
SELECT * , left(prod.productcode, LEN(sup.prefix)) AS MyTrimmedColumn
FROM MSLStore1_Products prod ,supplier sup
WHERE prod.suppid = 9039 AND prod.SgpID = 171
and sup.supno = prod.suppid
Product Codes:
ProductCode
CMDI25L
CMDI300M
CMDI750M
CMXFFP5L
Prefixes:
Prefix
CM
CM
CM
CM
You could use SUBSTRING() for that:
SUBSTRING(prod.productcode, LEN(sup.prefix))
(concrete syntax may vary for different database-managers)
You might want to put your prefixes in another table, and run the productCode field through a filtering UDF to retrieve the final code.
Something like
SELECT * , dbo.FilterPCode(prod.productcode) AS MyTrimmedColumn
FROM MSLStore1_Products prod ,supplier sup
WHERE prod.suppid = 9039 AND prod.SgpID = 171
and sup.supno = prod.suppid
then just define a UDF that takes a string product code and removes any prefix from the front.
You might see a performance hit using a UDF like this for really massive queries, and if that's the case, it might be better to generate a table of table of product codes without the prefix linked to the product codes with the prefix. Don't know enough about the data schema to know if that's truly possible without getting cross links though.
how do i thought use this query and imortant the results from that into the products table external code field
SELECT * , right(prod.productcode, len(prod.productcode) - LEN(sup.prefix) ) AS ExternalCoode
FROM MSLStore1_Products prod ,supplier sup
WHERE prod.suppid = 9217 AND prod.SgpID = 123 and sup.supno = prod.suppid