find more than one word in same row - sql

Query to generate a list of companies that have “prior to” more than once in their names.
For Example:
Company Name
Ittal PLC (adz ll **prior to** 04/2012) (Z Amp C **prior to** 02/2009)

One simple way is to use like:
where CompanyName like '%prior to%prior to%'

Try this
select *
from Yourtable
where len([Company Name]) - len(replace([Company Name],'prior to','')) > 1
and len([Company Name]) - len(replace([Company Name],'prior to','')) <> len('prior to')
SQL FIDDLE DEMO

I'd probably implement this as a two-step process.
1) Find all records matching prior to.
E.g.
SELECT Company_name FROM COMPANY_TABLE WHERE Company_name LIKE '%prior to%'
2) Iterate through all the records and find only those that have 2 occurrences of the prior to substring (in whatever language you're using).

Related

How to make this 2 step query solution better

Query is made in MS Access :
I show now a simplified example of my problem.
I have one table Members
Name/age/nickname
- Tom/12/wolf
- Chris/11/ranger
- Phil/14/H-man
- Chris/16/walker
- Chris/18/Mo
Goal: How many times a name occurs , but only count when the nickname had an "a" in it.
I needed 2 queries;
Step1:
SELECT Members.Name, Members.Age, Members.Nickname
FROM Members
WHERE (((Members.Nickname) Like "*A*"));
Step2:
SELECT Step1.Name, Count(Step1.Age) AS AantalVanAge
FROM Step1
GROUP BY Step1.Name;
Result
- Chris 2
- Phil 1
You can achieve this in a single query using:
select t.name, count(*) as AantalVanAge
from members t
where t.nickname like "*A*"
group by t.name
Use your 1st query as a subquery in the 2nd step:
SELECT t.Name, Count(t.Age) AS AantalVanAge
FROM (
SELECT Name, Age
FROM Members
WHERE Nickname Like "*A*"
) AS t
GROUP BY t.Name;

SQL Filtering duplicate rows due to bad ETL

The database is Postgres but any SQL logic should help.
I am retrieving the set of sales quotations that contain a given product within the bill of materials. I'm doing that in two steps: step 1, retrieve all DISTINCT quote numbers which contain a given product (by product number).
The second step, retrieve the full quote, with all products listed for each unique quote number.
So far, so good. Now the tough bit. Some rows are duplicates, some are not. Those that are duplicates (quote number & quote version & line number) might or might not have maintenance on them. I want to pick the row that has maintenance greater than 0. The duplicate rows I want to exclude are those that have a 0 maintenance. The problem is that some rows, which have no duplicates, have 0 maintenance, so I can't just filter on maintenance.
To make this exciting, the database holds quotes over 20+ years. And the data scientists guys have just admitted that maybe the ETL process has some bugs...
--- step 0
--- cleanup the workspace
SET CLIENT_ENCODING TO 'UTF8';
DROP TABLE IF EXISTS product_quotes;
--- step 1
--- get list of Product Quotes
CREATE TEMPORARY TABLE product_quotes AS (
SELECT DISTINCT master_quote_number
FROM w_quote_line_d
WHERE item_number IN ( << model numbers >> )
);
--- step 2
--- Now join on that list
SELECT
d.quote_line_number,
d.item_number,
d.item_description,
d.item_quantity,
d.unit_of_measure,
f.ref_list_price_amount,
f.quote_amount_entered,
f.negtd_discount,
--- need to calculate discount rate based on list price and negtd discount (%)
CASE
WHEN ref_list_price_amount > 0
THEN 100 - (ref_list_price_amount + negtd_discount) / ref_list_price_amount *100
ELSE 0
END AS discount_percent,
f.warranty_months,
f.master_quote_number,
f.quote_version_number,
f.maintenance_months,
f.territory_wid,
f.district_wid,
f.sales_rep_wid,
f.sales_organization_wid,
f.install_at_customer_wid,
f.ship_to_customer_wid,
f.bill_to_customer_wid,
f.sold_to_customer_wid,
d.net_value,
d.deal_score,
f.transaction_date,
f.reporting_date
FROM w_quote_line_d d
INNER JOIN product_quotes pq ON (pq.master_quote_number = d.master_quote_number)
INNER JOIN w_quote_f f ON
(f.quote_line_number = d.quote_line_number
AND f.master_quote_number = d.master_quote_number
AND f.quote_version_number = d.quote_version_number)
WHERE d.net_value >= 0 AND item_quantity > 0
ORDER BY f.master_quote_number, f.quote_version_number, d.quote_line_number
The logic to filter the duplicate rows is like this:
For each master_quote_number / version_number pair, check to see if there are duplicate line numbers. If so, pick the one with maintenance > 0.
Even in a CASE statement, I'm not sure how to write that.
Thoughts? The database is Postgres but any SQL logic should help.
I think you will want to use Window Functions. They are, in a word, awesome.
Here is a query that would "dedupe" based on your criteria:
select *
from (
select
* -- simplifying here to show the important parts
,row_number() over (
partition by master_quote_number, version_number
order by maintenance desc) as seqnum
from w_quote_line_d d
inner join product_quotes pq
on (pq.master_quote_number = d.master_quote_number)
inner join w_quote_f f
on (f.quote_line_number = d.quote_line_number
and f.master_quote_number = d.master_quote_number
and f.quote_version_number = d.quote_version_number)
) x
where seqnum = 1
The use of row_number() and the chosen partition by and order by criteria guarantee that only ONE row for each combination of quote_number/version_number will get the value of 1, and it will be the one with the highest value in maintenance (if your colleagues are right, there would only be one with a value > 0 anyway).
Can you do something like...
select
*
from
w_quote_line_d d
inner join
(
select
...
,max(maintenance)
from
w_quote_line_d
group by
...
) d1
on
d1.id = d.id
and d1.maintenance = d.maintenance;
Am I understanding your problem correctly?
Edit: Forgot the group by!
I'm not sure, but maybe you could Group By all other columns and use MAX(Maintenance) to get only the greatest.
What do you think?

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

Getting unwanted rows with LIKE in SQL

i'm back again, and it's another school topic
I'm restricted to the Commands WHERE FROM SELECT
The subject is basically show the name from table customers beginning with A or B and ending with S or P, i've tried multiple OR solutions but none of them seem to work, the answer is 1 column 6 rows
i can't get this code to work
USE Northwind
SELECT
CompanyName
FROM Customers
WHERE CompanyName LIKE '[A/B]%' or CompanyName LIKE '%[S/P]'
go
With this code i get 31 rows :/
How about combining together:
SELECT
CompanyName
FROM Customers
WHERE CompanyName LIKE '[A/B]%[S/P]'
SQL Fiddle Demo
I believe this should accomplish what you want:
SELECT CompanyName
FROM Customers
WHERE
(LEFT(CompanyName,1) = "A" OR LEFT(CompanyName,1) = "B")
AND (RIGHT(CompanyName,1) = "S" OR RIGHT(CompanyName,1) = "P")

Is it possible in SQL to match a LIKE from a list of records in a subquery?

Using these tables, as a sample:
CodeVariations:
CODE
-----------
ABC_012
DEF_024
JKLX048
RegisteredCodes:
CODE AMOUNT
-------- ------
ABCM012 5
ABCK012 25
JKLM048 16
Is it possible to write a query to retrieve all rows in RegisteredCodes when CODE matches a pattern in any row of the CodeVariations table? That is, any row that matches a LIKE pattern of either 'ABC_012', 'DEF_024' or 'JKLX048'
Result should be:
CODE AMOUNT
-------- ------
ABCM012 5
ABCK012 25
I'm using PostgreSQL, but it would be interesting to know if it's possible to do this in a simple query for either PostgreSQL or any other DB.
Does this do what you need?
select distinct RC.* from RegisteredCodes RC, CodeVariations CV
where RC.CODE LIKE CV.CODE;
Is this you are looking for:
SELCET * FROM RegisteredCodes RC WHERE RC.Code IN (SELECT CODE FROM CodeVariations WHERE CODE LIKE ('ABC%') AND CODE LIKE ('%012')
This will fetch all the record that start with 'ABC' and Ends with '012' and similar for 'DEF" and 'JKL'.
OR
Are you looking for something like this?
select * from CAT_ITEM where DESCRICAO LIKE '%TUBO%%PVC%%DNR%'
All like list is in a string.
In Oracle & PostgreSQL you can you single char wildcards "_" for single chars.
select RC.* from RegisteredCodes RC, CodeVariations CV
where RC.CODE LIKE 'ABC_012';
use Substring
select RC.* from RegisteredCodes RC, CodeVariations CV
where RC.CODE LIKE substring(CV.Code,1,3)||'_'||substring(CV.Code,5) ;