SQL for Splitting Names - Some with Middle Initial, Some Without - sql

I'm fairly new to SQL, I'm actually just building a small database in Access right now although if there is necessary functionality that Access can't do I'll remake the tables in SQL Server.
Here's my situation, I have a list of names that come from a data dump from a third party. In our database I need to be able to compare first and last names in separate columns.
I've been trying to use InStr, Left and Right - but am getting hung up with weird results
Left([NewClaims]![Claimant Full Name],InStr([NewClaims]![Claimant Full Name],",")-1) AS LastName,
Right([NewClaims]![Claimant Full Name],InStr([NewClaims]![Claimant Full Name], ", ")+2) AS FirstName,
On some names it works perfectly
West, Krystal --becomes--> LastName = West, FirstName= Krystal
On other names, similar in formant it doesn't work
Dalton, Kathy ----> LastName = Dalton, First Name = ON, KATHY
On Names with middle initials I get
Earles, Barbara A. ----> LastName = Earles, FirstName= ARBARA A. (one missing letter)
OR
Beard, Chekitha G. ----> LastName = Beard, FirstName= KITHA G. (three missing letters)
I'm frustrated. Can anyone offer another idea on how to make this work? I seem to have the last name down, but I can't get the first name to be consistently correct.

Try this. But I'm assuming that there's always a comma that separates last name from first name.
select
txt,
LastName = left(txt,charindex(',',txt)-1),
FirstName = ltrim(right(txt,len(txt)-charindex(',',txt)))
from (
select 'West, Krystal' as txt union all
select 'Dalton, Kathy' union all
select 'Earles, Barbara A.' union all
select 'Beard, Chekitha G.'
) x
Your mistake was that when using right to extract first name, you didn't take the length of the string under consideration.

Related

Selecting multiple options in MS Access

I am writing a code in SQL for Access. The query asks three questions. I have three categories -- I'll just use the categories 'country', 'city', 'street' for now. I am trying to figure out how to make it so that you only have to enter one answer even though it asks you 3. But if you answer two, it will give you the like terms. For example, if I answered Georgia and Atlanta, Atlanta Georgia would show up. Or if I entered Canal in 'street' and Louisiana, every street named Canal in Louisiana would show up.
Currently, if I typed out Canal and Louisiana, the query would show me everything listed under Louisiana and every street titled Canal (even the ones not in Louisiana).
SELECT *
FROM File
WHERE (((File.State)=[Enter the state]))
OR (((File.City)=[Enter the city]))
OR (((File.Street)=[Enter the street]));
I think you should be able to do it by using AND rather than OR to connect the criteria for the different columns, but not using the criteria for a column if its parameter wasn't given.
SELECT *
FROM File
WHERE ( ([Enter the state] = '') OR (File.State=[Enter the state]) )
AND ( ([Enter the city] = '') OR (File.City=[Enter the city]) )
AND ( ([Enter the street] = '') OR (File.Street=[Enter the street]) );
I'm kind of rusty with Access, so I'm not sure if the parameter will be null or '' if nothing is entered, so it might need to be adjusted a little for that.
SELECT * FROM File WHERE
(((File.State)='[Enter the state]'))
OR (((File.City)='[Enter the city]'))
OR (((File.Street)='[Enter the street]'));
You just needed some quotes around them because they are strings.

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

SQL - Getting a column from another table to join this query

I've got the code below which displays the location_id and total number of antisocial crimes but I would like to get the location_name from a different table called location_dim be output as well. I tried to find a way to UNION it but couldn't get it to work. Any ideas?
SELECT fk5_location_id , COUNT(fk3_crime_id) as TOTAL_ANTISOCIAL_CRIMES
from CRIME_FACT
WHERE fk1_time_id = 3 AND fk3_crime_id = 1
GROUP BY fk5_location_id;
You want to use join to lookup the location name. The query would probably look like this:
SELECT ld.location_name, COUNT(cf.fk3_crime_id) as TOTAL_ANTISOCIAL_CRIMES
from CRIME_FACT cf join
LOCATION_DIM ld
on cf.fk5_location_id = ld.location_id
WHERE cf.fk1_time_id = 3 AND cf.fk3_crime_id = 1
GROUP BY ld.location_name;
You need to put in the right column names for ld.location_name and ld.location_id.
you need to find a relationship between the two tables to link a location to crime. that way you could use a "join" and select the fields from each table you are interested in.
I suggest taking a step back and reading up on the fundamentals of relational databases. There are many good books out there which is the perfect place to start.

How to remove duplicate data lines with data not in came columns

A SELECT QUERY for Person Matches produces a Table in which each 2nd line contains the same info as the line above it.
After a Sort By Surname,GivenName,BirthD
e.g.
IDIR1, Surname, GivenName, BirthD IDIR2.
IDIR2, Surname, GivenName, BirthD IDIR1.
(Both persons have the same criteria but diff IDIR)
What options are there to eliminate the appearence of the 2nd Lines.
Delete is acceptable but NOT IN, <>, etc. do not work because:
All IDIRs (1 & 2) are in the 2 IDIR columns.
Only one line is read to check if both are Individuals & not Same Person.
Something along theese lines should give you a result set where every second record has been left out.
SELECT a.*
FROM thetable AS a
JOIN thetable AS b
ON a.Surname = b.Surname
AND a.GivenName = b.GivenName
...
WHERE a.IDIR < b.IDIR
Note that it is untested - you may need to clean it up a little bit, but the trick is using a.IDIR < b.IDIR to weed out the duplicates.

How to write a query returning non-chosen records

I have written a psychological testing application, in which the user is presented with a list of words, and s/he has to choose ten words which very much describe himself, then choose words which partially describe himself, and words which do not describe himself. The application itself works fine, but I was interested in exploring the meta-data possibilities: which words have been most frequently chosen in the first category, and which words have never been chosen in the first category. The first query was not a problem, but the second (which words have never been chosen) leaves me stumped.
The table structure is as follows:
table words: id, name
table choices: pid (person id), wid (word id), class (value between 1-6)
Presumably the answer involves a left join between words and choices, but there has to be a modifying statement - where choices.class = 1 - and this is causing me problems. Writing something like
select words.name
from words left join choices
on words.id = choices.wid
where choices.class = 1
and choices.pid = null
causes the database manager to go on a long trip to nowhere. I am using Delphi 7 and Firebird 1.5.
TIA,
No'am
Maybe this is a bit faster:
SELECT w.name
FROM words w
WHERE NOT EXISTS
(SELECT 1
FROM choices c
WHERE c.class = 1 and c.wid = w.id)
Something like that should do the trick:
SELECT name
FROM words
WHERE id NOT IN
(SELECT DISTINCT wid -- DISTINCT is actually redundant
FROM choices
WHERE class == 1)
SELECT words.name
FROM
words
LEFT JOIN choices ON words.id = choices.wid AND choices.class = 1
WHERE choices.pid IS NULL
Make sure you have an index on choices (class, wid).