Fuzzy match column to column

Fuzzy match column to column - sql

I'm trying to find a way to match a column of clean data in table 1 to a column of dirty data in table2 without making any changes to the dirty data. I was thinking a fuzzy match, but there are too many entries in the clean table to allow for CDEs to be used. So, for example:
Table 1
GroupID CompanyName
123 CompanyA
445 CompanyB
556 CompanyC
Table 2
GroupID Patientname
AE123789 PatientA
123987 PatientB
445111 PatientC
And I'm trying to match the insurance company to the patient using the group number. Is there a matching method out there? (Fortunately the group numbers are actually much longer and when looking for a single group's worth of patients, fuzzy matching works really well, so they seem to be unique enough to be applied here).
Working in SQL server 2008.

This changes slightly depending on which database you are using, but it looks like you're looking for something like this:
MSSQL
select *
from table1 t1
join table2 t2 on t2.groupid like '%'+cast(t1.groupid as varchar(max))+'%'
SQL Fiddle Demo
MySQL - use Concat():
select *
from table1 t1
join table2 t2 on t2.groupid like concat('%',t1.groupid,'%')

Related

SQL Server query: pair each record from table 1 to all record at table 2

Guys I am having a hard time searching for this, I just can't properly write it in words.. but it's like this. attached image -->
MY PROBLEM

You need a CROSS JOIN.
SELECT TABLE1.BAKESHOP_NO, TABLE2.PRODUCT_NO
FROM TABLE1 CROSS JOIN TABLE2
ORDER BY BAKESHOP_NO, PRODUCT_NO

How do I write an SQL query that retrieves one line per combination of existing unique ids?

I apologize if this has already been asked or if it's a basic concept - I'm not experienced enough in SQL to know how to succinctly ask it (thus I'm not having much luck searching for it).
Refer to the first row of each table in the first snippet below for an idea of the tables I'm working with. Table1 and Table3 can have multiple rows with the same table2Id.
Originally, I was just working with Table1 and Table2. I retrieved one row for each Table1, and all was good. However, I now need to also retrieve data from Table3 - and the best way I can describe the implications is using the following example:
Table1:
id table2Id otherData1
abc 123 "otherData"
def 123 "anotherData"
Table2:
id otherData2
123 "yetAnotherData"
Table3:
id table2Id otherData3
!## 123 "thereCouldntBeMoreData"
$%^ 123 "iCantBelieveItsNotData"
To fully represent all the data, I'd need 4 rows, one for abc + !##, one for abc + $%^, one for def + !##, and finally one for def + $%^ . If I'm picturing this correctly, my end result would be something to this effect:
comboId t1.id t2.id t3.id t1.otherData1 t2.otherData2 t3.otherData3
abc!## abc 123 !## "otherData" "yetAnotherData" "thereCouldn'tBeMoreData"
abc$%^ abc 123 $%^ "otherData" "yetAnotherData" "iCantBelieveItsNotData"
def!## def 123 !## "anotherData" "yetAnotherData" "thereCouldn'tBeMoreData"
def$%^ def 123 $%^ "anotherData" "yetAnotherData" "iCantBelieveItsNotData"
How would I be able to achieve this? And thank you in advance for any help you can provide, even if it's just pointing me in the direction of someone else's answer to a similar problem.

As WW said, you need a cartesian join. Fortunately that is the default join in SQL. However, since the columns you need to join against have different names, you need ON clauses; you should try to make the columns you join have the same name.
To create comboId you need to catenate columns; the syntax to do that in SQL Server is +, in Oracle |, and I think there is a Concat() function in MySql. Whenever you ask a question about SQL always tell us which SQL you are using.
In SQL Server it would be:
SELECT t1.id+t3.id as comboId,
t1.id as 't1.Id',
t2.id as 't2.Id',
t3.id as 't3.Id',
t1.otherData1,
t2.otherData2,
t3.otherData3
FROM Table1 t1
JOIN Table2 t2 ON t2.id=t1.table2Id
JOIN Table3 t3 ON t3.table2Id=t2.id

This will get you the result you want but I will be the first to say that its hacky at best, without a primary key and foreign key, I'm not quite sure how to perform a "text-book" join operation.
SELECT t1.id+''+t3.id comboId,t1.id t1ID, t2.id t2ID, t3.id t3ID, t1.otherData1, t2.otherData2, t3.otherData3
FROM table1 t1,table2 t2,table3 t3;

Using LIKE in a JOIN query

I have two separate data tables.
This is Table1:
Customer Name Address 1 City State Zip
ACME COMPANY 1 Street Road Maspeth NY 11777
This is Table2:
Customer Active Account New Contact
ACME Y John Smith
I am running a query using the JOIN where only include rows where the joined fields from both tables are equal.
I am joining Customer Name from Table1 and Customer from Table2. Obviously no match. What I am trying to do is show results where the first 4 characters match in each table so I get a result or match. Is this possible using LIKE or LEFT?

Yes, that's possible.
But I doubt, that every name in table 2 only has 4 letters, so here's a solution where the name in table2 is the beginning of the name in table1.
Concat the string with a %. It's a placeholder/wildcard for "anything or nothing".
SELECT
*
FROM
Table1
INNER JOIN Table2 ON Table1.CustomerName LIKE CONCAT(Table2.Customer, '%');
Concatenating of strings may work differently between DBMS.

It probably is, though this might depend on the Database you are using. For example, in Microsoft SQL, it would work to use somthing like this:
SELECT *
FROM [Table1] INNER JOIN [Table2]
ON LEFT([Table1].[Customer Name],4) = LEFT([Table2].[Customer],4)
Syntax may be different if using other RDBMS. What are you trying this on?

Seems like this should work:
Select *
From Table1, Table2
Where Table1.CustomerName Like Cat('%',Trim(Table2.CustomerName),'%')

If you are only trying to match first four Characters you can use following :
SELECT --your columns
FROM Table1 T1
JOIN Table T2
ON
SUBSTRING ( T1.CustomerName ,1, 4) = SUBSTRING ( T2.Customer ,1, 4)

sql query to Combine different fields from 2 different tables with no relations except a common field - SQL Server Compact 3.5 SP2

My first question here. This has been a really helpful platform so far. I am some what a newbie in sql. But I have a freelance project in hand which I should release this month.(reporting application with no database writes)
To the point now: I have been provided with data (excel sheets with rows spanning up to 135000). Requirement is to implement a standalone application. I decided to use sql server compact 3.5 sp2 and C#. Due to time pressure(I thought it made sense too), I created tables based on each xls module, with fields of each tables matching the names of the headers in the xls, so that it can be easily imported via CSV import using SDF viewer or sql server compact toolbox added in visual studio. (so no further table normalizations done due to this reason).
I have a UI design for a typical form1 in which inputs from controls in it are to be checked in an sql query spanning 2 or 3 tables. (eg: I have groupbox1 with checkboxes (names matching field1,field2.. of table1) and groupbox2 with checkboxes matching field3, field4 of table2). also date controls based on which a common 'DateTimeField' is checked in each of the tables.
There are no foreign keys defined on tables for linking(did not arise the need to, since the data are different for each). The only commmon field is a 'DateTimeField'(same name) which exists in each table. (basically readings on a datetime stamp from locations. field1, field 2 etc are locations. For a particular datetime there may or may not be readings from table 1 or table2)
How will I accomplish an sql select query(using Union/joins/nested selects - if sql compact 3.5 supports it) to return fields from the 2 tables based on datetime(where clause). For a given date time there can be even empty values for fields in table 2. I have done a lot of research on this and tried as well. but not yet a good solution probably also due to my bad experience. apologies!
I would really appreciate any of your help! Can provide a sample of the data how it looks if you need it. Thanks in advance.
Edit:
Sample Data (simple as that)
Table 1
t1Id xDateTime loc1 loc2 loc3
(could not format the tabular schmema here. sorry. but this is self explanatory)
... and so on up to 135000 records existing imported from xls
Table 2
t2Id xDateTime loc4 loc5 loc6
.. and so on up to 100000 records imported from xls. merging table 1 and table 2 will result in a huge amount of blank rows/values for a date time.. hence leaving it as it is.
But a UI multiselect(loc1,loc2,loc4,loc5 from both t1 and t2) event from winform needs to combine the result from both tables based on a datetime.
... and so on
I managed to write it which comes very close. I say very close cause i have test in detail with different combination of inputs.. Thanks to No'am for the hint. Will mark as answer if everything goes well.
SELECT T1.xDateTime, T1.loc2, T2.loc4 FROM Table1 T1
INNER JOIN Table2 T2 ON T1.xDateTime = T2.xDateTime
WHERE (T1.xDateTime BETWEEN 'somevalue1' AND 'somevalue2')
UNION
SELECT T2.xDateTime, T1.loc2, T2.loc4 FROM Table1 T1
RIGHT JOIN Table2 T2 ON T1.xDateTime = T2.xDateTime
WHERE (T1.xDateTime BETWEEN 'somevalue1' AND 'somevalue2')
UNION
SELECT T1.xDateTime, T1.loc2, T2.loc4 FROM Table1 T1
LEFT JOIN Table2 T2 ON T1.xDateTime = T2.xDateTime
WHERE (T1.xDateTime BETWEEN 'somevalue1' AND 'somevalue2')

If 't1DateTime' and 't2DateTime' are the common fields, then apparently you need a query such as
SELECT table1.t1DateTime, table1.tiID, table1.loc2, table2.t2id, table2.loc4
FROM table1
INNER JOIN table2 ON table2.t2DateTime = table1.t1DateTime
This will give you values from rows which match in both tables, according to DateTime. If there is also supposed to be a match with the locations then you will have to add the desired condition to the 'ON' statement.

Based on your comment:
For a given date time there can be even empty values for fields in table 2
my understanding would be that you are not interested in orphaned records in table 2 (based on date) so in that case a LEFT JOIN would do it:
SELECT table1.t1DateTime, table1.tiID, table1.loc2, table2.t2id, table2.loc4
FROM table1
LEFT JOIN table2 ON table2.t2DateTime = table1.t1DateTime
However if there are also entries in table2 with no matching dates in table1 that you need to return you could try this:
SELECT table1.t1DateTime, table1.tiID, table1.loc2, ISNULL(table2.t2id, 0), ISNULL(table2.loc4, 0.0)
FROM table1
LEFT JOIN table2 ON table2.t2DateTime = table1.t1DateTime
WHERE (T1.t1DateTime BETWEEN 'somevalue1' AND 'somevalue2')
UNION ALL
SELECT table2.t2DateTime, '0', '0.0', table2.t2id, table2.loc4
FROM table2
LEFT OUTER JOIN table1 on table1.t1DateTime=table2.t2DateTime
WHERE table1.t1Datetime IS NULL AND T2.t2DateTime BETWEEN 'somevalue1' AND 'somevalue2'

Thanks a lot to #kbbucks.
Works with this so far.
SELECT T1.MonitorDateTime, T1.loc2, T.loc4
FROM Table1 T1
LEFT JOIN Table2 T2 ON T2.MonitorDateTime = T1.MonitorDateTime
WHERE T1.MonitorDateTime BETWEEN '04/05/2011 15:10:00' AND '04/05/2011 16:00:00'
UNION ALL
SELECT T2.MonitorDateTime, '', T2.loc4
FROM Table2 T2
LEFT OUTER JOIN Table1 T1 ON T1.MonitorDateTime = T2.MonitorDateTime
WHERE T1.MonitorDateTime IS NULL AND T2.MonitorDateTime BETWEEN '04/05/2011 15:10:00' AND '04/05/2011 16:00:00'

What would be the best way to write this query

I have a table in my database that has 1.1MM records. I have another table in my database that has about 2000 records under the field name, "NAME". What I want to do is do a search from Table 1 using the smaller table and pull the records where they match the smaller tables record. For example Table 1 has First Name, Last Name. Table 2 has Name, I want to find every record in Table 1 that contains any of Table 2 Names in either the first name field or the second name field. I tried just making an access query but my computer just froze. Any thoughts would be appreaciated.

have you considered the following:
Select Table1.FirstName, Table1.LastName
from Table1
where EXISTS(Select * from Table2 WHERE Name = Table1.FirstName)
or EXISTS(Select * from Table2 WHERE Name = Table1.LastName)
I have found before that on large tables this might work better than an inner join.

Be sure to create indexes on Table1.first_name, Table1.last_name, and Table2.name. They will dramatically speed up your query.
Edit: For Microsoft Access 2007, see CREATE INDEX.

See above previous notes about indexes, but I believe from your description, you want something like:
select table1.* from table1
inner join
table2 on (table1.first_name = table2.name OR table1.last_name = table2.name);

It should go something like this,
Select Table1.FirstName, Table1.LastName
from Table1
where Table1.FirstName IN (Select Distinct Name from Table2)
or Table1.LastName IN (Select Distinct Name from Table2)
And there are various other ways to run this same query, i would suggest you see execution plan for each of these queries to find out which one is the fastest. In addition creating indexes on the column which is used in a "where" condition will also speed up the query.

i agree with astander. based on my experience, using EXIST instead of IN is a lot faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fuzzy match column to column - sql

Related

SQL Server query: pair each record from table 1 to all record at table 2

How do I write an SQL query that retrieves one line per combination of existing unique ids?

Using LIKE in a JOIN query

sql query to Combine different fields from 2 different tables with no relations except a common field - SQL Server Compact 3.5 SP2

What would be the best way to write this query

Categories

Resources