Combining almost identical rows into 1 - sql

I have a tricky problem that I wouldn't mind a bit of help on, I've made some progress using queries that I've here and elsewhere, but am getting seriously stumped now.
I have a mailing list that has numerous near duplications that I'm trying to combine into one meaningful row, taking data such as this.
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs D Andrews 122 Somewhere BH10 123456 66-70 Homeowner
Ms Diane Andrews 122 Somewhere BH10 123456 £25-40 EDF
and making one row along the lines of
Title Forename Surname Address1 Postcode Phone Age Income Ownership Gas
Mrs Diane Andrews 122 Somewhere BH10 123456 66-70 £25-40 Homeowner EDF
I have over 127 million records, most duplicated with a similar pattern, but no clear logic as was proven when I added an identity field. I also have over 90 columns to consider, so it's a bit of work!
There isn't a clear pattern to the data, so I'm thinking I may have a huge case statement to try to climb over.
Using the following code I can get a decent start on only returning the full name, but with the pattern of data - trying to compare the fields across rows is as follows.
SELECT c1.*
FROM
Mailing c1
JOIN
Mailingc2 ON c1.Telephone1 = c2.Telephone1 AND c1.surname = c2.surname
WHERE
len(c1.Forename) > len(c2.Forename)
AND c2.over_18 <> ''
AND c1.Telephone1 = '123456'
Has anyone got any pointers as to how I should progress please? I'm open to discussion and ideas...
I'm using SQL 2005 and apologies in advance if the tagging is all over the place!
Cheers,
Jon

Would it work by assuming that all persons with the same surname and phone number (Do all persons have a phone?) were the same person?
INSERT INTO newtable <fieldnames>
SELECT lastname,phone,max(field3),max(field4)....
FROM oldtable
GROUP BY lastname,phone
But that would collapse John Smith and Jack Smith living together into one person.
Perhaps you should consider outsourcing it to a data-entry sweatshop somewhere, adter you have preprocessed the data. :-)
And/or be prepared to take the flack for mistaken bundling.
Perhaps adding something like "To improve our green footprint, we have merged x listings on your adress together. If you would like separate mailings, please contact us"

Related

SQL how to display results only if two parts are unique

I'm currently having an issue trying to make a query such that it displays the fields only if both parts are unique. For example, lets say the fields to be displayed currently are as goes:
SELECT
Name,
CompanyName,
JobStartDate,
Birthday,
Age,
Favorite Ice Cream,
Height
From 'sample_person_data'
How would I set this so that it only displays fields where both CompanyName and JobStartDate are both distinct?
At first, I thought just putting distinct would be enough, but came to the realization that would not work, I then thought what if I make it so that it has to check both CompanyName + JobStartDate as unique fields, so only showing the fields where both those two things are unique, but could not go about implementing it.
Essentially what I'm aiming to achieve is if there was a large dataset with some repeated values, how could I help display only the unique fields. I use CompanyName and JobStartDate as examples here, but I understand that people can start at the same company on the same day, therefore this would be a concept which could expand into adding more comparisons.
Thank you for your time.
EDIT: Based on comments I am trying to provide further detail by example
Say this is the sample data:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Google
04-17-00
01-01-78
50
Chocolate
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
So here we see the same John from Google filled the form twice because say he decided to change his favorite ice cream. How do I edit the query such that it displays such as the following:
Name
CompanyName
JobStartDate
Birthday
Age
Favorite Ice Cream
Height
John
Google
04-17-00
01-01-78
50
Vanilla
5-7
John
Microsoft
04-17-00
02-01-95
30
Chocolate
5-8
Nancy
Google
06-27-00
04-01-78
50
Vanilla
5-2
Joanna
Google
08-19-00
05-01-78
50
Vanilla
5-0
I don't really care if his favorite ice cream shows up as Chocolate or Vanilla, but rather that only 1 entry of a John from google shows up, using the current company + job start date as the identifying fields for example.
Use below simple approach
select * from your_table
qualify 1 = row_number() over(partition by CompanyName, JobStartDate)
if applied to sample data in your question - output is

Merge SQL Rows in Subquery

I am trying to work with two tables on BigQuery. From table1 I want to find the accession ID of all records that are "World", and then from each of those accession numbers I want to create a column with every name in a separate row. Unfortunately, when I run this:
Select name
From `table2`
Where acc IN (Select acc
From `table1`
WHERE source = 'World')
Instead of getting something like this:
Acc1
Acc2
Acc3
Jeff
Jeff
Ted
Chris
Ted
Blake
Rob
Jack
Jack
I get something more like this:
row
name
1
Jeff
2
Chris
3
Rob
4
Jack
5
Jeff
6
Jack
7
Ted
8
Blake
Ultimately, I am hoping to download the data and somehow use python or something to take each name and count the number of times it shows up with each other name at a given accession number, and furthermore measure the degree to which each pairing is also found with third names in any given column, i.e. the degree to which they share a cohort. So I need to preserve the groupings which exist with each accession number, but I am struggling to find info on how one might do this.
Could anybody point me in the right direct for this, or otherwise is the way I am going about this wise if that is my end goal?
Thanks!
This is not a direct answer to the question you asked. In general, it is easier to handle multiple rows rather than multiple columns.
So, I would recommend that you put each acc value in a separate row and then list the names as an array:
select t2.acc, array_agg(t2.name order by t2.name) as names
from `table2` t2
where t2.acc in (Select t1.acc
From `table1` t1
where t1.source = 'World'
)
group by t2.acc;
Otherwise, you are going to have a challenge just naming the columns in your result set.

SQL Combine null rows with non null

Due to the way a particular table is written I need to do something a little strange in SQL and I can't find a 'simple' way to do this
Table
Name Place Amount
Chris Scotland
Chris £1
Amy England
Amy £5
Output
Chris Scotland £1
Amy England £5
What I am trying to do is above, so the null rows are essentially ignored and 'grouped' up based on the Name
I have this working using For XML however it is incredibly slow, is there a smarter way to do this?
This is where MAX would work
select
Name
,Place = Max(Place)
,Amount = Max(Amount)
from
YourTable
group by
Name
Naturally, if you have more than one occurance of a place for a given name, you may get unexpected results.

Levenshtein distance with multiple comparisons

Currently I am trying to create a "best match" query.
I came across this answer, but the main difference is that I have a table with more columns, and I need to compare 6 strings.
Is there a way to implement the Levenshtein distance algorithm with a query that involves this many comparisons? All the examples I've seen online involve a single comparison sort. Is there a better way of getting the best match in a query involving this many comparisons?
EDIT
So here is the table I am trying to query best match:
CustomerID CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
1 terbubbs terbubbs incorporated 1234567890 terbubbs#gmail.com 5 Main St 06482
This "best match" query is done when a user submits an order request. They will enter data into identical fields and I need to make sure whether this user has submitted a request in the past.
Here are three possible requests:
1. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
terrbubbs terbubbs inc 11234567890 terbubbs#gmail.com 7 Main St 06482
2. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
terribble Terribble Incorporated 1254643789 terribble#gmail.com 12 State St 04422
3. CustomerName CompanyName CompanyPhone CompanyEmail AddressL1 PostalCode
john doe JD inc 5468791313 john#gmail.com 12 Main St 06482
Now based on these three requests, I would want Request 1 to be the best match. Honestly, this is probably a terrible example.. My point is that a user might submit an almost identical request besides a few misspellings or grammar mistakes. I want to retrieve the most similar entry in the datatable if possible.
EDIT 2
I'm wondering if it is better to try and concatenate corresponding datatable column values into a formatted string and compare it to a formatted string of the request. Any thoughts?

Exclude entire row based on based on values from another query

I am using MS Access and I have a rather complex situation.
I have Respondents who are linked to varying numbers of different Companies via 2 connecting tables. I want to be able to create a list of distinct customers which excludes any customer associated with Company X.
Here is a pic of the relationships that are involved with the query.
And here is an example of what I'm trying to achieve.
RespondentRef | Respondent Name
8 Joe Bloggs
.
RespondentRef | GroupRef
8 2
.
GroupRef | CompanyRef
2 10
.
CompanyRef | CompanyName
10 Ball of String
I want a query where I enter in 'Ball of String' for the company name, and then it produces a list of all the Respondents (taken from Tbl_Respondent) which completely excludes Respondent 8 (as he is linked to CompanyName: Ball of String).
Tbl_Respondent
RespondentRef | Respondent Name
... ...
7 Bob Carlyle
9 Anton Boyle
I have tried many combinations of subqueries with <> and NOT EXISTS and NOT IN and nothing seems to work. I suspect the way these tables are linked may have something to do with it.
Any help you could offer would be very much appreciated. If you have any questions let me know. (I have made best efforts, but please accept my apologies for any formatting conventions or etiquette faux-pas I may have committed.)
Thank you very much.
EDIT:
My formatted version of Frazz's code is still turning resulting in a syntax error. Any help would be appreciated.
SELECT *
FROM Tbl_Respondent
WHERE RespondentRef NOT IN (
SELECT tbl_Group_Details_Respondents.RespondentRef
FROM tbl_Group_Details_Respondents
JOIN tbl_Group_Details ON tbl_Group_Details.GroupReference = tbl_Group_Details_Respondents.GroupReference
JOIN tbl_Company_Details ON tbl_Company_Details.CompanyReference = tbl_Group_Details.CompanyReference
WHERE tbl_Company_Details.CompanyName = "Ball of String"
)
This should do what you need:
SELECT *
FROM Tbl_Respondent
WHERE RespondentRef NOT IN (
SELECT gdr.RespondentRef
FROM Tbl_Group_Details_Respondent gdr
JOIN Tbl_Group_Details gd ON gd.GroupRef=gdr.GroupRef
JOIN Tbl_Company_Details cd ON cd.CompanyRef=gd.CompanyRef
WHERE cd.CompanyName='Ball of String'
)