SQL - How to "flatten" several very similar rows into 1 - sql

Apologies if the title isn't clear - I just didn't know how to describe the issue and I really don't know SQL that well/at all.
I am working with a database used by our case management system. At places it has clearly been extended over time by the developers. I am working with Contact details (names, addresses, etc...) and they have added extra fields to deal with email addresses and to allow for home/work/mobile phone numbers etc...
The problem is that they haven't added a new field for each individual new field. They have instead added a couple of fields in 2 different tables - the first field includes the field name, the second then includes the actual data.
The first is called AddElTypeText in a table called AdditionalAddElTypes - The AddElTypeText field includes values like "Work Telephone 1", "Fax", "Home Email" etc... (There are a total of 10 different values and I can't see the developers expanding this number any time soon)
The second field is called AddElementText in a table called AdditionalAddressElements - the AddElementText then includes the actual data e.g. the phone number, email address.
For those of you who (unlike me) find it easier to look at the SQL code, it's:
SELECT
Address.AddressLine1
,AdditionalAddElTypes.AddElTypeText
,AdditionalAddressElements.AddElementText
FROM
Address
INNER JOIN AdditionalAddressElements
ON Address.AddressID = AdditionalAddressElements.AddressID
INNER JOIN AdditionalAddElTypes
ON AdditionalAddressElements.AddElTypeID = AdditionalAddElTypes.AddElTypeID
I can work with this, but if any contact has 2 or more "additional" elements, I get multiple rows, with most of the data being the same, but just the 2 columns of AddElTypeText and AddElementText being different.
So can anyone suggest anything to "flatten" a contact into a single row. I had in mind something like concatenating AddElTypeText and AddElementText into a single string field, ideally with a space in between AddElTypeText and AddElementText, and then a : or , separating the pairs of AddElTypeText and AddElementText.
However, I have very little idea how to achieve that, or whether an entirely different approach would be better. Any help very gratefully received!
Gary

As #twn08 said, this type of question has generally been asked before. It's generally a pain to do this kind of grouping concatenation in SQL Server, involving the use of FOR XML.
That being said, here's a SQLFiddle that (I believe) does something like what you wanted. And here's the actual query:
WITH Results AS
(
SELECT a.*,
t.AddElTypeText,
aa.AddElementText
FROM
Address a
INNER JOIN
AdditionalAddressElements aa
ON a.AddressID = aa.AddressID
INNER JOIN
AdditionalAddElTypes t
ON aa.AddElTypeID = t.AddElTypeID
)
SELECT
res.AddressID,
STUFF((
SELECT ', ' + AddElTypeText + ': ' + AddElementText
FROM Results
WHERE (AddressID = res.AddressID)
FOR XML PATH (''))
,1,2,'') AS AdditionalElements
FROM Results res
GROUP BY res.AddressID

Related

Optimize a SQL Server query with conditional and formatted string join

I need to execute a query that will join two tables on fields named a.PatientAddress and b.ADDRESS, the issue is that b.ADDRESS needs to be standardized and formatted to match the standardized address found in a.PatientAddress. I don't have control over the incoming data format, so having the data scrubbed before it comes into my b table is not an option. Example:
a.PatientAddress may equal something like 1234 Someplace Cool Dr. Apt 1234 while the matching address in b.ADDRESS may equal something like 1234 Someplace Cool Dr. #1234 (in reality that is just one of many possibilities). The Apartment number (if existent in the address) is the area of fluctuation that needs formatting in order to join properly.
Some possible Apt variations I've seen in the data set are:
1234 Someplace Cool Dr. #1234
1234 Someplace Cool Dr. Apt 1234
1234 Someplace Cool Dr. Apt #1234
1234 Someplace Cool Dr. Apt # 1234
Now, for what I've already tried;
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN MARKETING_MAILING mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND (
-- IF APT IS NOT FOUND, THEN ADDRESS SHOULD DIRECTLY EQUAL ANOTHER ADDRESS
( mm.ADDRESS NOT LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) = UPPER(mm.ADDRESS)
)
OR
(
mm.ADDRESS LIKE '%[$0-9]'
AND UPPER(vgi.PatientAddress) =
-- PATIENT ADDRESS SHOULD EQUAL THE FORMATTED ADDRESS OF THE MAIL RECIPIENT
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(mm.ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#',''),mm.ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(mm.ADDRESS, CHARINDEX(' ', mm.ADDRESS)-1)),'#','')
)
)
The problem here is that the query takes 20+ minutes to execute, and sometimes doesn't even finish before the operation time expires. I've also tried splitting the two conditions up into UNION statements. I've also tried splitting the street address and apartment number to create a like statement that reads UPPER(vgi.PatientAddress) LIKE UPPER('%1234 Someplace Cool Dr.%1234%') and that doesn't seem to work either. I'm starting to run out of ideas and wanted to see what others could suggest.
Thanks in advance for any pointers or help!
The logic needed to scrub the data is beyond the scope of what we can do for you. You'll likely find that, ultimately, you need some other key for this query to ever work. However, assuming your existing logic is adequate to create a good match (even if slow), we might be able to help improve performance a bit.
One way you can improve things is to join on a projection of the address table that cleans the data. (That means join to a sub query). That projection might look like this:
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE UPPER(ADDRESS)
END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
This improves things because it avoids the "OR" condition in your JOIN; you simply match to the projected column. However, this will force the projection over every row in the table (hint: that was probably happening anyway), and so it's still not as fast as it could be. You can get an idea for whether this will help from how long it takes to run the projection by itself.
You can further improve on the projection method by adding the ADDRESS_CLEAN column above as a computed column to your Marketing_Mailing table. This will force the adjustment to happen at insert time, meaning the work is already done for your slow query. You can even index on the column. Of course, that is at the cost of slower inserts. You might also try a view (or materialized view) on the table. This will help Sql Server save some of the work it does computing that extra column across multiple queries. For best results, also think about what WHERE filters you can use at the time you are creating the projection, to avoid needing to every compute the extra column on those rows in the first place.
An additional note is that, for the default collation, you can skip using the UPPER() function. That is likely hurting your index use.
Put it all together like this:
SELECT vgi.VisitNo
,vgi.AdmitDate
,vgi.ChargesTotal
,MONTH(vgi.AdmitDate) AS AdmitMonth
,DATENAME(MONTH, vgi.AdmitDate) AS AdmitMonthName
,YEAR(vgi.AdmitDate) AS AdmitYear
,vgi.PatientAddress
,mm.MAIL_DATE
,mm.ADDRESS
FROM VISIT_GENERAL_INFORMATION vgi
INNER JOIN
(
SELECT Mail_Date, Address,
CASE WHEN ADDRESS LIKE '%[$0-9]' THEN
-- GET THE FIRST PART OF THE ADDRESS, UP TO THE ADDRESS NUMBER
SUBSTRING(ADDRESS,1,CHARINDEX(REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#',''),ADDRESS))
+ ' ' +
-- GET THE APARTMENT ADDRESS NUMBER AND FORMAT IT
-- TAKE OUT EXTRA SPACING AROUND IT AND THE # CHARACTER IF IT EXISTS
REPLACE(LTRIM(RIGHT(ADDRESS, CHARINDEX(' ', ADDRESS)-1)),'#','')
ELSE ADDRESS END AS ADDRESS_CLEAN
FROM MARKETING_MAILING
) mm ON vgi.AdmitDate >= mm.MAIL_DATE
AND vgi.AdmitDate > '2014-01-01 00:00:00.000'
AND vgi.PatientAddress = mm.ADDRESS_CLEAN
Another huge factor not yet covered is indexes. What indexes are on your VISIT_GENERAL_INFORMATION table? I'd especially like to see a single index that covers both AdmitDate and PatientAddress. Which order is determined by the cardinality of those fields, and how clean and how much data is in the Marketing_Mail table.
Finally, one request of my own: if this helps, I'd like to hear back on just how much it helped. If the query used to take 20 minutes, how long does it take now?
I agree with #TomTom that you would really benefit from "pre-standardizing" into either
a derived column that updates on the fly
or a view or just a temp table in your query process
that gives you a clean match.
And with that, I would use a third-party service or library, ideally, because they have spent a lot of time making it a reliable parse.
Either option works after receiving the data you can't control, so that is not a problem.
What you're doing is creating your own, internal copy that is standardized.
Of course, you're going to need to run the other side, "a", through the same standardization.

SQL query to find records with specific prefix

I'm writing SQL queries and getting tripped up by wanting to solve everything with loops instead of set operations. For example, here's two tables (lists, really - one column each); idPrefix is a subset of idFull. I want to select every full ID that has a prefix I'm interested in; that is, every row in idFull which has a corresponding entry in idPrefix.
idPrefix.ID idFull.ID
---------- ----------
12 8
15 12
300 12-1-1
12-1-2
15
15-1
300
Desired result would be everything in idFull except the value 8. Super-easy with a for each loop, but I'm just not conceptualizing it as a set operation. I've tried a few variations on the below; everything seems to return all of one table. I'm not sure if my issue is with how I'm doing joins, or how I'm using LIKE.
SELECT f.ID
FROM idPrefix AS p
JOIN idFull AS f
ON f.ID LIKE (p.ID + '%')
Details:
Values are varchars, prefixes can be any length but do not contain the delimiter '-'.
This question seems similar, but more complex; this one only uses one table.
Answer doesn't need to be fast/optimized/whatever.
Using SQL Server 2008, but am more interested in conceptual understanding than a flavor-specific query.
Aaaaand I'm coming back to both real coding & SO after ~3 years, so sorry if I'm rusty on any etiquette.
Thanks!
You can join the full table to the prefix table with a LIKE
SELECT idFull.ID
FROM idFull full
INNER JOIN idPrefix pre ON full.ID LIKE pre.ID + '%'

Search a column for values LIKE in another column

I searched but couldn't find what I was looking for, maybe I'm not looking for the right terms though.
I have a colum for SKUs and a Keyword column, the SKUs are formatted AA 12345, and the Keywords are just long lists of words, what I need to do is find any records where the numbers in the SKU match any part of the Keywords, I'm just not sure how to do this. For example I'd like to remove the AA so that I'm looking for %12345% anywhere inside of the value of keywords, but I need to do it for every record.
I've tried a few variations of:
SELECT *, Code AS C
FROM Prod
WHERE Keywords LIKE '%C%';
but I get errors on all of them. Can someone help?
Thank you.
EDIT: Okay, sorry about that, the question wasn't the clearest. I'll try to clarify;
The SKU column has values that have a 2 letter prefix in front of a varying amount of numbers such as, AA 12345 or UN 98767865
The Keywords columns are full of information, but also include the SKU values, the problem here is that some of the keyword columns contain the SKU values of products that have entirely different records
I'm trying to find what columns contain the value of different records.
I hope that's more understandable.
EDIT EDIT: Here is some actual sample data
Code: AD 56409429
Keywords: 56409429, 409249, AD 56409429, AD-56409429, Advance 56409429, Nilfisk 56409429, Nilfisk Advance 56409429, spx56409429, 56409429M, 56409429G, 56409429H, ADV56409429, KNT56409429, Kent 56409429, AA 12345
Code: AA 12345
Keywords: AA 12345, 12345, Brush
I need to find all the records where an Errant Code value has found it's way into the Keywords, such as the first case above, so I need a query that would only return the first example
I'm really sorry my explanation is confusing, it's perhaps an extension of how confused I am trying to figure out how to do it. Imagine me sitting there with the site owner who added thousands of these extra sku numbers to their keywords and having them ask me to then remove them :/
Assuming all of your SKU values are in exactly the same format you can remove the 'AA' part using SUBSTRING and then use the result in the LIKE statement:
SELECT * FROM Prod WHERE Keywords LIKE '%' + SUBSTRING(Code, 3,5) + '%'
Seeing as your SKU codes can be variable length the SUBSTRING statement above will have to changed to:
SELECT * FROM Prod WHERE Keywords LIKE '%' + SUBSTRING(Code, 3, LEN(Code)) + '%'
This will remove the first 3 characters from your SKU code regardless of the number of digits it contains afterwards.
It is not entirely clear from your question whether or not the Keywords are in the format AA 12345 or just 12345 but assuming they are and are comma separated. Then you can find all records where the code is in the keywords but there are OTHER keywords also by using this statement:
SELECT *
FROM Prod
WHERE Keywords LIKE '%' + SUBSTRING(Code, 3, LEN(Code)) + '%'
AND Keywords <> SUBSTRING(Code, 3, LEN(Code))
This statement basically says find me all records where SKU code is somewhere in the Keywords BUT also must not exactly match the Keywords contents, i.e. there must be other keywords in the data.
Ok based on your last revisions I think this will work - or at least get you along the road (I am assuming your Product table has a primary key of Id). Also this is most likely horribly inefficient but seeing as it sounds as if this is a one off tidy up it may not matter too much as long as it works (at least that is what I am hoping).
SELECT DISTINCT P.Id
FROM PROD P
INNER JOIN
(
-- Get all unique SKU codes from Prod table
SELECT DISTINCT SUBSTRING(CODE, 3, LEN(CODE)) as Code FROM Prod
) C ON P.Keywords LIKE '%' + C.Code + '%'
AND SUBSTRING(P.Code, 3, LEN(P.Code)) <> C.Code
The above statement joins a unique list of SKU codes (with the letter prefix removed) with every matching record via the join on the Keyword column. Note: This will result in duplicate product records being returned. Additionally the result-set is filtered so as to only return matching records where the SKU Code of the original Product record does not match a SKU code contained in the keywords column.
The distinct then returns only a unique list of Product Id's that have a erroneous SKU code in the Keyword column (they have may have multiples).
Stuff() seems better suited here.... I would do this:
SELECT *
FROM Prod WHERE
Keywords LIKE '%' + STUFF(SKU,1,3,'') + '%'
This will work for both AA 12345 and UN 98767865 -- it replace the first 3 characters with blank.

Sorting SQL by first two characters of fields

I'm trying to sort some data by sales person initials, and the sales rep field is 3 chars long, and is Firstname, Lastname and Account type. So, Bob Smith would be BS* and I just need to sort by the first two characters.
How can I pull all data for a certain rep, where the first two characters of the field equals BS?
In some databases you can actually do
select * from SalesRep order by substring(SalesRepID, 1, 2)
Othere require you to
select *, Substring(SalesRepID, 1, 2) as foo from SalesRep order by foo
And in still others, you can't do it at all (but will have to sort your output in program code after you get it from the database).
Addition: If you actually want just the data for one sales rep, do as the others suggest. Otherwise, either you want to sort by the thing or maybe group by the thing.
What about this
SELECT * FROM SalesTable WHERE SalesRepField LIKE 'BS_'
I hope that you never end up with two sales reps who happen to have the same initials.
Also, sorting and filtering are two completely different things. You talk about sorting in the question title and first paragraph, but your question is about filtering. Since you can just ORDER BY on the field and it will use the first two characters anyway, I'll give you an answer for the filtering part.
You don't mention your RDBMS, but this will work in any product:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE 'BS%'
If you're using a variable/parameter then:
SELECT
my_columns
FROM
My_Table
WHERE
sales_rep LIKE #my_param + '%'
You can also use:
LEFT(sales_rep, 2) = 'BS'
I would stay away from:
SUBSTRING(sales_rep, 1, 2) = 'BS'
Depending on your SQL engine, it might not be smart enough to realize that it can use an index on the last one.
You haven't said what DBMS you are using. The following would work in Oracle, and something like them in most other DBMSs
1) where sales_rep like 'BS%'
2) where substr(sales_rep,1,2) = 'BS'
SELECT * FROM SalesRep
WHERE SUBSTRING(SalesRepID, 1, 2) = 'BS'
You didn't say what database you were using, this works in MS SQL Server.

How to sort and display mixed lists of alphas and numbers as the users expect?

Our application has a CustomerNumber field. We have hundreds of different people using the system (each has their own login and their own list of CustomerNumbers). An individual user might have at most 100,000 customers. Many have less than 100.
Some people only put actual numbers into their customer number fields, while others use a mixture of things. The system allows 20 characters which can be A-Z, 0-9 or a dash, and stores these in a VARCHAR2(20). Anything lowercase is made uppercase before being stored.
Now, let's say we have a simple report that lists all the customers for a particular user, sorted by Customer Number. e.g.
SELECT CustomerNumber,CustomerName
FROM Customer
WHERE User = ?
ORDER BY CustomerNumber;
This is a naive solution as the people that only ever use numbers do not want to see a plain alphabetic sort (where "10" comes before "9").
I do not wish to ask the user any unnecessary questions about their data.
I'm using Oracle, but I think it would be interesting to see some solutions for other databases. Please include which database your answer works on.
What do you think the best way to implement this is?
Probably your best bet is to pre-calculate a separate column and use that for ordering and use the customer number for display. This would probably involve 0-padding any internal integers to a fixed length.
The other possibility is to do your sorting post-select on the returned results.
Jeff Atwood has put together a blog posting about how some people calculate human friendly sort orders.
In Oracle 10g:
SELECT cust_name
FROM t_customer c
ORDER BY
REGEXP_REPLACE(cust_name, '[0-9]', ''), TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+'))
This will sort by the first occurence of number, not regarding it's position, i. e.:
customer1 < customer2 < customer10
cust1omer ? customer1
cust8omer1 ? cust8omer2
, where a ? means that the order is undefined.
That suffices for most cases.
To force sort order on case 2, you may add a REGEXP_INSTR(cust_name, '[0-9]', n) to ORDER BY list n times, forcing order on the first appearance of n-th (2nd, 3rd etc.) group of digits.
To force sort order on case 3, you may add a TO_NUMBER(REGEXP_SUBSTR(cust_name, '[0-9]+', n)) to ORDER BY list n times, forcing order of n-th. group of digits.
In practice, the query I wrote is enough.
You may create a function based index on these expressions, but you'll need to force it with a hint, and a one-pass SORT ORDER BY will be performed anyway, as the CBO doesn't trust function-base indexes enough to allow an ORDER BY on them.
You could have a numeric column [CustomerNumberInt] that is only used when the CustomerNumber is purely numeric (NULL otherwise[1]), then
ORDER BY CustomerNumberInt, CustomerNumber
[1] depending on how your SQL version handles NULLs in ORDER BY you might want to default it to zero (or infinity!)
I have a similar horrible situation and have developed a suitably horrible function to deal with it (SQLServer)
In my situation I have a table of "units" (this is a work-tracking system for students, so unit in this context represents a course they're doing). Units have a code, which for the most part is purely numeric, but for various reasons it was made a varchar and they decided to prefix some by up to 5 characters. So they expect 53,123,237,356 to sort normally, but also T53, T123, T237, T356
UnitCode is a nvarchar(30)
Here's the body of the function:
declare #sortkey nvarchar(30)
select #sortkey =
case
when #unitcode like '[^0-9][0-9]%' then left(#unitcode,1) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-1)
when #unitcode like '[^0-9][^0-9][0-9]%' then left(#unitcode,2) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-2)
when #unitcode like '[^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,3) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-3)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,4) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-4)
when #unitcode like '[^0-9][^0-9][^0-9][^0-9][^0-9][0-9]%' then left(#unitcode,5) + left('000000000000000000000000000000',30-(len(#unitcode))) + right(#unitcode,len(#unitcode)-5)
when #unitcode like '%[^0-9]%' then #unitcode
else left('000000000000000000000000000000',30-len(#unitcode)) + #unitcode
end
return #sortkey
I wanted to shoot myself in the face after writing that, however it works and seems not to kill the server when it runs.
I used this in SQL SERVER and working great: Here the solution is to pad the numeric values with a character in front so that all are of the same string length.
Here is an example using that approach:
select MyCol
from MyTable
order by
case IsNumeric(MyCol)
when 1 then Replicate('0', 100 - Len(MyCol)) + MyCol
else MyCol
end
The 100 should be replaced with the actual length of that column.