SQL code to remove extra spaces and line breaks in free text? - sql

I'm currently working with a table that deals with patients who have visited a clinic. One of the fields in this table shows the reason for the visit, and it's free text so whoever's booking the appointment can leave a custom note for the doctor depending on what the issue is. Yes, I'm well aware free text is the actual worst, but I did not design this database or the front-end medical record system (which is also the worst) and I'm simply stuck dealing with it. Bear with me.
Because of the special characters, extra spaces, and carriage returns that often find their way into that free text field on the front end, all its contents would show up on a single line in SSMS but would cause all sorts of formatting issues with extra line breaks when the SQL results were pasted into Excel. I did a little research and found a snippet of code that would replace carriage returns, etc. in a given field, thus forcing all the contents of that field to remain in a single cell:
REPLACE(REPLACE(FieldName,char(10),''),char(13),'') as FieldName
This has worked splendidly for this VisitReason field and any other free text fields I've been forced to work with. However, does it account for every possible issue one might find in free text? Yesterday I was working with this table and pasted the results from SSMS into Excel, and there were two people whose VisitReason fields were cut off prematurely and then had all the results (as in multiple fields) from a bunch of other people's visits crammed into that same field (thus making for one really long cell in Excel).
For example, the VisitReason for one of these people showed up in SSMS as complaining of rash, see note. But then when it was pasted into Excel, the results looked like...
PatientID PatientName VisitDate ... VisitReason
----------------------------------------------------------------------------------------------
1001 Smith, John 01/08/2023 ... complaining of rash, see
PatientID1002PatientNameJaneDoeVisitDate01/08/2023VisitRe
asondiabetesfollowupPatientID1003PatientNameBobBrownVisitDa
(and so on)
I can't tell if this has something to do with the free text field, and there's some hidden character in there that's causing the weird line breaks and field merging that my REPLACE function isn't catching, or whether it's an error with Excel (in which case this obviously isn't the right place to be asking). But I wanted to check and see if there was anything that potentially needed to be added to the REPLACE line that would fix the problem.
My full query is really simple:
SELECT
d.PatientID,
d.PatientName,
v.VisitDate,
[some other visit-related fields, none of which are free text],
REPLACE(REPLACE(v.VisitReason,char(10),''),char(13),'') as VisitReason,
[some other demographic fields, none of which are free text]
FROM Demographics d
JOIN Visit v ON d.PatientID = v.PatientID
The REPLACE function works perfectly fine for literally every other patient in the list except for the two with results like what's shown above, which then go on to affect a number of other rows following them. Anyone have any thoughts?

Please try the following solution.
The xs:token data type is stripping out the white space characters.
SQL
USE tempdb;
GO
DROP FUNCTION IF EXISTS dbo.udf_tokenize;
GO
/*
1. All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
2. Then leading and trailing spaces are removed from the value.
3. Further, contiguous occurrences of more than one space will be replaced with a single space.
*/
CREATE FUNCTION dbo.udf_tokenize(#input VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN (SELECT CAST('<r><![CDATA[' + #input + ' ' + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
END
GO
-- DDL and sample data population, start
DECLARE #mockTbl TABLE (ID INT IDENTITY(1,1), col_1 VARCHAR(100), col_2 VARCHAR(100));
INSERT INTO #mockTbl (col_1, col_2) VALUES
(CHAR(13) + ' FL ' + CHAR(9), CHAR(10) + ' Miami'),
(' FL ', ' Fort Lauderdale '),
(' NY ', ' New York '),
(' NY ', ''),
(' NY ', NULL);
-- DDL and sample data population, end
-- before
SELECT *, LEN(col_2) AS [col_2_len]
FROM #mockTbl;
-- remove invisible white space chars
UPDATE #mockTbl
SET col_1 = dbo.udf_tokenize(col_1)
, col_2 = dbo.udf_tokenize(col_2);
-- after
SELECT *, LEN(col_2) AS [col_2_len]
FROM #mockTbl;

Related

Blank Space in every row of table SQL

Hello i have a table with rows
and i was doing a simple
select from table where column ='string'
and it gives me back no result, but when i use:
select from table where column ='%string%'
it gives me the row that exist in my table,
then i did a select * from table and noticed that there is a blank space before my rows:
Image of my SQL result
If you look closely theres a space at the beginning of the second row, and only in the first row theres no blank space.
so i thought it was a simple white space at the beggining but when i tried using this:
SELECT LTRIM(RTRIM(MATERIAL)) FROM table
nothing happened.
then i tried to copy the result of my
select * from table
to Excel and noticed this:
Excel paste from SQL
my 2nd row got splitted in 2 rows right at the start of the column 'material', so the thing i thught it was a blank space its something like a jump line.
i have never had this problem before or seen this before.
Larnu has commented how to remove all the linebreaks from the data. Here are some other things that could also work, and slightly differently depending on the effect you want:
--trim everything that is not a number or letter off the left hand side only
UPDATE table SET material = SUBSTRING(material, PATINDEX(material, '[0-9a-z]', 99999)
--convert all linebreaks to spaces and trim off the left and right spaces
UPDATE table SET material = RTRIM(LTRIM(REPLACE(material, CHAR(10), ' ')))
Larnu's SQL isn't wrong, it'll just remove every line break anywhere, which may cause more formatting disruption than is wanted. I'd be tempted to replace all the linebreaks with spaces, as two words that are separated by a line break would remain separated by a space rather than become one word if the space was removed
some
word
-> some word (if you replace linebreak with space)
-> someword (if you replace linebreak with nothing)
If all you want is to remove linebreaks from the left side of the field, the patindex method will search the field for the first occurrence of a numbe rof a letter, and return the index, then substring will cut everything from that index for a length of 99999 (use a bigger number if your field is longer). This has the effect of removing only linebreaks at the start of the field
As to how it happened, whoever inserted the data, or the data import program, made some mistakes when it was cutting up the data. Perhaps it was a Windows style text file, whose line endings are CR LF (ascci 13 followed by 10), and the program that did the import decided to cut the file up based on the 13 only, leaving behind the 10 to become "part of" the material field:
this,is,my,data1<13><10>this,is,my,data2<13><10>
//now lets cut it up into 2 records, based on using <13> only to denote the end of line:
record 1= this,is,my,data1
record 2= <10>this,is,my,data2
The program just sees a stream of bytes, it is we humans that interpret "lines". If the program treats 13 as the separator, then all the 10s get left behind as part of the data that gets inserted. The very first record in the file won't have 13/10 (crlf) before it because it's the first line, so one of your rows (the one with ascii (49)) won't suffer this problem
You could "cure" the bad data with a trigger upon insert:
CREATE TRIGGER prevent_bad_data
ON yourtable
INSTEAD OF INSERT
AS
BEGIN
INSERT INTO yourtable(somecolumn,othercolumn,material)
SELECT foo,
bar,
LTRIM(REPLACE(material, CHAR(10), ' '))
FROM Inserted
END
Or you could program the db to reject bad rows and fix the tool that is inserting the bad data:
ALTER TABLE yourtable
ADD CONSTRAINT prevent_bad_material
CHECK material LIKE '[0-9a-z]%'; --check it starts with a number or letter
Edit: though having seen your updated question with screenshots, the material column really should be a number, not a varchar type, then this wouldn't happen

SQL Server search using like while ignoring blank spaces

I have a phone column in the database, and the records contain unwanted spaces on the right. I tried to use trim and replace, but it didn't return the correct results.
If I use
phone like '%2581254%'
it returns
customerid
-----------
33470
33472
33473
33474
but I need use percent sign or wild card in the beginning only, I want to match the left side only.
So if I use it like this
phone like '%2581254'
I get nothing, because of the spaces on the right!
So I tried to use trim and replace, and I get one result only
LTRIM(RTRIM(phone)) LIKE '%2581254'
returns
customerid
-----------
33474
Note that these four ids have same phone number!
Table data
customerid phone
-------------------------------------
33470 96506217601532388254
33472 96506217601532388254
33473 96506217601532388254
33474 96506217601532388254
33475 966508307940
I added many number for test propose
The php function takes last 7 digits and compare them.
For example
01532388254 will be 2581254
and I want to search for all users that has this 7 digits in their phone number
2581254
I can't figure out where's the problem!
It should return 4 ids instead of 1 id
Given the sample data, I suspect you have control characters in your data. For example char(13), char(10)
To confirm this, just run the following
Select customerid,phone
From YourTable
Where CharIndex(CHAR(0),[phone])+CharIndex(CHAR(1),[phone])+CharIndex(CHAR(2),[phone])+CharIndex(CHAR(3),[phone])
+CharIndex(CHAR(4),[phone])+CharIndex(CHAR(5),[phone])+CharIndex(CHAR(6),[phone])+CharIndex(CHAR(7),[phone])
+CharIndex(CHAR(8),[phone])+CharIndex(CHAR(9),[phone])+CharIndex(CHAR(10),[phone])+CharIndex(CHAR(11),[phone])
+CharIndex(CHAR(12),[phone])+CharIndex(CHAR(13),[phone])+CharIndex(CHAR(14),[phone])+CharIndex(CHAR(15),[phone])
+CharIndex(CHAR(16),[phone])+CharIndex(CHAR(17),[phone])+CharIndex(CHAR(18),[phone])+CharIndex(CHAR(19),[phone])
+CharIndex(CHAR(20),[phone])+CharIndex(CHAR(21),[phone])+CharIndex(CHAR(22),[phone])+CharIndex(CHAR(23),[phone])
+CharIndex(CHAR(24),[phone])+CharIndex(CHAR(25),[phone])+CharIndex(CHAR(26),[phone])+CharIndex(CHAR(27),[phone])
+CharIndex(CHAR(28),[phone])+CharIndex(CHAR(29),[phone])+CharIndex(CHAR(30),[phone])+CharIndex(CHAR(31),[phone])
+CharIndex(CHAR(127),[phone]) >0
If the Test Results are Positive
The following UDF can be used to strip the control characters from your data via an update
Update YourTable Set Phone=[dbo].[udf-Str-Strip-Control](Phone)
The UDF if Interested
CREATE FUNCTION [dbo].[udf-Str-Strip-Control](#S varchar(max))
Returns varchar(max)
Begin
;with cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
cte2(C) As (Select Top (32) Char(Row_Number() over (Order By (Select NULL))-1) From cte1 a,cte1 b)
Select #S = Replace(#S,C,' ')
From cte2
Return LTrim(RTrim(Replace(Replace(Replace(#S,' ','><'),'<>',''),'><',' ')))
End
--Select [dbo].[udf-Str-Strip-Control]('Michael '+char(13)+char(10)+'LastName') --Returns: Michael LastName
As promised (and nudged by Bill), the following is a little commentary on the UDF.
We pass a string that we want stripped of Control Characters
We create an ad-hoc tally table of ascii characters 0 - 31
We then run a global search-and-replace for each character in the
tally-table. Each character found will be replaced with a space
The final string is stripped of repeating spaces (a little trick
Gordon demonstrated several weeks ago - don't have the original
link)

Query for blank white space before AND after a number string

How would i go about constructing a query, that would return all material numbers that have a "blank white space" either BEFORE or AFTER the number string? We are exporting straight from SSMS to excel and we see the problem in the spreadsheet. If i could return all of the material numbers with spaces.. i could go in and edit them or do a replace to fix this issue prior to exporting! (the mtrl numbers are imported in via a windows application that users upload an excel template to. This template has all of this data and sometimes they place in spaces in or after the material number). The query we have used to work but now it does not return anything, but upon export we identify these problems you see highlighted in the screenshot (left screenshot) and then query to find that mtrl # in the table (right screenshot). And indeed, it has a space before the 1.
Currently the query we use looks like:
SELECT Mtrl
FROM dbo.Source
WHERE Mtrl LIKE '% %'
Since you are getting the data from a query, you should just have that query remove any potential spaces using LTRIM and RTRIM:
LTRIM(RTRIM([MTRL]))
Keep in mind that these two commands remove only spaces, not tabs or returns or other white-space characters.
Doing the above will make sure that the data for the entire set of data is fine, whether or not you find it and/or fix it.
Or, since you are copying-and-pasting from the Results Grid into Excel, you can just CONVERT the value to a number which will naturally remove any spaces:
SELECT CONVERT(INT, ' 12 ');
Returns:
12
So you would just use:
CONVERT(INT, [MRTL])
Now, if you want to find the data that has anything that is not a digit in it, you would use this:
SELECT Mtrl
FROM dbo.Source
WHERE [Mtrl] LIKE '%[^0-9]%'; -- any single non-digit character
If the issue is with non-space white-space characters, you can find out which ones they are via the following (to find them at the beginning instead of at the end, change the RIGHT to be LEFT):
;WITH cte AS
(
SELECT UNICODE(RIGHT([MTRL], 1)) AS [CharVal]
FROM dbo.Source
)
SELECT *
FROM cte
WHERE cte.[CharVal] NOT BETWEEN 48 AND 57 -- digits 0 - 9
AND cte.[CharVal] <> 32; -- space
And you can fix in one shot using the following, which removes regular spaces (char 32 via LTRIM/RTRIM), tabs (char 9), and non-breaking spaces (char 160):
UPDATE src
SET src.[Mtrl] = REPLACE(
REPLACE(
LTRIM(RTRIM(src.[Mtrl])),
CHAR(160),
''),
CHAR(9),
'')
FROM dbo.Source src
WHERE src.[Mtrl] LIKE '%[' -- find rows with any of the following characters
+ CHAR(9) -- tab
+ CHAR(32) -- space
+ CHAR(160) -- non-breaking space
+ ']%';
Here I used the same WHERE condition that you have since if there can't be any spaces then it doesn't matter if you check both ends or for any at all (and maybe it is faster to have a single LIKE instead of two).

how do I retrieve data from a sql table with huge number of inputs for a single column

I have a Company table in SQL Server and I would like to retrieve list of data related to particular companies and list of companies is very huge of around 200 company names and I am trying to use IN clause of T-SQL which is complicating the retrieval as few the companies have special characters in their name like O'Brien and so its throwing up an error as it is obvious.
SELECT *
FROM COMPANY
WHERE COMPANYNAME IN
('Archer Daniels Midland'
'Shell Trading (US) Company - Financial'
'Redwood Fund, LLC'
'Bunge Global Agribusiness - Matt Thibodeaux'
'PTG, LLC'
'Morgan Stanley Capital Group'
'Vitol Inc.'..
.....
....
.....)
Above is the script that is not working for obvious reasons, is there any way I can input those company names from an excel file and retrieve the data?
The easiest way would be to make a table and join it:
CREATE TABLE dbo.IncludedCompanies (CompanyName varchar(1000)
INSERT INTO dbo.IncludedCompanies
VALUES
('Archer Daniels Midland'),
('PTG, LLC')
...
SELECT *
FROM Company C
JOIN IncludedCompanies IC
ON C.CompanyName = IC.CompanyName
I do not think that mysql knows how to handle excel format, but you can fix your query.
Check how complicated names are stored in database (check if they have escape characters in them or anything else".
Replace all ' with \' in your query and it will take care of the ' characters
mysql> select now() as 'O\'Brian'; returns
O'Brian
2014-03-17 15:06:39
So i'm guessing you have a excel sheet with a column containing these names, and you want to use this in your where clause. In addition, some of the values have special characters in them, which needs to be escaped.
First thing you do is to escape the '-characters. You do this in excel, with a search replace for all occurences of ' with '' (the escaped version in sqlserver (\' in MySQL.)) Then, create a new column on each side side of your companies column, and in the first row input a ' on the left hand side, and ', on the right. Then use the copy cell functionality (the little square in the bottom right of the cell when you select it) to copy the cells to the left and right to all the rows, as far as the company list goes (just grab the square and pull it downwards..)
Then, take your list, now containing three columns and x rows and paste it into your favorite text editor. It should look something like this:
' Company#1 ',
' Company with special '' char ',
[...]
' Last company ',
Now, you will have some whitespace to get rid of. Use search replace and replace two space characters with nothing, and repeat (or take the space from the first ' to the start of the text and replace this with nothing.
Now, you should have a list of:
'Company#1',
'Company with special '' char',
[...]
'Last company',
Remove the last comma, and you'll have a valid list of parameters to your in-clause (or a (temporary) table if you want to keep your query a bit cleaner.)

Searching for postcode when space exists or does not exist

Im running a sql search query to bring up records that match a post code
say i have a postcode:
'CB4 1AB'
if the database has (with and without a space)
cb41ab
cb4 1ab
or i search with (with and without a space)
cb41ab
cb4 1ab
i want it to bring back the record
How can i do it?
select addr1, addr2, postcode
from addresses p
where p.postcode LIKE 'cb%'
Thanks
You can try something like this:
select addr1, addr2, postcode
from addresses p
where replace(p.postcode, ' ', '') LIKE 'cb41ab'
So, it sounds like you're going for this. Expanding on other answers:
DECLARE #input VARCHAR(50)
SET #input = 'CB4 1AB'
SELECT addr1, addr2, postcode
FROM addresses p
WHERE REPLACE(p.postcode, ' ', '') = REPLACE(#input, ' ', '')
EDIT: I removed the "LIKE" since this should cover all above cases.
The question is unclear to me so I have multiple answers:a) If you look for a specific UK post code, do it like this: LIKE 'cb4%1ab'. Both versions will return ("spaced" and wo. space ones).b) If you use LIKE 'cb%' that won't give any trouble either. Field will return the value either it has or has no space in it.c) If you want to find a post code, specifically having or not having space say LIKE 'cb4_1%' for post codes with space and LIKE 'cb41%' for post codes not having space - or even better to look for field size instead (6 chars long or 7) so LENGTH([fieldname]) = 6 [or 7]d) The user enters a postcode and you don't know if value has or has no space in it: I'd say code revise is needed, shouldn't be sorted out on SQL/Server side. If it's not possible REPLACE added by others is just fine. If both "d" and "a" is true in your case, use REPLACE to change space into "%", so whatever the input was and whatver is stored in db (field value has or has no space), you'll get the resultLast but not least, just an advice: if you have the chance to 'uniformize' the field's value, do it (set field max length to 6 and/or update the table and remove space from values)Hope this helps!
[EDIT: added 'd' option :) ]