It seems that a regular expression would be ideal, yet some team members are not fond of regex...
Problem: Data in a column (from a mainframe flat file import) looks like 2 different ways
BreakID = 83823737237
OR
MFR BreakID=482883
Thus, the differences are a space before numerics, length of both the alphacharacter before the equals varies and finally the length of the numbers will vary.
Seems I have a few approaches,
1. Everything after the = sign , and trim ?
2. regex , get only the numerics?
So I found this code, in which I assume PATINDEX is standard way of doing regex in -tsql ? what is "string" in this?
SELECT SUBSTRING(string, PATINDEX('%[0-9]%', string), PATINDEX('%[0-9][^0-9]%', string + 't') - PATINDEX('%[0-9]%',
string) + 1) AS Number
How would this be solved with best practices?
Slightly different answer than scsimon. I usually go this route when I have to grab the vals at the end of a string. You reverse the string and grab position of the first instance of your key value ('=' in this case). Get that position with charindex, and then grab the RIGHT() chars using that charindex value.
DECLARE #val1 VARCHAR(100) = 'BreakID = 83823737237'
DECLARE #val2 VARCHAR(100) = 'MFR BreakID=482883'
SELECT
LTRIM(RTRIM(RIGHT(#val1, CHARINDEX('=', REVERSE(#val1), 0)-1)))
,LTRIM(RTRIM(RIGHT(#val2, CHARINDEX('=', REVERSE(#val2), 0)-1)))
This solution will play nice if you have weird cases, like if you have a company called SQL=Cool in your data and it needs an ID:
'SQL=CoolID = 12345'
and you wanted to still get 12345.
Seems like a good use case for substring and replace with charindex
We take the substring from everything starting with the first value after the = up to 99 digits (or how ever many you want to enter). We use replace to get rid of the leading space, if there is one.
select replace(substring(stringColumn,charindex('=',stringColumn) + 1,99),' ','')
That solution is good and versatile, although it sounds like your string will always have an = so you could write something more specific around that if you want to.
That solution finds the start location of the first number string:
PATINDEX('%[0-9]%', string)
And finds the location of the first non-numeric character after that number string (adding a 't' to the end of the string, in case it ends in a number which would otherwise throw an error):
PATINDEX('%[0-9][^0-9]%', string + 't')
And finally it subtracts the start position of the number from the end position to find the length of the number string, and pulls that length out with substring:
SELECT SUBSTRING(string, PATINDEX('%[0-9]%', string), PATINDEX('%[0-9][^0-9]%', string + 't') - PATINDEX('%[0-9]%',
string) + 1) AS Number
Here "string" is a placeholder that should be replaced with your column name. Also, the easiest way to test stuff like this in tsql is to use a variable:
DECLARE #string varchar(100) = 'foo bar la la la 83823737237'
SELECT SUBSTRING(#string, PATINDEX('%[0-9]%', #string), PATINDEX('%[0-9][^0-9]%', #string + 't') - PATINDEX('%[0-9]%',
#string) + 1) AS Number
Output:
83823737237
Kaizen: go for the simple solution, not the perfect one
SELECT substring(c, charindex('=', c), 999)
I'm assuming the column you're putting this in is some kind of number. Sqlserver doesn't care about leading spaces when casting to a number
If it's going in a string column then wrap it in a ltrim()
Now to your questions
1 .. trim
Sure, as above
2 regex...
Not implemented in sqlserver unless you use CLR
PATINDEX ...
It's like regex but it's a very limited subset that only does searching, only returns one string index, doesn't capture, has limited/no character classes. It's more like dos/vb6 wildcards/like than regex
...best practice?
Look at it simply; you're getting the part of a string after an =, not landing on the moon. the best solution to minor optimisations like these is the one that requires the least amount of mental effort from the next human who takes over your job, to get up to speed with this (it'll still be being used in 20 years) :)
Related
I have a column in which data has letters with numbers.
For example:
1 name
2 names ....
100 names
When sorting this data, it is not sorted correctly, how can I fix this? I made a request but it doesn’t sort correctly.
select name_subagent
from Subagent
order by
case IsNumeric(name_subagent)
when 1 then Replicate('0', 100 - Len(name_subagent)) + name_subagent
else name_subagent
end
This should work
select name_subagent
from Subagent
order by CAST(LEFT(name_subagent, PATINDEX('%[^0-9]%', name_subagent + 'a') - 1) as int)
This expression will find the first occurrence of a letter withing a string and assume anything prior to this position is a number.
You will need to adapt this statement to your needs as apparently your data is not in Latin characters.
With a bit of tweaking you should be able to achieve exactly what you're looking for:
select
name_subagent
from
Subagent
order by
CAST(SUBSTRING(name_subagent,0,PATINDEX('%[A-Z]%',name_subagent)) as numeric)
Note, the '%[A-Z]%' expression. This will only look for the first occurrence of a letter within the string.
I'm not considering special characters such as '!', '#' and so on. This is the bit you might want to play around with and adapt to your needs.
I have the following test table in SQL Server 2005:
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] NOT NULL,
[TestField] [varchar](100) NOT NULL
)
Populated with:
INSERT INTO TestTable (ID, TestField) VALUES (1, 'A value'); -- Len = 7
INSERT INTO TestTable (ID, TestField) VALUES (2, 'Another value '); -- Len = 13 + 6 spaces
When I try to find the length of TestField with the SQL Server LEN() function it does not count the trailing spaces - e.g.:
-- Note: Also results the grid view of TestField do not show trailing spaces (SQL Server 2005).
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
FROM
TestTable
How do I include the trailing spaces in the length result?
This is clearly documented by Microsoft in MSDN at http://msdn.microsoft.com/en-us/library/ms190329(SQL.90).aspx, which states LEN "returns the number of characters of the specified string expression, excluding trailing blanks". It is, however, an easy detail on to miss if you're not wary.
You need to instead use the DATALENGTH function - see http://msdn.microsoft.com/en-us/library/ms173486(SQL.90).aspx - which "returns the number of bytes used to represent any expression".
Example:
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
DATALENGTH(TestField) As DataLengthOfTestField -- Shows the true length of data, including trailing spaces.
FROM
TestTable
You can use this trick:
LEN(Str + 'x') - 1
I use this method:
LEN(REPLACE(TestField, ' ', '.'))
I prefer this over DATALENGTH because this works with different data types, and I prefer it over adding a character to the end because you don't have to worry about the edge case where your string is already at the max length.
Note: I would test the performance before using it against a very large data set; though I just tested it against 2M rows and it was no slower than LEN without the REPLACE...
"How do I include the trailing spaces in the length result?"
You get someone to file a SQL Server enhancement request/bug report because nearly all the listed workarounds to this amazingly simple issue here have some deficiency or are inefficient. This still appears to be true in SQL Server 2012. The auto trimming feature may stem from ANSI/ISO SQL-92 but there seems to be some holes (or lack of counting them).
Please vote up "Add setting so LEN counts trailing whitespace" here:
https://feedback.azure.com/forums/908035-sql-server/suggestions/34673914-add-setting-so-len-counts-trailing-whitespace
Retired Connect link:
https://connect.microsoft.com/SQLServer/feedback/details/801381
There are problems with the two top voted answers. The answer recommending DATALENGTH is prone to programmer errors. The result of DATALENGTH must be divided by the 2 for NVARCHAR types, but not for VARCHAR types. This requires knowledge of the type you're getting the length of, and if that type changes, you have to diligently change the places you used DATALENGTH.
There is also a problem with the most upvoted answer (which I admit was my preferred way to do it until this problem bit me). If the thing you are getting the length of is of type NVARCHAR(4000), and it actually contains a string of 4000 characters, SQL will ignore the appended character rather than implicitly cast the result to NVARCHAR(MAX). The end result is an incorrect length. The same thing will happen with VARCHAR(8000).
What I've found works, is nearly as fast as plain old LEN, is faster than LEN(#s + 'x') - 1 for large strings, and does not assume the underlying character width is the following:
DATALENGTH(#s) / DATALENGTH(LEFT(LEFT(#s, 1) + 'x', 1))
This gets the datalength, and then divides by the datalength of a single character from the string. The append of 'x' covers the case where the string is empty (which would give a divide by zero in that case). This works whether #s is VARCHAR or NVARCHAR. Doing the LEFT of 1 character before the append shaves some time when the string is large. The problem with this though, is that it does not work correctly with strings containing surrogate pairs.
There is another way mentioned in a comment to the accepted answer, using REPLACE(#s,' ','x'). That technique gives the correct answer, but is a couple orders of magnitude slower than the other techniques when the string is large.
Given the problems introduced by surrogate pairs on any technique that uses DATALENGTH, I think the safest method that gives correct answers that I know of is the following:
LEN(CONVERT(NVARCHAR(MAX), #s) + 'x') - 1
This is faster than the REPLACE technique, and much faster with longer strings. Basically this technique is the LEN(#s + 'x') - 1 technique, but with protection for the edge case where the string has a length of 4000 (for nvarchar) or 8000 (for varchar), so that the correct answer is given even for that. It also should handle strings with surrogate pairs correctly.
LEN cuts trailing spaces by default, so I found this worked as you move them to the front
(LEN(REVERSE(TestField))
So if you wanted to, you could say
SELECT
t.TestField,
LEN(REVERSE(t.TestField)) AS [Reverse],
LEN(t.TestField) AS [Count]
FROM TestTable t
WHERE LEN(REVERSE(t.TestField)) <> LEN(t.TestField)
Don't use this for leading spaces of course.
You need also to ensure that your data is actually saved with the trailing blanks. When ANSI PADDING is OFF (non-default):
Trailing blanks in character values
inserted into a varchar column are
trimmed.
You should define a CLR function that returns the String's Length field, if you dislike string concatination.
I use LEN('x' + #string + 'x') - 2 in my production use-cases.
If you dislike the DATALENGTH because of of n/varchar concerns, how about:
select DATALENGTH(#var)/isnull(nullif(DATALENGTH(left(#var,1)),0),1)
which is just
select DATALENGTH(#var)/DATALENGTH(left(#var,1))
wrapped with divide-by-zero protection.
By dividing by the DATALENGTH of a single char, we get the length normalised.
(Of course, still issues with surrogate-pairs if that's a concern.)
This is the best algorithm I've come up with which copes with the maximum length and variable byte count per character issues:
ISNULL(LEN(STUFF(#Input, 1, 1, '') + '.'), 0)
This is a variant of the LEN(#Input + '.') - 1 algorithm but by using STUFF to remove the first character we ensure that the modified string doesn't exceed maximum length and remove the need to subtract 1.
ISNULL(..., 0) is added to deal with the case where #Input = '' which causes STUFF to return NULL.
This does have the side effect that the result is also 0 when #Input is NULL which is inconsistent with LEN(NULL) which returns NULL, but this could be dealt with by logic outside this function if need be
Here are the results using LEN(#Input), LEN(#Input + '.') - 1, LEN(REPLACE(#Input, ' ', '.')) and the above STUFF variant, using a sample of #Input = CAST(' S' + SPACE(3998) AS NVARCHAR(4000)) over 1000 iterations
Algorithm
DataLength
ExpectedResult
Result
ms
LEN
8000
4000
2
14
+DOT-1
8000
4000
1
13
REPLACE
8000
4000
4000
514
STUFF+DOT
8000
4000
4000
0
In this case the STUFF algorithm is actually faster than LEN()!
I can only assume that internally SQL looks at the last character and if it is not a space then optimizes the calculation
But that's a good result eh?
Don't use the REPLACE option unless you know your strings are small - it's hugely inefficient
use
SELECT DATALENGTH('string ')
I have found many answers that have pointed me on the right track for what I want, like this.
However, if I had a string like "this attachment will be 22 in. x 15 in. long" I would want to grab "22" (the first int value in the string). Also, I won't know how large (# of characters) the int value will be so I'll need to find the WHOLE first int value until the next special character/space in the string.
Any idea how I can go about doing this?
Assuming there really is a number in the string, you can use patindex():
select left(s, patindex('%[^0-9]%', s) - 1)
from (select substring(col, patindex('%[0-9]%', col), len(col)) as s
from t
) t;
Is it possible to delete part of string using regexp (or something else, may be something like CHARINDEX could help) in SQL query?
I use MS SQL Server (2008 most likely).
Example: I have strings like "[some useless info] Useful part of string" I want to delete parts with text in brackets if they are in line.
Use REPLACE
for example :
UPDATE authors SET city = replace(city, 'To Remove', 'With BLACK or Whatever')
WHERE city LIKE 'Salt%'; // with where condition
You can use the PATINDEX function. Its not a complete regular expression implementation but you can use it for simple things.
PATINDEX (Transact-SQL)> Returns the starting position of the first occurrence of a pattern in a specified expression, or zeros if the pattern is not found, on all valid text and character data types.
OR You can use CLR to extend the SQL Server with a complete regular expression implementation.
SQL Server 2005: CLR Integration
SELECT * FROM temp where replace(replace(replace(url,'http://',''),'www.',''),'https://','')='"+url+"';
You can use STUFF to insert a string into another string. It deletes a specified length of characters in the first string at the start position and then inserts the second string into the first string at the start position.
For example, the code below, replaces the 5 with 666666:
DECLARE #Variable NVARCHAR(MAX) = '12345678910'
SELECT STUFF(#Variable, 5, 1, '666666')
Note, that the second argument is not a string, it is a position and you are able to calculate it position using CHARINDEX for example.
Here is your case:
DECLARE #Variable NVARCHAR(MAX) = '[some useless info] Useful part of string'
SELECT STUFF(
#Variable
,CHARINDEX('[', #Variable)
,LEN(SUBSTRING(#Variable, CHARINDEX('[', #Variable), CHARINDEX(']', #Variable) - LEN(SUBSTRING(#Variable, 0, CHARINDEX('[', #Variable)))))
,''
)
Finally helps REPLACE, SUBSTRING and PATINDEX.
REPLACE(t.badString, Substring(t.badString , Patindex('%[%' , t.badString)+1 , Patindex('%]%' , t.badString)), '').
Thanks to all.
I am trying to remove all words in front of a consistent known sub string ("To Find a"). I would like to remove everything in front of "To Find a" in bulk over 600 Descriptionstrings. The words in front of this sub string are different in all cases. For example (Description 'Some Text, Some More Text…To Find a… Some More Text') I have red several other posts and have tried using TRIM, CHARINDEX, and SUBSTRING_INDEX.
Thanks for the help!
If this is SQL Server, a relatively easy way to remove the leading bit would be with the help of SUBSTRING and CHARINDEX:
SELECT SUBSTRING(ColumnName, CHARINDEX('To Find a', ColumnName), 2147483647)
FROM YourTable;
The CHARINDEX function finds the position of the substring, and the result is used as SUBSTRING's second argument. The length argument is set to the maximum int value to make sure all the remaining characters to the end of the string are returned. (You don't need to calculate the exact number.) If the substring isn't found, CHARINDEX returns 0. In this context, 0 as the starting position causes the entire string value to be returned.
If you actually want to do the opposite, i.e. keep the leading text and remove the rest, as one of your comments seems to imply, you could try using CHARINDEX and LEFT in this way:
SELECT LEFT(ColumnName, CHARINDEX('To Find a', ColumnName + 'To Find a') - 1)
FROM YourTable;
Again, CHARINDEX returns the position of 'To Find a' in the column value. After subtracting 1, that becomes the length argument of LEFT. To make sure CHARINDEX does find the search term, the term is appended to the value being searched: if the original value doesn't have 'To Find a', CHARINDEX hits the appended bit and returns the position after the last character of the original string, which, when subtracted, becomes the string's exact length.
In SQL Server to select the leading text:
DECLARE #String AS VARCHAR(255) = 'Some Text, Some More Text…To Find a… Some More Text'
SELECT LEFT(#String,CHARINDEX('To Find a',#String)-1)
(Assuming string is consistently present, as stated in question)
To remove the leading text:
DECLARE #String AS VARCHAR(255) = 'Some Text, Some More Text…To Find a… Some More Text'
SELECT RIGHT(#String,CHARINDEX(REVERSE('To Find a'),REVERSE(#String))-1)
If you want to keep the 'To Find a' then you adjust the -1 near the end of the query.
Update:
If 'to find a' isn't in every string, and using your table:
SELECT CASE WHEN CHARINDEX('To Find a',YourField) > 0
THEN LEFT(YourField,CHARINDEX('To Find a',YourField)-1)
ELSE YourField
END AS 'FixedField'
FROM YourTable