I have a column of values that are as shown below:
ID
x-644478134
x-439220334
x-645948923
x-10686843432
x-4273883234
I would like to return a column like so:
ID
644478134
439220334
645948923
10686843432
4273883234
Can someone advise how to do this cleanly? I believe it is something to do with substring but not sure exactly
SELECT SUBSTRING(d.ID)
FROM data d
You need to use substring, charindex and len functions such as below, assuming the dash (-) is the separator of the text and numeric part:
declare #x varchar(50) = 'a-0123'
select substring(#x,CHARINDEX('-',#x,1)+1,len(#x)-CHARINDEX('-',#x,1))
If you are sure that ID always starts with x-, then:
Select substring(ID,3,LEN(ID)-2)
I have a table which contains some bad data I am trying to clean up.
An example of the fields is below
36234735HAN876
2342JOE9823
554444PUT003
What I want to do is remove all the numeric characters before the first alphabetical character so it would look like the below:
HAN876
JOE9823
PUT003
What would be the best way to achieve this? I have used the below method but this can only be used to extract ALL numeric from the string, not the ones before the alphabetical characters
How to get the numeric part from a string using T-SQL?
You could achieve this using PATINDEX to locate the first position of an alphabetical character in the string, and then use SUBSTRING to only return the characters after that position:
CREATE TABLE #temp (val VARCHAR(50));
INSERT INTO #temp VALUES ('36234735HAN876'), ('2342JOE9823'), ('554444PUT003'), ('TEST1234');
SELECT val,
SUBSTRING(val, PATINDEX('%[A-Z]%', val), LEN(val)) AS output
FROM #temp;
DROP TABLE #temp;
Outputs:
val output
36234735HAN876 HAN876
2342JOE9823 JOE9823
554444PUT003 PUT003
TEST1234 TEST1234
Note that I have created a temporary table with a column named val. You should change this to work with whatever the actual column is called.
About case sensitivity: If you are using a non-case sensitive collation this will work without issue. If your collation is case sensitive then you may need to alter the pattern being matched to cater for upper- and lower-case letters.
Use PATINDEX to find the first non-numeric character (or first alpha character, depending on the logic) and STUFF to remove them:
SELECT STUFF(V.YourString,1,ISNULL(NULLIF(PATINDEX('%[^0-9]%',V.YourString),0)-1,0),'')
FROM (VALUES('36234735HAN876'),
('2342JOE9823'),
('554444PUT003'),
('ABC123'))V(YourString)
If the logic is the first alpha character, instead of the first non-numeric, then the pattern would be [A-z].
The NULLIF and ISNULL are in there for when/if the string starts with a alpha/non-numeric and thus doesn't cause STUFF to error due to the 3rd parameter being -1. The is demonstrated with the additional example I put into the sample data ('ABC123').
I have the following test table in SQL Server 2005:
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] NOT NULL,
[TestField] [varchar](100) NOT NULL
)
Populated with:
INSERT INTO TestTable (ID, TestField) VALUES (1, 'A value'); -- Len = 7
INSERT INTO TestTable (ID, TestField) VALUES (2, 'Another value '); -- Len = 13 + 6 spaces
When I try to find the length of TestField with the SQL Server LEN() function it does not count the trailing spaces - e.g.:
-- Note: Also results the grid view of TestField do not show trailing spaces (SQL Server 2005).
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
FROM
TestTable
How do I include the trailing spaces in the length result?
This is clearly documented by Microsoft in MSDN at http://msdn.microsoft.com/en-us/library/ms190329(SQL.90).aspx, which states LEN "returns the number of characters of the specified string expression, excluding trailing blanks". It is, however, an easy detail on to miss if you're not wary.
You need to instead use the DATALENGTH function - see http://msdn.microsoft.com/en-us/library/ms173486(SQL.90).aspx - which "returns the number of bytes used to represent any expression".
Example:
SELECT
ID,
TestField,
LEN(TestField) As LenOfTestField, -- Does not include trailing spaces
DATALENGTH(TestField) As DataLengthOfTestField -- Shows the true length of data, including trailing spaces.
FROM
TestTable
You can use this trick:
LEN(Str + 'x') - 1
I use this method:
LEN(REPLACE(TestField, ' ', '.'))
I prefer this over DATALENGTH because this works with different data types, and I prefer it over adding a character to the end because you don't have to worry about the edge case where your string is already at the max length.
Note: I would test the performance before using it against a very large data set; though I just tested it against 2M rows and it was no slower than LEN without the REPLACE...
"How do I include the trailing spaces in the length result?"
You get someone to file a SQL Server enhancement request/bug report because nearly all the listed workarounds to this amazingly simple issue here have some deficiency or are inefficient. This still appears to be true in SQL Server 2012. The auto trimming feature may stem from ANSI/ISO SQL-92 but there seems to be some holes (or lack of counting them).
Please vote up "Add setting so LEN counts trailing whitespace" here:
https://feedback.azure.com/forums/908035-sql-server/suggestions/34673914-add-setting-so-len-counts-trailing-whitespace
Retired Connect link:
https://connect.microsoft.com/SQLServer/feedback/details/801381
There are problems with the two top voted answers. The answer recommending DATALENGTH is prone to programmer errors. The result of DATALENGTH must be divided by the 2 for NVARCHAR types, but not for VARCHAR types. This requires knowledge of the type you're getting the length of, and if that type changes, you have to diligently change the places you used DATALENGTH.
There is also a problem with the most upvoted answer (which I admit was my preferred way to do it until this problem bit me). If the thing you are getting the length of is of type NVARCHAR(4000), and it actually contains a string of 4000 characters, SQL will ignore the appended character rather than implicitly cast the result to NVARCHAR(MAX). The end result is an incorrect length. The same thing will happen with VARCHAR(8000).
What I've found works, is nearly as fast as plain old LEN, is faster than LEN(#s + 'x') - 1 for large strings, and does not assume the underlying character width is the following:
DATALENGTH(#s) / DATALENGTH(LEFT(LEFT(#s, 1) + 'x', 1))
This gets the datalength, and then divides by the datalength of a single character from the string. The append of 'x' covers the case where the string is empty (which would give a divide by zero in that case). This works whether #s is VARCHAR or NVARCHAR. Doing the LEFT of 1 character before the append shaves some time when the string is large. The problem with this though, is that it does not work correctly with strings containing surrogate pairs.
There is another way mentioned in a comment to the accepted answer, using REPLACE(#s,' ','x'). That technique gives the correct answer, but is a couple orders of magnitude slower than the other techniques when the string is large.
Given the problems introduced by surrogate pairs on any technique that uses DATALENGTH, I think the safest method that gives correct answers that I know of is the following:
LEN(CONVERT(NVARCHAR(MAX), #s) + 'x') - 1
This is faster than the REPLACE technique, and much faster with longer strings. Basically this technique is the LEN(#s + 'x') - 1 technique, but with protection for the edge case where the string has a length of 4000 (for nvarchar) or 8000 (for varchar), so that the correct answer is given even for that. It also should handle strings with surrogate pairs correctly.
LEN cuts trailing spaces by default, so I found this worked as you move them to the front
(LEN(REVERSE(TestField))
So if you wanted to, you could say
SELECT
t.TestField,
LEN(REVERSE(t.TestField)) AS [Reverse],
LEN(t.TestField) AS [Count]
FROM TestTable t
WHERE LEN(REVERSE(t.TestField)) <> LEN(t.TestField)
Don't use this for leading spaces of course.
You need also to ensure that your data is actually saved with the trailing blanks. When ANSI PADDING is OFF (non-default):
Trailing blanks in character values
inserted into a varchar column are
trimmed.
You should define a CLR function that returns the String's Length field, if you dislike string concatination.
I use LEN('x' + #string + 'x') - 2 in my production use-cases.
If you dislike the DATALENGTH because of of n/varchar concerns, how about:
select DATALENGTH(#var)/isnull(nullif(DATALENGTH(left(#var,1)),0),1)
which is just
select DATALENGTH(#var)/DATALENGTH(left(#var,1))
wrapped with divide-by-zero protection.
By dividing by the DATALENGTH of a single char, we get the length normalised.
(Of course, still issues with surrogate-pairs if that's a concern.)
This is the best algorithm I've come up with which copes with the maximum length and variable byte count per character issues:
ISNULL(LEN(STUFF(#Input, 1, 1, '') + '.'), 0)
This is a variant of the LEN(#Input + '.') - 1 algorithm but by using STUFF to remove the first character we ensure that the modified string doesn't exceed maximum length and remove the need to subtract 1.
ISNULL(..., 0) is added to deal with the case where #Input = '' which causes STUFF to return NULL.
This does have the side effect that the result is also 0 when #Input is NULL which is inconsistent with LEN(NULL) which returns NULL, but this could be dealt with by logic outside this function if need be
Here are the results using LEN(#Input), LEN(#Input + '.') - 1, LEN(REPLACE(#Input, ' ', '.')) and the above STUFF variant, using a sample of #Input = CAST(' S' + SPACE(3998) AS NVARCHAR(4000)) over 1000 iterations
Algorithm
DataLength
ExpectedResult
Result
ms
LEN
8000
4000
2
14
+DOT-1
8000
4000
1
13
REPLACE
8000
4000
4000
514
STUFF+DOT
8000
4000
4000
0
In this case the STUFF algorithm is actually faster than LEN()!
I can only assume that internally SQL looks at the last character and if it is not a space then optimizes the calculation
But that's a good result eh?
Don't use the REPLACE option unless you know your strings are small - it's hugely inefficient
use
SELECT DATALENGTH('string ')
On running the below query:
SELECT DISTINCT [RULE], LEN([RULE]) FROM MYTBL WHERE [RULE]
LIKE 'Trademarks, logos, slogans, company names, copyrighted material or brands of any third party%'
I am getting the output as:
The column datatype is NCHAR(120) and the collation used is SQL_Latin1_General_CP1_CI_AS
The data is inserted with an extra leading space in the end. But using RTRIM function also I am not able to trim the extra space. I am not sure which type of leading space(encoded) is inserted here.
Can you please suggest some other alternative except RTRIM to get rid of extra white space at the end as the Column is NCHAR.
Below are the things which I have already tried:
RTRIM(LTRIM(CAST([RULE] as VARCHAR(200))))
RTRIM([RULE])
Update to Question
Please download the Database from here TestDB
Please use below query for your reference:
SELECT DISTINCT [RULE], LEN([RULE]) FROM [TestDB].[BI].[DimRejectionReasonData]
WHERE [RULE]
LIKE 'Trademarks, logos, slogans, company names, copyrighted material or brands of any third party%'
You may have a non-breaking space nchar(160) inside the string.
You can convert it to a simple space and then use the usual trim function
LTRIM(RTRIM(REPLACE([RULE], NCHAR(160), ' ')))
In case of unicode space
LTRIM(RTRIM(REPLACE(RULE, NCHAR(0x00A0), ' ')))
I guess this is what you are looking for ( Not sure ) . Make a try with this approach
SELECT REPLACE(REPLACE([RULE], CHAR(13), ''), CHAR(10), '')
Reference links : Link 1 & Link 2
Note: FYI refer those links for better understanding .
change the type nchar into varchar it will return the result without extra space
I have a CHAR column that contains messy OCR'd scan of printed integers.
I need to do SUM() operators on that column. But I'm unable to cast properly.
;Good
sqlite> select CAST("123" as integer);
123
;No Good, should be '323999'
sqlite> select CAST("323,999" as integer);
323
I believe SQLite interprets the comma as marking the end of the "the longest possible prefix of the value that can be interpreted as an integer number"
I prefer to avoid the agony of writing python scripts to do data cleaning on this column. Is there any clever way to do it strictly with SQL?
If you are trying to ignore commas, then remove them before the conversion:
select cast(replace('323,999', ',', '') as integer)