SQL Server substring breaking on words, not characters - sql

I'd like to show no more than n characters of a text field in search results to give the user an idea of the content. However, I can't find a way to easily break on words, so I wind up with a partial word at the break.
When I want to show: "This student has not submitted his last few assignments", the system might show: "This student has not submitted his last few assig"
I'd prefer that the system show up to the n character limit where words are preserved, so I'd like to see:
"This student has not submitted his last few"
Is there a nearest word function that I could write in T-SQL, or should I do that when I get the results back into ASP or .NET?

If you must do it in T-SQL:
DECLARE #t VARCHAR(100)
SET #t = 'This student has not submitted his last few assignments'
SELECT LEFT(LEFT(#t, 50), LEN(LEFT(#t, 50)) - CHARINDEX(' ', REVERSE(LEFT(#t, 50))))
It will not be catastrophically slow, but it will definitely be slower than doing it in the presentation layer.
Other than that — just cutting off the word and appending an ellipsis for longer strings is no bad option either. This way at least all truncated strings have the same length, which might come in handy if you are formatting for a fixed-width output.

I agree with doing this outside of the database that way other applications with different length restrictions can make their own decisions on what to show/hide. Perhaps that can be a parameter to the database call though.
Here's a quick stab at a solution:
DECLARE #OriginalData NVARCHAR(MAX)
,#ReversedData NVARCHAR(MAX)
,#MaxLength INT
,#DelimiterPosition INT ;
SELECT #OriginalData = 'This student has not submitted his last few assignments'
,#MaxLength = 45;
SET #ReversedData = REVERSE(
LEFT(#OriginalData, #MaxLength)
);
SET #DelimiterPosition = CHARINDEX(' ', #ReversedData);
PRINT LEFT(#OriginalData, #MaxLength - #DelimiterPosition);
/*
This student has not submitted his last few assignments
1234567890123456789012345678901234567890123456789012345
*/

I recommend doing that kind of logic outside database. With C# it could look similar to this:
static string Cut(string s, int length)
{
if (s.Length <= length)
{
return s;
}
while (s[length] != ' ')
{
length--;
}
return s.Substring(0, length).Trim();
}
Of cause you could do this with T-SQL, but that is bad idea (bad performance etc.). If you really need to put it inside DB I would use CLR-based stored procedure instead.

I'd like to add to the solutions already offered that word breaking logic is a lot more complicated than it seems on the surface. To do it well you are going to need to define a number of rules for what constitutes a word. Consider the following:
Spaces - No brainer.
Hyphens - Well that depends. In Over-exposed proably, in re-animated probably not. Then what about dates such as 01-02-1985?
Periods - No brainer. Oh wait, what about the one in myemail#myisp.com or $79.95?
Commas - In numbers such as 1,239 no, but in sentences yes.
Apostrophes - In O'Reily no, in SQL is an 'Enterprise' Database tool yes.
Do special characters alone constitute words?: In Item 1 : Buy TP is the colon counted as a word?

I found an answer on this site and modified it:
the cast (150) must be greater than the number of characters you're returning (100)
LEFT (Cast(myTextField As varchar(150)),
CHARINDEX(' ', CAST(flag_myTextField AS VARCHAR(150)), 100) ) AS myTextField_short

I'm not sure how fast this will run, but it will work....
DECLARE #Max int
SET #Max=??
SELECT
REVERSE(RIGHT(REVERSE(LEFT(YourColumnHere,#Max)),#Max- CHARINDEX(' ',REVERSE(LEFT(YourColumnHere,#Max)))))
FROM YourTable
WHERE X=Y

I wouldn't advice to do that either, but if you must, you can do something like this:
DECLARE #text nvarchar(max);
DECLARE #end_char int;
SELECT #text = 'This student has not submitted his last few assignments', #end_char = 50 ;
WHILE #end_char > 0 AND SUBSTRING( #text, #end_char+1, 1 ) <> ' '
SET #end_char = #end_char - 1
SELECT #text = SUBSTRING( #text, 1, #end_char ) ;
SELECT #text

Related

in SQL how can I remove the first 3 characters on the left and everything on the right after an specific character

In SQL how can I remove (from displaying on my report no deleting from database) the first 3 characters (CN=) and everything after the comma that is followed by "OU" so that I am left with the name and last name in the same column? for example:
CN=Tom Chess,OU=records,DC=1234564786_data for testing, 1234567
CN=Jack Bauer,OU=records,DC=1234564786_data for testing, 1234567
CN=John Snow,OU=records,DC=1234564786_data for testing, 1234567
CN=Anna Rodriguez,OU=records,DC=1234564786_data for testing, 1234567
Desired display:
Tom Chess
Jack Bauer
John Snow
Anna Rodriguez
I tried playing with TRIM but I don't know how to do it without declaring the position and with names and last names having different lengths I really don't know how to handle that.
Thank you in advance
Update: I wonder about an approach of using Locate to match the position of the comma and then feed that to a sub-string. Not sure if a approach like would work and not sure how to put the syntax together. What do you think? will it be a feasible approach?
You can try this one SUBSTRING(ColumnName, 4, CHARINDEX(',', ColumnName) - 4)
In Postgres, you could use split_part() assuming no name contains a ,
select substr(split_part(the_column, ',', 1), 4)
from ...
Db2 11.x for LUW:
with tab (str) as (values
' CN = Tom Chess , OU = records,DC=1234564786_data for testing, 1234567'
, 'CN=Jack Bauer,OU=records,DC=1234564786_data for testing, 1234567'
, 'CN=John Snow,OU=records,DC=1234564786_data for testing, 1234567'
, 'CN=Anna Rodriguez,OU=records,DC=1234564786_data for testing, 1234567'
)
select REGEXP_REPLACE(str, '^\s*CN\s*=\s*(.*)\s*,\s*OU\s*=.*', '\1')
from tab;
Note, that such a regex pattern allows an arbitrary number of spaces as in the 1-st record of example above.
In Oracle 11g, it might work.
REGEXP_SUBSTR(REGEXP_SUBSTR(COLUMN_NAME, '[^CN=]+',1,1),'[^,OU]+',1,1)
I think there has to be a loop to handle this. Here's SQL Server function that will parse this out. (I know the question didn't specify SQL Server, but it's an example of how it can be done.)
select dbo.ScrubFieldValue(value) from table will return what you're looking for
CREATE FUNCTION ScrubFieldValue
(
#Input varchar(8000)
)
RETURNS varchar(8000)
AS
BEGIN
DECLARE #retval varchar(8000)
DECLARE #charidx int
DECLARE #remaining varchar(8000)
DECLARE #current varchar(8000)
DECLARE #currentLength int
select #retval = ''
select #remaining = #Input
select #charidx = CHARINDEX('CN=', #remaining,2)
while(LEN(#remaining) > 0)
BEGIN
--strip current row from remaining
if (#charidx > 0)
BEGIN
select #current = SUBSTRING(#remaining, 1, #charidx - 1)
END
else
BEGIN
select #current = #remaining
END
select #currentLength = LEN(#current)
-- get current name
select #current = SUBSTRING(#current, 4, CHARINDEX(',OU', #current)-4)
select #retval = #retval + #current + ' '
-- strip off current from remaining
select #remaining =substring(#remaining,#currentLength + 1,
LEN(#remaining) - #currentLength)
select #charidx = CHARINDEX('CN=', #remaining,2)
END
RETURN #retval
END
On my version of DB2 for Z/OS CHARINDEX throws a syntax error. Here are two ways to work around that.
SUBSTRING(ColumnName, 4, INSTR(ColumnName,',',1) - 4)
SUBSTRING(ColumnName, 4, LOCATE_IN_STRING(ColumnName,',') - 4)
I should add that the version is V12R1
If input str is wellformed (i.e. looks like your sample data without any additional tokens such as space), you could use something like:
substr(str,locate('CN=', str)+length('CN='), locate(',', str)-length('CN=')-1)
If your Db2 version support REGEXP, that's a better choice.

SQL Server Function to Split Full Names

I have scoured SO for a resolution to my problem with getting this function to run properly, but I mostly see solutions regarding the use of the function and not as many in creating the function. I have already created the function, so that's why it reads 'ALTER FUNCTION' at the top of my code. The end goal is to parse out First Name, Middle Initial, and Last Name.
I keep getting an incorrect syntax error near the first 'END' in the CASE statement regarding the parsing of the FirstName. I apologize if this is such an easy fix but I just cannot figure out what I am missing. Any help in error recognition or a cleaner syntax would be much appreciated for a beginner like myself.
Also, the 2nd SET statement towards the bottom is just a simple function(already written before I got here) that CamelCases the output.
Sorry about the first comment. Here are some sample names that I have been using and I want to parse these from one column into 3 columns. First, middle, and last name.
Carlton J Smith
Charmane Thorn
Deel S Shah
Curtis Brennan
Allie F Allison
Alex Finde
Tina D Page
Jackie Russell
I tried adding two more SET statements but it still is giving me the same syntax error around the first CASE statement. Anything else I could provide to give more context? Thanks for the prompt responses.
ALTER FUNCTION fn_clean_Name_Split (#source VARCHAR(255))
RETURNS VARCHAR(255)
AS
BEGIN
DECLARE #target VARCHAR(255) = #source
DECLARE #index INT = CHARINDEX(' ',#target)
SET #target =
CASE
WHEN #index <> LEN(#target)
THEN LEFT(#target, #index)
END AS FirstName,
CASE
WHEN #index <> LEN(#target) - CHARINDEX(' ', REVERSE(#target)) + 1
THEN SUBSTRING(#target, #index + 1, LEN(#target) - CHARINDEX(' ', REVERSE(#target)) - #index)
END AS MI,
CASE
WHEN #index <> LEN(#target) - #index + 1
THEN RIGHT(#target, CHARINDEX(' ', REVERSE(#target))) AS LastName,
ELSE #target
SET #target = dbo.fn_standardize_CamelCase(#target)
RETURN #target
END
This is too long for a comment.
A set is used to set the value of a single variable, not multiple variables. This is easy to work around; you can just use multiple set statements.
You can set multiple variables using select. That is a nice convenience. But you cannot both set values and return values in the same statement.
You have a function, so you don't want a query that returns values anyway.

Simple Explanation for PATINDEX

I have have been reading up on PATINDEX attempting to understand what and why. I understand the when using the wildcards it will return an INT as to where that character(s) appears/starts. So:
SELECT PATINDEX('%b%', '123b') -- returns 4
However I am looking to see if someone can explain the reason as to why you would use this in a simple(ish) way. I have read some other forums but it just is not sinking in to be honest.
Are you asking for realistic use-cases? I can think of two, real-life use-cases that I've had at work where PATINDEX() was my best option.
I had to import a text-file and parse it for INSERT INTO later on. But these files sometimes had numbers in this format: 00000-59. If you try CAST('00000-59' AS INT) you'll get an error. So I needed code that would parse 00000-59 to -59 but also 00000159 to 159 etc. The - could be anywhere, or it could simply not be there at all. This is what I did:
DECLARE #my_var VARCHAR(255) = '00000-59', #my_int INT
SET #my_var = STUFF(#my_var, 1, PATINDEX('%[^0]%', #my_var)-1, '')
SET #my_int = CAST(#my_var AS INT)
[^0] in this case means "any character that isn't a 0". So PATINDEX() tells me when the 0's end, regardless of whether that's because of a - or a number.
The second use-case I've had was checking whether an IBAN number was correct. In order to do that, any letters in the IBAN need to be changed to a corresponding number (A=10, B=11, etc...). I did something like this (incomplete but you get the idea):
SET #i = PATINDEX('%[^0-9]%', #IBAN)
WHILE #i <> 0 BEGIN
SET #num = UNICODE(SUBSTRING(#IBAN, #i, 1))-55
SET #IBAN = STUFF(#IBAN, #i, 1, CAST(#num AS VARCHAR(2))
SET #i = PATINDEX('%[^0-9]%', #IBAN)
END
So again, I'm not concerned with finding exactly the letter A or B etc. I'm just finding anything that isn't a number and converting it.
PATINDEX is roughly equivalent to CHARINDEX except that it returns the position of a pattern instead of single character. Examples:
Check if a string contains at least one digit:
SELECT PATINDEX('%[0-9]%', 'Hello') -- 0
SELECT PATINDEX('%[0-9]%', 'H3110') -- 2
Extract numeric portion from a string:
SELECT SUBSTRING('12345', PATINDEX('%[0-9]%', '12345'), 100) -- 12345
SELECT SUBSTRING('x2345', PATINDEX('%[0-9]%', 'x2345'), 100) -- 2345
SELECT SUBSTRING('xx345', PATINDEX('%[0-9]%', 'xx345'), 100) -- 345
Quoted from PATINDEX (Transact-SQL)
The following example uses % and _ wildcards to find the position at
which the pattern 'en', followed by any one character and 'ure' starts
in the specified string (index starts at 1):
SELECT PATINDEX('%en_ure%', 'please ensure the door is locked');
Here is the result set.
8
You'd use the PATINDEX function when you want to know at which character position a pattern begins in an expression of a valid text or character data type.

Looking for a scalar function to find the last occurrence of a character in a string

Table FOO has a column FILEPATH of type VARCHAR(512). Its entries are absolute paths:
FILEPATH
------------------------------------------------------------
file://very/long/file/path/with/many/slashes/in/it/foo.xml
file://even/longer/file/path/with/more/slashes/in/it/baz.xml
file://something/completely/different/foo.xml
file://short/path/foobar.xml
There's ~50k records in this table and I want to know all distinct filenames, not the file paths:
foo.xml
baz.xml
foobar.xml
This looks easy, but I couldn't find a DB2 scalar function that allows me to search for the last occurrence of a character in a string. Am I overseeing something?
I could do this with a recursive query, but this appears to be overkill for such a simple task and (oh wonder) is extremely slow:
WITH PATHFRAGMENTS (POS, PATHFRAGMENT) AS (
SELECT
1,
FILEPATH
FROM FOO
UNION ALL
SELECT
POSITION('/', PATHFRAGMENT, OCTETS) AS POS,
SUBSTR(PATHFRAGMENT, POSITION('/', PATHFRAGMENT, OCTETS)+1) AS PATHFRAGMENT
FROM PATHFRAGMENTS
)
SELECT DISTINCT PATHFRAGMENT FROM PATHFRAGMENTS WHERE POS = 0
I think what you're looking for is the LOCATE_IN_STRING() scalar function. This is what Info Center has to say if you use a negative start value:
If the value of the integer is less than zero, the search begins at
LENGTH(source-string) + start + 1 and continues for each position to
the beginning of the string.
Combine that with the LENGTH() and RIGHT() scalar functions, and you can get what you want:
SELECT
RIGHT(
FILEPATH
,LENGTH(FILEPATH) - LOCATE_IN_STRING(FILEPATH,'/',-1)
)
FROM FOO
One way to do this is by taking advantage of the power of DB2s XQuery engine. The following worked for me (and fast):
SELECT DISTINCT XMLCAST(
XMLQuery('tokenize($P, ''/'')[last()]' PASSING FILEPATH AS "P")
AS VARCHAR(512) )
FROM FOO
Here I use tokenize to split the file path into a sequence of tokens and then select the last of these tokens. The rest is only conversion from SQL to XML types and back again.
I know that the problem from the OP was already solved but I decided to post the following anyway to hopefully help others like me that land here.
I came across this thread while searching for a solution to my similar problem which had the exact same requirement but was for a different kind of database that was also lacking the REVERSE function.
In my case this was for a OpenEdge (Progress) database, which has a slightly different syntax. This made the INSTR function available to me that most Oracle typed databases offer.
So I came up with the following code:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH(foo.filepath) - LENGTH( REPLACE( foo.filepath, '/', '')))+1,
LENGTH(foo.filepath))
FROM foo
However, for my specific situation (being the OpenEdge (Progress) database) this did not result into the desired behaviour because replacing the character with an empty char gave the same length as the original string. This doesn't make much sense to me but I was able to bypass the problem with the code below:
SELECT
SUBSTRING(
foo.filepath,
INSTR(foo.filepath, '/',1, LENGTH( REPLACE( foo.filepath, '/', 'XX')) - LENGTH(foo.filepath))+1,
LENGTH(foo.filepath))
FROM foo
Now I understand that this code won't solve the problem for T-SQL because there is no alternative to the INSTR function that offers the Occurence property.
Just to be thorough I'll add the code needed to create this scalar function so it can be used the same way like I did in the above examples.
-- Drop the function if it already exists
IF OBJECT_ID('INSTR', 'FN') IS NOT NULL
DROP FUNCTION INSTR
GO
-- User-defined function to implement Oracle INSTR in SQL Server
CREATE FUNCTION INSTR (#str VARCHAR(8000), #substr VARCHAR(255), #start INT, #occurrence INT)
RETURNS INT
AS
BEGIN
DECLARE #found INT = #occurrence,
#pos INT = #start;
WHILE 1=1
BEGIN
-- Find the next occurrence
SET #pos = CHARINDEX(#substr, #str, #pos);
-- Nothing found
IF #pos IS NULL OR #pos = 0
RETURN #pos;
-- The required occurrence found
IF #found = 1
BREAK;
-- Prepare to find another one occurrence
SET #found = #found - 1;
SET #pos = #pos + 1;
END
RETURN #pos;
END
GO
To avoid the obvious, when the REVERSE function is available you do not need to create this scalar function and you can just get the required result like this:
SELECT
SUBSTRING(
foo.filepath,
LEN(foo.filepath) - CHARINDEX('\', REVERSE(foo.filepath))+2,
LEN(foo.filepath))
FROM foo
You could just do it in a single statement:
select distinct reverse(substring(reverse(FILEPATH), 1, charindex('/', reverse(FILEPATH))-1))
from filetable

Find and Replace credit card numbers

We have a large database with a lot of data in it. I found out recently our sales and shipping department have been using a part of the application to store clients credit card numbers in the open. We've put a stop to it, but now there are thousands of rows with the numbers.
We're trying to figure out how to scan certain columns for 16 digits in a row (or dash separation) and replace them with X's.
It's not a simple UPDATE statement because the card numbers are stored among large amounts of text. So far I've been unable to figure out if SQL Server is capable of regex (it would seem not).
All else fails i will do this through PHP since that is what i'm best at... but it'll be painful.
Sounds like you need to use PATINDEX with a WHERE LIKE clause.
Something like this. Create a stored proc with something similar, then call it with a bunch of different parameters (make #pattern & #patternlength the params) that you have identified, until you've replaced all of the instances.
declare #pattern varchar(100), #patternlength int
set #pattern = '[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
set #patternlength = 19
update tableName
set fieldName =
LEFT(fieldName, patindex('%'+ #pattern + '%', fieldName)-1)
+ 'XXXX-XXXX-XXXX-XXXX'
+ SUBSTRING(fieldName, PATINDEX('%'+ #pattern + '%', fieldName)+#patternlength, LEN(fieldName))
from tableName
where fieldName like '%'+ #pattern + '%'
The trick is just finding the appropriate patterns, and setting the appropriate #patternlength value (not the length of #pattern as that won't work!)
I think you are better off doing this programatically, especially since you mentioned the data can be in a couple of different formats. Do keep in mind that not all credit card numbers are 16 digits long (Amex is 15, Visa is 13 or 16, etc).
The ability to check for various regexes and validate code will probably be best served at a cleanup job level, if possible.
Improvised Sean's answer.
The following will find all the occurrences of #maskPattern in #text and replace them with 'x'.
Example, If #maskPattern = XXXX-XXXX-XXXX-XXXX, it will find this pattern in #text and replace all occurrences with XXXX-XXXX-XXXX-XXXX. If it does not find any occurrence, it will leave the text as is.
This stored procedure can also be manipulated to only mask 3/4th of the beginning of the maskPattern. Cheers!
ALTER PROCEDURE [dbo].[SP_MaskCharacters] #text nvarchar(max),
#maskPattern nvarchar(500)
AS
BEGIN
DECLARE #numPattern nvarchar(max) = REPLACE(#maskPattern, 'x', '[0-9]')
DECLARE #patternLength int = LEN(#maskPattern)
WHILE (#text IS NOT NULL)
BEGIN
IF PATINDEX('%' + #numPattern + '%', #text) = 0 BREAK;
SET #text =
LEFT(#text, PATINDEX('%' + #numPattern + '%', #text)-1) --Get beginning chars of the input text until first occurance of pattern is found
+ #maskPattern --Append aasking pattern
+ SUBSTRING(#text, PATINDEX('%' + #numPattern + '%', #text) + #patternLength, LEN(#text)) -- Get & append rest of the text found after masking attern
END
SELECT #text
END
I faced this situation recently. Using Patindex and Stuff should help, but you would need to repeat for CC numbers with different number of digits separately.
-- For 16 digits CC numbers
UPDATE table
SET columnname = Stuff (columnname, Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname), 16, '################')
WHERE Patindex(
'%[3-6][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%'
, columnname) > 0
You can use patindex. It won't be pretty and there might be a more concise way to write it. But you can use sets ie [0-9]
patindex: http://msdn.microsoft.com/en-us/library/ms188395.aspx
similar question: SQL Server Regular expressions in T-SQL
For anyone finding this question who does want to use PHP, here's a function I use that takes a credit card number (all digits, with dashes, or with spaces) and replaces all but the first and last 4 digits with 'X'.
To accept credit card numbers with dashes as well, use this regex pattern instead:
$cc_regex_pattern = '/(\d{4})(-)?(\d{4})(-)?(\d{4})(-)?(\d{4})/'
and remove the preprocessing of the cc number that removes the dashes:
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
and so the replacement string becomes (because we've changed the index of patterns - note the $7):
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$7';
or if you want, simply replace the whole cc number, like in the original question:
$cc_regex_replacement = 'XXXX$2XXXX$4XXXX$6XXXX';
Here's the original function for credit card numbers with or without spaces or dashes, which obfuscates and removes any dashes:
/**
* #param integer|string $credit_card_number
* #return mixed
*/
static function obfuscate_credit_card($credit_card_number)
{
$compressed_cc_number = preg_replace('/(\ |-)/', '', $credit_card_number);
$cc_length = strlen($compressed_cc_number);
$cc_middle_length = $cc_length >= 9 ? $cc_length - 8 : 0;
//create middle pattern
$cc_middle_pattern = '';
for ($i = 0; $i < $cc_middle_length; $i++) {
$cc_middle_pattern .= 'X';
}
//replace cc middle digits with middle pattern
$cc_regex_pattern = '/(\d{4})(\d+)(\d{4})/';
$cc_regex_replacement = '$1' . $cc_middle_pattern . '$3';
$obfuscated_cc = preg_replace($cc_regex_pattern, $cc_regex_replacement, $compressed_cc_number);
return $obfuscated_cc;
}