Find character sequence at specific position in string - sql

I need to use SQL to find a sequence of characters at a specific position in a string.
Example:
atcgggatgccatg
I need to find 'atg' starting at character 7 or at character 7-9, either way would work. I don't want to find the 'atg' at the end of the string. I know about LIKE but couldn't find how to use it for a specific position.
Thank you

In MS Access, you could write this as:
where col like '???????atg*' or
col like '????????atg*' or
col like '?????????atg*'
However, if you interested in this type of comparison, you might consider using a database that supports regular expressions.

If you have a look at this page you'll find that LIKE is entirely capable of doing what you want. To find something at, for example, a 3 char offset you can use something like this
SELECT * FROM SomeTable WHERE [InterestingField] LIKE '___FOO%'
The '_' (underscore) is a place marker for any char. Having 3 "any char" markers in the pattern, with a trailing '%', means that the above SQL will match anything with FOO starting from the fourth char, and then anything else (including nothing).
To look for something 7 chars in, use 7 underscores.
Let me know ifthis isn't quite clear.
EDIT: I quoted SQL Server stuff, not Access. Swap in '?' where I have '_', use '*' instead of '%', and check out this link instead.
Revised query:
SELECT * FROM SomeTable WHERE [InterestingField] LIKE '???FOO*'

Related

sql looking for pattern

I have a string similar to this 'MSH|^~\&|STF_ALL_LAB_IN_C...
I'm trying to find some sql that will bring back all messages that contain
MSH|^~\&|(any 3 characters)_(anything after the underscore).
Tried something like this
WHERE TransText LIKE 'MSH|^~\&|%_%_%_'
But that doesn't seem to require the underscore.
Any suggestions?
MSH|^~&|(any 3 characters)_(anything after the underscore).
The pattern would be:
where TransText like 'MSH|^~\&|___\_%'
In some databases, the backslash would need to be escaped, so that would be:
where TransText like 'MSH|^~\\&|___\_%'
_ is a special character in a LIKE clause. It matches any one character, where % matches any series of 0 or more characters.
You need to escape it, using \_.

How can I escape the wildcard for like operator? [duplicate]

This question also has the answer, but it mentions DB2 specifically.
How do I search for a string using LIKE that already has a percent % symbol in it? The LIKE operator uses % symbols to signify wildcards.
Use brackets. So to look for 75%
WHERE MyCol LIKE '%75[%]%'
This is simpler than ESCAPE and common to most RDBMSes.
You can use the ESCAPE keyword with LIKE. Simply prepend the desired character (e.g. '!') to each of the existing % signs in the string and then add ESCAPE '!' (or your character of choice) to the end of the query.
For example:
SELECT *
FROM prices
WHERE discount LIKE '%80!% off%'
ESCAPE '!'
This will make the database treat 80% as an actual part of the string to search for and not 80(wildcard).
MSDN Docs for LIKE
WHERE column_name LIKE '%save 50[%] off!%'
You can use the code below to find a specific value.
WHERE col1 LIKE '%[%]75%'
When you want a single digit number after the% sign, you can write the following code.
WHERE col2 LIKE '%[%]_'
In MySQL,
WHERE column_name LIKE '%|%%' ESCAPE '|'

Converting symbols from outside of alphabet caused by copying text in different encoding

In my database I should only have data written using Polish alphabet but sometimes there are symbols not included in Polish alphabet (words copied from source with different encoding) that correspond to Polish letters in another encoding. Is it possible to somehow convert symbols outside of Polish alphabet to corresponding letters?
The only solution I figured is to manually find and replace those characters but maybe you have better solution to my problem.
Question concerns Oracle SQL Language.
I don't have database in front of me but as I remember correctly the example could look like this - two rows from my db:
ŚWIAT
ÚWIAT
and what I need is to convert Ú that doesn't belong to Polish alphabet to Ś.
You can try this. Experiment with it first to see if it works.
If I want to change every occurrence of the letter z with a j in a string, I would use the translate function: translate(text_string, 'z', 'j'). I don't have to use the letters z and j; instead, I can write translate(text_string, chr(122), chr(106) - to find out the character code, I use select ascii('z') from dual;. For example:
SQL> select translate('banzo', chr(122), chr(106)) from dual;
TRANS
-----
banjo
This changes every occurrence of z to j in text_string.
Now, you will have to find the code for the characters you want to change (both the "from" and the "to" characters) in your character set - it should be your session character set, not the database character set. (At least I think this is correct; experiment with it or read the documentation for CHR and perhaps for TRANSLATE - CHR returns the character code in the DATABASE character set unless you indicate otherwise, while I believe TRANSLATE uses the session character set.)
The function ascii may or may not work for non-ASCII characters, but if you google the name of your character set, you should find a character set table that will show you the codes for all the letters available in that character set.
Then, if this works, you can do the translation in one shot - translate(text_string, 'abcd', 'qrst') will change every 'a' to a 'q', every 'b' to an 'r' etc. And with chr(...), instead of 'abcd' you can write chr(97) || chr(98) || chr(99) || chr(100).

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))

Unable to replace Char(63) by SQL query

I am having some rows in table with some unusual character. When I use ascii() or unicode() for that character, it returns 63. But when I try this -
update MyTable
set MyColumn = replace(MyColumn,char(63),'')
it does not replace. The unusual character still exists after the replace function. Char(63) incidentally looks like a question mark.
For example my string is 'ddd#dd ddd' where # it's my unusual character and
select unicode('#')
return me 63.But this code
declare #str nvarchar(10) = 'ddd#dd ddd'
set #char = char(unicode('#'))
set #str = replace(#str,#char,'')
is working!
Any ideas how to resolve this?
Additional information:
select ascii('�') returns 63, and so does select ascii('?'). Finally select char(63) returns ? and not the diamond-question-mark.
When this character is pasted into Excel or a text editor, it looks like a space, but in an SQL Server Query window (and, apparently, here on StackOverflow as well), it looks like a diamond containing a question mark.
Not only does char(63) look like a '?', it is actually a '?'.
(As a simple test ensure you have numlock on your keyboard on, hold down the alt key andtype '63' into the number pad - you can all sorts of fun this way, try alt-205, then alt-206 and alt-205 again: ═╬═)
Its possible that the '?' you are seeing isn't a char(63) however, and more indicitive of a character that SQL Server doesn't know how to display.
What do you get when you run:
select ascii(substring('[yourstring]',[pos],1));
--or
select unicode(substring('[yourstring]',[pos],1));
Where [yourstring] is your string and [pos] is the position of your char in the string
EDIT
From your comment it seems like it is a question mark. Have you tried:
replace(MyColumn,'?','')
EDIT2
Out of interest, what does the following do for you:
replace(replace(MyColumn,char(146),''),char(63),'')
char(63) is a question mark. It sounds like these "unusual" characters are displayed as a question mark, but are not actually characters with char code 63.
If this is the case, then removing occurrences of char(63) (aka '?') will of course have no effect on these "unusual" characters.
I believe you actually didn't have issues with literally CHAR(63), because that should be just a normal character and you should be able to properly work with it.
What I think happened is that, by mistake, an UTF character (for example, a cyrilic "А") was inserted into the table - and either your:
columns setup,
the SQL code,
or the passed in parameters
were not prepared for that.
In this case, the sign might be visible to you as ?, and its CHAR() function would actually give 63, but you should really use the NCHAR() to figure out the real code of it.
Let me give a specific example, that I had multiple times - issues
with that Cyrilic "А", which looks identical to the Latin one, but has
a unicode of 1040.
If you try to use the non-UTF CHAR function on that 1040 character,
you would get a code 63, which is not true (and is probably just an
info about the first byte of multibyte character).
Actually, run this to make the differences in my example obvious:
SELECT NCHAR(65) AS Latin_A, NCHAR(1040) Cyrilic_A, ASCII(NCHAR(1040)) Latin_A_Code, UNICODE(NCHAR(1040)) Cyrilic_A_Code;
That empty string Which shows us '?' in substring.
Gives us Ascii value as 63.
It's a Zero Width space which gets appended if you copy data from ui and insert into the database.
To replace the data, you can use below query
**set MyColumn = replace(MyColumn,NCHAR(8203),'')**
It's an older question, but I've run into this problem as well. I found the solution somewhere else on internet, but I thought it would be good to share it here as well. Have a good day.
Replace(YourString, nchar(65533) COLLATE Latin1_General_BIN2, '')
This should work as well:
UPDATE TABLE
SET [FieldName] = SUBSTRING([FieldName], 2, LEN([FieldName]))
WHERE ASCII([FieldName]) = 63