sql complex like expression #user - sql

im trying to find all the mentions for the matched username (sam) from the database text code with following:
$sql = "select * from tweet where feed like '%#sam%'";
this will return all the rows which user have been mention at.
but now the problem is, it also returns rows with #sam3, #sam-dadf or anything that is after #sam..
how can i limit this, so only specific username shows from the text and not all the matching sams.. following are the text format in the database, that is inserted in feed row.
1. i been out with #sam today, but im
not sure what we should do
2. we head great party today and all the frineds were invited, such as #sam, #jon, #dan...
3. i been out today with #samFan and with #dj. << this row should not get pull from database``

Yes, use REGEXP (or RLIKE), but watch out for the common regex mistake of looking for the end of the desired "token" as merely a common, negated character class like [^A-Za-z0-9] -- instead use the zero-width, "end of word" match construct [[:>:]] (which perl's engine and other inspired regexen know as \b).
The negated character class fails to match at the end-of-string:
mysql> SELECT 'I am #sam' REGEXP '#sam[^A-Za-z0-9]' AS "Does This Match?";
+------------------+
| Does This Match? |
+------------------+
| 0 |
+------------------+
1 row in set (0.00 sec)
where the word boundary match succeeds:
mysql> SELECT 'I am #sam' REGEXP '#sam[[:>:]]' AS "Does This Match?";
+------------------+
| Does This Match? |
+------------------+
| 1 |
+------------------+
1 row in set (0.00 sec)
If [[:>:]] isn't quite right for your application (because your "username" character set is not what the MySQL regex engine thinks of as one side of a word boundary in your locale), you can, alternatively, specify a negated character class and separately test for end-of-string:
SELECT ... WHERE (feed REGEXP '#sam[^A-Za-z0-9]' or feed REGEXP '#sam$')

I would recomend using mysql regular expression to try to achive this.
Search the string sfor your required text, disregarding selected followup chars
MySQL Regular Expressions

I'm sorry, but I'm not familiar with MySql.
In Microsoft SQL Server, the Like clause can allow you to specify fields to NOT match.
Ex:
if ('I like #Spam.' LIKE '%#Spam[^a-z]%') SELECT 1

select * from table where feed REGEXP '#sam[^A-Za-z0-9]'

Related

Imapala Regex - find specific sequence of characters, with delimiters between them, some are not letters, digits or underscore

I am new to regex and need to search a string field in Impala for multiple matches to this exact sequence of characters: ~FC* followed by 11 more * that could have letters/digits between (but could not, they are basically delimiters in this string field). After the 12th * (if you count #1 in ~FC*) it should be immediately followed by Y~.
since the asterisks are not letters or digits, I am unsure on how to search for these delimiters properly.
This is my SQL so far:
select
regexp_extract(col_name, '(~FC\\*).*(\\*Y~)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
data returned:
pattern_found
--------------
~FC*
(~FC\\*) in Impala SQL it returns ~FC* which is great (got it from my other question)
Been trying this (~FC\\*).*(\\*Y~) which obviously isnt counting the number of asterisks but its is also not picking the Y up.
This is a test string, it has 2 occurrences:
N4*CITY*STATE*2155446*2120~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
results should be these 2, which has an overlapping ~ between them. but will settle for at least the first being found if both cannot.
~FC*C*IND*30*MC*blah blah fjdgfeufh*27*0*****Y~
~FC*Z*IND*39*MC*jhlkfhfudfgsdkufgkusgfn*23*0*****Y~
figured out a solution but happy to learn of a better way to accomplish this
This is what worked in Impala SQL, needed parentheses and double escape backslashes for allllll the asterisks:
(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)
Full SQL:
select
regexp_extract(col_name, '(~FC\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*[^\\*]*\\*Y)', 1) as "pattern_found"
from db.table
where id = 123456789
limit 1
and here is the RegexDemo without the additional syntax needed for Impala SQL

SQL Server Like Pattern

I want to match custom pattern in one of the column in a SQL Server database. The problem is I don't know the exact pattern length.
I want only those rows which has 'function' and 'alphanumeric pattern' which has min 5 and max 8 characters. Starting and ending characters are not fixed, not case sensitive.
Column value looks like this:
Row Value
--------------------------------------------------------------------
1 I have a single own function and its namely 123BA689,BAS54256
2 Everyone has base function AFD12,CHD12234
3 Nicole has its own ASS1256902,25ADFG2
Desired output:
Row Value
--------------------------------------------------------------------
1 I have a single own function and its namely 123BA689,BAS54256a
2 Everyone has base function AFD12,CHD1223465AS
I have tried Like and regex to match pattern but failed.
Does anybody know how to fix it?
select *
from ab
where lower(ab.a) like '%function' and '%[a-z0-9]{6}%'
Thanks.
SQL Server doesn't support regular expressions. You could conceptually do what you want with:
where lower(ab.a) like '%function%' and
lower(ab.a) like '%[a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9]%' and
lower(ab.a) not like '%[a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9]%'
However, this will return any string that has "function", because that is an alphanumeric patter with 5-8 characters.

Regular expression to remove element not match specific prefix

I am doing this in Impala or Hive. Basically let say I have a string like this
f-150:aa|f-150:cc|g-210:dd
Each element is separated by the pipe |. Each has prefix f-150 or whatever. I want to be able to remove the prefix and keep only element that matches specific prefix. For example, if the prefix is f-150, I want the final string after regex_replace is
aa|cc
dd is removed because g-210 is different prefix and not match, therefore the whole element is removed.
Any idea how to do this using string expression in one SQL?
Thanks
UPDATE 1
I tried this in Impala:
select regexp_extract('f-150:aa|f-150:cc|g-210:dd','(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}',0);
But got this output:
f-150:aa
In Hive, I got NULL.
The regexyou in question could look like this:
(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}
I have added some pseudo keywords to retain, but I am sure you get the idea:
Wholy match elements that should be dropped but only match the prefix for those that should be retained.
To keep the separator intact, match | at the beginning of an element in group 1 and put it back in the replacement with $1.
Demo
According to the documentation, your query should be written like a Java regex; likewise, this should perform like this code sample in Java.
You could match the values that you want to remove and then replace with an empty string:
f-150:|\|[^:]+:[^|]+$|[^|]+:[^|]+\|
f-150:|\\|[^:]+:[^|]+$|[^|]+:[^|]+\\|
Explanation
f-150: Match literally
| Or
\|[^:]+:[^|]+$ Match a pipe, not a colon one or more times followed by not a pipe one or more times and assert the end of the line
| Or
[^|]+:[^|]+\| Match not a pipe one or more times, a colon followed by matching not a pipe one or more times and then match a pipe
Test with multiple lines and combinations
You may have to loop through the string until the end to get the all the matching sub string. Look ahead syntax is not supported in most sql so above regexp might not be suitable for SQL syntax. For you purpose you can do something like creating a table to loop through just to mimic Oracle's level syntax and join with your table containing the string.
With loop_tab as (
Select 1 loop union all
Select 2 union all
select 3 union all
select 4 union all
select 5),
string_tab as(Select 'f-150:aa|ade|f-150:ce|akg|f-150:bb|'::varchar(40) as str)
Select regexp_substr(str,'(f\\-150\\:\\w+\\|)',1,loop)
from string_tab
join loop_tab on 1=1
Output:
regexp_substr
f-150:aa|
f-150:ce|
f-150:bb|

How to use regex OR operation in impala regex_extract method and get different capture group

I have the following table1 with attribute co:
|-----------------------------------------
| co
|-----------------------------------------
| fsdsdf "This one" fdsfsd ghjhgj "sfdsf"
| Just This
|-----------------------------------------
In case there are quotation mark - I would like to get the first occurrence content. If there is no quotation mark I would like to return the content as is.
For the above example:
For the first line - This one
For the second line - Just This
I have SQL code in Impala that solves the first case:
select regexp_extract (co, '"([^"]*")',1) from table1
How can I generalize it to detect and return the required results for the next case?
You can not generalize it in impala. As far as the problem you are having it requires OR | implementation in your regex. With regex_extract you need to put capture group no. in the end . e.g.
select regexp_extract (co, '"([^"]*")',1) from table1
But with | operand in a regex, capture group will have to be different for both case. Which you can not define in your regex_extract method.
Say if (A)|(B) is your regex then for your first case capture group will be 1 and for your second case capture group will be 2 . But you can not put both 1 and 2 in your regex_extract syntax to date.
The Generic regex syntax would be (which i guess won't work in impala grouping):
^(?!.*")(.*)$|^[^"]*"(.*?)".*$
Watch out the capture groupings
In the link , you will see "This One" is captured as group 2
Where as Just this is captured as group 1
Check This using union.
select regexp_extract (co, '"([^"]*")',1) from table1
union
select co from table1 where co like '"%"'
You can use an if function and put RegEx functions inside for the arguments. So,
if(regexp_like(co,'"'),
regexp_extract(co,'"([^"]*)',1), co)

Informix Accent Insensitive Search

Is there any way (a function, a config option, etc.) to force informix to ignore accents on searches?
Example:
select id, name from user where name like 'conceição%'
Returns:
1 | conceicao oliveira
2 | conceiçao santos
3 | conceicão andrade
4 | conceição barros
Thanks
Not directly, that I'm aware of. You could install the Regex DataBlade module. The use it's regexp_match function. Replacing the query with something like this:
where regexp_match(name , 'concei[çc][ãa][o]%')
Or, if you don't have that option, what I would do would be add another 'normalized_name' column. replacing all the accented characters with a "standard" character. Then query my table based on that.
name='conceiçao santos', normalized_name='conceicao santos'
Adding a normalized column will also make sure you're not dependant on any module, or any particular database for that matter.