Can I use Regular Expressions in USQL? - azure-data-lake

Is it possible to write regular expression comparisons in USQL?
For example, rather than multiple "LIKE" statements to search for the name of various food items, I want to perform a comparison of multiple items using a single Regex expression.

You can create a new Regex object inline and then use the IsMatch() method.
The example below returns "Y" if the Offer_Desc column contains the word "bacon", "croissant", or "panini".
#output =
SELECT
, CSHARP(new Regex("\\b(BACON|CROISSANT|PANINI)S?\\b"
)).IsMatch(wrk.Offer_Desc.ToUpper())
? "Y"
: "N" AS Is_Food
FROM ... AS wrk
Notes:
The CSHARP() block is optional, but you do need to escape any backslashes in your regex by doubling them (as in the example above).
The regex sample accepts these as a single words, either in singular or plural form ("paninis" is okay but "baconator" is not).

I'd assume it would be the same inline, but when I used regex in code behind I hit some show-stopping speed issues.
If you are checking a reasonable number of food items I'd really recommend just using an inline ternary statement to get the results you're looking for.
#output =
SELECT
wrk.Offer_Desc.ToLowerInvariant() == "bacon" ||
wrk.Offer_Desc.ToLowerInvariant() == "croissant" ||
wrk.Offer_Desc.ToLowerInvariant() == "panini" ? "Y" : "N" AS Is_Food
FROM ... AS wrk
If you do need to check if a string contains a string, the string Contains method might still be a better approach.
#output =
SELECT
wrk.Offer_Desc.ToLowerInvariant().Contains("bacon") ||
wrk.Offer_Desc.ToLowerInvariant().Contains("croissant") ||
wrk.Offer_Desc.ToLowerInvariant().Contains("panini") ? "Y" : "N" AS Is_Food
FROM ... AS wrk

Related

What is the proper syntax for an Ecto query using ilike and SQL concatenation?

I am attempting to make an Ecto query where I concatenate the first and last name of a given query result, then perform an ilike search using a query string. For example, I may want to search the database for all names that start with "Bob J". Currently, my code looks like this:
pending_result_custom_search = from result in pending_result_query,
where: ilike(fragment("CONCAT(?, '',?)", result.first_name, result.last_name), ^query)
(pending_result_query is a previous query that I am composing on top of)
This approach does not work and I continue to get an empty query set. If I perform the query doing something like this
query = "Bob"
pending_result_custom_search = from result in pending_result_query,
where: ilike(fragment("CONCAT(?, '',?)", "%Bob%", ""), ^query)
I get the correct functionality.
What is the proper syntax to get the first approach working properly?
I think in your case I would use only fragment, e.g.
query = "%" <> "Bob" <> "%"
pending_result_custom_search = from result in pending_result_query,
where: fragment("first_name || last_name ILIKE = ?", ^query)
That way you can shift the focus to PostGres and use its functions instead of worrying too much about the Ecto abstractions of them. In the above example, I used || to concatenate column values, but you could use PostGres' CONCAT() if you desired:
pending_result_custom_search = from result in pending_result_query,
where: fragment("CONCAT(first_name, last_name) ILIKE = ?", ^query)
Note that both examples here did not include a space between first_name and last_name. Also, I added the % characters to the search query before binding it.

PostgreSQL - find matching line in char/string column?

How can I find matching line in char/string type column?
For example let say I have column called text and some row has content of:
12345\nabcdf\nXKJKJ
(where \n are real new lines)
Now I want to find related row if any of lines match. For example, I have value 12345,
then it should find match. But if I have value 123, It would not.
I tried using like but it finds in both cases, when I have matching value (like 12345) and partially matching value (like 123).
For example something like this, but to have boundary for checking whole line:
SELECT id
FROM my_table
WHERE text like [SOME_VALUE]
Update
Maybe its not yet clear what Im asking. But basically I want something equivalent what you can do with regular expression,
like this: https://regexr.com/5akj1
Here regular expression /^123$/m would not match my string, it would only match if it would have been with pattern /^12345$/m (when I use pattern, value is dynamic, so pattern would change depending what value I got).
You may use regexp_replace and then check that the replaced string is not equal to the original column value:
select count(*)
from dummy
where regexp_replace(mytext, '(?m)^1234$', '') <> mytext;
You have a demo here.
Bear in mind that I have used the (?m) modifier, which makes ^ and $ match begin and end of line instead of begin and end of string.
You should be able to use ~ for matching:
where mytext ~ '(\n|^)1234(\n|$)'

Why "=" and "like" work in the same statement

I was practicing SQL injection skill, and I found that I could put = and LIKE in a single statement.
However, I'm not sure what does this mean and why it works?
SELECT 1 FROM users WHERE name='' LIKE '%'
So, what does that mean when I put = and LIKE in a statement, and when would I write something like this?
I am guessing that you are using MySQL, because this is syntactically correct in MySQL. It treats boolean types as numbers (which will be converted to integers and strings).
So, your code should be parsed as:
WHERE (name = '') LIKE '%'
This is because = and LIKE have the same precedence, and when operators have the same precedence, they are evaluated left-to-right (as explained in the documentation).
This, in turn evaluates to one of these three possibilities:
WHERE 1 LIKE '%' -- when name = ''
WHERE 0 LIKE '%' -- otherwise when name is not null
WHERE NULL like '%'
The first two will always evaluate to true. The third would discard any row where name is null.
(in MySQL and other popular DBMS) The LIKE operator is used to search for a specified pattern in a column. It admits "%" as a wildcard that represents zero, one, or multiple characters.
Your query always passes because the string '' meets this wildcard (zero characters). Incidentally, almost anything will. Some DBMS will react differently to such a query though.

Escaping (, round brackets sybase SQL

I am working with Sybase SQL and want to exclude all entries that look like this:
(NOT PRESENT)
So I tried using:
SELECT col FROM table WHERE col NOT LIKE '(%)'
Do you guys know what is happening? I think I need to escap ( somehow, but I do not know how. The following returns an error:
SELECT col FROM table WHERE col NOT LIKE '\(%\)' ESCAPE '\'
Kind Regards
Try this :
SELECT col FROM table WHERE col NOT LIKE ('(%)')
You might find this helpful
Sybase Event Stream Processor 5.0 CCL Programmers Guide - String Functions
like()
Scalar. Determines whether a given string matches a specified pattern string.
Syntax
like ( string, pattern )
Parameters
string A string.
pattern A pattern of characters, as a string. Can contain wildcards.
Usage
Determines whether a string matches a pattern string. The function returns 1 if the string matches the pattern, and 0 otherwise. The pattern argument can contain wildcards: '_' matches a single arbitrary character, and '%' matches 0 or more arbitrary characters. The function takes in two strings as its arguments, and returns an integer.
Note: In SQL, the infix notation can also be used: sourceString like patternString.
Example
like ('MSFT', 'M%T') returns 1.

regexp after a word appear

Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".