Compare strings by their written representation - is it possible? - sql

For example I have a column that contains string (in English - "A", "B" ) "AB1234" and I'd like to compare it to the string "AB1234" (in Russian "A", "B" ), for Example.
Is there any built-in function to achieve this?
The best way I found is to use Translate func where i enumerate all needed symbols.

You are looking for a function LOOKS LIKE.
Unfortunately there is no such function in SQL.
Instead, you can create a function-based index that casts every string to a common denominator using TRANSLATE, and search for the string:
CREATE INDEX ix_mytable_transliterate ON (TRANSLATE(UPPER(str), 'АВЕКМНОРСТУХ', 'AВEKMHOPCTYX'))
SELECT *
FROM mytable
WHERE TRANSLATE(UPPER(str), 'АВЕКМНОРСТУХ', 'AВEKMHOPCTYX') = TRANSLATE(UPPER('весна на танке'), 'АВЕКМНОРСТУХ', 'AВEKMHOPCTYX')

Related

Selecting substrings from different points in strings depending on another column entry SQL

I have 2 columns that look a little like this:
Column A
Column B
Column C
ABC
{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}
1.0
DEF
{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}
24.0
I need a select statement to create column C - the numerical digits in column B that correspond to the letters in Column A. I have got as far as finding the starting point of the numbers I want to take out. But as they have different character lengths I can't count a length, I want to extract the characters from the calculated starting point( below) up to the next comma.
STRPOS(Column B, Column A) +5 Gives me the correct character for the starting point of a SUBSTRING query, from here I am lost. Any help much appreciated.
NB, I am using google Big Query, it doesn't recognise CHARINDEX.
You can use a regular expression as well.
WITH sample_table AS (
SELECT 'ABC' ColumnA, '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}' ColumnB UNION ALL
SELECT 'DEF', '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}' UNION ALL
SELECT 'XYZ', '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}'
)
SELECT *,
REGEXP_EXTRACT(ColumnB, FORMAT('"%s":([0-9.]+)', ColumnA)) ColumnC
FROM sample_table;
Query results
[Updated]
Regarding #Bihag Kashikar's suggestion: sinceColumnB is an invalid json, it will not be properly parsed within js udf like below. If it's a valid json, js udf with json key can be an alternative of a regular expression. I think.
CREATE TEMP FUNCTION custom_json_extract(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
try {
obj = JSON.parse(json);
}
catch {
return null;
}
return obj[key];
""";
SELECT custom_json_extract('{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}', 'ABC') invalid_json,
custom_json_extract('{"ABC":1.0,"DEF":24.0,"XYZ":10.50}', 'ABC') valid_json;
Query results
take a look at this post too, this shows using js udf and with split options
Error when trying to have a variable pathsname: JSONPath must be a string literal or query parameter

Presto array contains an element that likes some pattern

For example, one column in my table is an array, I want to check if that column contains an element that contains substring "denied" (so elements like "denied at 12:00 pm", "denied by admin" will all count, I believe I will have to use "like" to identify the pattern). How to write sql for this?
Use presto's array functions:
filter(), which returns elements that satisfy the given condition
cardinality(), which returns the size of an array:
Like this:
where cardinality(filter(myArray, x -> x like '%denied%')) > 0
In newer versions of PrestoSQL (now known as Trino), you can use the any_match function:
WHERE any_match(column, e -> e like '%denied%')
See array operator docs here
contains(array_column,'denied')
We can use strpos(returns starting position of substring and 0 if not found) here. (documentation)
where strpos(array_column,'denied')>0

How to write SQL query with many % wildcard characters

I have a coloumn in Sql Server table as:
companystring = {"CompanyId":0,"CompanyType":1,"CompanyName":"Test
215","TradingName":"Test 215","RegistrationNumber":"Test
215","Email":"test215#tradeslot.com","Website":"Test
215","DateStarted":"2012","CompanyValidationErrors":[],"CompanyCode":null}
I want to query the column to search for
companyname like '%CompanyName":"%test 2%","%'
I want to know if I'm querying correctly, because for some search string it does not yield the proper result. Could anyone please help me with this?
Edit: I have removed the format bold
% is a special character that means a wildcard. If you want to find the actual character inside a string, you need to escape it.
DECLARE #d TABLE(id INT, s VARCHAR(32));
INSERT #d VALUES(1,'foo%bar'),(2,'fooblat');
SELECT id, s FROM #d WHERE s LIKE 'foo[%]%'; -- returns only 1
SELECT id, s FROM #d WHERE s LIKE 'foo%'; -- returns both 1 and 2
Depending on your platform, you might be able to use some combination of regular expressions and/or lambda expressions which are built into its main libraries. For example, .NET has LINQ , which is a powerful tool that abstracts querying and which provides leveraging for searches.
It looks like you have JSON data stored in a column called "companystring". If you want to search within the JSON data from SQL things get very tricky.
I would suggest you look at doing some extra processing at insert/update to expose the properties of the JSON you want to search on.
If you search in the way you describe, you would actually need to use Regular Expressions or something else to make it reliable.
In your example you say you want to search for:
companystring like '%CompanyName":"%test 2%","%'
I understand this as searching inside the JSON for the string "test 2" somewhere inside the "CompanyName" property. Unfortunately this would also return results where "test 2" was found in any other property after "CompanyName", such as the following:
-- formatted for readability
companystring = '{
"CompanyId":0,
"CompanyType":1,
"CompanyName":"Test Something 215",
"TradingName":"Test 215",
"RegistrationNumber":"Test 215",
"Email":"test215#tradeslot.com",
"Website":"Test 215",
"DateStarted":"2012",
"CompanyValidationErrors":[],
"CompanyCode":null}'
Even though "test 2" isn't in the CompanyName, it is in the text following it (TradingName), which is also followed by the string "," so it would meet your search criteria.
Another option would be to create a view that exposes the value of CompanyName using a column defined as follows:
LEFT(
SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring)),
CHARINDEX('"', SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring))) - 1
) AS CompanyName
Then you could query that view using WHERE CompanyName LIKE '%test 2%' and it would work, although performance could be an issue.
The logic of the above is to get everything after "CompanyName":":
SUBSTRING(companystring, CHARINDEX('"CompanyName":"', companystring) + LEN('"CompanyName":"'), LEN(companystring))
Up to but not including the first " in the sub-string (which is why it is used twice).

Is there a more efficient way to handle these replace calls

I'm querying across two dbs separated by a legacy application. When the app encounters characters like, 'ü', '’', 'ó' they are replaced by a '?'.
So to match messages, I've been using a bunch of 'replace' calls like so:
(replace(replace(replace(replace(replace(replace(lower(substring([Content],1,153)) , '’', '?'),'ü','?'),'ó','?'), 'é','?'),'á','?'), 'ñ','?'))
Over a couple thousand records, this can (as you expect) is very slow. There is probably a better way to do this. Thanks for telling me what it is.
One thing you can do is implement a RegEx Replace function as a SQL assembly and call is as a user-defined function on your column instead of the Replace() calls. Could be faster. You also want to probably to the same RegEx Replace on your passed in query values.
TSQL Regular Expression
You could create a persisted computed column on the same table where the [Content] column is.
Alternatively, you can probably speed up the replace by creating a user defined function in C# using a StringBuilder. And you can even combine both of these solutions.
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlString LegacyReplace(SqlString value)
{
if(value.IsNull) return value;
string s = value.Value;
int l = Math.Min(s.Length, 153);
var sb = new StringBuilder(s, 0, l, l);
sb.Replace('’', '?');
sb.Replace('ü', '?');
// etc...
return new SqlString(sb.ToString());
}
Why not first do the same replace (chars to "?") on the string you are searching for in the app side using regular expressions? E.g. your SQL server query that was passed a raw string to search for and used these nested replace() calls will instead be passed a search string already containing "?"s by your app code.
Could you convert the strings to varbinary before comparing? Something like the below:
declare
#Test varbinary (100)
,#Test2 varbinary (100)
select
#Test = convert(varbinary(100),'abcu')
,#Test2 = convert(varbinary(100),'abcü')
select
case
when #Test <> #Test2 then 'NO MATCH'
else 'MATCH'
end

What is the best way to select string fields based on character ranges?

I need to add the ability for users of my software to select records by character ranges.
How can I write a query that returns all widgets from a table whose name falls in the range Ba-Bi for example?
Currently I'm using greater than and less than operators, so the above example would become:
select * from widget
where name >= 'ba' and name < 'bj'
Notice how I have "incremented" the last character of the upper bound from i to j so that "bike" would not be left out.
Is there a generic way to find the next character after a given character based on the field's collation or would it be safer to create a second condition?
select * from widget
where name >= 'ba'
and (name < 'bi' or name like 'bi%')
My application needs to support localization. How sensitive is this kind of query to different character sets?
I also need to support both MSSQL and Oracle. What are my options for ensuring that character casing is ignored no matter what language appears in the data?
Let's skip directly to localization. Would you say "aa" >= "ba" ? Probably not, but that is where it sorts in Sweden. Also, you simply can't assume that you can ignore casing in any language. Casing is explicitly language-dependent, with the most common example being Turkish: uppercase i is İ. Lowercase I is ı.
Now, your SQL DB defines the result of <, == etc by a "collation order". This is definitely language specific. So, you should explicitly control this, for every query. A Turkish collation order will put those i's where they belong (in Turkish). You can't rely on the default collation.
As for the "increment part", don't bother. Stick to >= and <=.
For MSSQL see this thread: http://bytes.com/forum/thread483570.html .
For Oracle, it depends on your Oracle version, as Oracle 10 now supports regex(p) like queries: http://www.psoug.org/reference/regexp.html (search for regexp_like ) and see this article: http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/rischert_regexp_pt1.html
HTH
Frustratingly, the Oracle substring function is SUBSTR(), whilst it SQL-Server it's SUBSTRING().
You could write a simple wrapper around one or both of them so that they share the same function name + prototype.
Then you can just use
MY_SUBSTRING(name, 2) >= 'ba' AND MY_SUBSTRING(name, 2) <= 'bi'
or similar.
You could use this...
select * from widget
where name Like 'b[a-i]%'
This will match any row where the name starts with b, the second character is in the range a to i, and any other characters follow.
I think that I'd go with something simple like appending a high-sorting string to the end of the upper bound. Something like:
select * from widgetwhere name >= 'ba' and name <= 'bi'||'~'
I'm not sure that would survive EBCDIC conversion though
You could also do it like this:
select * from widget
where left(name, 2) between 'ba' and 'bi'
If your criteria length changes (as you seemed to indicate in a comment you left), the query would need to have the length as an input also:
declare #CriteriaLength int
set #CriteriaLength = 4
select * from widget
where left(name, #CriteriaLength) between 'baaa' and 'bike'