BigQuery regex to find string that contains chinese characters

BigQuery regex to find string that contains chinese characters - sql

I want to find string that contains any chinese characters.
I have the following query in PostgreSQL which works as expected.
with tmp as (
select '中文zz' as word
union all
select '中文' as word
union all
select 'english' as word
union all
select 'にほんご' as word
union all
select 'eng–lish' as word
)
select word,
word ~* '[\x4e00-\x9fff\x3400-\x4dbf]'
from tmp
Results:
中文zz true
中文 true
english false
にほんご false
eng–lish false
However, if I convert this SQL in BigQuery, it does not produce the same result.
with tmp as (
select '中文zz' as word
union all
select '中文' as word
union all
select 'english' as word
union all
select 'にほんご' as word
union all
select 'eng–lish' as word
)
select word,
regexp_contains(word, r'[\x4e00-\x9fff\x3400-\x4dbf]')
from tmp
Results:
中文zz true
中文 false
english true
にほんご false
eng–lish true

You can use the following regex with BigQuery :
with tmp as (
select '中文zz' as word
union all
select '中文' as word
union all
select 'english' as word
union all
select 'にほんご' as word
union all
select 'eng–lish' as word
)
select word,
regexp_contains(word, '''[\u4E00-\u9FA5]''')
from tmp
The result is :

Related

How to use regex to select rows where the column has more than two words in oracle

for example:
id
center
1
man
2
some men here
I want to select rows with three or more words so ouput should be:
id
center
2
some men here
I've tried using this: regexp_like(center, '\w{3,}') but it's not giving the expected output.

You can use REGEXP_COUNT to look for more than 2 sets of words
WITH
some_table (id, center)
AS
(SELECT 1, 'man' FROM DUAL
UNION ALL
SELECT 2, 'some men here' FROM DUAL)
SELECT *
FROM some_table
WHERE REGEXP_COUNT (center, '\w+') > 2;

You could use the regex pattern \w+ \w+ \w+:
SELECT id, center
FROM yourTable
WHERE REGEXP_LIKE(center, '\w+[:space:]+\w+[:space:]+\w+);

I think this is the regex you are looking for:
regexp_like(center, '((\s|^)\w+(\s|$)?){3,}')
or with a short test:
select * from (
select 'abc' center
from dual
union all
select 'abc def'
from dual
union all
select 'abc def ghi'
from dual
union all
select 'abc def ghi jkl'
from dual
)
where regexp_like(center, '((\s|^)\w+(\s|$)?){3,}')
It says
Start of line or whitespace
One or more letters
Whitespace or end of line, non-greedy
Repeat all of the above at least three times

How to check in SQL which character occurs first in a string

I have a column with string values which I need to parse based on which of the 2 characters occurred first - # and /.
Can you please help me with a SQL select query that will check the string? The possible scenarios are:
Both # and / are present in the string, if so which one comes first
Only 1 of the character occurs in the string

You could use a combination of charindex and outer apply & values, from which you can select the first-occuring character:
select t.*, FirstChar
from t
outer apply(
select top (1) FirstChar
from (values
(CharIndex('#',string),'#'),
(CharIndex('/',string),'/')
)v(i,FirstChar)
where i > 0
order by i
)x;
Demo Fiddle

I wrote this in Oracle before I knew it was MS SQL you needed, maybe it can be of some help anyway?
with data (text) as(
select '#' from dual union all
select '/' from dual union all
select '#/' from dual union all
select '/#' from dual union all
select '/#/' from dual union all
select '//#' from dual union all
select 'aaa#aaa/aaa' from dual union all
select 'aaa/aaa#aaa' from dual union all
select 'aaaaaaaaa' from dual
)
select text ,
case when INSTR(text,'/')=0 and INSTR(text,'#')>0 then '#'
when INSTR(text,'#')=0 and INSTR(text,'/')>0 then '/'
when INSTR(text,'#')>INSTR(text,'/') then '/'
when INSTR(text,'#')<INSTR(text,'/') then '#'
else ' '
end first from data;

Oracle SQL : Check if specified words are present in comma separated string

I have an SQL function that returns me a string of comma separated country codes.
I have configured some specific codes in another table and I may remove or add more later.
I want to check if the comma separated string is only the combination of those specific country codes or not. That said, if that string is having even a single country code other than the specified ones, it should return true.
Suppose I configured two rows in the static data table GB and CH. Then I need below results:
String from function
result
GB
false
CH
false
GB,CH
false
CH,GB
false
GB,FR
true
FR,ES
true
ES,CH
true
CH,GB,ES
true
I am on Oracle 19c and can use only the functions available for this version. Plus I want it to be optimised. Like I can check the number of values in string and then count for each specific code. If not matching then obviously some other codes are present. But I don't want to use loops.
Can someone please suggest me a better option.

Assuming that all country codes in the static table, as well as all tokens in the comma-separated strings, are always exactly two-letter strings, you could do something like this:
with
static_data(country_code) as (
select 'GB' from dual union all
select 'CH' from dual
)
, sample_inputs(string_from_function) as (
select 'GB' from dual union all
select 'CH' from dual union all
select 'GB,CH' from dual union all
select 'CH,GB' from dual union all
select 'GB,FR' from dual union all
select 'FR,ES' from dual union all
select 'ES,CH' from dual union all
select 'CH,GB,ES' from dual
)
select string_from_function,
case when regexp_replace(string_from_function,
',| |' || (select listagg(country_code, '|')
within group (order by null)
from static_data))
is null then 'false' else 'true' end as result
from sample_inputs
;
Output:
STRING_FROM_FUNCTION RESULT
---------------------- --------
GB false
CH false
GB,CH false
CH,GB false
GB,FR true
FR,ES true
ES,CH true
CH,GB,ES true
The regular expression replaces comma, space, and every two-letter country code from the static data table with null. If the result of the whole thing is null, then all coded in the csv are in the static table; that's what you need to test for.
The assumptions guarantee that a token like GBCH (for a country like "Great Barrier Country Heat") would not be mistakenly considered OK because GB and CH are OK separately.

You can convert a csv column to a table and use EXISTS. For example
with tbl(id,str) as
(
SELECT 1,'GB,CH' FROM DUAL UNION ALL
SELECT 2,'GB,CH,FR' FROM DUAL UNION ALL
SELECT 3,'GB' FROM DUAL
),
countries (code) as
(SELECT 'GB' FROM DUAL UNION ALL
SELECT 'CH' FROM DUAL
)
select t.* ,
case when exists (
select 1
from xmltable(('"' || REPLACE(str, ',', '","') || '"')) s
where trim(s.column_value) not in (select code from countries)
)
then 'true' else 'false' end flag
from tbl t

One option is to match the country codes one by one, and then determine whether there exists an extra non-matched country from the provided literal as parameter.
The following one with FULL JOIN would help by considering the logic above
WITH
FUNCTION with_function(i_countries VARCHAR2) RETURN VARCHAR2 IS
o_val VARCHAR2(10);
BEGIN
SELECT CASE WHEN SUM(NVL2(t.country_code,0,1))=0 THEN 'false'
ELSE 'true'
END
INTO o_val
FROM (SELECT DISTINCT REGEXP_SUBSTR(i_countries,'[^ ,]+',1,level) AS country
FROM dual
CONNECT BY level <= REGEXP_COUNT(i_countries,',')+1) tt
FULL JOIN t
ON tt.country = t.country_code;
RETURN o_val;
END;
SELECT with_function(<comma-seperated-parameter-list>) AS result
FROM dual
Demo

Here is one solution
with cte as
(select distinct
s,regexp_substr(s, '[^,]+',1, level) code from strings
connect by regexp_substr(s, '[^,]+', 1, level) is not null
)
select
s string,min(case when exists
(select * from countries
where cod = code) then 'yes'
else 'no'end) all_found
from cte
group by s
order by s;
STRING | ALL_FOUND
:----- | :--------
CH | yes
CH,GB | yes
ES | no
ES,CH | no
FR | no
GB | yes
GB,CH | yes
GB,ES | no
db<>fiddle here

If you have a small number of values in the static table then the simplest method may not be to split the values from the function but to generate all combinations of values from the static table using:
SELECT SUBSTR(SYS_CONNECT_BY_PATH(value, ','), 2) AS combination
FROM static_table
CONNECT BY NOCYCLE PRIOR value != value;
Which, for the sample data:
CREATE TABLE static_table(value) AS
SELECT 'GB' FROM DUAL UNION ALL
SELECT 'CH' FROM DUAL;
Outputs:
COMBINATION
GB
GB,CH
CH
CH,GB
Then you can use a simple CASE expression to your string output to the combinations:
SELECT function_value,
CASE
WHEN function_value IN (SELECT SUBSTR(SYS_CONNECT_BY_PATH(value, ','), 2)
FROM static_table
CONNECT BY NOCYCLE PRIOR value != value)
THEN 'false'
ELSE 'true'
END AS not_matched
FROM string_from_function;
Which, for the sample data:
CREATE TABLE string_from_function(function_value) AS
SELECT 'GB' FROM DUAL UNION ALL
SELECT 'CH' FROM DUAL UNION ALL
SELECT 'GB,CH' FROM DUAL UNION ALL
SELECT 'CH,GB' FROM DUAL UNION ALL
SELECT 'GB,FR' FROM DUAL UNION ALL
SELECT 'FR,ES' FROM DUAL UNION ALL
SELECT 'ES,CH' FROM DUAL UNION ALL
SELECT 'CH,GB,ES' FROM DUAL;
Outputs:
FUNCTION_VALUE
NOT_MATCHED
GB
false
CH
false
GB,CH
false
CH,GB
false
GB,FR
true
FR,ES
true
ES,CH
true
CH,GB,ES
true
db<>fiddle here

Find value that is not a number or a predefined string

I have to test a column of a sql table for invalid values and for NULL.
Valid values are: Any number and the string 'n.v.' (with and without the dots and in every possible combination as listed in my sql command)
So far, I've tried this:
select count(*)
from table1
where column1 is null
or not REGEXP_LIKE(column1, '^[0-9,nv,Nv,nV,NV,n.v,N.v,n.V,N.V]+$');
The regular expression also matches the single character values 'n','N','v','V' (with and without a following dot). This shouldn't be the case, because I only want the exact character combinations as written in the sql command to be matched. I guess the problem has to do with using REGEXP_LIKE. Any ideas?

I guess this regexp will work:
NOT REGEXP_LIKE(column1, '^([0-9]+|n\.?v\.?)$', 'i')
Note that , is not a separator, . means any character, \. means the dot character itself and 'i' flag could be used to ignore case instead of hard coding all combinations of upper and lower case characters.

No need to use regexp (performance will increase by large data) - plain old TRANSLATE is good enough for your validation.
Note that the first translate(column1,'x0123456789','x') remove all numeric charcters from the string, so if you end with nullthe string is OK.
The second translate(lower(column1),'x.','x') removes all dots from the lowered string so you expect the result nv.
To avoid cases as n.....v.... you also limit the string length.
select
column1,
case when
translate(column1,'x0123456789','x') is null or /* numeric string */
translate(lower(column1),'x.','x') = 'nv' and length(column1) <= 4 then 'OK'
end as status
from table1
COLUMN1 STATUS
--------- ------
1010101 OK
1012828n
1012828nv
n.....v....
n.V OK
Test data
create table table1 as
select '1010101' column1 from dual union all -- OK numbers
select '1012828n' from dual union all -- invalid
select '1012828nv' from dual union all -- invalid
select 'n.....v....' from dual union all -- invalid
select 'n.V' from dual; -- OK nv

You can use:
select count(*)
from table1
WHERE TRANSLATE(column1, ' 0123456789', ' ') IS NULL
OR LOWER(column1) IN ('nv', 'n.v', 'nv.', 'n.v.');
Which, for the sample data:
CREATE TABLE table1 (column1) AS
SELECT '12345' FROM DUAL UNION ALL
SELECT 'nv' FROM DUAL UNION ALL
SELECT 'NV' FROM DUAL UNION ALL
SELECT 'nV' FROM DUAL UNION ALL
SELECT 'n.V.' FROM DUAL UNION ALL
SELECT '...................n.V.....................' FROM DUAL UNION ALL
SELECT '..nV' FROM DUAL UNION ALL
SELECT 'n..V' FROM DUAL UNION ALL
SELECT 'nV..' FROM DUAL UNION ALL
SELECT 'xyz' FROM DUAL UNION ALL
SELECT '123nv' FROM DUAL;
Outputs:
COUNT(*)
5
or, if you want any quantity of . then:
select count(*)
from table1
WHERE TRANSLATE(column1, ' 0123456789', ' ') IS NULL
OR REPLACE(LOWER(column1), '.') = 'nv';
Which outputs:
COUNT(*)
9
db<>fiddle here

Oracle SQL -- find the values NOT in a table

Take this table WORDS
WORD
Hello
Aardvark
Potato
Dog
Cat
And this list:
('Hello', 'Goodbye', 'Greetings', 'Dog')
How do I return a list of words that AREN'T in the words table, but are in my list?
If I have a table that "contains all possible words", I can do:
SELECT * from ALL_WORDS_TABLE
where word in ('Hello', 'Goodbye', 'Greetings', 'Dog')
and word not in
(SELECT word from WORDS
where word in ('Hello', 'Goodbye', 'Greetings', 'Dog')
);
However I do not have such a table. How else can this be done?
Also, constructing a new table is not an option because I do not have that level of access.

Instead of hard coding the list values into rows, use DBMS_DEBUG_VC2COLL to dynamically convert your delimited list into rows, then use the MINUS operator to eliminate rows in the second query that are not in the first query:
select column_value
from table(sys.dbms_debug_vc2coll('Hello', 'Goodbye', 'Greetings', 'Dog'))
minus
select word
from words;

Try this solution :
SELECT
a.word
FROM
(
SELECT 'Hello' word FROM DUAL UNION
SELECT 'Goodbye' word FROM DUAL UNION
SELECT 'Greetings' word FROM DUAL UNION
SELECT 'Dog' word FROM DUAL
) a
LEFT JOIN ALL_WORDS_TABLE t ON t.word = a.word
WHERE
t.word IS NULL

You can turn your list into a view like this:
select 'Hello' as word from dual
union all
select 'Goodbye' from dual
union all
select 'Greetings' from dual
union all
select 'Dog' from dual
Then you can select from that:
select * from
(
select 'Hello' as word from dual
union all
select 'Goodbye' from dual
union all
select 'Greetings' from dual
union all
select 'Dog' from dual
)
where word not in (select word from words);
Possibly not as neat a solution as you might have hoped for...
You say you don't have sufficient privileges to create tables, so presumably you can't create types either - but if you can find a suitable type "lying around" in your database you can do this:
select * from table (table_of_varchar2_type('Hello','Goodbye','Greetings','Dog'))
where column_value not in (select word from words);
Here table_of_varchar2_type is imagined to be the name of a type that is defined like:
create type table_of_varchar2_type as table of varchar2(100);
One such type you are likely to be able to find is SYS.KU$_VCNT which is a TABLE OF VARCHAR2(4000).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery regex to find string that contains chinese characters - sql

Related

How to use regex to select rows where the column has more than two words in oracle

How to check in SQL which character occurs first in a string

Oracle SQL : Check if specified words are present in comma separated string

Find value that is not a number or a predefined string

Oracle SQL -- find the values NOT in a table

Categories

Resources