Number of consecutive digits in a column string - google-bigquery

I am trying to count the number of consecutive digits appear in a string column, let me give an example to illustrate better what i am trying to do. If i have table called email
email
lucas1234#gmail.com
fer12#gmail.com
lupal#gmail.com
carlos1perez222#gmail.com
my expected output would be
email count_cons_digits
lucas1234#gmail.com 4
fer12#gmail.com 2
lupal#gmail.com 0
carlos1perez222#gmail.com 3

You could use a regex replacement with length trick:
SELECT email,
LENGTH(email) - LENGTH(REGEXP_REPLACE(email, '[0-9]{2,}', '')) AS count_cons_digits
FROM yourTable;
Note that this answer assumes that there would be at most one segment of a given email string having continuous digits. If not, and there could be more than one, then you would need to define what happens in that case.

Related

How to extract a number part of a field using regex_substr function?

I need to extract the numerical part of values in a column (varchar) if there exists a number in the value.
ColumnA has values like ABC, M365, J344, MCT etc.
I would like to check the entire value from second position and if is a number I would like to extract it, for instance,
a. M365, from 2nd position 365 is a number so I would like to return this substring.
b. M3AB, from 2nd position 3AB is not a number so I would not want to return this substring.
I tried regex_substr('M365', '[0-9]', 2) but this is not how I want and it only returns what is there in the second position but not the entire substring.
This seems to do what you want:
select regexp_substr(substr(x, 2), '^\d+$')
This starts matching the pattern at the second position in the string, requiring that a number start there.
[0-9] only searches for one number. You want to know if they are all numbers, so you need the '+' operator. For more info, visit:
https://www.techonthenet.com/oracle/functions/regexp_substr.php
The following code should work for you.
regex_substr('M365', '[0-9]+', 2)

Randomize integers in a string

I am attempting to randomize all integers in a string.
E.g "Transferred to account 123456789" randomized into "Transferred to account 256829876"
I already have a slow solution in PL/SQL where I am looping through each character in the string individually. If char is an asci value between 48-57 (digits 0 to 9), I randomize the digit accordingly.
In SQL I have gotten this far:
select regexp_replace('Transferred to account 05172262116','[0-9]',
floor(dbms_random.value(0, 10)))
from dual;
However, this does not give me the expected result as integers are replaced with a single unique value. (E.g. 'Transferred to account 555555555')
Is it possible to achieve what I am looking for via use of SQL?
Thanks.
If you know the numbers are always 11 digits, you can explicitly look for that:
select regexp_replace('Transferred to account 05172262116','[0-9]{11}', floor(dbms_random.value(10000000000, 99999999999)))
from dual;
Otherwise, you can replace with an integer, but the length may not be the same length as the original one:
select regexp_replace('Transferred to account 05172262116','[0-9]+', floor(dbms_random.value(10000000000, 99999999999)))
from dual;
As a note: things like account numbers are often removed using translate(), but this produces a fixed string:
select translate('Transferred to account 05172262116', ' 0123456789', ' ##########')
from dual;
(And you can do the same thing with regexp_replace().)
This answer may be viewed as a cop-out, but I would argue that information as sensitive as an account number should not be shown in any form, even if the digits are randomly permuted. So, I recommend just completely masking the account number using e.g.
SELECT
REGEXP_REPLACE('Transferred to account 05172262116', '[0-9]', '*')
FROM dual;
Even the above presents some security risk, because it shows the same number of * as there are digits in the account number. But, it is often the case, e.g. with credit cards or account numbers at a given bank, that all account numbers have the same length anyway.
The issue that you are having is that you are doing the replace once. This gets you one value to replace each character with. To do this correctly you would have to loop through each character and get a new random value to replace it with.
You could use translate() with a single 10-digit random number:
select translate('Transferred to account 05172262116',
'1234567890',
floor(dbms_random.value(1000000000, 10000000000))) from dual;
TRANSLATE('TRANSFERREDTOACCOUNT051
----------------------------------
Transferred to account 81677787668
It will work with any number of digits anywhere in the string, and preserves the original length (number of digits) of the replaced value. It maps an original digit to the same (random) digit each time, at least within that string. (If you apply the same translate across multiple source rows at one, they will get different mappings as dbms_random is non-deterministic).
with t (s) as (
select 'Transferred to account 05172262116' from dual
union all
select 'Transferred to account 05172262116' from dual
)
select s, translate(s,
'1234567890',
floor(dbms_random.value(1000000000, 10000000000))) from t;
S TRANSLATE(S,'1234567890',FLOOR(DBM
---------------------------------- ----------------------------------
Transferred to account 05172262116 Transferred to account 57238858225
Transferred to account 05172262116 Transferred to account 95587747554
Each digit in your original string is translated to the corresponding digit in the random number. For instance, the first output above came from the generated random number 6703187918. The first digit of you original string was 0; that's the 10th digit of the second argument to translate(); so you get the 10th digit of the (random) replacement string which is the third argument to that function - which is 8. The second digit in your string is 5, which is the 5th digit in the second argument; so you get the 5th digit in the third argument - which is 7. And so on.
It's arguable if this is random enough, I suppose, but the main goal is presumably to stop you reconstructing the original value from the replacement. You could potentially learn something about the shape of the original value by looking for repetitions new one; but as you could have repeated characters in the random value too that doesn't get you very far.
For instance, in the example above the replacement has a row of three consecutive 7s, so you might think the original has three consecutive digits too - but it didn't. The random value had two positions - 2nd and 7th - which both mapped to 7 in the new string, and you can't tell which of those mapping was applied. (So even if you knew the random value you couldn't get back to the original, in this case anyway - it won't always have repeated numbers, of course.)

Consider a query to find details of research fields where the first two parts of the ID are D and 2 and the last part is one character (digit)

The ID of research fields have three parts, each part separated by a period.
Consider a query to find the details of research fields where the first two parts of the ID are D and 2, and the last part is a single character (digit).
IDs like D.2.1 and D.2.3 are in the query result whereas IDs like D.2.12 or D.2.15 are not.
The SQL query given below does not return the correct result. Explain the reason why it does not return the correct result and give the correct SQL query.
select *
from field
where ID like 'B.1._';
I have no idea why it doesnt work.
Anyone can help on this? Many thanks
D.2.1 and D.2.3 are in the query result whereas IDs like D.2.12 or D.2.15 are not.
An underscore matches any single character in a LIKE filter so B.1._ is looking for the start of the string followed by a B character followed by a . character then a 1 character then a . character then any single character then the end of the string.
You could use:
SELECT *
FROM field
WHERE ID like 'B.1._%';
The % will match any number of characters (including zero) until the end of the string and the preceding underscore will enforce that there is at least one character after the period.

REGEXP_LIKE between number range

Can someone please finalize the code on the below.
I only want to look for a 6 digit number range anywhere in the RMK field, between 100000 and 999999
REGEXP_LIKE(RMKADH.RMK, '[[:digit:]]')
The current code works but is bringing back anything with a number so I'm trying to narrow it down to 6 digits together. I've tried a few but no luck.
Edit:
I want to flag this field if a 6 digit number is present. The reference will always be 6 digits long only, no more no less. But as it's a free text field it could be anywhere and contain anything.
Example output I do want to flag: >abc123456markj< = flagged.
Output I don't want to flag: >Mark 23647282< because the number it finds is more than 6 characters in length I know it's not a valid reference.
Try this:
REGEXP_LIKE(RMKADH.RMK, '[1-9][[:digit:]]{5}') AND length(RMKADH.RMK) = 6
For more info, see: Multilingual Regular Expression Syntax
You can do a REGEXP_SUBSTR to get 6 digits out of the given field and compare it using between
select * from t
where to_number(regexp_substr(col,'[[:digit:]]{6}')) between 100000 and 999999;
;
Please note that if a bigger sequence than 6 digits exists, the above solution will take first 6 digits into consideration. If you want to do for any 6 consecutive digits, the solution will have to be a different one.
If you want to get all the Records which have only Numeric values in them you can use below query
REGEXP_LIKE(RMKADH.RMK, '^[[:digit:]]+$');
The above will match any number of Numbers from start to end in the string. So if your Numbers span from 1 digit to any number of Digits, this will be useful.
SELECT
to_number(regexp_replace('abc123456markj', '[^[:digit:]]', '')) digits
FROM
dual
WHERE
REGEXP_LIKE('abc123456markj', '[[:digit:]]')
AND
length(regexp_replace('abc123456markj', '[^[:digit:]]', '')) = 6
AND
regexp_replace('abc123456markj', '[^[:digit:]]', '') BETWEEN 100000 AND 999999;

Query to return all results where a specific character is in the 4th position of a string

I have a table that contains 6 digit ID numbers ('AB1E11' for example) and I need to build a query in Teradata SQL that returns all results where 'E' is in the fourth position of the string. I haven't had a reason to do anything like this in several years, so I am extremely rusty. I know how to filter the results so that 'E' is contained anywhere in the string ( using SELECT * WHERE PLANID LIKE '%E%'), but I'm not sure how to filter the results so that only the ones where 'E' is in the 4th position show up. Can anyone help me out with this? I tried searching several times but couldn't find an answer.
Thank you.
Just use LIKE with the _ wildcard:
where planid like '____E%'
Note: that is 4 underscores, which represent any single character.
SELECT planid WHERE CHARINDEX('E', planid) = 4;
The starting position returned is 1-based, not 0-based.