Removing n characters from the beginning / end of a record with regex - sql

If I have a column, say:
paths
And paths holds the absolute path to a file.
If I wanted, to remove n characters from the beginning and end of the records, for the entire path column - would this be possible?
Edit - example:
Lets say there are records in paths like so:
C:\Users\Alex\Documents\Files\File-1.txt
C:\Users\Alex\Documents\Files\File-2.txt
C:\Users\Alex\Documents\Files\File-3.txt
C:\Users\Alex\Documents\Files\File-4.txt
I would like to update the records to be:
Users\Alex\Documents\Files\File-1
Users\Alex\Documents\Files\File-2
Users\Alex\Documents\Files\File-3
Users\Alex\Documents\Files\File-4
So essentially removing n characters from the beginning and end of an entire column.

A general regex solution would be:
^(?:.{3})(.*)(?:.{4})$
# that is match exactly three chars in the beginning (non-capturing group)
# match everything up to the end of the string
# but as .* is sweet-tempered...
# it gives back the four characters in the end ($)
# your match is in the first group
See a demo on regex101.com

use this pattern and replace with nothing
^.{3}|.{4}$
Demo

As you tagged SQL which implies Standard SQL:
Answers to questions tagged with SQL should use ANSI SQL.
There's no need for anything more complicated than a SUBSTRING, you just need to calculate the number of characters you want to strip:
SUBSTRING(path FROM 4 FOR CHARACTER_LENGTH(path)-7)
This is also working as-is in PostgreSQL :)

Related

Regex like telephone number on Hive without prefix (+01)

We have a problem with a regular expression on hive.
We need to exclude the numbers with +37 or 0037 at the beginning of the record (it could be a false result on the regex like) and without letters or space.
We're trying with this one:
regexp_like(tel_number,'^\+37|^0037+[a-zA-ZÀÈÌÒÙ ]')
but it doesn't work.
Edit: we want it to come out from the select as true (correct number) or false.
To exclude numbers which start with +01 0r +001 or +0001 and having only digits without spaces or letters:
... WHERE tel_number NOT rlike '^\\+0{1,3}1\\d+$'
Special characters like + and character classes like \d in Hive should be escaped using double-slash: \\+ and \\d.
The general question is, if you want to describe a malformed telephone number in your regex and exclude everything that matches the pattern or if you want to describe a well-formed telephone number and include everything that matches the pattern.
Which way to go, depends on your scenario. From what I understand of your requirements, adding "not starting with 0037 or +37" as a condition to a well-formed telephone number could be a good approach.
The pattern would be like this:
Your number can start with either + or 00: ^(\+|00)
It cannot be followed by a 37 which in regex can be expressed by the following set of alternatives:
a. It is followed first by a 3 then by anything but 7: 3[0-689]
b. It is followed first by anything but 3 then by any number: [0-24-9]\d
After that there is a sequence of numbers of undefined length (at least one) until the end of the string: \d+$
Putting everything together:
^(\+|00)(3[0-689]|[0-24-9]\d)\d+$
You can play with this regex here and see if this fits your needs: https://regex101.com/r/KK5rjE/3
Note: as leftjoin has pointed out: To use this regex in hive you might need to additionally escape the backslashes \ in the pattern.
You can use
regexp_like(tel_number,'^(?!\\+37|0037)\\+?\\d+$')
See the regex demo. Details:
^ - start of string
(?!\+37|0037) - a negative lookahead that fails the match if there is +37 or 0037 immediately to the right of the current location
\+? - an optional + sign
\d+ - one or more digits
$ - end of string.

Regexp_extract everything after appearance of '-q_'

Have strings containing 'q_' which I want to extract everything that comes after it. Some rows contain occurrence of q_ which I want everything that occurs after it. Example values in the column are:
prod-q_cat_trait_cat_social_issue
_prod-q_body_modification_graffiti
event_tickets
dappled_grey
_prod-q_cat_tech_support
What is wrong with my regular expression as I'm trying to remove the trailing '_' after q.
REGEXP_EXTRACT(queue_id, '[^q_]+$')
Is just returning
issue
I've also tried the split method:
SPLIT(queue_id, 'q_')[OFFSET(2)]
But this returns
Array index 2 is out of bounds (overflow)
Any suggestions. Thanks! (I am using Google Cloud SQL)
Using a capturing group, you may extract all after the first q_ with:
REGEXP_EXTRACT(queue_id, 'q_(.*)')
You may extract all after the last q_ with:
REGEXP_EXTRACT(queue_id, '.*q_(.*)')
See the regex demo #1 and regex demo #2.
Here, q_ finds the first occurrence of q_ and (.*) grabs the rest of the line into Group 1, and this is the value returned by REGEXP_EXTRACT. .* matches any 0+ chars other than line break chars as many as possible, that is why the second regex will start capturing the rest of the line after the last occurrence of q_.
Google Cloud SQL uses MySQL. I think the simplest method is substring_index():
select substring_index(queue_id, '-q_', -1)
Can you try this : q_([^q_]+)$? You'll have what you want in the first group.
Edit: this one match all the cases > (?(?<=-q_).*|^((?!-q_).)*$)

how to get specific part from string in sql

I want to retrieve file names from urls in sql.
for example:
Input:
url:
https://www.google.co.in/root/subdir/file.extension?p1=v1&p2=v2
https://www.abxdhcak.com/sitemap-companies.xml
then Output should be:
file.extension
sitemap-companies.xml
To match your expected output you can use REGEXP_REPLACE
REGEXP_REPLACE(txt, '^.*/|\?.*$') as rg
This does 2 things:
'^.*/'
This removes all characters up to and including the last forward-slash in the string.
'\?.*$'
This removes all characters after and including a question mark.
This may not work for all cases, but it works for the examples provided.

Teradata regular expressions, 0 or 1 spaces

In Teradata, I'm looking for one regular expression pattern that would allow me to find a pattern of some numbers, then a space or maybe no space, and then 'SF'. It should return 7 in both cases below:
SELECT
REGEXP_INSTR('12345 1000SF', pattern),
REGEXP_INSTR('12345 1000 SF', pattern)
Or, my actual goal is to extract the 1000 in both cases if there's an easier way, probably using REGEXP_SUBSTR. More details are below if you need them.
I have a column that contains free text and I would like to extract the square footage. But, in some cases, there is a space between the number and 'SF' and in some cases there is not:
'other stuff 1000 SF'
'other stuff 1000SF'
I am trying to use the REGEXP_INSTR function to find the starting position. Through google, I have found the pattern for the first to be
'([0-9])+ SF'
When I try the pattern for the second, I try
'([0-9])+SF'
and I get the error
SELECT Failed. [2662] SUBSTR: string subscript out of bounds
I've also found an answer to a similar questions, but they don't work for Teradata. For example, I don't think you can use ? in Teradata.
The error message indicates you're using SUBSTR, not REGEXP_SUBSTR.
Try this:
RegExp_Substr(col, '[0-9]*(?= {0,1}SF)')
Find multiple digits followed by a single optional blank followed by SF and extract those digits.
I would pattern it like this:
\b(\d+)\s*[Ss][Ff]\b
\b # word boundary
(\d+) # 1 or more digits (captured)
\s* # 0 or more white-space characters
[Ss] # character class
[Ff] # character class
\b # word boundary
Demo

SQL substring non greedy regex

I have data like
http://www.linz.at/politik_verwaltung/32386.asp
stored in a text column. I thought a non-greedy extraction with
select substring(turl from '\..*?$') as ext from tdata
would give me .asp but instead it still ?greedely results in
.linz.at/politik_verwaltung/32386.asp
How can I only match against the last occurence of dot .?
Using Postgresql 9.3
\.[^.]*$ matches . followed by any number of non-dot characters followed by end-of-string:
# select substring('http://www.linz.at/politik_verwaltung/32386.asp'
from '\.[^.]*$');
substring
-----------
.asp
(1 row)
As for why the non-greedy quantifiers do not work here is that they still start matching as soon as possible while still trying to match as short as possible from there on.
Try this:
\.[\w]*$
Here is how it works:
all the word characters (\w), any numbers of them with *, between dot (\.) and the end of the string ($), with the last . itself.
Note: updated the answer, now will capture the strings ends with ..