Get the last part of the value returned by split_part() function - sql

I have a file_path string separated by forward slashes. I want to split them based on the forward slashes and return the file name.
INPUT
//a/b/c/xyz.png
OUTPUT
xyz.png
CURRENT SOLUTION
SELECT REVERSE(SPLIT_PART(REVERSE('//a/b/c/xyz.py'), '/', 1)) as "file_name";
Is there a more efficient way of doing this?

regexp_match() is more concise:
select (regexp_match('//a/b/c/xyz.py', '[^/]+$'))[1]

I would just use regexp_replace() to remove everything before the last slash (included):
select regexp_replace('//a/b/c/xyz.png', '.*/', '')
Demo on DB Fiddle:
| regexp_replace |
| :------------- |
| xyz.png |
You can also use substring(), which may or may not be more efficient:
substring('//a/b/c/xyz.png' from '[^/]*$')

PostgreSQL 14 will support negative index so it will be straightforward operation.
split_part
Splits string at occurrences of delimiter and returns the n'th field (counting from one), or when n is negative, returns the |n|'th-from-last field.
split_part('abc,def,ghi,jkl', ',', -2) → ghi
In this particular scenario:
SELECT SPLIT_PART('//a/b/c/xyz.py', '/', -1) as "file_name";

Related

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

Separate with different characters sql

So I have a column which contains multiple different strings. If the string contains a _ it has to be split on that character. For the others I use would use a separate rule like: If it starts with 4FH, GWO, CTW and doesn't have an _ then it has to split after 3 characters. If it starts with 4 and doesn't have an _.. etc..
Example
|Source |
|EC_HKT |
|4FHHTK |
|ABC_GJE |
|4SHARED |
|ETK_ETK-40|
etc..
What i want as a result is
|Source|Instance|
|EC |HKT |
|4FH |HTK |
|ABC |GJE |
|4 |SHARED |
|ETK |40 |
As a start I first tried
SELECT
LEFT(lr.Source, CHARINDEX('_',lr.Source)) AS Source,
RIGHT(lr.Source, LEN(lr.Source) - CHARINDEX('_', lr.Source)) AS Interface,
But this would only work if all the results had a _ . Any tips or ideas? Would a CASE WHEN THEN work?
This requires a little creativity and no doubt more work than what I've done here, however this gives you at least one pattern to work with and enhance as required.
The following uses a simple function to apply basic rules from your sample data to derive the point to split your string, plus some additional removal of characters and removal of the source part if it also exists in the instance part.
If more than one "rule" matches, it uses the one with higher number of matching characters.
Function:
create or alter function splitpos(#string varchar(50))
returns table as
return
(
with map as (
select * from (values ('4FH'),('GWO'),('CTW'),('4'))m(v)
)
select IsNull(NullIf(CharIndex('_',#string),0)-1,Max(Len(m.v))) pos
from map m
where #string like m.v + '%'
)
Query:
select l.v source, Replace(Replace(Replace(Stuff(source,1,pos,''),'_',''),'-',''),l.v,'') instance
from t
cross apply dbo.splitpos(t.source)
cross apply (values(Left(source,pos)))l(v)
Demo DB<>Fiddle
To split with different rules, use a CASE expression. (W3Schools)
SELECT CASE
WHEN lr.Source LIKE '4FH%' AND CHARINDEX('_', lr.Source) = 0
THEN LEFT(lr.Source, 3)
...
END as Source
If theses are separate columns then you would need a case statement for each column.

How do I remove the | special character in a CHAR dataset using proc sql?

I am trying to remove the charater | using proc sql. The position of | is not fixed and varies in the data, hence I do not want to use the substr function
Example 1- 1234|5678|9|101
Example 2 - 12345|6789|1|011
You can use TRANSLATE() function
UPDATE tab
SET TRANSLATE(Col, '', '|')
In oracle you could use REPLACE, something like
SELECT REPLACE('1234|5678|9|101 Example 2 - 12345|6789|1|011','|','') Changed
FROM DUAL;

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))
You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.
Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.
Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Get the substring from a string in Apache drill using the position of a charecter

Please help me with a solution.
From the string '/This/is/apache/drill/queries' I want a sql query that runs on Apache Drill to fetch sub-string 'drill' that comes after the 4th occurrence of '/'
NOTE: the length of the string varies, hence position of the '/' also varies.
And the string starts with '/'
In drill the instr(string,'/',1,4) will not work. Hence I am not able to get the string that appears after the 4th occurrence of '/'.
Drill has split UDF which does the same thing as String.split(), so this query:
SELECT split(a, '/')[4] FROM (VALUES('/This/is/apache/drill/queries')) t(a);
Will return the desired result:
+---------+
| EXPR$0 |
+---------+
| drill |
+---------+
You can use REGEXP_REPLACE for that purpose:
SELECT REGEXP_REPLACE('/This/is/apache/drill/queries', '^\/.*?\/.*?\/.*?\/(.*?)\/.*$.','\1');
The above regular expression looks for fourth '/' and takes the content from there till fifth '/'.