Get the substring from a string in Apache drill using the position of a charecter - sql

Please help me with a solution.
From the string '/This/is/apache/drill/queries' I want a sql query that runs on Apache Drill to fetch sub-string 'drill' that comes after the 4th occurrence of '/'
NOTE: the length of the string varies, hence position of the '/' also varies.
And the string starts with '/'
In drill the instr(string,'/',1,4) will not work. Hence I am not able to get the string that appears after the 4th occurrence of '/'.

Drill has split UDF which does the same thing as String.split(), so this query:
SELECT split(a, '/')[4] FROM (VALUES('/This/is/apache/drill/queries')) t(a);
Will return the desired result:
+---------+
| EXPR$0 |
+---------+
| drill |
+---------+

You can use REGEXP_REPLACE for that purpose:
SELECT REGEXP_REPLACE('/This/is/apache/drill/queries', '^\/.*?\/.*?\/.*?\/(.*?)\/.*$.','\1');
The above regular expression looks for fourth '/' and takes the content from there till fifth '/'.

Related

How to add delimiter to String after every n character using hive functions?

I have the hive table column value as below.
"112312452343"
I want to add a delimiter such as ":" (i.e., a colon) after every 2 characters.
I would like the output to be:
11:23:12:45:23:43
Is there any hive string manipulation function support available to achieve the above output?
For fixed length this will work fine:
select regexp_replace(str, "(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})","$1:$2:$3:$4:$5:$6")
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
Another solution which will work for dynamic length string. Split string by the empty string that has the last match (\\G) followed by two digits (\\d{2}) before it ((?<= )), concatenate array and remove delimiter at the end (:$):
select regexp_replace(concat_ws(':',split(str,'(?<=\\G\\d{2})')),':$','')
from
(select "112312452343" as str)s
Result:
11:23:12:45:23:43
If it can contain not only digits, use dot (.) instead of \\d:
regexp_replace(concat_ws(':',split(str,'(?<=\\G..)')),':$','')
This is actually quite simple if you're familiar with regex & lookahead.
Replace every 2 characters that are followed by another character, with themselves + ':'
select regexp_replace('112312452343','..(?=.)','$0:')
+-------------------+
| _c0 |
+-------------------+
| 11:23:12:45:23:43 |
+-------------------+

Get the last part of the value returned by split_part() function

I have a file_path string separated by forward slashes. I want to split them based on the forward slashes and return the file name.
INPUT
//a/b/c/xyz.png
OUTPUT
xyz.png
CURRENT SOLUTION
SELECT REVERSE(SPLIT_PART(REVERSE('//a/b/c/xyz.py'), '/', 1)) as "file_name";
Is there a more efficient way of doing this?
regexp_match() is more concise:
select (regexp_match('//a/b/c/xyz.py', '[^/]+$'))[1]
I would just use regexp_replace() to remove everything before the last slash (included):
select regexp_replace('//a/b/c/xyz.png', '.*/', '')
Demo on DB Fiddle:
| regexp_replace |
| :------------- |
| xyz.png |
You can also use substring(), which may or may not be more efficient:
substring('//a/b/c/xyz.png' from '[^/]*$')
PostgreSQL 14 will support negative index so it will be straightforward operation.
split_part
Splits string at occurrences of delimiter and returns the n'th field (counting from one), or when n is negative, returns the |n|'th-from-last field.
split_part('abc,def,ghi,jkl', ',', -2) → ghi
In this particular scenario:
SELECT SPLIT_PART('//a/b/c/xyz.py', '/', -1) as "file_name";

BigQuery - Regex to match a pattern after a known string (positive lookbehind alternative)

I need to extract 8 digits after a known string:
| MyString | Extract: |
| ---------------------------- | -------- |
| mypasswordis 12345678 | 12345678 |
| # mypasswordis 12345678 | 12345678 |
| foobar mypasswordis 12345678 | 12345678 |
I can do this with regex like:
(?<=mypasswordis.*)[0-9]{8})
However, when I want to do this in BigQuery using the REGEXP_EXTRACT command, I get the error message, "Cannot parse regular expression: invalid perl operator: (?<".
I searched through the re2 library and saw there doesn't seem to be an equivalent for positive lookbehind.
Is there any way I can do this using other methods? Something like
SELECT REGEXP_EXTRACT(MyString, r"(?<=mypasswordis.*)[0-9]{8}"))
You need a capturing group here to extract a part of a pattern, see the REGEXP_EXTRACT docs you linked to:
If the regular expression contains a capturing group, the function returns the substring that is matched by that capturing group. If the expression does not contain a capturing group, the function returns the entire matching substring.
Also, the .* pattern is too costly, you only need to match whitespace between the word and the digits.
In general, to "convert" a (?<=mypasswordis).* pattern with a positive lookbehind, you can use mypasswordis(.*).
In this case, you can use
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]{8})"))
Or just
SELECT REGEXP_EXTRACT(MyString, r"mypasswordis\s*([0-9]+)"))
See the re2 regex online test.
Try to not use regexp as much as you can, its quite slow. Try substring and instr as example:
SELECT SUBSTR(MyString, INSTR(MyString,'mypasswordis') + LENGTH('mypasswordis')+1)
otherwise Wiktor Stribiżew have probably right answer.
Use REGEXP_REPLACE instead to match what you don't want and delete that:
REGEXP_REPLACE(str, r'^.*mypasswordis ', '')

Getting everything after last '/'

I have a path like this in my TBL_Documents table:
Uploads/Documents/6093/12/695-Graco-SW_5-15-19.pdf
I need to compare it to a file being uploaded now that will look like this:
695-Graco-SW_5-15-19.pdf
I want to compare the path in my table with the uploaded file name. I tried using substring() on the first right / but I don't really get how substring is really working. For example, I tried to do this:
select substring(right(path,1),1,1) as path from TBL_DOCUMENT
but it is only giving me the very first character from the right. I expected to see everything after the last / character.
How can I do this?
I would use an approach of finding how many characters you need to use from the right. I would do this by first reversing the string and then searching for the '/'. This will tell you how many characters from the right this '/' is. I would then use this in the RIGHT function:
SQL Fiddle
MS SQL Server 2017 Schema Setup:
Query 1:
DECLARE #documentName varchar(100) = 'Uploads/Documents/6093/12/695-Graco-SW_5-15-19.pdf'
SELECT RIGHT(#documentName, CHARINDEX('/',REVERSE(#documentName))-1)
Results:
| |
|--------------------------|
| 695-Graco-SW_5-15-19.pdf |
RIGHT(path,1) means you want [1] character from the right of the path string, or 'f'.
You then wrap 'f' in a substring, asking for [1] character starting at the [1]st position of the string. Since the expression passed to substring returns 'f', your substring also returns 'f'.
You want to use a combination of charindex and reverse to handle this appropriately.
SUBSTRING(path,len(path) - charindex('/',reverse(path))). That will not parse but it should get you on the right track.
In normal speak, this returns the string, starting with the right most '/' of the path, to the end of string.

How to select 1st half part of pipe separated data

Data in each record of a column named REQUEST_IP_ADDR is as below '10.247.32.44 | 10.247.32.44'. How do i select only 1st part that is 10.247.32.44 ?
--Below is the select query I am trying to run
SELECT DISTINCT MSG_TYPE_CD, SRC, SRC_IP from MESSAGE_LOG order by MSG_TYPE_CD;
--My table looks as below
MSG_TYPE_CD SRC SRC_IP
KB0192 ZOHO 10.247.32.44 | 10.247.32.44
KB0192 ZOHO 10.247.32.45 | 10.247.32.45
KB0192 ZOHO 127.0.0.1 | 10.240.20.137
KB0192 ZOHO 127.0.0.1 | 10.240.20.138
KB0196 GUPSHUP 10.240.20.59 | 10.10.1.19
I want select only 1st part of data which is before the pipe
Using the base string functions we can try:
SELECT
SRC_IP,
SUBSTR(SRC_IP, 1, INSTR(SRC_IP, '|') - 2) AS first_ip
FROM MESSAGE_LOG
ORDER BY
MSG_TYPE_CD;
Demo
The logic behind the first query is that we find the position of the pipe | using INSTR. Then, we take the substring from the first character until two characters before the pipe (to leave out both the pipe and the space that precedes it).
A very slick answer using REGEXP_SUBSTR:
SELECT
SRC_IP,
REGEXP_SUBSTR(SRC_IP, '^[^ |]+') AS first_ip
FROM MESSAGE_LOG
ORDER BY
MSG_TYPE_CD;
Demo
The regex pattern used here is:
^[^ |]+
This says to take any character from the start of the SRC_IP column which is not space or pipe |. This means take the first IP address.