how to use regexp_extract in hive - sql

I am trying to extract a portion of the below string using regexp_extract but am not having any success:
CUST_NEW_ACCOUNTS_LINES_2019-03-03.dat.gz
I want to just get the date portion. On the regex101.com website this seemed to work, but hive is giving me an error message.
regexp_extract(meta_source_filename,'^(?:[^_]+_){4}([^_]+)') file_date
Can someone help me understand what is incorrect here? I am not at all familiar with regexp_extract syntax so have been using another function as a starting point.
Thank you!

with your_data as (
select 'CUST_NEW_ACCOUNTS_LINES_2019-03-03.dat.gz' str
)
select regexp_extract(str,'_(\\d{4}(-\\d{2}){2})\\.',1)
from your_data;
Result:
OK
2019-03-03
Time taken: 0.062 seconds, Fetched: 1 row(s)
Expression '_(\\d{4}(-\\d{2}){2})\\.' means:
underscore _ four digits \\d{4} repeat (hyphen and two digits) two times (-\\d{2}){2} dot\\.
Capture group number one (date only): (\\d{4}(-\\d{2}){2}) .
In Hive you need to use \\ for shielding.

You have captured the substring you need into a capturing group. You should use the number, ID of the group as the third argument:
regexp_extract(meta_source_filename,'^(?:[^_]+_){4}([^_]+)', 1) file_date
^
See the regexp_extract(string subject, string pattern, int index) docs:
The 'index' parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the 'index' or Java regex group() method.

Related

Splitting a string and converting to integer in BigQuery

I have a simple problem but I started to use google bq and their help menu was so complex for me.
I have a column like that for some rows:
ANSWER(title of column)
9
10 - Certainly Satisfied.
7 -
My aim is to split the previous part of that column from "-" sign and convert it to integer. I found some formulas like split(), regexp_extract() but I couldn't be sure how can I imply them for my data.
Thanks for your help in advance :)
If the number is always first, you can use:
select sum(safe.cast((split(answer, '-'))[ordinal(1)] as int64)
from t;
Note: It looks like you have spaces, so you might really want to split on the space:
select sum(safe.cast((split(answer, ' '))[ordinal(1)] as int64)
from t;
Consider below option
select answer,
safe_cast(regexp_extract(trim(answer), r'^\d+') as int64) as score
from `project.dataset.table`
if to apply to sample data in your question - output is

Druid SQL: get substring issue

There is the table column which holds the comma-separated values, e.g:
abc321,rd512,spwewr
I need to extract the substring which starts from the user-defined pattern.
Example:
Input Pattern | Expected result
abc abc321
r rd512
spwe spwewr
b NULL
Following fails in Druid SQL:
SELECT SUBSTRING('abc321,rd512,spwewr', POSITION('r' IN 'abc321,rd512,spwewr'), 2)
This is the known Druid bug:
" Substring operator converter does not handle non-constant literals correctly":
https://issues.apache.org/jira/browse/CALCITE-2226
I think the way to go is to use REGEXP_EXTRACT() or REGEXP_LIKE()
but I cannot figure out the specific syntax.
select regexp_extract('abc321,rd512,spwewr', 'rd[^,]+', 0)

Getting an error when using CONCAT in BigQuery

I'm trying to run a query where I combine two columns and separate them with an x in between.
I'm also trying to get some other columns from the same table. However, I get the following error.
Error: No matching signature for function CONCAT for argument types: FLOAT64, FLOAT64. Supported signatures: CONCAT(STRING, [STRING, ...]); CONCAT(BYTES, [BYTES, ...]).
Here is my code:
SELECT
CONCAT(right,'x',left),
position,
numbercreated,
Madefrom
FROM
table
WHERE
Date = "2018-10-07%"
I have tried also putting a cast before but that did not work.
SELECT Concast(cast(right,'x',left)), position,...
SELECT Concast(cast(right,'x',left)as STRING), position,...
Why am I getting this error?
Are there any fixes?
Thanks for the help.
You need to cast each value before the concat():
SELECT CONCAT(CAST(right as string), 'x', CAST(left as string)),
position, numbercreated, Madefrom
FROM table
WHERE Date = '2018-10-07%';
If you want a particular format, then use the FORMAT() function.
I also doubt that your WHERE will match anything. If Date is a string, then you probably want LIKE:
WHERE Date LIKE '2018-10-07%';
More likely, you should use the DATE function or direct comparison:
WHERE DATE(Date) = '2018-10-07'
or:
WHERE Date >= '2018-10-07' AND
Date < '2018-10-08'
Another option to fix your issue with CONCAT is to use FROMAT function as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1.01 AS `right`, 2.0 AS `left`
)
SELECT FORMAT('%g%s%g', t.right, 'x', t.left)
FROM `project.dataset.table` t
result will be
Row f0_
1 1.01x2
Note: in above specific example - you could use even simpler statement
FORMAT('%gx%g', t.right, t.left)
You can see more for supporting formats
Few recommendations - try not to use keywords as a column names/aliases. If for some reason you do use - wrap such with backtick or prefix it with table name/alias
Yet another comment - looks like you switched your values positions - your right one is on left side and left one is on right - might be exactly what you need but wanted to mention
Try like below by using safe_cast:
SELECT
CONCAT(SAFE_CAST( right as string ),'x',SAFE_CAST(left as string)),
position,
numbercreated,
Madefrom
FROM
table
WHERE
Date = '2018-10-07'

Translate function not returning relevant string in amazon redshift

I am trying to use a simple Translate function to replace "-" in a 23 digit string. The example of one such string is "1049477-1623095-2412303" The expected outcome of my query should be 104947716230952412303
The list of all "1049477-1623095-2412303" is present in a single column "table1". The name of the column is "data"
My query is
Select TRANSLATE(t.data, '-', '')
from table1 as t
However, it is returning 104947716230952000000 as the output.
At first, I thought it is an overflow error since the resulting integer is 20 digit so I also tried to use following
SELECT CAST(TRANSLATE(t.data,'-','') AS VARCHAR)
from table1 as t
but this is not working as well.
Please suggest a way so that I could have my desirable output
This is too long for a comment.
This code:
select translate('1049477-1623095-2412303', '-', '')
is going to return:
'104947716230952412303'
The return value is a string, not a number.
There is no way that it can return '104947716230952000000'. I could only imagine that happening if somehow the value is being converted to a numeric or bigint type.
Try regexp_replace()
Taking your own example, execute:
select regexp_replace('[string / column_name]','-');
It can be achieve RPAD try below code.
SELECT RPAD(TRANSLATE(CAST(t.data as VARCHAR),'-','') ,20,'00000000000000000000')

In Hive I need to Get numeric value after a particular word is it possible?

i want to get a numeric value immediately after a particular word in string
In hive for example :
APDSGDSCRAM051 in that i need to get numeric value after word RAM
is it possible in hive
Note: its not a fixed length string
Here you go, you need to use substr and instr pre-defined hive functions:
create table str_testing (c string);
insert into table str_testing values ('APDSGDSCRAM051');
select substr(c, instr(c, 'RAM') + 3) from str_testing;
OK
051
Time taken: 0.243 seconds, Fetched: 1 row(s)
As explained here, you can implemented in hive as
select regexp_extract(name, '\\d+', 0) from <table_name>;
Note: I do not have environment for Hive configured so you can check this by running at your end. Ya this will work only for first set of numbers found in your string, if you string has numbers at multiple places this might fail.