Regex_extract in hive to extract date

Regex_extract in hive to extract date - hive

I need help with regex_extract in Hive. I have string column from which I need to extract date. Sample data is given below
Abc def: 23-oct-17
Def:abc abc: 23-nov-2017
My data is: 17-nov-17

Since data is last part of the string, you can use below query
hive> select regexp_extract('Def:abc abc: 23-nov-2017', '\\d*-\\w*-\\d*$', 0);
OK
23-nov-2017
Above reg ex will match end of string with pattern DD-MON-YYYY

split() function is also regexp-based and you can split by semicolon+one or more spaces :
select
split(str,':\\s+')[1] date
from
(
select
stack(3,
'Abc def: 23-oct-17',
'Def:abc abc: 23-nov-2017',
'My data is: 17-nov-17'
) as str
)s
Result:
OK
23-oct-17
23-nov-2017
17-nov-17
Time taken: 0.063 seconds, Fetched: 3 row(s)

Related

Extract a substring and take second value in a Bigquery Column

I have this data:
id val
1 ajkdks - jkdj
2 djs - djsd
I want to take only the second value. Which is:
id val
1 jkdj
2 djsd
I know the query if using MySQL:
SUBSTRING_INDEX(SUBSTRING_INDEX(val, " - ", 2)," - ",-1)
But what the query if i using bigquery?

Use below
select id, split(val, ' - ')[safe_offset(1)] val
from your_table
if applied to sample data in your question - output is

We could phrase this using REGEXP_EXTRACT:
SELECT id, REGEXP_EXTRACT(val, r'[^ -]+$') AS val
FROM yourTable
ORDER BY id;
Note that the above regex approach is also robust to the case where val might not have any hyphen separator, in which case the entire value would be returned.

AWS Athena: How can we get integer value as string with thousand comma separator in AWS Athena

How can we show integer numbers with thousand comma separator.
So, by executing the below statement
select * from 1234567890
How can we get the result as 1,234,567,890

You can achieve this by casting number to string and using regex:
with dataset(num) as (
values (1234567890),
(123456789),
(12345678),
(1234567),
(123456),
(12345),
(1234),
(123)
)
select regexp_replace(cast(num as VARCHAR), '(\d)(?=(\d\d\d)+(?!\d))', '$1,')
from dataset
Output:
_col0
1,234,567,890
123,456,789
12,345,678
1,234,567
123,456
12,345
1,234
123

Column Name as a Paramater [duplicate]

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc.
I am not sure how to get the data from this table.
The data looks like:
| count_of_accidents | year_2016 | year_2017 |year_2018 |
|--------------------|-----------|-----------|----------|
| 15 | 12 | 5 | 1 |
| 5 | 10 | 6 | 18 |
I have tried 'concat' function but this doesn't really work.
I have tried with this:
select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-'))))
from table_name;
The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-
select SUM( count_of_accidents * year_2018) from table_name;
Is there any way I can do this?

You can do it using regular expressions. Like this:
--create test table
create table test_col(year_2018 string, year_2019 string);
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--test select using hard-coded pattern
select year_2018, `(year_)2019` from test_col;
OK
year_2018 year_2019
Time taken: 0.862 seconds
--test pattern parameter
set hivevar:year_param=2019;
select year_2018, `(year_)${year_param}` from test_col;
OK
year_2018 year_2019
Time taken: 0.945 seconds
--two parameters
set hivevar:year_param1=2018;
set hivevar:year_param2=2019;
select `(year_)${year_param1}`, `(year_)${year_param2}` from test_col t;
OK
year_2018 year_2019
Time taken: 0.159 seconds
--parameter contains full column_name and using more strict regexp pattern
set hivevar:year_param2=year_2019;
select `^${year_param2}$` from test_col t;
OK
year_2019
Time taken: 0.053 seconds
--select all columns using single pattern year_ and four digits
select `^year_[0-9]{4}$` from test_col t;
OK
year_2018 year_2019
Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.
Also column aliasing does not work for columns extracted using regular expressions:
select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;
throws exception:
FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias
or column reference '^year_2018$': (possible column names are:
number_of_incidents, year_2016, year_2017, year_2018)
I found a workaround to alias a column using union all with empty dataset, see this test:
create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1
from
(--UNION ALL with empty dataset to alias columns extracted
select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
union all
select t.number_of_incidents, `^${year_param}$` from test_t t
)s;
Result:
OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)
First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.
Update:
No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;
Result:
OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)

Google Big Query SQL to extract numeric ID from string

How do I write a SQL Query in Google Big Query to extract numeric ID from a string like these:
Example 1:
Column Value: "http://www.google.com/abc/eeq/entity/32132"
Desired Extraction: 32132
Example 2:
Column Value: "http://www.google.com/abc/eeq/entity/32132/ABC/2138"
Desired Extraction: 32132
Example 3:
Column Value: "http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132"
Desired Extraction: 32132

Below example for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT "http://www.google.com/abc/eeq/entity/32132" url UNION ALL
SELECT "http://www.google.com/abc/eeq/entity/32132/ABC/2138" UNION ALL
SELECT "http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132"
)
SELECT url, REGEXP_EXTRACT(url, r'\d+') extracted_id
FROM `project.dataset.table`
with output
Row url extracted_id
1 http://www.google.com/abc/eeq/entity/32132 32132
2 http://www.google.com/abc/eeq/entity/32132/ABC/2138 32132
3 http://www.google.com/abc/eeq/entity/32132http://www.google.com/abc/eeq/entity/32132 32132

You can use regexp_extract(). To get the first series of digits in the string:
select regexp_extract(col, '[0-9]+')

Hive/SQL query to extract value of the keyword "swid" from a table struct filed "Details"

Hive/SQL query to extract value of the keyword "swid" from a table struct filed "Details".
column - "Details"
Value - "id":123;"name":"Alex";"depID":100;"swid":5456213
Desired Output:
swid
5456213

Using sts_to_map function:
with test_data as (select '"id":123\\;"name":"Alex"\\;"depID":100\\;"swid":5456213' as str)
select str, str_to_map(regexp_replace(str,'\\"',''),'\\;',':')['swid'] as swid from test_data
;
Result:
OK
str swid
"id":123;"name":"Alex";"depID":100;"swid":5456213 5456213
Time taken: 0.971 seconds, Fetched: 1 row(s)
One more solution is to convert to valid JSON (replace semicolon with comma) then use get_json_object to extract element:
with test_data as (select '"id":123\\;"name":"Alex"\\;"depID":100\\;"swid":5456213' as str)
select str, get_json_object(concat('{',regexp_replace(str,'\\;',','),'}'),'$.swid') as swid from test_data;
OK
str swid
"id":123;"name":"Alex";"depID":100;"swid":5456213 5456213
Time taken: 6.54 seconds, Fetched: 1 row(s)
Using regexp_extract:
select str, regexp_extract(str,'\\"swid\\":(\\d+)',1) as swid from test_data;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regex_extract in hive to extract date - hive

I need help with regex_extract in Hive. I have string column from which I need to extract date. Sample data is given below Abc def: 23-oct-17 Def:abc abc: 23-nov-2017 My data is: 17-nov-17

Since data is last part of the string, you can use below query hive> select regexp_extract('Def:abc abc: 23-nov-2017', '\\d-\\w-\\d*$', 0); OK 23-nov-2017 Above reg ex will match end of string with pattern DD-MON-YYYY

Related

Extract a substring and take second value in a Bigquery Column

AWS Athena: How can we get integer value as string with thousand comma separator in AWS Athena

Column Name as a Paramater [duplicate]

Google Big Query SQL to extract numeric ID from string

Hive/SQL query to extract value of the keyword "swid" from a table struct filed "Details"

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Regex_extract in hive to extract date - hive

I need help with regex_extract in Hive. I have string column from which I need to extract date. Sample data is given below Abc def: 23-oct-17 Def:abc abc: 23-nov-2017 My data is: 17-nov-17

Since data is last part of the string, you can use below query hive> select regexp_extract('Def:abc abc: 23-nov-2017', '\\d*-\\w*-\\d*$', 0); OK 23-nov-2017 Above reg ex will match end of string with pattern DD-MON-YYYY

Related

Extract a substring and take second value in a Bigquery Column

AWS Athena: How can we get integer value as string with thousand comma separator in AWS Athena

Column Name as a Paramater [duplicate]

Google Big Query SQL to extract numeric ID from string

Hive/SQL query to extract value of the keyword "swid" from a table struct filed "Details"

Categories

Resources

Since data is last part of the string, you can use below query hive> select regexp_extract('Def:abc abc: 23-nov-2017', '\\d-\\w-\\d*$', 0); OK 23-nov-2017 Above reg ex will match end of string with pattern DD-MON-YYYY