Select a string within a string within ATHENA - sql

I have a table in AWS ATHENA that I need to clean up for production, but having difficulties extracting only a specfic portion of a string.
EXAMPLE:
Column A
{"display_value":"TECH_FinOps_SERVICE","link":" https://sdfs.saff-now.com/api/now/v2/table/sys_user_group/8fc10b99dbeedf12321317e15b9619b2"}
Basically I would like to just extract Tech_FinOps_Service from the string in Column_A

Your string looks like json so you can try using json functions:
-- sample data
WITH dataset(column_a) AS (
values ('{"display_value":"TECH_FinOps_SERVICE","link":" https://sdfs.saff-now.com/api/now/v2/table/sys_user_group/8fc10b99dbeedf12321317e15b9619b2"}')
)
-- query
select json_extract_scalar(column_a, '$.display_value') display_value
from dataset;
Output:
display_value
---------------------
TECH_FinOps_SERVICE

Related

Get the filename from filepath column in Hive

I have a table containing the FILE_PATH column like below
FILE_PATH
\root\2010\2010-01\1234.zip
\root\2010\2010-02\2345.zip
\root\2010\2010-03\3456.zip
How to extract the filename from the FILE_PATH column, using SELECT query?
I used below query to get the output.
select file_path,substr(file_path,-1*(locate('\\',reverse(file_path),1)-1)) from TABLE_NAME limit 2
you can use regexp_extract to get everything after last backslash:
select regexp_extract(file_path, '\\\\([^\\\\]*)$',1) from TABLE_NAME
Four back slashes is used to represent a single backslash because it is a special character.
Using split and size:
select split(file_path, '\\\\')[size(split(file_path, '\\\\'))-1] from TABLE_NAME
Using split and reverse:
select reverse(split(reverse(file_path), '\\\\')[0]) from TABLE_NAME

AWS Athena custom data format?

I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.
This is how the log lines look:
2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}
which is a timestamp followed by space and the JSON line I want to query.
Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.
Define table with single column:
CREATE EXTERNAL TABLE your_table(
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/path/mylogs/';
You can extract timestamp and JSON using regexp, then parse JSON separately:
select ts,
json_extract(json_col, '$.reqid') AS reqid
...
from
(
select regexp_extract(line, '(.*?) +',1) as ts,
regexp_extract(line, '(.*?) +(.*)',2) as json_col
from your_table
)s
Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:
CREATE EXTERNAL TABLE your_table (
ts STRING,
json_col STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(.*?) +(.*)$"
)
LOCATION 's3://mybucket/path/mylogs/';
SELECT ts, json_extract(json_col, '$.reqid') AS reqid ...
FROM your_table

Databricks - String manipulation via sql command

I have a table column that I need to get from databricks whatever appears between the 15th and 16th appearance of the character # as follows the following example:
Column
1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#
Results
Z.POIUS.LKJS_20200103
how can I do this?
select reverse(substring_index(reverse(substring_index('1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#', '#', 16)),'#', 1))
You can just split the string and get the 15th element, eg something like this:
%sql
SELECT *,
regexp_extract( yourCol, '(?:[^#]*(#)){15}(.[^#]+)', 2 ) xregex,
split( yourCol, '#' )[15] AS xsplit
FROM tmp
I was experimenting with regex which may be appropriate for some cases too. My results:

Single hive query to remove certain text in data

I have a column data like this in 2 formats
1)"/abc/testapp/v1?FirstName=username&Lastname=test123"
2)"/abc/testapp/v1?FirstName=username"
I want to retrieve the output as "/abc/testapp/v1?FirstName=username" and strip out the data starting with "&Lastname" and ending with "".The idea is to remove the Lastname with its value.
But if the data doesn't contain "&Lastname" then it should also work fine as per the second scenario
The value for Lastname shown in the example is "test123" but in general this will be dynamic
I have started with regexp_replace but i am able to replace "&Lastname" but not its value.
select regexp_replace("/abc/testapp/v1?FirstName=username&Lastname=test123&type=en_US","&Lastname","");
Can someone please help here how i can achieve both these with a single hive query?
Use split function:
with your_data as (--Use your table instead of this example
select stack (2,
"/abc/testapp/v1?FirstName=username&Lastname=test123",
"/abc/testapp/v1?FirstName=username"
) as str
)
select split(str,'&')[0] from your_data;
Result:
_c0
/abc/testapp/v1?FirstName=username
/abc/testapp/v1?FirstName=username
Or use '&Lastname' pattern for split:
select split(str,'&Lastname')[0] from your_data;
It will allow something else with & except starting with &Lastname
for both queries with or without last name its working in this way using split for hive no need for any table to select you can directly execute the function like select functionname
select
split("/abc/testapp/v1FirstName=username&Lastname=test123",'&')[0]
select
split("/abc/testapp/v1FirstName=username",'&')[0]
Result :
_c0
/abc/testapp/v1FirstName=username
you can make a single query :
select
split("/abc/testapp/v1FirstName=username&Lastname=test123",'&')[0],
split("/abc/testapp/v1FirstName=username",'&')[0]
_c0 _c1
/abc/testapp/v1FirstName=username /abc/testapp/v1FirstName=username

How to extract postgres hstore in string type on BigQuery

I download data from Postgres which has hstore type and upload it on Bigquery with STRING type. the column looks like below.
"bar"=>"12356","website_url"=>"http://www.google.com","baz"=>"1722.0"
How can I get the website url field http://www.google.com using BigQuery query.
You can use REGEXP_EXTRACT(str, r'"website_url"=>"(.*?)".') as in example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '"bar"=>"12356","website_url"=>"http://www.google.com","baz"=>"1722.0"' str
)
SELECT
REGEXP_EXTRACT(str, r'"website_url"=>"(.*?)".') url
FROM `project.dataset.table`
with result
Row url
1 http://www.google.com
You can use the REGEXP_EXTRACT function to extract the relevant string from the field and capture it as a new field. For example:
REGEXP_EXTRACT(MYFIELD, 'www.([^\.]+)\.com') AS website_url
When used on your example, will return:
www.google.com