AWS Athena custom data format? - sql

I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.
This is how the log lines look:
2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}
which is a timestamp followed by space and the JSON line I want to query.
Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.

Define table with single column:
CREATE EXTERNAL TABLE your_table(
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/path/mylogs/';
You can extract timestamp and JSON using regexp, then parse JSON separately:
select ts,
json_extract(json_col, '$.reqid') AS reqid
...
from
(
select regexp_extract(line, '(.*?) +',1) as ts,
regexp_extract(line, '(.*?) +(.*)',2) as json_col
from your_table
)s
Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:
CREATE EXTERNAL TABLE your_table (
ts STRING,
json_col STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(.*?) +(.*)$"
)
LOCATION 's3://mybucket/path/mylogs/';
SELECT ts, json_extract(json_col, '$.reqid') AS reqid ...
FROM your_table

Related

Select a string within a string within ATHENA

I have a table in AWS ATHENA that I need to clean up for production, but having difficulties extracting only a specfic portion of a string.
EXAMPLE:
Column A
{"display_value":"TECH_FinOps_SERVICE","link":" https://sdfs.saff-now.com/api/now/v2/table/sys_user_group/8fc10b99dbeedf12321317e15b9619b2"}
Basically I would like to just extract Tech_FinOps_Service from the string in Column_A
Your string looks like json so you can try using json functions:
-- sample data
WITH dataset(column_a) AS (
values ('{"display_value":"TECH_FinOps_SERVICE","link":" https://sdfs.saff-now.com/api/now/v2/table/sys_user_group/8fc10b99dbeedf12321317e15b9619b2"}')
)
-- query
select json_extract_scalar(column_a, '$.display_value') display_value
from dataset;
Output:
display_value
---------------------
TECH_FinOps_SERVICE

Newline characters in external table in HIVE

I have a lot (think terabytes) of data that is partitioned per day (a few gigabytes per file) in a TSV format; some fields, unfortunately, have \n chars.
I am trying to create external table in a way that the newline chars in the data do not break the rows :( i tried the
CREATE EXTERNAL TABLE test
( `column` int, `column1` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/home/';
i also tried
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 'field.delim' = '\t', 'line.delim'='\n')
and
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
In both cases i get no luck, newlines in fields cause the row to terminate :(
How can i get my tsv into hive with newlines in columns?

read a csv file with comma as delimiter and escaping quotes in psql

I want to read a csv file which is separated by comma (,) but want to ignore comma within the double quotes (""). I want to store the result into a table.
Example:
abc,00.000.00.00,00:00:00:00:00:00,Sun Nov 01 00:00:00 EST 0000,Sun Nov 01 00:00:00 EST 0000,"Apple, Inc.",abcd-0000abc-a,abcd-abcd-a0000-00
Here I don't want to split on Apple, .
I know there exists csv reader in python and I can use it in plpython but that's slow considering millions of such strings! I would like a pure psql method!
Here is an example of reading a CSV file with an External Table using the CSV format.
CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc1 text )
LOCATION ('gpfdist://etlhost-1:8081/*.txt',
'gpfdist://etlhost-2:8082/*.txt')
FORMAT 'CSV' ( DELIMITER ',' )
LOG ERRORS SEGMENT REJECT LIMIT 5;
This was taken from the Greenplum docs too.
http://gpdb.docs.pivotal.io/530/admin_guide/external/g-example-4-single-gpfdist-instance-with-error-logging.html

How to import csv file into hive table when it has delimiter value, say comma, as a field value?

I want to import a csv file into hive table. The csv file is having comma (,) as within a field value. How can we escape it?
You can use CSV SerDe based on below conditions.
If your fields which has comma are in quoted strings.
sam,1,"sam is adventurous, brave"
bob,2,"bob is affectionate, affable"
CREATE EXTERNAL TABLE csv_table(name String, userid BIGINT,comment STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION 'location_of_csv_file';
If your fields which has comma are escaped as below.
sam,1,sam is adventurous\, brave
bob,2,bob is affectionate\, affable
CREATE EXTERNAL TABLE csv_table(name String, userid BIGINT, comment STRING)
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
LOCATION '/user/cloudera/input/csv';
In both cases, the output will be as below:
hive> select * from csv_table;
OK
sam 1 sam is adventurous
bob 2 bob is affectionate

How to use XMLType in a SQL*Loader control file?

I have a csvfile, which contains a column of XML data. A sample record of the csv look like:
1,,,<capacidade><numfields>3</numfields><template>1</template><F1><name lang="pt">Apple</name></F1></capacidade>
I'd like to use SQL*Loader to import all the data into Oracle; I defined the ctl file as follows:
LOAD DATA
CHARACTERSET UTF8
INFILE '/home/db2inst1/result.csv'
CONTINUEIF NEXT(1:1) = '#'
INTO TABLE "TEST"."T_DATA_TEMP"
FIELDS TERMINATED BY','
( "ID"
, "SOURCE"
, "CATEGORY"
, "RAWDATA"
)
Wen running this, the error log shows that the column of RAWDATA is treated as CHARACTER data type. How can I define the RAWDATA to be a XMLType in this case so that it can be correctly insert into the Oracle?
Try something like this:
Add a comma at the end of xml so as to mark a delimiter(can't use whitespace as xml may contain spaces in between)
then create your ctl file like below
LOAD DATA
APPEND
INTO TABLE "TEST"."T_DATA_TEMP" fields OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
( ID terminated by "," ,
SOURCE terminated by ",",
CATEGORY terminated by ",",
RAWDATA char(4000) terminated by ","
)