For vertica s3 export query escape multiple characters - amazon-s3

In Vertica DB we have an attribute column that is either comma-separated or enclosed within inverted commas (double and single applicable). When we do an s3 export query on Vertica DB we get the CSV file but when we validate it through an online CSV validator or s3 select query formatted we get an error.
SELECT S3EXPORT(* USING PARAMETERS url='xxxxxxxxxxxxxxxxxxxx.csv', delimiter=',', enclosed_by='\"', prepend_hash=false, header=true, chunksize='10485760'....
and suggestions on how to resolve this issue?
PS: Reading manually every row and checking columns is not the choice
example attributes:-
select uid, cid, att1 from table_name where uid in (16, 17, 15);
uid | cid | att1
-----+-------+---------------------
16 | 78940 | yel,k
17 | 78940 | master#$;#
15 | 78940 | "hello , how are you"

S3EXPORT() is deprecated as from Version 11. We are at Version 12 currently.
Now, you would export like so:
EXPORT TO DELIMITED(
directory='s3://mybucket/mydir'
, filename='indata'
, addHeader='true'
, delimiter=','
, enclosedBy='"'
) OVER(PARTITION BEST) AS
SELECT * FROM indata;
With your three lines, this would generate the below:
dbadmin#gessnerm-HP-ZBook-15-G3:~$ cat /tmp/export/indata.csv
uid,cid,att1
15,78940,"\"hello \, how are you\""
16,78940,"yel\,k"
17,78940,"master#$;#"
Do you need a different format?
Then, try this : ...
EXPORT TO DELIMITED(
directory='/tmp/csv'
, filename='indata'
, addHeader='true'
, delimiter=','
, enclosedBy=''
) OVER(PARTITION BEST) AS
SELECT
uid
, cid
, QUOTE_IDENT(att1) AS att1
FROM indata;
... to get this:
dbadmin#gessnerm-HP-ZBook-15-G3:~$ cat /tmp/csv/indata.csv
uid,cid,att1
15,78940,"""hello \, how are you"""
16,78940,"yel\,k"
17,78940,"master#$;#"

Related

How do I identify problematic documents in S3 when querying data in Athena?

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Hive Explode and extract a value from a String

Folks, I'm trying to extract value of 'status' from below string(column name: people) in hive. The problem is, the column is neither a complete JSON nor stored as an Array.
I tried to make it look like a JSON by replacing '=' with ':', which didnt help.
[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]
Below is the query I used:
SELECT
str.name,
str.org,
str.status
FROM table
LATERAL VIEW EXPLODE (TRANSLATE(people,'=',':')) exploded as str;
but I'm getting below error:
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
Need output something like this:
name | org | status
-------- ------- ------------
abc | true | accepted
cab abc | false | needsAction
Note: There is a table already, the datatype is string, and I
can't change the table schema.
Solution for Hive. It possibly can be optimized. Read comments in the code:
with your_table as ( --your data example, you select from your table instead
select "[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]" str
)
select --get map values
m['org'] as org ,
m['name'] as name ,
m['self'] as self ,
m['status'] as status ,
m['email'] as email
from
(--remove spaces after commas, convert to map
select str_to_map(regexp_replace(a.s,', +',','),',','=') m --map
from your_table t --replace w your table
lateral view explode(split(regexp_replace(str,'\\[|\\{|]',''),'}, *')) a as s --remove extra characters: '[' or '{' or ']', split and explode
)s;
Result:
OK
true abc true accepted abc#gmail.com
false cab abc false needsAction cab#google.com
Time taken: 1.001 seconds, Fetched: 2 row(s)

How to get JSON value from varchar field

*outdated Oracle version
I have a table for receipt data.
I want to get some data from field EXT_ATTR. such as PAYMENT_RECEIPT_NO
The field "EXT_ATTR" is varchar(4000) stored JSON value
SerialId | EXT_ATTR
1 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000001",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
2 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000055",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
How can I extract varchar filed to get only value.
SerialId | PAYMENT_RECEIPT_NO
1 | PS00000000000000001
2 | PS00000000000000055
Thank you very much.
to work with json documents you can use PL/JSON
if you want to parse it without json Tools, than you can use substr, instr function in Oracle.
depending on what your string looks like, you have to adjust string positions.
create table tab (json varchar2(1000));
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000001","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000055","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
select substr(json,instr(json,': ',1,1)+3,instr(json,',',1,1)-instr(json,': ',1,1)-4)
from tab;
| SUBSTR(JSON,INSTR(JSON,':',1,1)+3,INSTR(JSON,',',1,1)-INSTR(JSON,':',1,1)-4) |
| :--------------------------------------------------------------------------- |
| PS00000000000000001 |
| PS00000000000000055 |
db<>fiddle here
JSON functions are defined for Database Oracle12c+ version. APEX_JSON package with release 5.0+ should be installed for the previous releases. Whenever installation complete, then the following code might be used as an XML data type manner through APEX_JSON.TO_XMLTYPE() function in order to extract the desired values :
WITH t AS
(
SELECT SerialId, APEX_JSON.TO_XMLTYPE(Payment_Receipt_No) AS xml_data
FROM tab
)
SELECT SerialId, Payment_Receipt_No
FROM t
CROSS JOIN
XMLTABLE('/json'
PASSING xml_data
COLUMNS
Payment_Receipt_No VARCHAR2(100) PATH 'PAYMENT_RECEIPT_NO'
)

creating external table from compressed (gz format) files without selecting all fields

I have gz files in a folder. I need only 3 columns from these files, but each line has over 100 of them. At the moment I create a view this way.
drop table MAK_CHARGE_RCR;
create external table MAK_CHARGE_RCR
(LINE string)
STORED as SEQUENCEFILE
LOCATION '/apps/hive/warehouse/mydb.db/file_rcr';
drop view VW_MAK_CHARGE_RCR;
create view VW_MAK_CHARGE_RCR as
Select LINE[57] as CREATE_DATE, LINE[64] as SUBS_KEY, LINE[63] as RC_TERM_NAME
from
(Select split(LINE, '\\|') as LINE
from MAK_CHARGE_RCR) a;
The view has the fields I need. Now I have to do the same, but without CTAS and I am not sure how to go about it. What can I do?
I was told the table must look like this
create external table MAK_CHARGE_RCR
(CREATE_DATE string, SUBS_KEY string, RC_TERM_NAME etc)
I could split the line like this
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\|'
but I'll need to list every column. I have another group of files with over 1000 columns. All of them I'll need to list. This just seems a bit excessive, so I wondered if it is possible to do
create external table arstel.MAK_CHARGE_RCR
(split(LINE, '\\|')[57] string,
split(LINE, '\\|')[64] string
etc)
This doesn't work obviously, but maybe there are work arounds?
RegexSerDe
For educational purposes
P.s.
I intend to create an enhanced version of the CSV SerDe that excepts an additional parameter with the positions of the requested columns.
Demo
bash
echo {a..c}{1..100} | xargs -n 100 | tr ' ' '|' | \
hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create external table mytable
(
col58 string
,col64 string
,col65 string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "^(?:([^|]*)\\|){58}(?:([^|]*)\\|){6}([^|]*)\\|.*$")
stored as textfile
location '/user/hive/warehouse/mytable'
;
select * from mytable
;
+---------------+---------------+---------------+
| mytable.col58 | mytable.col64 | mytable.col65 |
+---------------+---------------+---------------+
| a58 | a64 | a65 |
| b58 | b64 | b65 |
| c58 | c64 | c65 |
+---------------+---------------+---------------+

How to select a row from any hstore values?

I've a table Content in a PostgreSQL (9.5) database, which contains the column title. The title column is a hstore. It's a hstore, because the title is translated to different languages. For example:
example=# SELECT * FROM contents;
id | title | content | created_at | updated_at
----+---------------------------------------------+------------------------------------------------+----------------------------+----------------------------
1 | "de"=>"Beispielseite", "en"=>"Example page" | "de"=>"Beispielinhalt", "en"=>"Example conten" | 2016-07-17 09:20:23.159248 | 2016-07-17 09:20:23.159248
(1 row)
My question is, how can I select the content which title contains Example page?
SELECT * FROM contents WHERE title = 'Example page';
This query unfortunately doesn't work.
example=# SELECT * FROM contents WHERE title = 'Example page';
ERROR: Syntax error near 'p' at position 8
LINE 1: SELECT * FROM contents WHERE title = 'Example page';
The avals() function returns an array of all values in a hstore column. You can then match your value using any against that array:
select *
from contents
where 'Example page' = any(avals(title))
You should use like in where clause
SELECT * FROM contents WHERE title like '%Example page%';
Hope it helps you.