Folks, I'm trying to extract value of 'status' from below string(column name: people) in hive. The problem is, the column is neither a complete JSON nor stored as an Array.
I tried to make it look like a JSON by replacing '=' with ':', which didnt help.
[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]
Below is the query I used:
SELECT
str.name,
str.org,
str.status
FROM table
LATERAL VIEW EXPLODE (TRANSLATE(people,'=',':')) exploded as str;
but I'm getting below error:
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
Need output something like this:
name | org | status
-------- ------- ------------
abc | true | accepted
cab abc | false | needsAction
Note: There is a table already, the datatype is string, and I
can't change the table schema.
Solution for Hive. It possibly can be optimized. Read comments in the code:
with your_table as ( --your data example, you select from your table instead
select "[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]" str
)
select --get map values
m['org'] as org ,
m['name'] as name ,
m['self'] as self ,
m['status'] as status ,
m['email'] as email
from
(--remove spaces after commas, convert to map
select str_to_map(regexp_replace(a.s,', +',','),',','=') m --map
from your_table t --replace w your table
lateral view explode(split(regexp_replace(str,'\\[|\\{|]',''),'}, *')) a as s --remove extra characters: '[' or '{' or ']', split and explode
)s;
Result:
OK
true abc true accepted abc#gmail.com
false cab abc false needsAction cab#google.com
Time taken: 1.001 seconds, Fetched: 2 row(s)
Related
In Vertica DB we have an attribute column that is either comma-separated or enclosed within inverted commas (double and single applicable). When we do an s3 export query on Vertica DB we get the CSV file but when we validate it through an online CSV validator or s3 select query formatted we get an error.
SELECT S3EXPORT(* USING PARAMETERS url='xxxxxxxxxxxxxxxxxxxx.csv', delimiter=',', enclosed_by='\"', prepend_hash=false, header=true, chunksize='10485760'....
and suggestions on how to resolve this issue?
PS: Reading manually every row and checking columns is not the choice
example attributes:-
select uid, cid, att1 from table_name where uid in (16, 17, 15);
uid | cid | att1
-----+-------+---------------------
16 | 78940 | yel,k
17 | 78940 | master#$;#
15 | 78940 | "hello , how are you"
S3EXPORT() is deprecated as from Version 11. We are at Version 12 currently.
Now, you would export like so:
EXPORT TO DELIMITED(
directory='s3://mybucket/mydir'
, filename='indata'
, addHeader='true'
, delimiter=','
, enclosedBy='"'
) OVER(PARTITION BEST) AS
SELECT * FROM indata;
With your three lines, this would generate the below:
dbadmin#gessnerm-HP-ZBook-15-G3:~$ cat /tmp/export/indata.csv
uid,cid,att1
15,78940,"\"hello \, how are you\""
16,78940,"yel\,k"
17,78940,"master#$;#"
Do you need a different format?
Then, try this : ...
EXPORT TO DELIMITED(
directory='/tmp/csv'
, filename='indata'
, addHeader='true'
, delimiter=','
, enclosedBy=''
) OVER(PARTITION BEST) AS
SELECT
uid
, cid
, QUOTE_IDENT(att1) AS att1
FROM indata;
... to get this:
dbadmin#gessnerm-HP-ZBook-15-G3:~$ cat /tmp/csv/indata.csv
uid,cid,att1
15,78940,"""hello \, how are you"""
16,78940,"yel\,k"
17,78940,"master#$;#"
I have below JSON from which i need to fetch the value of issuedIdentValue where issuedIdentType = PANCARD
{
"issuedIdent": [
{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},
{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"},
{"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}
]
}
I can not hard-code the index value [2] in my below query as the order of these records can be changed. So want to get rid off any hardcoded index.
select json_value(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARDSctyNb","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[2].issuedIdentValue'
) as output
from d1entzendev.ExternalEventLog
where
eventname = 'CustomerDetailsInqSVC'
and applname = 'digitalBANKING'
and requid = '4fe1fa1b-abd4-47cf-834b-858332c31618';
What changes will need to apply in json_value function to achieve the expected result
In Oracle 12c or higher, you can use JSON_TABLE() for this:
select value
from json_table(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[*]' columns
type varchar(50) path '$.issuedIdentType',
value varchar(50) path '$.issuedIdentValue'
) t
where type = 'PANCARD'
This returns:
| VALUE |
| :---------- |
| 078-01-8877 |
I have this string (character varying) as a row value in Postgresql :
{'img_0': 'https://random.com/xxxxxx.jpg', 'img_1': 'https://random.com/yyyyyy.jpg', 'img_2': 'https://random.com/zzzzzz.jpg'}
I am trying to json_eact_text() it but can't figure out how to.
I have tried to_jsonb() on it (working) and then jsonb_each(), but I have this error :
ERROR: cannot call jsonb_each on a non-object
My query :
WITH
test AS (
SELECT to_jsonb(value) as value FROM attribute_value WHERE id = 43918
)
SELECT jsonb_each(value) FROM test
Your text value is not valid JSON. JSON requires doublequotes (") to delimit strings.
This will work by doctoring your text provided that your data is consistently wrong:
with t (sometext) as (
values ($${'img_0': 'https://random.com/xxxxxx.jpg', 'img_1': 'https://random.com/yyyyyy.jpg', 'img_2': 'https://random.com/zzzzzz.jpg'}$$)
)
select jsonb_each_text(replace(sometext, '''', '"')::jsonb)
from t;
jsonb_each_text
---------------------------------------
(img_0,https://random.com/xxxxxx.jpg)
(img_1,https://random.com/yyyyyy.jpg)
(img_2,https://random.com/zzzzzz.jpg)
(3 rows)
To break this out into columns:
with t (sometext) as (
values ($${'img_0': 'https://random.com/xxxxxx.jpg', 'img_1': 'https://random.com/yyyyyy.jpg', 'img_2': 'https://random.com/zzzzzz.jpg'}$$)
)
select j.*
from t
cross join lateral jsonb_each_text(replace(sometext, '''', '"')::jsonb) as j;
key | value
-------+-------------------------------
img_0 | https://random.com/xxxxxx.jpg
img_1 | https://random.com/yyyyyy.jpg
img_2 | https://random.com/zzzzzz.jpg
(3 rows)
I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
<wbi:appData>
<wbi:content wbi:name="1st_status">
<wbi:value xsi:type="xsd:string">Success</wbi:value>
</wbi:content>
</wbi:appData>
</wbi:event>
I need to get the value "1st_status" which is in 2nd line and "Success" which is in 3rd line using SQL developer.
this xml is in a table which has a column in the form of CLOB type.
you do it like this:
(ie table yourdata contains a clob column c)
SQL> select extractvalue(xmltype(c), '/wbi:event/wbi:appData/wbi:content/#wbi:name','xmlns:wbi="http://foo"') name,
2 extractvalue(xmltype(c), '/wbi:event/wbi:appData/wbi:content/wbi:value','xmlns:wbi="http://foo"') status
3 from yourdata
4 /
NAME STATUS
--------------- ---------------
1st_status Success
this assumes you wbi namespace is xmlns:wbi="http://foo"
if //content is a repeating tag, then you do this instead:
SQL> select extractvalue(value(t), '/wbi:content/#wbi:name','xmlns:wbi="http://foo"') name,
2 extractvalue(value(t), '/wbi:content/wbi:value','xmlns:wbi="http://foo"') status
3 from yourdata,
4 table(xmlsequence(extract(xmltype(c), '/wbi:event/wbi:appData/wbi:content', 'xmlns:wbi="http://foo"'))) t
5
SQL> /
NAME STATUS
--------------- ---------------
1st_status Success
2nd_status Failure
you can try ::
SELECT extractvalue(xmltype(xmlcol), 'xpath') FROM yourtable
But need to check if xmltype works for CLOB data type.