How do I identify problematic documents in S3 when querying data in Athena? - amazon-s3

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

Related

Change null to empty array in databricks SQL?

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM
JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.
You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM (
SELECT *
FROM JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
)

HIVE-SQL_SERVER: HadoopExecutionException: Not enough columns in this line

I have a hive table with the following structure and data:
Table structure:
CREATE EXTERNAL TABLE IF NOT EXISTS db_crprcdtl.shcar_dtls
ID string,
CSK string,
BRND string,
MKTCP string,
AMTCMP string,
AMTSP string,
RLBRND string,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/on/hadoop/dir/'
-------------------------------------------------------------------------------
ID | CSK | BRND | MKTCP | AMTCMP
-------------------------------------------------------------------------------
782 flatn,grpl,mrtn hnd,mrc,nsn 34555,56566,66455 38900,59484,71450
1231 jikl,bngr su,mrc,frd 56566,32333,45000 59872,35673,48933
123 unsrvl tyt,frd,vlv 25000,34789,33443 29892,38922,36781
Trying to push this data into the SQL Server. But while doing so, getting the following error message:
SQL Error [107090] [S0001]: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Not enough columns in this line.
What I tried:
There's an online article where the author has documented similar kind of issues. I tried to implement one of them Looked in Excel and found two columns that had carriage returns but this also doesn't come handy.
Any suggestion/help would be really appreciated. Thanks
If I'm able to understand your issue, then it seems that your , separated data is getting divided into various columns rather one column on the SQL-SERVER, something like:
------------------------------
ID |CSK |BRND |MKTCP |AMTCMP
------------------------------
782 flatn grpl mrtn hnd mrc nsn 345 56566 66455 38900 59484 71450
1231 jikl bngr su mrc frd 56566 32333 45000 59872 35673 48933
123 unsrvl tyt frd vlv 25000 34789 33443 29892 38922 36781
So, if you look on Hive there are only 5 columns. While on SQL-SERVER the same. This I presume as you haven't shared the schema. But if that's the case, then you see that there are more than 5 values are being passed. While the schema definition is only of 5 columns.
So the error is populating.
Refer this Document by MS and try to create a FILE_FORMAT with FIELD_TERMINATOR ='\t',
like:
CREATE EXTERNAL FILE FORMAT <name>
WITH (   
FORMAT_TYPE = DELIMITEDTEXT,   
FORMAT_OPTIONS (        
FIELD_TERMINATOR ='\t',
| STRING_DELIMITER = string_delimiter
| First_Row = integer -- ONLY AVAILABLE SQL DW
| DATE_FORMAT = datetime_format
| USE_TYPE_DEFAULT = { TRUE | FALSE }
| Encoding = {'UTF8' | 'UTF16'} )
);
Hope that helps to resolve to your issue :)

Hive Explode and extract a value from a String

Folks, I'm trying to extract value of 'status' from below string(column name: people) in hive. The problem is, the column is neither a complete JSON nor stored as an Array.
I tried to make it look like a JSON by replacing '=' with ':', which didnt help.
[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]
Below is the query I used:
SELECT
str.name,
str.org,
str.status
FROM table
LATERAL VIEW EXPLODE (TRANSLATE(people,'=',':')) exploded as str;
but I'm getting below error:
FAILED: UDFArgumentException explode() takes an array or a map as a parameter
Need output something like this:
name | org | status
-------- ------- ------------
abc | true | accepted
cab abc | false | needsAction
Note: There is a table already, the datatype is string, and I
can't change the table schema.
Solution for Hive. It possibly can be optimized. Read comments in the code:
with your_table as ( --your data example, you select from your table instead
select "[{name=abc, org=true, self=true, status=accepted, email=abc#gmail.com}, {name=cab abc, org=false, self=false, status=needsAction, email=cab#google.com}]" str
)
select --get map values
m['org'] as org ,
m['name'] as name ,
m['self'] as self ,
m['status'] as status ,
m['email'] as email
from
(--remove spaces after commas, convert to map
select str_to_map(regexp_replace(a.s,', +',','),',','=') m --map
from your_table t --replace w your table
lateral view explode(split(regexp_replace(str,'\\[|\\{|]',''),'}, *')) a as s --remove extra characters: '[' or '{' or ']', split and explode
)s;
Result:
OK
true abc true accepted abc#gmail.com
false cab abc false needsAction cab#google.com
Time taken: 1.001 seconds, Fetched: 2 row(s)

How to get JSON value from varchar field

*outdated Oracle version
I have a table for receipt data.
I want to get some data from field EXT_ATTR. such as PAYMENT_RECEIPT_NO
The field "EXT_ATTR" is varchar(4000) stored JSON value
SerialId | EXT_ATTR
1 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000001",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
2 |
{
"PAYMENT_RECEIPT_NO": "PS00000000000000055",
"IS_CORPOR": "1",
"POSTCODE1": "51000",
"POSTCODE2": "51000",
"BILLADDR1PART1": "BILLADDR1PART1_DATA",
"BILLADDR1PART2": "BILLADDR1PART2_DATA",
"NEED_PRINT_WHT": "1",
"WHT_AMT": "0",
"TRXAMT": "2340600",
"LOCATIONID": "02140",
"PAYMENT_METHOD_NAME": "Cash",
"WITH_TAX": "1"
}
How can I extract varchar filed to get only value.
SerialId | PAYMENT_RECEIPT_NO
1 | PS00000000000000001
2 | PS00000000000000055
Thank you very much.
to work with json documents you can use PL/JSON
if you want to parse it without json Tools, than you can use substr, instr function in Oracle.
depending on what your string looks like, you have to adjust string positions.
create table tab (json varchar2(1000));
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000001","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
insert into tab values('{"PAYMENT_RECEIPT_NO": "PS00000000000000055","IS_CORPOR": "1","POSTCODE1": "51000","POSTCODE2": "51000","BILLADDR1PART1": "BILLADDR1PART1_DATA","BILLADDR1PART2": "BILLADDR1PART2_DATA","NEED_PRINT_WHT": "1","WHT_AMT": "0","TRXAMT": "2340600","LOCATIONID": "02140","PAYMENT_METHOD_NAME": "Cash","WITH_TAX": "1"}');
select substr(json,instr(json,': ',1,1)+3,instr(json,',',1,1)-instr(json,': ',1,1)-4)
from tab;
| SUBSTR(JSON,INSTR(JSON,':',1,1)+3,INSTR(JSON,',',1,1)-INSTR(JSON,':',1,1)-4) |
| :--------------------------------------------------------------------------- |
| PS00000000000000001 |
| PS00000000000000055 |
db<>fiddle here
JSON functions are defined for Database Oracle12c+ version. APEX_JSON package with release 5.0+ should be installed for the previous releases. Whenever installation complete, then the following code might be used as an XML data type manner through APEX_JSON.TO_XMLTYPE() function in order to extract the desired values :
WITH t AS
(
SELECT SerialId, APEX_JSON.TO_XMLTYPE(Payment_Receipt_No) AS xml_data
FROM tab
)
SELECT SerialId, Payment_Receipt_No
FROM t
CROSS JOIN
XMLTABLE('/json'
PASSING xml_data
COLUMNS
Payment_Receipt_No VARCHAR2(100) PATH 'PAYMENT_RECEIPT_NO'
)

Postgres Function to Insert Arrays

I am trying to INSERT data via a postgres function, and I can't quite get it working. I am getting an error stating
ERROR: function unnest(integer) does not exist
SQL state: 42883
Hint: No function matches the given name and argument types. You might need to add explicit type casts.
I am using Postgres 9.5, and my function is as follows:
CREATE FUNCTION insert_multiple_arrays(
some_infoid INTEGER[],
other_infoid INTEGER[],
some_user_info VARCHAR,
OUT new_user_id INTEGER
)
RETURNS INTERGER AS $$
BEGIN
INSERT INTO user_table (user_info) VALUES ($3) RETURNING user_id INTO new_user_id;
INSERT INTO some_info_mapper (user_id, some_info_id) SELECT new_user_id, unnest($1);
INSERT INTO other_info_mapper (user_id, other_info_id) SELECT new_user_id,unnest($2);
END;
$$ LANGUAGE plpgsql;
I will be calling the stored procedure from my backend via a SELECT statement. An example is like so:
createUser(user, callback){
let client = this.getDb();
client.query("SELECT insert_multiple_arrays($1, $2, $3)",
[user.some_info_ids, user.other_info_ids, user.info], function(err, results){
if(err){
callback (err);
}
callback(null, results);
});
};
The output that I am expecting would be as follows:
user_table
user_id | user_info |
----------------------+-----------------+
1 | someInfo |
some_info_mapper
user_id | some_info_id |
----------------------+-----------------+
1 | 33 |
1 | 5 |
other_info_mapper
user_id | other_info_id |
----------------------+-----------------+
1 | 8 |
1 | 9 |
1 | 22 |
1 | 66 |
1 | 99 |
How do I handle this error? Do I need to do some sort of processing to my data to put it into a format that postgres accepts?
You're calling insert_multiple_arrays with three parameters, but show the definition with four. Perhaps you have an old 3-parameter version still lurking there, buggy, and trying to find the bug in the 4-parameter version that is not actually in use?
After exploring #cachiques comments, it appears that the data was not being sent correctly after all. As it turns out, that the data being passed to the back end was an array objects that needed to be parsed further than I realized. Once parsed, the sql worked fine. Here is the code I used to parse from the server side, which would be sent to the sql query:
user.other_info_ids = req.body.other_info.map( function(obj) { return obj.info_id; } );