Hive regexp_replace Unable to execute method public - sql

Getting the following error when running the hive query in an airflow job. However the query itself works in hive when executing in the interactive environment.
The data is the 'scans' field in virus total https://developers.virustotal.com/v2.0/reference/url-report and it got stored as json string when saving to the table
Here is my query on extracting the scans data from the table and parsing the json string and split each av engine into a separate row of itself
SELECT
av_engine, output_map['detected'], output_map['result']
FROM (
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(scans, '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM virus_total_scan_results
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result
)
And it error out with the following error message:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public
org.apache.hadoop.io.Text org.apache.hadoop.hive.ql.udf.UDFRegExpReplace.evaluate(
org.apache.hadoop.io.Text,
org.apache.hadoop.io.Text,
org.apache.hadoop.io.Text
) with arguments
{
Bkav: {detected: false, version: 1.3.0.9899, result: null, update: 20210218},
DrWeb: {detected: false, version: 7.0.49.9080, result: null, update: 20210218},
McAfee: {detected: false, version: 6.0.6.653, result: null, update: 20210218},
...
Qihoo-360: {detected: false, version: 1.0.0.1120, result: null, update: 20210218}
},
{,}:Illegal repetition
Any idea on what the issue could be?

Related

Parse Json - CTE & filtering

I need to remove a few records (that contain t) in order to parse/flatten the data column. The query in the CTE that creates 'tab', works independent but when inside the CTE i get the same error while trying to parse json, if I were not have tried to filter out the culprit.
with tab as (
select * from table
where data like '%t%')
select b.value::string, a.* from tab a,
lateral flatten( input => PARSE_JSON( a.data) ) b ;
;
error:
Error parsing JSON: unknown keyword "test123", pos 8
example data:
Date Data
1-12-12 {id: 13-43}
1-12-14 {id: 43-43}
1-11-14 {test12}
1-11-14 {test2}
1-02-14 {id: 44-43}
It is possible to replace PARSE_JSON(a.data) with TRY_PARSE_JSON(a.data) which will produce NULL instead of error for invalid input.
More at: TRY_PARSE_JSON

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

PostgreSQL - Query nested json in text column

My situation is the following
-> The table A has a column named informations whose type is text
-> Inside the informations column is stored a JSON string (but is still a string), like this:
{
"key": "value",
"meta": {
"inner_key": "inner_value"
}
}
I'm trying to query this table by seraching its informations.meta.inner_key column with the given query:
SELECT * FROM A WHERE (informations::json#>>'{meta, inner_key}' = 'inner_value')
But I'm getting the following error:
ERROR: invalid input syntax for type json
DETAIL: The input string ended unexpectedly.
CONTEXT: JSON data, line 1:
SQL state: 22P02
I've built the query following the given link: DevHints - PostgreSQL
Does anyone know how to properly build the query ?
EDIT 1:
I solved with this workaround, but I think there are better solutions to the problem
WITH temporary_table as (SELECT A.informations::json#>>'{meta, inner_key}' as inner_key FROM A)
SELECT * FROM temporary_table WHERE inner_key = 'inner_value'

PG::InvalidParameterValue: ERROR: cannot extract element from a scalar

I'm fetching data from the JSON column by using the following query.
SELECT id FROM ( SELECT id,JSON_ARRAY_ELEMENTS(shipment_lot::json) AS js2 FROM file_table WHERE ('shipment_lot') = ('shipment_lot') ) q WHERE js2->> 'invoice_number' LIKE ('%" abc1123"%')
My Postgresql version is 9.3
Saved data in JSON column:
[{ "id"=>2981, "lot_number"=>1, "activate"=>true, "invoice_number"=>"abc1123", "price"=>378.0}]
However, I'm getting this error:
ActiveRecord::StatementInvalid (PG::InvalidParameterValue: ERROR: cannot extract element from a scalar:
SELECT id FROM
( SELECT id,JSON_ARRAY_ELEMENTS(shipment_lot::json)
AS js2 FROM file_heaps
WHERE ('shipment_lot') = ('shipment_lot') ) q
WHERE js2->> 'invoice_number' LIKE ('%abc1123%'))
How I can solve this issue.
Your issue is that you have improper JSON stored
If you try running your example data on postgres it will not run
SELECT ('[{ "id"=>2981, "lot_number"=>1, "activate"=>true, "invoice_number"=>"abc1123", "price"=>378.0}]')::json
This is the JSON formatted correctly:
SELECT ('[{ "id":2981, "lot_number":1, "activate":true, "invoice_number":"abc1123", "price":378.0}]')::json

Error extracting data using TPT script for Fixed width table

I am trying to export data from a fixed width table ( teradata)
The following is the error log
Found CheckPoint file: /path
This is a restart job; it restarts at step MAIN_STEP.
Teradata Parallel Transporter DataConnector Version 13.10.00.05
FILE_WRITER Instance 1 directing private log report to 'dataconnector_log-1'.
Teradata Parallel Transporter SQL Selector Operator Version 13.10.00.05
SQL_SELECTOR: private log specified: selector_log
FILE_WRITER: TPT19007 DataConnector Consumer operator Instances: 1
FILE_WRITER: TPT19003 ECI operator ID: FILE_WRITER-31608
FILE_WRITER: TPT19222 Operator instance 1 processing file 'path/out.dat'.
SQL_SELECTOR: connecting sessions
SQL_SELECTOR: TPT15105: Error 13 in finalizing the table schema definition
SQL_SELECTOR: disconnecting sessions
SQL_SELECTOR: Total processor time used = '0.02 Second(s)'
SQL_SELECTOR: Start : Sat Aug 9 12:37:48 2014
SQL_SELECTOR: End : Sat Aug 9 12:37:48 2014
FILE_WRITER: TPT19221 Total files processed: 0.
Job step MAIN_STEP terminated (status 12)
Job edwaegcp terminated (status 12)
TPT script used :
USING CHARACTER SET UTF8
DEFINE JOB EXPORT_DELIMITED_FILE
DESCRIPTION 'Export rows from a Teradata table to a file'
(
DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA
(
id char(20)
);
DEFINE OPERATOR SQL_SELECTOR
TYPE SELECTOR
SCHEMA PRODUCT_SOURCE_SCHEMA
ATTRIBUTES
(
VARCHAR PrivateLogName = 'selector_log',
VARCHAR TdpId= '****',
VARCHAR UserName= '****',
VARCHAR UserPassword='*****',
VARCHAR SelectStmt= 'LOCKING ROW FOR ACCESS SELECT
CAST(id AS CHAR(20)),
FROM sample_db.sample_table'
);
DEFINE OPERATOR FILE_WRITER
TYPE DATACONNECTOR CONSUMER
SCHEMA *
ATTRIBUTES
(
VARCHAR PrivateLogName = 'dataconnector_log',
VARCHAR DirectoryPath = '/path/',
VARCHAR Format = 'Text',
VARCHAR FileName= 'out.dat',
VARCHAR OpenMode= 'Write'
);
APPLY TO OPERATOR (FILE_WRITER)
SELECT * FROM OPERATOR (SQL_SELECTOR);
);
Could you point out the error in the TPT script thats leading to this error?
or
How do we extract a fixed width table using TPT?