Syncing Qubole HIve table to Snowflake with Struct field - hive

I have a table like following Qubole:
use dm;
CREATE EXTERNAL TABLE IF NOT EXISTS fact (
id string,
fact_attr struct<
attr1 : String,
attr2 : String
>
)
STORED AS PARQUET
LOCATION 's3://my-bucket/DM/fact'
I have created parallel table in Snowflake like following:
CREATE TABLE IF NOT EXISTS dm.fact (
id string,
fact_attr variant
)
My ETL process loads the data into qubole table like:
+------------+--------------------------------+
| id | fact_attr |
+------------+--------------------------------+
| 1 | {"attr1": "a1", "attr2": "a2"} |
| 2 | {"attr1": "a3", "attr2": null} |
+------------+--------------------------------+
I am trying to sync this data to snowflake using Merge command, like
MERGE INTO DM.FACT dst USING %s src
ON dst.id = src.id
WHEN MATCHED THEN UPDATE SET
fact_attr = parse_json(src.fact_attr)
WHEN NOT MATCHED THEN INSERT (
id,
fact_attr
) VALUES (
src.id,
parse_json(src.fact_attr)
);
I am using PySpark to sync the data:
df.write \
.option("sfWarehouse", sf_warehouse) \
.option("sfDatabase", sf_database) \
.option("sfSchema", sf_schema) \
.option("postactions", query) \
.mode("overwrite") \
.snowflake("snowflake", sf_warehouse, sf_temp_table)
With above command I am getting following error:
pyspark.sql.utils.IllegalArgumentException: u"Don't know how to save StructField(fact_attr,StructType(StructField(attr1,StringType,true), StructField(attr2,StringType,true)),true) of type attributes to Snowflake"
I have read through following links but no success:
Semi-structured Data Types
Querying Semi-structured Data
Question:
How can I insert/sync data from Qubole Hive table which has STRUCT field to snowflake?

The version of your Spark Connector for Snowflake in use at the time of trying this lacked support for variant data types.
Support was introduced in their connector version 2.4.4 (released July 2018) onwards, where the StructType fields are now auto-mapped to a VARIANT data type that will work with your MERGE command.

Related

Cannot define a BigQuery column as ARRAY<STRUCT<INT64, INT64>>

I am trying to define a table that has a column that is an arrays of structs using standard sql. The docs here suggest this should work:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<INT64,INT64>>
)
but I get an error:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat xxxx.ddl.temp.sql | awk 'ORS=" "')"
Waiting on bqjob_r6735048b_00000173ed2d9645_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxxx-10843454-yyyyy-
dev:bqjob_r6735048b_00000173ed2d9645_1': Illegal field name:
Changing the field (edit: column!) name does not fix it. What I am doing wrong?
The fields within the struct need to be named so this works:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<x INT64,y INT64>>
)

How do I identify problematic documents in S3 when querying data in Athena?

I have a basic Athena query like this:
SELECT *
FROM my.dataset LIMIT 10
When I try to run it I get an error message like this:
Your query has the following error(s):
HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"
How do I identify the S3 document that has the invalid field?
My documents are JSON.
My table looks like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
Inconsistent schema
Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files
// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos", "age":"eleven"}
{"name":"1Fabiana", "age":22}
// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos", "age":11}
{"name":"2Fabiana", "age":22}
Then a simple query SELECT * FROM some_table will fail with
HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"
However, we can exclude that file within WHERE clause
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
WHERE "$PATH" != 's3://path/to/bad.json'
Result:
source_s3_file | name | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos | 11
s3://path/to/good.json | 1Fabiana | 22
Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.
SELECT
COUNT(*)
FROM some_table
WHERE regexp_like("$PATH", 's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.
The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.
Malformed records
In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:
{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }
AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying
-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
when you create table. Thus, the following query will reveal files with malformed records:
SELECT
DISTINCT("$PATH")
FROM "some_database"."some_table"
WHERE(
col_1 IS NULL AND
col_2 IS NULL AND
col_3 IS NULL
-- etc
)
Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.
In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed
For example if a file contains:
{"name": "2Patrick","age": 35,"address": "North Street"}
{
"name": "2Carlos",
"age": 11,
"address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}
the following query will still succeed
SELECT
"$PATH" AS "source_s3_file",
*
FROM some_table
Result:
source_s3_file | name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35 | North Street
2 s3://path/to/malformed.json| | |
3 s3://path/to/malformed.json| | |
4 s3://path/to/malformed.json| | |
5 s3://path/to/malformed.json| | |
6 s3://path/to/malformed.json| | |
7 s3://path/to/malformed.json| 2Fabiana | 22 | Main Street
While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error
HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

creating external table from compressed (gz format) files without selecting all fields

I have gz files in a folder. I need only 3 columns from these files, but each line has over 100 of them. At the moment I create a view this way.
drop table MAK_CHARGE_RCR;
create external table MAK_CHARGE_RCR
(LINE string)
STORED as SEQUENCEFILE
LOCATION '/apps/hive/warehouse/mydb.db/file_rcr';
drop view VW_MAK_CHARGE_RCR;
create view VW_MAK_CHARGE_RCR as
Select LINE[57] as CREATE_DATE, LINE[64] as SUBS_KEY, LINE[63] as RC_TERM_NAME
from
(Select split(LINE, '\\|') as LINE
from MAK_CHARGE_RCR) a;
The view has the fields I need. Now I have to do the same, but without CTAS and I am not sure how to go about it. What can I do?
I was told the table must look like this
create external table MAK_CHARGE_RCR
(CREATE_DATE string, SUBS_KEY string, RC_TERM_NAME etc)
I could split the line like this
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\\|'
but I'll need to list every column. I have another group of files with over 1000 columns. All of them I'll need to list. This just seems a bit excessive, so I wondered if it is possible to do
create external table arstel.MAK_CHARGE_RCR
(split(LINE, '\\|')[57] string,
split(LINE, '\\|')[64] string
etc)
This doesn't work obviously, but maybe there are work arounds?
RegexSerDe
For educational purposes
P.s.
I intend to create an enhanced version of the CSV SerDe that excepts an additional parameter with the positions of the requested columns.
Demo
bash
echo {a..c}{1..100} | xargs -n 100 | tr ' ' '|' | \
hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create external table mytable
(
col58 string
,col64 string
,col65 string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "^(?:([^|]*)\\|){58}(?:([^|]*)\\|){6}([^|]*)\\|.*$")
stored as textfile
location '/user/hive/warehouse/mytable'
;
select * from mytable
;
+---------------+---------------+---------------+
| mytable.col58 | mytable.col64 | mytable.col65 |
+---------------+---------------+---------------+
| a58 | a64 | a65 |
| b58 | b64 | b65 |
| c58 | c64 | c65 |
+---------------+---------------+---------------+

BigQuery select alias using regex_extract_all in standard mode

I'm unable to reference a SELECT alias in BigQuery (standard mode).
Trying to do this query:
SELECT
REGEXP_EXTRACT_ALL(text,
r"(<div \w+>)") AS matches
FROM
regex.test
WHERE
matches IS NOT NULL
Here are steps to reproduce.
bq mk regex
bq mk -t regex.test id:integer,text:string
echo '{"id":1, "text":"<div a>"}' | bq insert regex.test
echo '{"id":2, "text":"<div b>"}' | bq insert regex.test
echo '{"id":3, "text":"<div>"}' | bq insert regex.test
bq query --use_legacy_sql=false "select REGEXP_EXTRACT_ALL(text, r\"(<div \w+>)\") AS matches FROM regex.test WHERE id IS NOT NULL"
+--------------+
| matches |
+--------------+
| [u'<div b>'] |
| [] |
| [u'<div a>'] |
+--------------+
When I try to reference the matches alias, I see an error:
bq query --use_legacy_sql=false "select REGEXP_EXTRACT_ALL(text, r\"(<div \w+>)\") AS matches FROM regex.test WHERE matches IS NOT NULL"
Error in query string: Error processing job 'myname': Unrecognized name:
matches
I am unable to reference the alias matches, and am unable to filter those results WHERE matches IS NOT NULL.
Does anyone know what I'm doing incorrectly here?
Thanks!
Even in BQ, you can't use a column alias in the where clause. Just use a subquery:
SELECT t.*
FROM (SELECT REGEXP_EXTRACT_ALL(text, r"(<div \w+>)") AS matches
FROM regex.test
) t
WHERE ARRAY_LENGTH(matches) > 0
Check out SELECT list aliases visibility
The reason why comparing with NULL does't work for REGEXP_EXTRACT_ALL is because
it returns array so checking with length is the way. Comparing with NULL still will work for REGEXP_EXTRACT
In addition, ideally you should be able use REGEX_MATCH to filter out records w/o matches, but looks like there is an issue with this function in standard mode

Hive JSON Serde MetaStore Issue

I have an external table with JSON data and I am using JsonSerde to populate data into the table. I am properly getting the data populated and when I query the data I am able to see the results correctly.
But,when I use desc command on that table I am getting from deserializer text for all the column comments.
Below is the table creation ddl.
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
field1 string COMMENT 'This is a field1',
field2 int COMMENT 'This is a field2',
field3 string COMMENT 'This is a field3',
field4 double COMMENT 'This is a field4'
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
Location '/user/uszszb6/json_test/data';
Entries in the data file.
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
When I use use the command desc my_table, I get the below output.
+-----------+------------+--------------------+--+
| col_name | data_type | comment |
+-----------+------------+--------------------+--+
| field1 | string | from deserializer |
| field2 | int | from deserializer |
| field3 | string | from deserializer |
| field4 | double | from deserializer |
+-----------+------------+--------------------+--+
JsonSerde is not able to capture the comments properly. I have also tried with other JSONSerde like
org.openx.data.jsonserde.JsonSerDe
org.apache.hive.hcatalog.data.JsonSerDe
com.amazon.elasticmapreduce.JsonSerde
But desc command output is same. There is a JIRA ticket for this bug [https://issues.apache.org/jira/browse/HIVE-6681][1]
According to ticket it's resolved in version 0.13, I am using hive 1.2.1 but still I am facing this issue.
Could anyone share your thoughts on resolving this issue.
Yeah, it looks like it's an hive bug that affects all the Json SerDes, but have you tried using DESCRIBE EXTENDED ?
DESCRIBE EXTENDED my_table;
hive> describe extended json_serde_test;
OK
browser string from deserializer
device_uuid string from deserializer
custom struct<customer_id:string> from deserializer
Detailed Table Information
Table(tableName:json_serde_test,dbName:default, owner:rcongiu,
createTime:1448477902, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:browser, type:string,
comment:hello), FieldSchema(name:device_uuid, type:string, comment:my
name is elder price), FieldSchema(name:custom,
type:struct<customer_id:string>, comment:null)],
location:hdfs://localhost:9000/user/hive/warehouse/json_serde_test,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.openx.data.jsonserde.JsonSerDe, parameters:
{serialization.format=1, mapping.customer_id=Customer ID}),
bucketCols:[], sortCols:[], parameters:{},
skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{numFiles=1,
transient_lastDdlTime=1448477903, COLUMN_STATS_ACCURATE=true,
totalSize=128, numRows=0, rawDataSize=0}, viewOriginalText:null,
viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.073 seconds, Fetched: 5 row(s)
Will output a json-ish detailed description that includes comments..kind of hard to read but it is showing me the comments and may be enough for your purposes..or not.