Convert timestamp with milli seconds in Parquet - apache-spark-sql

I have an Athena column with value 2021-03-02 00:00:00.000. It is stored as timestamp datatype.
I am trying to write the data frame as parquet and the spark job errors out with
Caused by: java.lang.RuntimeException: Unable to create Parquet converter for data type "timestamp" whose Parquet type is optional binary column_name_1 (UTF8)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:409)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$1(ParquetRowConverter.scala:213)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
I tried below and it worked for athena. But still have the same problem while writing it as parquet
cast(date_format(column_name_1,'%Y-%m-%d %H:%i:%s') as timestamp) as column_name_1
Then Tried setting up the spark config and still no luck
"spark.sql.parquet.outputTimestampType": "TIMESTAMP_MILLIS"
Note: I don't need the milli seconds precision for my case. It is enough if I have until seconds.

Related

Google BigQuery: Importing DATETIME fields using Avro format

I have a script that downloads data from an Oracle database, and uploads it to Google BigQuery. This is done by writing to an Avro file, which is then uploaded directly using BQ's python framework. The BigQuery tables I'm uploading the data to has predefined schemas, some of which contain DATETIME fields.
As BigQuery now has support for Avro Logical fields, import of timestamp data is no longer a problem. However, I'm still not able to import datetime fields. I tried using string, but then I got the following error:
Field CHANGED has incompatible types. Configured schema: datetime; Avro file: string.
I also tried to convert the field data to timestamps on export, but that produced an internal error in BigQuery:
An internal error occurred and the request could not be completed. Error: 3144498
Is it even possible to import datetime fields using Avro?
In Avro, the logical data types must include the attribute logicalType, it is possible that this field is not included in your schema definition.
Here there are a couple of examples like the following one. As far as I know the type can be int or long, but logicalType should be date:
{
'name': 'DateField',
'type': 'int',
'logicalType': 'date'
}
Once the logical data type is set, try again. The documentation does indicate it should work:
Avro logical type --> date
Converted BigQuery data type --> DATE
In case you get an error, it would be helpful to check the schema of your avro file, you can use this command to obtain its details:
java -jaravro-tools-1.9.2.jargetschema my-avro-file.avro
UPDATE
For cases where DATE alone doesn't work, please consider that the TIMESTAMP can store the date and time with a number of micro/nano seconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC (UTC seems to be the default for avro). Additionally, the values stored in an avro file (of type DATE o TIMESTAMP) are independent of a particular time zone, in this sense, it is very similar to BigQuery Timestamp data type.

Timezone issue in Hive

We are ingesting data in a Hive parquet table from oracle database using an ETL tool. Database is storing time stamps in UTC format however when we see this in Hive table it is showing time stamp values in Eastern (cluster time zone in EST).
Now I understand we could use hive functions to convert this into desired time zone in the select queries executed against the table but my question is can we ask Hive parquet to not convert into cluster time zone while writing data and so display the source value as is ?
The goal is to keep the values same as what we have in source and not allow any implicit conversions. One other way is to treat these time stamp values as strings but we don't want to go with this approach. Appreciate if someone could advise on the right solution for this.
Thanks

How to load Avro File to BigQuery tables with columns having 'Timestamp' type

By default, Avro doesn't support timestamp but I can have 'Epoch' time values having 'Long' type in the file. What I want is to load those values in 'Timestamp' format while loading the Avro file data to Bigquery table using command line tool.
For example : I have a column having value 1511253927 and I want this value to be loaded as 2017-11-21 00:00:00 using command line tool.
Any leads will be appreciated.
You can try to run a query with your file as a federated data source and use TIMESTAMP_SECONDS standard SQL function to convert values.

TIMESTAMP on HIVE table

I'm trying to load data from Oracle to Hive as parquet. Every time i load a table with date/timestamp column to hive, it automatically converts these columns to BIGINT. Is is possible to load timestamp/date formats to hive using sqoop and as a parquet file?
Already tried creating the table first in hive then using impala to LOAD DATA INPATH the parquet file.
Still failed with errors
"file XX has an incompatible Parquet schema for column XX column:
TIMESTAMP"
BTW, I'm using cloudera quickstart vm. Thanks
From the Cloudera documentation:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Or you can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Try using configuration of sqoop
--map-column-hive
<cols_name>=TIMESTAMP

Invalid parquet hive schema: repeated group array

Most datasets on our production Hadoop cluster currently are stored as AVRO + SNAPPY format. I heard lots of good things about Parquet, and want to give it a try.
I followed this web page, to change one of our ETL to generate Parquet files, instead of Avro, as the output of our reducer. I used the Parquet + Avro schema, to produce the final output data, plus snappy codec. Everything works fine. So the final output parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any problem. But when I tried to query the table, like "select * from table limit 2", I got the following error:
Caused by: java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional binary sub2 (UTF8); optional int64 date_value;} at parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40) at parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32) at parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107) at parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230) at parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) ... 14 more
I noticed that the error comes from the first nested array of struct columns. My question is following:
Does Parquet support the nested array of struct?
Is this only related to Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?
If I have to use later version of Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause any issue?
Can I use all kinds of Hive feature, like "explode" of nest structure, from the parquet data?
What we are looking for is to know if parquet can be used same way as we currently use AVRO, but gives us the columnar storage benefits which missing from AVRO.
It looks like Hive 12 cannot support the nest structure of parquet file, as shown in this Jira ticket.
https://issues.apache.org/jira/browse/HIVE-8909