Timezone issue in Hive - hive

We are ingesting data in a Hive parquet table from oracle database using an ETL tool. Database is storing time stamps in UTC format however when we see this in Hive table it is showing time stamp values in Eastern (cluster time zone in EST).
Now I understand we could use hive functions to convert this into desired time zone in the select queries executed against the table but my question is can we ask Hive parquet to not convert into cluster time zone while writing data and so display the source value as is ?
The goal is to keep the values same as what we have in source and not allow any implicit conversions. One other way is to treat these time stamp values as strings but we don't want to go with this approach. Appreciate if someone could advise on the right solution for this.
Thanks

Related

How to convert the spanner TIMESTAMP to BigQuery TIMESTAMP?

I am trying to copy some tables from Spanner to BigQuery.
I dumped Spanner database in csv file and when I try to upload that csv to BigQuery it is throwing error of the timestamp format.
Here they mentioned limitation of BigQuery TIMESTAMP.
How do I convert spanner TIMESTAMP to BigQuery TIMESTAMP?
There may be two ways to go about this.
Keep the timestamp field as a string as exported by Cloud Spanner and load it into BigQuery as a string. It should still be sortable and used in predicates.
Use a user-defined function to do the string conversion required to load the timestamp natively in BigQuery, via the TextToBigQuery Dataflow template.
You may also write a script to convert the Timestamp to the BigQuery format.
In addition to what #Biswa-nag wrote -
We export our Spanner tables to avro files then import to BigQuery.
Unfortunately, the timestamps turned out to be Strings in BigQuery.
Our workaround for ad-hoc queries is to use user defined function to convert the timestamp in the queries (it took some time to find the correct format...)
An example:
CREATE TEMP FUNCTION ConvertTimestamp(dt STRING) AS (PARSE_DATETIME("%Y-%m-%dT%H:%M:%E*SZ", dt));
select count(*) from `[db].Games` where ConvertTimestamp(StartTime) >= DateTime(2019,8,1,0,0,0)
I converted timestamp to epoch time like this
SELECT myTime , FORMAT_TIMESTAMP("%s", myTime, "America/Los_Angeles") FROM MyTable
and it worked.

How to export AVRO files from a BigQuery table with a DATE column and load it again to BigQuery

For moving data from a BigQuery (BQ) table that resides in the US, I want to export the table to a Cloud Storage (GCS) bucket in the US, copy it to an EU bucket, and from there import it again.
The problem is that AVRO does not support DATE types, but it is crucial to us as we are using the new partitioning feature that is not relying on ingestion time, but a column in the table itself.
The AVRO files contain the DATE column as a STRING and therefore a
Field date has changed type from DATE to STRING error is thrown, when trying to load the files via bq load.
There has been a similar question, but it is about timestamps - in my case it absolutely needs to be a DATE as dates don't carry timezone information and timestamps are always interpreted in UTC by BQ.
It works when using NEWLINE_DELIMITED_JSON, but is it possible to make this work with AVRO files?
As #ElliottBrossard pointed out in the comments, there's a public feature request regarding this where it's possible to sign up for the whitelist.

In PostgreSQL, what's data type you pass to a create table call when dealing with timestamp values?

When creating a table how do you deal with a timestamp in csv file that has the following syntax - MM/DD/YY HH:MI? Here's an example: 1/1/16 19:00
I have tried the following script in PostgreSQL:
create table timetable (
time timestamp
);
copy table from '<path>' delimiter ',' CSV;
But, I receive an error message saying:
ERROR: ERROR: invalid input syntax for type timestamp:
"visit_datetime" Where: COPY air_reserve, line 16, column
visit_datetime: "visit_datetime"
One solution I have considered is first creating the timestamp column in char then run a separate query that converts it to the appropriate timestamp datatype using the function call 'to_char(time, MM/DD/YY HH:MI). But, I'm looking for a solution that would enable to load the data in the correct datatype in a single query.
You may find a datestyle that enables you to load the data you have, but sooner or later someone will deliver to you something that doesn't fit.
The solution you have considered is probably the best.
We use this as a standard pattern for loading data warehouses. We take today's data, load it into a staging table using varchar columns for any data that will not load directly into its target data type. We then run whatever scripts we need to to get the data into a good state, raising warnings for anything that is broken in a way we haven't seen before. Then we add the cleaned version of today's data into the table containing cleaned data for all previous days.
We don't mind if this takes several steps; we put them all in a script and run it as an automated job.
I'm working on documenting the techniques we use. You can see the beginnings of this at http://www.thedatastudio.net.

AWS Redshift: How do I convert a data in varchar(7) YYYY/MM to Date type in Redshift efficiently?

I have financial data that unfortunately has no day data in the date. I've already uploaded over 100GB of data into Redshift but now I want to convert the YYYY/MM varchar(y) into Date in Redshift; how should I do it efficiently?
My first thought is to create a Ruby script that connects to Redshift and transform the data from one Redshift database to another db in EC2, is there a better way of doing that in SQL or something else?
Use to_date to transform strings to dates:
UPDATE original_table SET new_date_column = TO_DATE(varchar_date_column, 'YYYY/MM');
More info here.

TIMESTAMP on HIVE table

I'm trying to load data from Oracle to Hive as parquet. Every time i load a table with date/timestamp column to hive, it automatically converts these columns to BIGINT. Is is possible to load timestamp/date formats to hive using sqoop and as a parquet file?
Already tried creating the table first in hive then using impala to LOAD DATA INPATH the parquet file.
Still failed with errors
"file XX has an incompatible Parquet schema for column XX column:
TIMESTAMP"
BTW, I'm using cloudera quickstart vm. Thanks
From the Cloudera documentation:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Or you can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Try using configuration of sqoop
--map-column-hive
<cols_name>=TIMESTAMP