I'm trying to load data from Oracle to Hive as parquet. Every time i load a table with date/timestamp column to hive, it automatically converts these columns to BIGINT. Is is possible to load timestamp/date formats to hive using sqoop and as a parquet file?
Already tried creating the table first in hive then using impala to LOAD DATA INPATH the parquet file.
Still failed with errors
"file XX has an incompatible Parquet schema for column XX column:
TIMESTAMP"
BTW, I'm using cloudera quickstart vm. Thanks
From the Cloudera documentation:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Or you can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Try using configuration of sqoop
--map-column-hive
<cols_name>=TIMESTAMP
Related
When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?
According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type
I am trying to copy some tables from Spanner to BigQuery.
I dumped Spanner database in csv file and when I try to upload that csv to BigQuery it is throwing error of the timestamp format.
Here they mentioned limitation of BigQuery TIMESTAMP.
How do I convert spanner TIMESTAMP to BigQuery TIMESTAMP?
There may be two ways to go about this.
Keep the timestamp field as a string as exported by Cloud Spanner and load it into BigQuery as a string. It should still be sortable and used in predicates.
Use a user-defined function to do the string conversion required to load the timestamp natively in BigQuery, via the TextToBigQuery Dataflow template.
You may also write a script to convert the Timestamp to the BigQuery format.
In addition to what #Biswa-nag wrote -
We export our Spanner tables to avro files then import to BigQuery.
Unfortunately, the timestamps turned out to be Strings in BigQuery.
Our workaround for ad-hoc queries is to use user defined function to convert the timestamp in the queries (it took some time to find the correct format...)
An example:
CREATE TEMP FUNCTION ConvertTimestamp(dt STRING) AS (PARSE_DATETIME("%Y-%m-%dT%H:%M:%E*SZ", dt));
select count(*) from `[db].Games` where ConvertTimestamp(StartTime) >= DateTime(2019,8,1,0,0,0)
I converted timestamp to epoch time like this
SELECT myTime , FORMAT_TIMESTAMP("%s", myTime, "America/Los_Angeles") FROM MyTable
and it worked.
I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.
By default, Avro doesn't support timestamp but I can have 'Epoch' time values having 'Long' type in the file. What I want is to load those values in 'Timestamp' format while loading the Avro file data to Bigquery table using command line tool.
For example : I have a column having value 1511253927 and I want this value to be loaded as 2017-11-21 00:00:00 using command line tool.
Any leads will be appreciated.
You can try to run a query with your file as a federated data source and use TIMESTAMP_SECONDS standard SQL function to convert values.
Using cloudera 8.1. In Hive, loaded a table in ORC format with a CSV file. Getting this error on attempting to query the loaded table:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.IntWritable
This is common issue I see lots of people make,
You can create hive external table with CSV format and then say
"INSERT INTO TABLE FINAL SELECT * FROM TEMP_TABLE" which will copy the CSV data into ORC table.
By using this method Hive will convert the CSV data into ORC using inbuilt libraries.