Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype - hive

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.

That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Related

Why loading parquet files into Bigquery gives me back gibberish values into the table?

When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?
According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type

Creating external hive table in databricks

I am using databricks community edition.
I am using a hive query to create an external table , the query is running without any error but the table is not getting populated with the specified file that has been specified in the hive query.
Any help would be appreciated .
from official docs ... make sure your s3/storage location path and schema (with respects to the file format [TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM]) are correct
DROP TABLE IF EXISTS <example-table> // deletes the metadata
dbutils.fs.rm("<your-s3-path>", true) // deletes the data
CREATE TABLE <example-table>
USING org.apache.spark.sql.parquet
OPTIONS (PATH "<your-s3-path>")
AS SELECT <your-sql-query-here>
// alternative
CREATE TABLE <table-name> (id long, date string) USING PARQUET LOCATION "<storage-location>"

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Is it possible to load only selected columns from Avro file to Hive?

I have a requirement to load Avro file to hive. Using the following to create the table
create external table tblName stored as avro location 'hdfs://host/pathToData' tblproperties ('avro.schema.url'='/hdfsPathTo/schema.avsc');
I am getting an error FOUND NULL, EXPECTED STRING while doing a select on the table. Is it possible to load few columns and find which column data is causing this error?
Actually you need first to create an Hive External table pointing to the location of your AVRO files, and using the AvroSerDe format.
At this stage, nothing is loaded. The external table is just a mask on files.
Then you can create an internal HIVE table and load data (the expected columns) from the external one.
If you are already having AVRO file, then load the file to HDFS in a directory of your choice. Next create an external table on top of the directory.
CREATE EXTERNAL TABLE external_table_name(col1 string, col2 string, col3 string ) STORED AS AVRO LOCATION '<HDFS location>';
Next create an internal hive table on top of the external table to load the data
CREATE TABLE internal_table_name(col2 string, col3 string) AS SELECT col2, col3 FROM external_table_name
You can schedule the internal table load using a batch script in any scripting language or tools.
Hope this helps :)

Getting ClassCastException in Hive ORC table

Using cloudera 8.1. In Hive, loaded a table in ORC format with a CSV file. Getting this error on attempting to query the loaded table:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.IntWritable
This is common issue I see lots of people make,
You can create hive external table with CSV format and then say
"INSERT INTO TABLE FINAL SELECT * FROM TEMP_TABLE" which will copy the CSV data into ORC table.
By using this method Hive will convert the CSV data into ORC using inbuilt libraries.