Why loading parquet files into Bigquery gives me back gibberish values into the table? - sql

When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?

According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type

Related

bq load command to load parquet file from GCS to BigQuery with column name start with number

I am loading parquet file into BigQuery using bq load command, my parquet file contains column name start with number (e.g. 00_abc, 01_xyz). since BigQuery don't support column name start number I have created column in BigQuery such as _00_abc, _01_xyz.
But I am unable to load the parquet file to BigQuery using bq load command.
Is there any way to specify bq load command that source column 00_abc (from parquet file) will load to target column _00_abc (in BigQuery).
Thanks in advance.
Regards,
Gouranga Basak
It's general best practice to not start a Parquet column name with a number. You will experience compatibility issues with more than just bq load. For example, many Parquet readers use the parquet-avro library, and Avro's documentation says:
The name portion of a fullname, record field names, and enum symbols must:
start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]
The solution here is to rename the column in the Parquet file. Depending on how much control you have over the Parquet file's creation, you may need to write a Cloud Function to rename the columns (Pandas Dataframes won't complain about your column names).

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

ORC file format

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.