Apache Hudi schema evolution - schema

Can anyone share the right approach for handling schema changes in apache hudi? Example: renaming a column from col1 to col2 or changing the data type from long to int.
(Pyspark)

hudi supports avro schema compatibility, so do not support renaming and changing from long to int (int to long is supported). And there are some discussions about supporting full schema evolution in community.

Related

BQ STRUCT versus RECORD types?

May be a very trivial question.
What is the actual difference between STRUCT and RECORD types in GCP BigQuery? Can I use them interchangably? If I have a table created with a column defined as STRUCT, will it show a "schema" mismatch if I try to re-run a Terraform script with the field type changed to RECORD?
I believe they are mostly the same thing, or you may view them as same concept in different components of BigQuery.
For historical reasons the Legacy SQL and storage documentation talks mostly about RECORD, while Standard SQL dialect uses STRUCT.
A column created with Standard SQL DDL as STRUCT will appear as RECORD in storage UI, and Terraform script using RECORD should be compatible.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

what are the data types allowed with the sqoop option "--map-column-java"?

I want to use sqoop import to import data from SQL Server, however I am facing some data type conversion issues, and I want to use "--map-column-java" to solve that.
Just in case anybody wants to suggest "--map-column-hive". I can't because I am importing to "--as-parquetfile"; therefore I have to cast the columns data types before inserted in the file.
So, what are the data types allowed with the sqoop option "--map-column-java"?
P.S.
Especially I want to know the "datetime" data type that works with "--map-column-java"
It's pretty taught to load from database into parquet through sqoop, keeping the source datatypes, from the datatypes point of view. For example, you can't load timestamp because it's not supported.
I'm suggesting you the next workaround:
Load with sqoop with all the datatypes string;
Insert from table 1 (with all the datatypes string) into table 2, using cast (as timestamp, as decimal ... etc);
Example:
--map-column-java "ID=String,NR_CARD=String,TIP_CARD_ID=String,CONT_CURENT_ID=String,AUTORIZ_CONTURI_ID=String,TIP_STARE_ID=String,DATA_STARE=String,COMIS=String,BUGETARI_ID=String,DATA_SOLICITARII=String,DATA_EMITERII=String,DATA_VALABILITATII=String,TIP_DESCOPERIT_ID=String,BRANCH_CODE_EMIT=String,ORG_ID=String,DATA_REGEN=String,FIRMA_ID=String,VOUCHER_BLOC=String,CANAL_CERERE=String,CODE_BUG_OPER=String,CREATED_BY=String,CREATION_DATE=String,LAST_UPDATED_BY=String,LAST_UPDATE_DATE=String,LAST_UPDATE_LOGIN=String,IDPAN=String,MOTIV_STARE_ID=String,DATA_ACTIVARII=String" \
In this way you will have all the datatypes, correctly loaded from source.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.

Autodetect BigQuery schema within Dataflow?

Is it possible to use the equivalent of --autodetect in DataFlow?
i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?
(potentially related question)
If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.
I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).
You can use the util as follows:
TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.
I'm not sure about reading from BQ, but for writes I think that something like this will work on the latest java SDK.
.apply("WriteBigQuery", BigQueryIO.Write
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.to(outputTableName));
Note: BigQuery Table must be of the form: <project_name>:<dataset_name>.<table_name>.