Spark createdataframe cannot infer schema - default data types? - apache-spark-sql

I am creating a spark dataframe in databricks using createdataframe and getting the error:
'Some of types cannot be determined after inferring'
I know I can specify the schema, but that does not help if I am creating the dataframe each time with source data from an API and they decide to restructure it.
Instead I would like to tell spark to use 'string' for any column where a data type cannot be inferred.
Is this possible?

This can be easily handled with schema evaluation with delta format. Quick ref: https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html

Related

How can I load data into snowflake from S3 whilst specifying data types

I'm aware that its possible to load data from files in S3 (e.g. csv, parquet or json) into snowflake by creating an external stage with file format type csv and then loading it into a table with 1 column of type VARIANT. But this needs some manual step to cast this data into the correct types to create a view which can be used for analysis.
Is there a way to automate this loading process from S3 so the table column data types is either inferred from the CSV file or specified elsewhere by some other means? (similar to how a table can be created in Google BigQuery from csv files in GCS with inferred table schema)
As of today, the single Variant column solution you are adopting is the closest you can get with Snowflake out-of-the-box tools to achieve your goal which, as I understand from your question, is to let the loading process infer the source file structure.
In fact, the COPY command needs to know the structure of the expected file that it is going to load data from, through FILE_FORMAT.
More details: https://docs.snowflake.com/en/user-guide/data-load-s3-copy.html#loading-your-data

BQ STRUCT versus RECORD types?

May be a very trivial question.
What is the actual difference between STRUCT and RECORD types in GCP BigQuery? Can I use them interchangably? If I have a table created with a column defined as STRUCT, will it show a "schema" mismatch if I try to re-run a Terraform script with the field type changed to RECORD?
I believe they are mostly the same thing, or you may view them as same concept in different components of BigQuery.
For historical reasons the Legacy SQL and storage documentation talks mostly about RECORD, while Standard SQL dialect uses STRUCT.
A column created with Standard SQL DDL as STRUCT will appear as RECORD in storage UI, and Terraform script using RECORD should be compatible.

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

what are the data types allowed with the sqoop option "--map-column-java"?

I want to use sqoop import to import data from SQL Server, however I am facing some data type conversion issues, and I want to use "--map-column-java" to solve that.
Just in case anybody wants to suggest "--map-column-hive". I can't because I am importing to "--as-parquetfile"; therefore I have to cast the columns data types before inserted in the file.
So, what are the data types allowed with the sqoop option "--map-column-java"?
P.S.
Especially I want to know the "datetime" data type that works with "--map-column-java"
It's pretty taught to load from database into parquet through sqoop, keeping the source datatypes, from the datatypes point of view. For example, you can't load timestamp because it's not supported.
I'm suggesting you the next workaround:
Load with sqoop with all the datatypes string;
Insert from table 1 (with all the datatypes string) into table 2, using cast (as timestamp, as decimal ... etc);
Example:
--map-column-java "ID=String,NR_CARD=String,TIP_CARD_ID=String,CONT_CURENT_ID=String,AUTORIZ_CONTURI_ID=String,TIP_STARE_ID=String,DATA_STARE=String,COMIS=String,BUGETARI_ID=String,DATA_SOLICITARII=String,DATA_EMITERII=String,DATA_VALABILITATII=String,TIP_DESCOPERIT_ID=String,BRANCH_CODE_EMIT=String,ORG_ID=String,DATA_REGEN=String,FIRMA_ID=String,VOUCHER_BLOC=String,CANAL_CERERE=String,CODE_BUG_OPER=String,CREATED_BY=String,CREATION_DATE=String,LAST_UPDATED_BY=String,LAST_UPDATE_DATE=String,LAST_UPDATE_LOGIN=String,IDPAN=String,MOTIV_STARE_ID=String,DATA_ACTIVARII=String" \
In this way you will have all the datatypes, correctly loaded from source.

Writing Avro to BigQuery using Beam

Q1: Say I load Avro encoded data using BigQuery load tool. Now I need to write this data to different table still in Avro format. I am trying to test out different partition in order to test the table performance. How do I write back SchemaAndRecord to BigQuery using Beam? Also would schema detection work in this case?
Q2: Looks like schema information is lost when converted to BigQuery schema type from Avro schema type. For example both double and float Avro type is converted to FLOAT type in BigQuery. Is this expected?
Q1: If the table already exists and the schema matches the one you're copying from you should be able to use CREATE_NEVER CreateDisposition (https://cloud.google.com/dataflow/model/bigquery-io#writing-to-bigquery) and just write the TableRows directly from the output of readTableRows() of the original table. Although I suggest using BigQuery's TableCopy command instead.
Q2: That's expected, BigQuery does not have a Double type. You can find more information on the type mapping here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions. Also Logical Types will soon be supported as well: https://issuetracker.google.com/issues/35905894.