Getting a Databricks drop schema error for delta table - sql

I have a delta table schema that needs new columns/changed data types (Usually I do this on non delta tables and those work fine)
I have already dropped the existing delta table and tried dropping the schema and getting a 'v1 session catalog' error.
I am currently using SQL, 10.4 LTS cluster, spark3.2.1, scala 2.12 (I cant change these computes), driver and workers are standard E_v4
What I already did, and worked as usual
drop table if exists dbname.tablename;
What I wanted to do next:
drop schema if exists dbname.tablename;
The error I got instead:
Error in SQL statement: AnalysisException: Nested databases are not supported by v1 session catalog: dbname.tablename
When I try recreating the schema in the same location I get the error:
AnalysisException: The specified schema does not match the existing schema at dbfs:locationOfMy/table
... Differences
-Specified schema has additional fields newColNameIAdded, anotherNewColIAdded
-Specified type for myOldCol is different from existing schema ...
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.
How can I do the schema drop and re-register it in same location and same name with new definitions?

Answering a month later since I didnt get replies and found the right solution;
Delta files have left over partitions and logs that cannot be updated using the drop commands. I had to manually delete the logs depending on where my location was.

Try this:
dbutils.fs.rm(path, True)
Use the path of your schema.
Then create your table again.

Related

ACID table error in IMPALA?Hive upgraded to Hive3

I am very new to Hive and Impala.
I was trying to run an already existing table in IMPALA but I got the following error.
AnalysisException: Table dev_test.customer not supported. Transactional (ACID) tables are only supported when they are configured as insert_only.
The version is Hive 3. I am clueless as in what to do. I did see some documentation, articles online, but still could not solve the issue. I have attached a screenshot of the error screen. Let me know if you need more information.
Any help is greatly appreciated. Thanks!
Unfortunately you cant see the data through Impala and you have to use hive.
you can change table properties to insert_only to see this data.
alter TABLE tmp2 set
TBLPROPERTIES (
'transactional'='true', 'transactional_properties'='insert_only'
);
When you set a table to FULL ACID or hive upgrades to full acid, table file format changed to ORC and this is not supported by Impala so you can not access them. So you need to use hive to access these tables.
If you choose the workaround and change table properties, you will loose all ACID benefits like UPD/DEL etc.

Run truncate in bigquery with Apache NiFi

I have a process that uses the PutBigQueryBatch processor, in which I would like it to truncate the table before inserting the data. I defined an AVRO schema, and previously created the table in BigQuery specifying how I wanted the fields. I am aware that if I change the "Write Disposition" property to the value "WRITE_TRUNCATE", it will truncate the table. However, when I use this option, the schema of the table in BigQuery ends up being deleted, which I would not like to happen, and a new schema is created to record the data. I understand that the "Create Disposition" property exists, and that if the "CREATE_NEVER" option is selected, the schema should be respected and not deleted.
When I run this processor with the "Write Disposition" property set to "WRITE_APPEND", the schema I created in BigQuery is respected, but with the "WRITE_TRUNCATE" not.
Is there any way to use the "WRITE_TRUNCATE" option and the table schema not be deleted?
Am I doing something wrong?
Below I forward the configuration that I am using in the PutBigQueryBatch processor:
PutBigQueryBatch processor configuration
It sounds like what you want is to run a TRUNCATE TABLE query before starting your process: https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#truncate_table_statement

How to rename a database in azure databricks?

I am trying to rename a database in azure databricks but I am getting the following error:
no viable alternative at input 'ALTER DATABASE inventory
Below is code:
%sql
use inventory;
ALTER DATABASE inventory MODIFY NAME = new_inventory;
Please explain what is meant by this error "no viable alternative at input 'ALTER DATABASE inventory"
and how can I solve it
It's not possible to rename database on Databricks. If you go to the documentation, then you will see that you can only set DBPROPERTIES.
If you really need to rename database, then you have 2 choices:
if you have unmanaged tables (not created via saveAsTable, etc.), then you can produce SQL using SHOW CREATE TABLE, drop your database (be careful anyway), and recreate all tables from saved SQL
if you have managed tables, then the solution would be to create new database, and either use CLONE (only for Delta tables), or CREATE TABLE ... AS SELECT for other file types, and after that drop your database
Alex Ott's answer, to use Clone, is OK if you do not need to maintain the versioning history of your database when you rename it.
However if you wish to time travel on the database of Delta tables after the renaming, this solution works:
Create your new database, specifying its location
Move the file system from the old location to the new location
For each table on the old database, create a table on the new database, based on the location (my code relies on the standard file structure of {database name}/{table name} being observed). No need to specify schema as it's just taken from the files in place
Drop old database
You will then be left with a database with your new name, that has all of the data and all of the history of your old database, i.e. a renamed database of Delta tables.
Pyspark method (on databricks, with "spark" and "dbutils" already defined by default) :
def rename_db(original_db_name, original_db_location, new_db_name, new_db_location):
spark.sql(f"create database if not exists {new_db_name} location '{new_db_location}'")
dbutils.fs.mv(original_db_location,new_db_location,True)
for table in list(map(lambda x: x.tableName, spark.sql(f"SHOW TABLES FROM {original_db_name}").select("tableName").collect())):
spark.sql(f"create table {new_db_name}.{table} location '{new_db_location}/{table}'")
spark.sql(f"drop database {original_db_name} cascade")
return spark.sql(f"SHOW TABLES FROM {new_db_name}")

Appending data to a table created from an Avro file in BigQuery

Every morning, an automatic job creates a new table from an Avro file. In the afternoon, I would need to append some data to this table from a Query.
When trying to do so, I get the following error:
Error: Invalid schema update. Field chn has changed mode from REQUIRED to NULLABLE
I noticed that I can change the property of the field chn from REQUIRED to NULLABLE in the BigQuery Web UI and then it works fine, but I would have to do it manually everyday which is not what I am looking for.
Is there a way to "cast" the field as REQUIRED during the append query ?
Or during the first import from the Avro file, force the field to be NULLABLE and not REQUIRED ?
Thanks !
The feature that allows relaxing a field as part of a query or a load job will be available in production shortly. I will update this answer when it goes live (likely within a week).
Update: 08/25/2016
You can supply schemaUpdateOptions in load or query job configuration.
Multiple options can be provided.
It allows the schema of the destination table to be updated as a side effect of the load or query job. Schema update options are supported in two cases:
When writeDisposition is WRITE_APPEND
When writeDisposition is WRITE_TRUNCATE and the destination table is a partition of a table, specified by partition decorators
For non-partitioned tables, WRITE_TRUNCATE will always overwrite the schema.
The following values are supported:
ALLOW_FIELD_ADDITION: allow adding a nullable field to the schema
ALLOW_FIELD_RELAXATION: allow relaxing a required field in the original schema to nullable
NOTE: This doesn't currently work with schema auto-detection. We plan to support that soon.

hive doesn't change parquet schema

I've a problem with alter table that changes the table schema but not the parquet schema.
For example I've a PARQUET table with these columns:
column1(string) column2(string)
column3(string) column4(string)
column5(bigint)
Now, I try to change the table's schema with
ALTER TABLE name_table DROP COLUMN column3;
With DESCRIBE TABLE I can see that the column2 there is not anymore;
Now I try to execute select * from table but i receive an error like this :
"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"
The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)
This is a bug? How I can change the Parquet file's schema using Hive?
This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been.
Hence you see this issue.
The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.
If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.
I had the same error after adding a column to hive table.
Solution is to set the below query option at each session
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name