I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as
ALTER TABLE main.metrics_table
DROP COLUMN metric_1;
I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.
I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?
There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs
Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.
In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.
Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.
Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.
use below code :
df = spark.sql("Select * from <DB Name>.<Table Name>")
df1 = df.drop("<Column Name>")
spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")
spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")
df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")
One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.
Break down of the steps :
Read the table in the dataframe.
Drop the columns that you don't want in your final table
Drop the actual table from which you have read the data.
now save the newly created dataframe after dropping the columns as the same table name.
but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )
Above steps would help you recreate the same table with the extra column/s removed.
Hope it helps someone facing the similar issue.
Databricks Runtime 10.2+ supports dropping columns if you enable Column Mapping mode
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
And then drops will work --
ALTER TABLE table_name DROP COLUMN col_name
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)
You can overwrite the table without the column if the table isn't too large.
df = spark.read.table('table')
df = df.drop('col')
df.write.format('delta')\
.option("overwriteSchema", "true")\
.mode('overwrite')\
.saveAsTable('table')
As of Delta Lake 1.2, you can drop columns, see the latest ALTER TABLE docs.
Here's a fully working example if you're interested in a snippet you can run locally:
# create a Delta Lake
columns = ["language","speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.write.format("delta").saveAsTable("default.my_cool_table")
spark.sql("select * from `my_cool_table`").show()
+--------+--------+
|language|speakers|
+--------+--------+
|Mandarin| 1.1|
| English| 1.5|
| Hindi| 0.6|
+--------+--------+
Here's how to drop the language column:
spark.sql("""ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')""")
spark.sql("alter table `my_cool_table` drop column language")
Verify that the language column isn't included in the table anymore:
spark.sql("select * from `my_cool_table`").show()
+--------+
|speakers|
+--------+
| 1.1|
| 1.5|
| 0.6|
+--------+
It works only if you added your column after creating the table.
If it is so, and if it is possible for you to recover the data inserted after altering your table, you may consider using the table history to restore the table to a previous version.
With
DESCRIBE HISTORY <TABLE_NAME>
you can check all the available versions of your table (operation 'ADD COLUMN' will create a new table version).
Afterwards, with RESTORE it is possible to transform the table to any available state.
RESTORE <TALBE_NAME> VERSION AS OF <VERSION_NUMBER>
Here you have more information about TIME TRAVEL
Related
Given a Delta table:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'
This table already has tons of data. However, the desired schema is:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'
How this schema change can be achieved?
I eventually took the following approach:
Backup the original table to be on the safe-side
spark.read.table into memory
df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes
From BQ table if we drop a column, the output of "select * from table" query won't show the dropped column.
but if we generate the DDL Script from information schema, we still can see the dropped column available in output script.
try recreate table without this column instead
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_table_statement
https://popsql.com/learn-sql/bigquery/how-to-drop-a-column-in-bigquery
I have created table in delta format and not ingested any data .
just an empty table created and when I try using
DESCRIBE HISTORY table_name
it's showing:
DESCRIBE HISTORY is only supported for Delta tables
even though my table is delta table
but if in ingest any data its work perfectly
Use the below syntax:
DESCRIBE HISTORY table_identifier
table_identifier
[database_name.] table_name: A table name, optionally qualified with a database name.
delta.<path-to-table> : The location of an existing Delta table.
Refer: https://docs.databricks.com/delta/delta-utility.html#delta-history, https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-describe-history.html
I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.
I'm processing a big hive's table (more than 500 billion records).
The processing is too slow and I would like to make it faster.
I think that by adding partitions, the process could be more efficient.
Can anybody tell me how I can do that?
Note that my table already exists.
My table :
create table T(
nom string,
prenom string,
...
date string)
Partitioning on date field.
Thx
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE table_name PARTITION(Date) select date from table_name;
Note :
In the insert statement for a partitioned table make sure that you are specifying the partition columns at the last in select clause.
You have to restructure the table. Here are the steps:
Make sure no other process is writing to the table.
Create new external table using partitioning
Insert into new table by selecting from the old table
Drop the new table (external), only table will be dropped but data will be there
Drop the old table
Create the table with original name by pointing to the location under step 2
You can run repair command to fix all the metadata.
Alternative 4, 5, 6 and 7
Create the table with original name by running show create table on new table and replace with original table name
Run LOAD DATA INPATH command to move files under partitions to new partitions of new table
Drop the external table created
Both the approaches will achieve restructuring with one insert/map reduce job.