I have created table in delta format and not ingested any data .
just an empty table created and when I try using
DESCRIBE HISTORY table_name
it's showing:
DESCRIBE HISTORY is only supported for Delta tables
even though my table is delta table
but if in ingest any data its work perfectly
Use the below syntax:
DESCRIBE HISTORY table_identifier
table_identifier
[database_name.] table_name: A table name, optionally qualified with a database name.
delta.<path-to-table> : The location of an existing Delta table.
Refer: https://docs.databricks.com/delta/delta-utility.html#delta-history, https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-describe-history.html
Related
Given a Delta table:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'
This table already has tons of data. However, the desired schema is:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'
How this schema change can be achieved?
I eventually took the following approach:
Backup the original table to be on the safe-side
spark.read.table into memory
df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes
I have an existing External Table called for example YYYYYY that contains n number of columns and this table is loaded daily with partitioned column as extract_date.
We got a request from business to add few more columns in the existing table. To implement this we have done following things.
DROP existing partitions from Hive
alter table xxxx.yyyyyy add columns (
`c10` string COMMENT '',
`b` string COMMENT '',
`c11` string COMMENT '',
`c12` string COMMENT '',
`c13` string COMMENT '',
`c14` string COMMENT '',
`c15` string COMMENT '') CASCADE;
alter table xxxx.yyyyyyy change `c8` `c8` string COMMENT '' after `c7` CASCADE;
After I did the above 2 steps, I went to Hive and did MSCK REPAIR TABLE xxxx.yyyyyy;
Partitions added (there are partitions from 2018) along with my new fields.
Before the changes I was able to query the data both from Impala and Hive, but after executing ALTER commands, I am getting the error as below.
> select * from xxxx.yyyyyyy where extract_date like '2019%';
Query: select * from XXXXX.YYYYYYY where extract_date like '2019%'
Query submitted at: 2020-05-09 11:57:10 (Coordinator: ' xxxx.yyyyyyy .c9'. Column type: STRING, Parquet schema:
optional fixed_len_byte_array a_auth [i:12 d:1 r:0]
Whereas in Hive I am able to browse the data with no issues. So I have an issue only in Impala.
Troubleshooting steps:
Created new table without additional columns and pointed the external path as new and copied the previously created partitions to new path.
MSCK REPAIR TABLE TABLE NAME;
Both in Impala and Hive a select query is working.
Added additional fields to the newly created table with ALTER commands then did the following things
MSCK REPAIR TABLE TABLE NAME;
In Impala :
REFRESH TABLE TABLE NAME;
INVALIDATE METADATA TABLE NAME;
This time in Hive select query worked but in Impala got the above mentioned error.
Can some one guide me why this is happening and how to fix this issue.
Impala Shell v2.12.0-cdh5.16.2
I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.
I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as
ALTER TABLE main.metrics_table
DROP COLUMN metric_1;
I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.
I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?
There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs
Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.
In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.
Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.
Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.
use below code :
df = spark.sql("Select * from <DB Name>.<Table Name>")
df1 = df.drop("<Column Name>")
spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")
spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")
df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")
One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.
Break down of the steps :
Read the table in the dataframe.
Drop the columns that you don't want in your final table
Drop the actual table from which you have read the data.
now save the newly created dataframe after dropping the columns as the same table name.
but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )
Above steps would help you recreate the same table with the extra column/s removed.
Hope it helps someone facing the similar issue.
Databricks Runtime 10.2+ supports dropping columns if you enable Column Mapping mode
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
And then drops will work --
ALTER TABLE table_name DROP COLUMN col_name
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)
You can overwrite the table without the column if the table isn't too large.
df = spark.read.table('table')
df = df.drop('col')
df.write.format('delta')\
.option("overwriteSchema", "true")\
.mode('overwrite')\
.saveAsTable('table')
As of Delta Lake 1.2, you can drop columns, see the latest ALTER TABLE docs.
Here's a fully working example if you're interested in a snippet you can run locally:
# create a Delta Lake
columns = ["language","speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.write.format("delta").saveAsTable("default.my_cool_table")
spark.sql("select * from `my_cool_table`").show()
+--------+--------+
|language|speakers|
+--------+--------+
|Mandarin| 1.1|
| English| 1.5|
| Hindi| 0.6|
+--------+--------+
Here's how to drop the language column:
spark.sql("""ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')""")
spark.sql("alter table `my_cool_table` drop column language")
Verify that the language column isn't included in the table anymore:
spark.sql("select * from `my_cool_table`").show()
+--------+
|speakers|
+--------+
| 1.1|
| 1.5|
| 0.6|
+--------+
It works only if you added your column after creating the table.
If it is so, and if it is possible for you to recover the data inserted after altering your table, you may consider using the table history to restore the table to a previous version.
With
DESCRIBE HISTORY <TABLE_NAME>
you can check all the available versions of your table (operation 'ADD COLUMN' will create a new table version).
Afterwards, with RESTORE it is possible to transform the table to any available state.
RESTORE <TALBE_NAME> VERSION AS OF <VERSION_NUMBER>
Here you have more information about TIME TRAVEL
I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".