How to delete data physically with Presto/Trino? - hive

In my installation of Presto (358) I've got two working hive connectors:
S3
Azure blob (ABFS)
Everything works fine but when I call DROP (TABLE/SCHEMA) or DELETE FROM, the deletion happens only in metastore and no data is being physically deleted. Applies on both S3 and ABFS.
This becomes quite problematic in case of replacing data:
> DROP TABLE hive.abc;
-- ok
> CREATE TABLE hive.abc AS (...)
-- ERROR: Target directory 'abc' already exists.
The same applies on deleting partitions, etc.
Is there any way to really delete the data?

Found out the solution. The main difference was in specifying external_location vs location for the schema and its tables.
CREATE SCHEMA hive.xyz WITH (location = 'abfs://...');
CREATE TABLE hive.xyz.test AS SELECT (...);
DELETE FROM hive.xyz.test WHERE TRUE;
-- Data ARE physically deleted
CREATE SCHEMA hive.xyz;
CREATE TABLE hive.xyz.test
WITH (external_location = 'abfs://...')
AS SELECT (...);
DELETE FROM hive.xyz.test WHERE TRUE;
-- Data ARE NOT physically deleted.
Conclusion: external_location for a table will prevent data deletion.

Related

Old records appear in the Hadoop table after drop and creating new table with the same old name

I have a question regarding creating tables in Hadoop.
I create external table the following way:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
After I created this table I run Jenkins job (with sqoop process) which loads 70.000.000 records to this table.
Then I needed to remove this table, so I run:
DROP TABLE SCHEMA.TABLENAME
Later on I want to create a table with the same name as the previous one, but I need it to be empty. I make the same query as earlier, I do:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
But when I create table this way, it has 70.000.000 records inside it again, although I didn't run any job to populate it.
This is why I have two questions:
When I drop and create table with old name, then is it recovering records from the old table?
How can I drop (or truncate) table in bigsql/hive so that I have an empty table with the old name.
I am using bigsql and hive.
Dropping an external table doesn't remove the stored data, only the metadata from the Hive Metastore.
Refer Managed vs External Tables
Key points...
Use external tables when files are already present or in remote locations
files should remain even if the table is dropped
Create a managed table (remove EXTERNAL from your query), if you want to be able to DROP and/or TRUNCATE.
Or have your Jenkins job run hadoop fs -rm -skipTrash before the import.

Databricks - is not empty but it's not a Delta table

I run a query on Databricks:
DROP TABLE IF EXISTS dublicates_hotels;
CREATE TABLE IF NOT EXISTS dublicates_hotels
...
I'm trying to understand why I receive the following error:
Error in SQL statement: AnalysisException: Cannot create table ('default.dublicates_hotels'). The associated location ('dbfs:/user/hive/warehouse/dublicates_hotels') is not empty but it's not a Delta table
I already found a way how to solve it (by removing it manually):
dbutils.fs.rm('.../dublicates_hotels',recurse=True)
But I can't understand why it's still keeping the table?
Even though that I created a new cluster (terminated the previous one) and I'm running this query with a new cluster attached.
Anyone can help me to understand that?
I also faced a similar problem, then tried the command line CREATE OR REPLACE TABLE and it solved my problem.
DROP TABLE & CREATE TABLE work with entries in the Metastore that is some kind of database that keeps the metadata about databases and tables. There could be the situation when entries in metastore don't exist so DROP TABLE IF EXISTS doesn't do anything. But when CREATE TABLE is executed, then it additionally check for location on DBFS, and fails if directory exists (maybe with data). This directory could be left from some previous experiments, when data were written without using the metastore.
if the table created with LOCATION specified - this means the table is EXTERNAL, so when you drop it - you drop only hive metadata for that table, directory contents remains as it is. You can restore the table by CREATE TABLE if you specify the same LOCATION (Delta keeps table structure along with it's data in the directory).
if LOCATION wasn't specified while table creation - it's a MANAGED table, DROP will destroy metadata and directory contents

External Table data not getting Purged in Hive

I created 2 external tables Hive. In first table specified data location with create statement. In second table loaded data after creating it.
I can see data file created for second table in /hive/warehouse/ directory. Then I set "external.table.purge"="true" for both tables. And DROP both tables. But data files of both tables remains as is.
What is the behaviour of 'external.table.purge'='true'. Shouldn't it delete data files as well on issuing Drop command?
If Hive does not take any ownership over data files of external table, why is there even an option as 'external.table.purge'='true'.
I read in one of the threads, where someone mentioned it is possible to delete data as well for external tables by ALTER TABLE ... SET TBLPROPERTIES('external.table.purge'='true'), but unable to find that post again.
You can not drop the data in external table but you can do it for internal(managed) tables. So convert the table to internal and then drop it.
First change eternal property to false.
hive> ALTER TABLE nyse_external SET TBLPROPERTIES('EXTERNAL'='False');
and then you can easily drop it.
hive> drop table nyse_external;
TBLPROPERTIES ("external.table.purge"="true") should work for hive version 4.x+.
Answer to point 1:
Table property "external.table.purge", which if true (and if the table is an external table), will let Hive know to delete the table data when the table is dropped. This feature is introduced in this apache jira.
https://issues.apache.org/jira/browse/HIVE-19981 .
For reference on how to set the property take a look at this example,
https://docs.cloudera.com/runtime/7.2.7/using-hiveql/topics/hive_drop_external_table_data.html

How to rename/recreate a table without disrupting service?

I've a table I need to purge without disrupting the service. About 99.99% of data should be deleted, so I'm trying to recreate the table and moving the 0.01% usefull data into the new table as following (and I will truncate the old table later) :
BEGIN ISOLATION LEVEL SERIALIZABLE;
LOCK TABLE table1 IN ACCESS EXCLUSIVE MODE;
/* I rename the old table */
ALTER TABLE table1 RENAME TO table1_to_be_deleted;
/* And I recreate the table */
CREATE TABLE table1 (
...
);
/* Restore usefull data from old table to new one */
INSERT INTO table1 SELECT * FROM table1_to_be_deleted WHERE toBeKept = 1;
COMMIT;
But when I run my transaction I've got some client's error due to rows not found into the new table but present in the old one. These rows are well tagged as to be kept so they should be copied from old table to the new inside the transaction and found by the client's request....
When other requests are waiting for a lock acquired on a table, has it got a pointer to the targeted object? It's the only I've which can explained the update of the old table after I commit my transaction...
PS : I'm using Postgres 9.1
To do that I'd rather:
create auxilary table
create rules to DML instead of original table to auxilary
create rule to select instead of original, `unionned' both
move good data from ONLY original to auxilary
truncate original
either move back data (will not need to rebuild references) or rename
drop obsoleted rules and objects
But really, I'd just delete from where 99%, not inventing the wheel

Hive: DROP TABLE IF EXISTS <Table Name> does not free memory

When I am using DROP TABLE IF EXISTS <Table Name> in hive, it is not freeing the memory. The files are created as 0000_n.bz2 and they are still on disk.
I have two questions here:
1) Will these files keep on growing for each and every insert?
2) Is there any DROP equivalent to remove the files as well on the disk?
Couple of things you can do:
Check if the table is an external table and in that case you need to drop the files manually on HDFS as dropping tables won't drop the files:
hadoop fs -rm /HDFS_location/filename
Secondly check if you are in the right database. You need to issue use database command before dropping the tables. The database should be same as the one in which tables were created.
There are two types of tables in hive.
Hive managed table: If you drop a hive managed table the data in HDFS are automatically deleted.
External Table: If you drop an external table, hive doesnt delete the underlying data.
I believe yours is an external table.
Drop table if exists table_name purge;
This command will also remove data files from trash folder and cannot be recovered after table drop