Is there any way to insert values into a table if the table is snapshottable? - hive

I need to perform an INSERT operation into a table which is created upon a snapshottable location, is that possible?
(I've been working with Hive 1.1)
CREATE TABLE tablename (x string, y string)
LOCATION'/tmp/snapshots_test/';
INSERT INTO TABLE tablename VALUES('x','y');
where /tmp/snapshots_test/ is set as snapshottable:
hdfs dfsadmin -allowSnapshot /tmp/snapshots_test

I've found out that, if the table is partitioned, it is possible to perform an insert operation even if the location is a snapshottable directory, however, it might not work depending on the Hive version.
Anyway, it is always possible to delete just the file inside the snapshottable directory, so to avoid to delete all the directory through a Hive command.

Related

Old records appear in the Hadoop table after drop and creating new table with the same old name

I have a question regarding creating tables in Hadoop.
I create external table the following way:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
After I created this table I run Jenkins job (with sqoop process) which loads 70.000.000 records to this table.
Then I needed to remove this table, so I run:
DROP TABLE SCHEMA.TABLENAME
Later on I want to create a table with the same name as the previous one, but I need it to be empty. I make the same query as earlier, I do:
CREATE EXTERNAL HADOOP TABLE SCHEMA.TABLENAME (
ID BIGINT NOT NULL,
CODE INTEGER,
"VALUE" DOUBLE
STORED AS ORC
TBLPROPERTIES ('bigsql.table.io.doAs'='false',
'bucketing_version'='2',
'orc.compress'='ZLIB',
'orc.create.index'='true')
But when I create table this way, it has 70.000.000 records inside it again, although I didn't run any job to populate it.
This is why I have two questions:
When I drop and create table with old name, then is it recovering records from the old table?
How can I drop (or truncate) table in bigsql/hive so that I have an empty table with the old name.
I am using bigsql and hive.
Dropping an external table doesn't remove the stored data, only the metadata from the Hive Metastore.
Refer Managed vs External Tables
Key points...
Use external tables when files are already present or in remote locations
files should remain even if the table is dropped
Create a managed table (remove EXTERNAL from your query), if you want to be able to DROP and/or TRUNCATE.
Or have your Jenkins job run hadoop fs -rm -skipTrash before the import.

Databricks - is not empty but it's not a Delta table

I run a query on Databricks:
DROP TABLE IF EXISTS dublicates_hotels;
CREATE TABLE IF NOT EXISTS dublicates_hotels
...
I'm trying to understand why I receive the following error:
Error in SQL statement: AnalysisException: Cannot create table ('default.dublicates_hotels'). The associated location ('dbfs:/user/hive/warehouse/dublicates_hotels') is not empty but it's not a Delta table
I already found a way how to solve it (by removing it manually):
dbutils.fs.rm('.../dublicates_hotels',recurse=True)
But I can't understand why it's still keeping the table?
Even though that I created a new cluster (terminated the previous one) and I'm running this query with a new cluster attached.
Anyone can help me to understand that?
I also faced a similar problem, then tried the command line CREATE OR REPLACE TABLE and it solved my problem.
DROP TABLE & CREATE TABLE work with entries in the Metastore that is some kind of database that keeps the metadata about databases and tables. There could be the situation when entries in metastore don't exist so DROP TABLE IF EXISTS doesn't do anything. But when CREATE TABLE is executed, then it additionally check for location on DBFS, and fails if directory exists (maybe with data). This directory could be left from some previous experiments, when data were written without using the metastore.
if the table created with LOCATION specified - this means the table is EXTERNAL, so when you drop it - you drop only hive metadata for that table, directory contents remains as it is. You can restore the table by CREATE TABLE if you specify the same LOCATION (Delta keeps table structure along with it's data in the directory).
if LOCATION wasn't specified while table creation - it's a MANAGED table, DROP will destroy metadata and directory contents

Best way to replace a table

I have a web app running off a database table that is generated from a csv file.
This table needs to update frequently from the csv. The table needs to match the csv exactly, i.e. if a record no longer exists in the csv it should no longer exist in the table or it should be soft deleted.
What is the proper way to do this?
It seems the easiest way would be:
create temp table
import csv to temp table
drop live table
rename temp table to live table name
This will be scripted inside the app so I don't think the downtime will be much as dropping table and renaming shouldn't take too long but it doesn't seem like the safest way to do things as there is a moment where no table exists.
I tried to instead do:
create temp table
import csv to temp table
update records in live table with data from temp table
delete records in live table that don't exist in temp table
In theory that sounded better but it is extremely slow. The first method just takes a few seconds. The second method the update takes a really long time, I let it run for 10 minutes before cancelling it as it wasn't finished.
I did the update like this:
update table_name as t
set
column1 = t.column1,
column2 = t.column2,
-- etc..
from table_name_temp
What is the proper way to handle this situation?
What you want to do is wrap your simple solution within a transaction. This will ensure that your steps are executed atomically. See: https://www.tutorialspoint.com/sql/sql-transactions.htm for more info.
Postgres support ALTER TABLE .. RENAME.
http://www.postgresqltutorial.com/postgresql-rename-table/
https://dba.stackexchange.com/questions/100779/how-to-atomically-replace-table-data-in-postgresql
The rename table method only works if there are no constraints or triggers involved.
In most cases the new table's contents will not differ too much from the old version; the trick is to suppres updates that don't change anything.
In steps:
create temp table
import csv to temp table
delete records from live table that don't exist in temp table # deletes
delete records from temp table that are EXACTLY THE SAME in live table # idempotent updates
update records in live table with data from temp table # actual updates
insert records into live table from temp table that dont yest exist # inserts

HIVE - INSERT OVERWRITE vs DROP TABLE + CREATE TABLE + INSERT INTO

I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?
INSERT OVERWRITE TABLE SOME_TABLE
SELECT * FROM OTHER_TABLE;
or is faster to do like this:
DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
SELECT * FROM OTHER_TABLE;
The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?
For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.
UPDATE:
As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.
One edge consideration is that if your schema changes, INSERT OVERWRITE will fail, while DROP+CREATE+INSERT will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.

Dumping a table's content in sqlite3 to be imported into a new database

Is there an easy way of dumping a SQLite database table into a text string with insert statements to be imported into the same table of a different database?
In my specific example, I have a table called log_entries with various columns. At the end of every day, I'd like to create a string which can then be dumped into an other database with a table of the same structure called archive. (And empty the table log_entries)
I know about the attach command to create new databases. I actually wish to add it to an existing one rather than creating a new one every day.
Thanks!
ATTACH "%backup_file%" AS Backup;
INSERT INTO Backup.Archive SELECT * FROM log_entries;
DELETE FROM log_entries;
DETACH Backup;
All you need to do is replace %backup_file% with the path to your backup database. This approach considers that your Archive table is already defined and that you are using the same database file to cumulate your archive.
$ sqlite3 exclusion.sqlite '.dump exclusion'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE exclusion (word string);
INSERT INTO "exclusion" VALUES('books');
INSERT INTO "exclusion" VALUES('rendezvousing');
INSERT INTO "exclusion" VALUES('motherlands');
INSERT INTO "exclusion" VALUES('excerpt');
...