I need to remove a few lines from a parquet table (table_a) in Hive. If I create a new table (Table_b), and insert into it:
Insert Overwrite table table_b
select * from table_a
Where (my conditions to exclude the right fields here)
Are both tables now using the same HDFS file? If I drop table_a with purge, will both table's data disappear?
You can do describe formatted <table name> to check the hdfs path of the table.
To your question, if you've not specified any location while creating the table, the hdfs path of table a, and hdfs path of table b will be different
And if you drop table after loading the data to table b, you'll not lose data in table b
Related
I have table A as truncate and load for every month file and table B will be append
So table A will be file to table in hive
Table B will be tableA Insert and append data
Issue here is table B is straight move select stmt from table A , and chances are it can be inserted with duplicate/ same data
How should I write a select query to insert data from Table A
Both tables will have file-date as the column
Left join A and B is giving wrong counts in this insert tables
And hive is not working for not exists code
Issue Is:
Append table script : partitioned by yearmonth
Insert into table dist.t2
Select
Person_sk,
Np_id,
Yearmonth,
Insert_date
File_date
From table raw.ma
Data in Table raw.ma —this is truncate and reload
File1 data:201902
File2data:201903
File3data:201904
File4data: if 201902 data gets loaded to table — this should not duplicate the file1 data.. it should either not get inserted or should overwrite that partition
Here I need a filter or where condition to append data into dist.t2
Can you please help with this ??
I tried alter drop table partition in hive, but it’s failing in the spark framework
Please help with avoiding duplicate entries insert
I have created table using this statement:
CREATE TABLE tablename STORED AS PARQUET AS (SELECT ...)
How can i recalculate it without DROP TABLE - CREATE TABLE flow?
In Impala, The INSERT INTO syntax appends data to a table. The existing data files are left as-is, and the inserted data is put into one or more new data files.
The INSERT OVERWRITE syntax replaces the data in a table. Currently, the overwritten data files are deleted immediately; they do not go through the HDFS trash mechanism.
So If you want to replace the data in the table tablename without undergoing drop table and create table, you can run a query like this
INSERT OVERWRITE TABLE tablename SELECT * from <source_tablename>;
I need to create an external table in hiveql with the output from a SELECT clause. Every time when the HiveQL is ran the table should be dropped and recreated . When we drop an external table only the table structure is getting dropped but not the data files from HDFS location. How to achieve this?
Create Table As Select (CTAS) has restrictions. One of them is that target table cannot be External.
You have these options:
Create external table once, then INSERT OVERWRITE
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) select_statement1 FROM from_statement;
Use managed table, then you can DROP TABLE, then CREATE TABLE ... as SELECT
See also answer about skipTrash and auto.purge property.
I'm trying to delete data from external and partitioned table in hive. I can delete partitions with:
ALTER TABLE myTable DROP PARTITION(field > 'xxxx')
or
TRUNCATE TABLE myTable PARTITION(field)
But related files in Blob storage are not deleted. How do I delete those files?
In other hand, I'd like to delete data using any field as a filter (not only partition field). Can it be done in my case (in an external and partitioned table)? I've tried to achive this using:
INSERT OVERWRITE TABLE myTable PARTITION(field)
SELECT * FROM myTable WHERE machine = 'xxxxx'
But data in SELECT doesn't replace data in myTable.
Data in the external table will remain if you drop table or partition. Only if the table is managed, the data will be deleted automatically when the table or partition is deleted.
INSERT OVERWRITE TABLE myTable PARTITION(field) SELECT...
statement can replace data with newly loaded data for partitions existing in the returned dataset. If returned dataset is empty, the data will remain untouched.
To delete data in external table you need to delete files on the filesystem.
I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".