Does DROP PARTITION delete data from external table in HIVE? - hive

An external table in HIVE is partitioned on year, month and day.
So does the following query delete data from external table for the specific partitioned referenced in this query?:-
ALTER TABLE MyTable DROP IF EXISTS PARTITION(year=2016,month=7,day=11);

Partitioning scheme is not data. Partitioning scheme is part of table DDL stored in metadata (simply saying: partition key value + location where the data-files are being stored).
Data itself are stored in files in the partition location(folder). If you drop partition of external table, the location remain untouched, but unmounted as partition (metadata about this partition is deleted). You can have few versions of partition location unmounted (for example previous versions).
You can drop partition and mount another location as partition (alter table add partition) or change existing partition location. Also drop external table do not delete table/partitions folders with files in it. And later you can create table on top of this location.
Have a look at this answer for better understanding external table/partition concept: It is possible to create many tables (both managed and external at the same time) on top of the same location in HDFS.

No external table have only references that will be deleted actual file will still persists at location .
External Table data files are not owned by table neither moved to hive warehouse directory
Only PARTITION meta will be deleted from hive metastore tables..
Difference between Internal & external tables :
For External Tables -
External table stores files on the HDFS server but tables are not linked to the source file completely.
If you delete an external table the file still remains on the HDFS server.
As an example if you create an external table called “table_test” in HIVE using HIVE-QL and link the table to file “file”, then deleting “table_test” from HIVE will not delete “file” from HDFS.
External table files are accessible to anyone who has access to HDFS file structure and therefore security needs to be managed at the HDFS file/folder level.
Meta data is maintained on master node and deleting an external table from HIVE, only deletes the metadata not the data/file.
For Internal Tables-
Stored in a directory based on settings in hive.metastore.warehouse.dir, by default internal tables are stored in the following directory “/user/hive/warehouse” you can change it by updating the location in the config file .
Deleting the table deletes the metadata & data from master-node and HDFS respectively.
Internal table file security is controlled solely via HIVE. Security needs to be managed within HIVE, probably at the schema level (depends on organisation to organisation).
Hive may have internal or external tables this is a choice that affects how data is loaded, controlled, and managed.
Use EXTERNAL tables when:
The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
Hive should not own data and control settings, dirs, etc., you may have another program or process that will do those things.
You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
The data is temporary.
You want Hive to completely manage the life-cycle of the table and data.
Note: Meta table if you will look in to the database ( configured details)
|BUCKETING_COLS |
| COLUMNS |
| DBS |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SORT_COLS |
| TABLE_PARAMS |
| TBLS |

Related

Hive - Is it mandatory to have '=' for external table to consider as partition

I am new to Hive and have a below basic question:
I am trying to create external table on HDFS directory at location
/projects/score/output/scores_2020-06-30.gzip
but it is not considering it as partition.
Should developer need to change directory name "scores=yyyy-mm-dd" in place of "scores_yyyy-mm-dd.gzip"
like "/projects/score/output/scores=2020-06-30"
then only it would consider as partitioned?
i.e. Is it mandatory to have '=' for external table to consider as partition
Or can I change something in below table while creation. Trying as below:
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ (
...
)
PARTITIONED BY (scores STRING)
LOCATION '/projects/score/output/';
Thanks in advance!
You can define partition on top of any location, even outside table directory using ALTER TABLE ADD PARTITION. Partition in HDFS is a directory usually inside table location but not necessarily. If it is inside table directory, then you can use msck repair table to attach existing subdirestories inside table directory as partitions, it will scan table location and add partitions metadata.
In your example partition directory is missing, you have only table directory with a file inside. Filename does not matter in this case.
It is not absolutely mandatory to have partition directory in the format key=value, though msck repair table may not work in your Hive distribution, you still can add partitions using ALTER TABLE ADD PARTITION ... LOCATION command on top of any directory.
It may depend on vendor. For example on Qubole, ALTER TABLE RECOVER PARTITIONS(EMR alternative of MSCK REPAIR TABLE) works fine with directories like '2020-06-30'.
By default when inserting data using dynamic partitioning, it creates partition folders in the format key=value, but if you creating partition directories using some other tools, 'value' as partition folder name is okay. Just check does MSCK REPAIR work or not in your case. If it does not, create directories key=value if you need MSCK REPAIR.
The name of file(s) and the number of files inside partition folder does not matter in this context.

AWS Athena Table Data Update

I have started testing out AWS Athena, and it so far looks good. One problem I am having is about the updating of data in a table.
Here is the scenario: In order to update the data for a given date in the table, I am basically emptying out the S3 bucket that contains the CSV files, and uploading the new files to become the updated data source. However, the period of time during which the bucket is empty (i.e. when the old source is deleted and new source is being uploaded) actually is a bottleneck, because during this interval, anyone querying the table will get no result.
Is there a way around this?
Thanks.
Athena is a web service that allows to query data which resides on AWS S3. In order to run queries, Athena needs to now table schema and where to look for data on S3. All this information is stored in AWS Glue Meta Data catalog. This essentially means that each time you get a new data you simply need to upload a new csv file onto S3.
Let's assume that you get new data everyday at midnight and you store them in an S3 bucket:
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
└── data-file-2019-01-03.csv
and each of these files looks like:
| date | volume | product | price |
|------------|---------|---------|-------|
| 2019-01-01 | 100 | apple | 10 |
| 2019-01-01 | 200 | orange | 50 |
| 2019-01-01 | 50 | cherry | 100 |
Then after you have uploaded them to AWS S3 you can use the following DDL statement in order to define table
CREATE EXTERNAL TABLE `my_table`(
`date` timestamp,
`volume` int,
`product` string,
`price` double)
LOCATION
's3://my-s3-bucket/'
-- Additional table properties
Now when you get a new file data-file-2019-01-04.csv and you upload it to the same location as other files, Athena would be able to query new data as well.
my-data-bucket
├── data-file-2019-01-01.csv
├── data-file-2019-01-02.csv
├── data-file-2019-01-03.csv
└── data-file-2019-01-04.csv
Update 2019-09-19
If your scenario is when you need to updated data in the S3 bucket, then you can try to combine views, tables and keeping different versions of data
Let's say you have table_v1 that queries data in s3://my-data-bucket/v1/ location. You create a view for table_v1 which can be seen as a wrapper of some sort:
CREATE VIEW `my_table_view` AS
SELECT *
FROM `table_v1`
Now your users could use my_table to query data in s3://my-data-bucket/v1/ instead of table_v1. When you want to update data, you can simply upload it to s3://my-data-bucket/v2/ and define table table_v2. Next, you need to update your my_table_view view since all queries are run against it:
CREATE OR REPLACE VIEW `my_table_view` AS
SELECT *
FROM `table_v2`
After this is done, you can drop table_v1 and delete files from s3://my-data-bucket/v1/. Provided that data schema hasn't changed, all queries that ran against my_table_view view while it was based on table_v1 should still be valid and succeed after my_table_view got replaced.
I don't know what would the downtime of replacing a view, but I'd expect it to less then a second, which is definitely less that the time it takes to upload new files.
What most people want to do is probably MSCK REPAIR TABLE <table_name>.
This updates the metadata if you have added more files in the location, but it is only available if you table has partitions.
You might also want to do this with a Glue Crawler which can be scheduled to refresh the table with new data.
Relevant documentation.

Impala External Table Location/URI

I am troubleshooting an application issue on an External (unmanaged) Table that was created using the CREATE TABLE X LIKE PARQUET syntax via Cloudera Impala. I am trying to determine the Location of the files comprising the partitions of the External table but having difficulty determining how to do this, or finding documentation describing this.
If I do a:
show create table T1;
I see the hive-managed location such as:
LOCATION 'hdfs://nameservice1/user/hive/warehouse/databaseName'
If I do a:
describe formatted T1;
I see that the table is in fact external but it doesnt give any insight on the unmanaged Location.
| Table Type: | EXTERNAL_TABLE
| Location: | hdfs://nameservice1/user/hive/warehouse/databaseName/T1
Question:
How do I determine the Location/URI/Parent Directory of the actual external files that comprise this External Table?
When you create a external table with impala or hive and you want know the location you must put the HDFS location, for example :
CREATE EXTERNAL TABLE my_db.table_name
(column string ) LOCATION 'hdfs_path'
The probably location o theses files if dont provite this, is under user directory that execute the comand create table.
For more detail you can see this link:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html
I hope to help!

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

What will be DataSet size in hive

I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables