How to delete customer information from hdfs - hive

Suppose, I have several customers today so I am storing their information like customer_id, customer_name, customer_emailid etc. If my customer is leaving and he wants that his personal information should be removed from my hdfs.
So I have below two approaches to achieve the same.
Approach 1:
1.Create Internal Table on top of HDFS
2.Create external table from first table using filter logic
3.While Creating 2nd Table apply udfs on specific columns for more column filtering
Approach 2:
Spark=> Read, filter, write
Is there any other solution?

Approach 2 is possible in Hive - select, filter, write
Create a table on top of directory in hdfs (external or managed, does not matter in this context, better external if you are going to drop table later and keep the data as is). Insert overwrite table or partition from select with filter.
insert overwrite mytable
select *
from mytable --the same table
where customer_id not in (...) --filter rows

Related

Understanding Hive table creation notation

I have come across Hive tables which I need to convert to Redshift/MySql equivalent.
I am having trouble understanding Hive query structure and would appreciate some help:
CREATE TABLE IF NOT EXISTS table_1 (
id BIGINT,
price DOUBLE,
asset string
)
PARTITIONED BY (
pt STRING
);
ALTER TABLE table_1 DROP IF EXISTS PARTITION (pt== '${yyyymmdd}');
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol from
...
...
from
table_2 table
I am having trouble understanding the PARTITIONED BY clause. This, if I am understanding correctly, is different from MySQL table partitions, and is a Hive specific dynamic partition.
The partition does not define a column or a key, and partitions by the current date.
Does this mean that table_1 is partitioned by the date? Each day has a separate partition?
Then later on in the code there are notations similar to
inner join table_new table on table.pt = '${yyyymmdd}' and ...
In this context, does it mean only rows inserted on yyyymmdd are selected for the join?
Thank you.
Partition in Hive is a folder in HDFS by default with name key=value + metadata in the Hive metastore. You can alter partition location and create partition on top of any folder.
This PARTITIONED BY (pt STRING) defines partition column pt of type string, not date. Partition values are stored in the metadata. The pt column is not present in the table data files, it is only defined in PARTITIONED BY, all partition values are stored in the metadata. If you load partition dynamically, partition folder is being created with name pt='value'.
This sentence creates partition dynamically:
INSERT OVERWRITE TABLE table_1 PARTITION (pt)
select id, price, symbol
coln as pt --partition column should be the last one
from ...
And this sentence loads single STATIC partition:
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol
from
No partition column is selected, partition value specified in the
PARTITION (pt= '${yyyymmdd}')
'${yyyymmdd}' here is a parameter with name yyyymmdd which is passed to the script using --hivevar like this:
hive --hivevar yyyymmdd=20200604 -f myscript.sql
You can pass ANY string as partition value in this case, though parameter name yyyymmdd suggests it's format.
BTW date format in hive is 'yyyy-MM-dd' Strings in 'yyyy-MM-dd' format can be implicitly converted to DATE.
I will try in one shot explain what is partitioning in Hive. First of all would be
WHEN TO USE TABLE PARTITIONING
Table partitioninig is good when:
Reading the entire dataset takes too long
Queries almost always filter on the partition columns
There are a reasonable number of different values for partition columns
Data generation of ETL process splits data by file or directory names
Partition column values are not in the data itself
Don't partition on columns with many unique values
Example: Partitioning customers by first name
CREATING PARTITIONED TABLES
To create a partitioned table, use the PARTITIONED BY clause in the CREATE TABLE statement.
The names and types of the partition columns must be specified
in the PARTITIONED BY clause, and only in the PARTITIONED BY clause.
They must not also appear in the list of all the other columns.
CREATE TABLE customers_by_country
(cust_id STRING, name STRING)
PARTITIONED BY (country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
The example CREATE TABLE statement shown above creates the table customers_by_country,
which is partitioned by the STRING column named country.
Notice that the country column appears only in the PARTITIONED BY clause,
and not in the column list above it.
This example specifies only one partition column, but you can specify more than one by using
a comma-separated column list in the PARTITIONED BY clause.
Aside from these specific differences, this CREATE TABLE statement is the same
as the statement used to create an equivalent non-partitioned table.
Table partitioning is implemented in a way that is mostly transparent
to a user issuing queries with Hive.
A partition column is what’s known as a virtual column, because its values are not stored within the data files.
Following is the result of the DESCRIBE command on customers_by_country;
it displays the partition column country just as if it were a normal column within the table.
You can refer to partition columns in any of the usual clauses of a SELECT statement.
name type comment
cust_id string
name string
country string
You can load data in partitioned tables dynamically or statically
LOADING DATA WITH DYNAMIC PARTITION
One way to load data into a partitioned table is to use dynamic partitioning,
which automatically defines partitions when you load the data, using the values in the partition column.
(The other way is to manually define the partitions with Static Partitioning)
To use dynamic partitioning, you must load data using an INSERT statement.
In the INSERT statement, you must use the PARTITION clause to list the partition columns.
The data you are inserting must include values for the partition columns.
The partition columns must be the rightmost columns in the data you are inserting,
and they must be in the same order as they appear in the PARTITION clause.
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country)
SELECT cust_id, name, country FROM customers;
The example shown above uses an INSERT … SELECT statement
to load data into the customers_by_country table with dynamic partitioning.
Notice that the partition column, country, is included
in the PARTITION clause and is specified last in the SELECT list.
When Hive executes this statement, it automatically creates partitions
for the country column and loads the data into these partitions based on the values in the country column.
The resulting data files in the partition subdirectories do not include values for the country column.
Since the country is known based on which subdirectory a data file is in,
it would be redundant to include country values in the data files as well.
Look at the contents of the customers_by_country directory.
It should now have one subdirectory for each value in the country column.
Look at the file in one of those directories.
Notice that the file contains the row for the customer from that country,
and no others; notice also that the country value is not included.
Note: Hive includes a safety feature that prevents users
from accidentally creating or overwriting a large number of partitions.
(See “Risks of Using Partitioning” for more about this.)
By default, Hive sets the property hive.exec.dynamic.partition.mode to strict.
This prevents you from using dynamic partitioning, though you can still use static partitions.
You can disable this safety feature in Hive by setting
the property hive.exec.dynamic.partition.mode to nonstrict:
SET hive.exec.dynamic.partition.mode=nonstrict;
Then you can use the INSERT statement to load the data dynamically.
Hive properties set in Beeline are for the current session only,
so the next time you start a Hive session this property will be set back to strict.
But you or your system administrator can configure properties permanently, if necessary.
When you run some SELECT queries on the partitioned table, if the table is big enough you can note significant difference in the time it takes to run.
Notice that you will not query the table any differently than you would query the customers table.
LOADING DATA WITH STATIC PARTITIONING
One way to load data into a partitioned table is to use static partitioning,
in which you manually define the different partitions.
With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement,
and then load the data into the partition.
For example, this ALTER TABLE statement creates the partition for Pakistan (pk):
ALTER TABLE customers_by_country
ADD PARTITION (country='pk');
Notice how the partition column name, which is country, and the specific value that defines this partition,
which is pk, are both specified in the ADD PARTITION clause.
This creates a partition directory named country=pk inside the customers_by_country table directory.
After the partition for Pakistan is created, you can add data into the partition using an INSERT … SELECT statement:
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country='pk')
SELECT cust_id, name FROM customers WHERE country='pk'
Notice how in the PARTITION clause, the partition column name, which is country,
and the specific value, which is pk, are both specified, just like in the ADD PARTITION command used to create the partition.
Also notice that in the SELECT statement, the partition column is not included in the SELECT list.
Finally, notice that the WHERE clause in the SELECT statement selects only customers from Pakistan.
With static partitioning, you need to repeat these two steps for each partition:
first create the partition, then add data.
You can actually use any method to load the data; you need not use an INSERT statement.
You could instead use hdfs dfs commands or a LOAD DATA INPATH command.
But however you load the data, it’s your responsibility to ensure that data is stored in the correct partition subdirectories.
For example, data for customers in Pakistan must be stored in the Pakistan partition subdirectory,
and data for customers in other countries must be stored in those countries’ partition subdirectories.
Static partitioning is most useful when the data being loaded
into the table is already divided into files based on the partition column,
or when the data grows in a manner that coincides with the partition column:
For example, suppose your company opens a new store in a different country,
like New Zealand ('nz'), and you're given a file of data for new customers, all from that country.
You could easily add a new partition and load that file into it.
RISKS OF USING PARTITIONING
A major risk when using partitioning is creating partitions that lead you into the small files problem.
When this happens, partitioning a table will actually worsen query performance
(the opposite of the goal when using partitioning) because it causes too many small files to be created.
This is more likely when using dynamic partitioning, but it could still
happen with static partitioning—for example if you added a new partition to a sales table
on a daily basis containing the sales from the previous day,
and each day’s data is not particularly big.
When choosing your partitions, you want to strike a happy balance between too many partitions
(causing the small files problem) and too few partitions (providing performance little benefit).
The partition column or columns should have a reasonable number of values
for the partitions—but what you should consider reasonable is difficult to quantify.
Using dynamic partitioning is particularly dangerous because if you're not careful,
it's easy to partition on a column with too many distinct values.
Imagine a use case where you are often looking for data that falls within
a time frame that you would specify in your query.
You might think that it's a good idea to partition on a column that pertains to time.
But a TIMESTAMP column could have the time to the nanosecond, so every row could have a unique value;
that would be a terrible choice for a partition column! Even to the minute or hour could create
far too many partitions, depending on the nature of your data;
partitioning by larger time units like day, month, or even year might be a better choice.
As another example, consider an employees table.
This has five columns: empl_id, first_name, last_name, salary, and office_id.
Before reading on, think for a moment, which of these might be reasonable for partitioning
The column empl_id is a unique identifier.
If that were your partition column, you would have a separate partition for each employee,
and each would have exactly one row.
In addition, it's not likely you'll be doing a lot of queries looking for a particular value,
or even a particular range of values. This is a poor choice.
The column first_name will not have one per employee, but there will likely be many columns that have only one row.
This is also true for last_name.
Also, like empl_id, it's not likely you'll need filter queries based on these columns. These are also poor choices.
The column salary also will have many divisions
(and even more so if your salaries go to the cent rather than to the dollar as our sample table does).
While it may be that you'll sometimes want to query on salary ranges,
it's not likely you'll want to use individual salaries.
So salary is a poor choice.
A more limited salary_grades specification, like the ones in the salary_grades table,
might be reasonable if your use case involves looking at the data by salary grade frequently.
The office_id column identifies the office where an employee works.
This will have a much smaller number of unique values, even if you have a large company with offices in many cities.
It's imaginable that your use case might be to frequently filter
your employee data based on office location, too. So this would be a good choice.
You also can use multiple columns and create nested partitions.
For example, a dataset of customers might include country and state_or_province columns.
You can partition by country and then partition those further by state_or_province, so customers from Ontario,
Canada would be in the country=ca/state_or_province=on/ partition directory.
This can be extremely helpful for large amounts of data that you want to access either by country or by state or province.
However, using multiple columns increases the danger of creating too many partitions, so you must take extra care when doing so.
The risk of creating too many partitions is why Hive includes the property
hive.exec.dynamic.partition.mode, set to strict by default, which must be reset to nonstrict before you can create a partition.
Rather than automatically and mechanically resetting that property when you're about to load data dynamically,
take it as an opportunity to think about the partitioning columns
and maybe check the number of unique values you would get when you load the data.
And that's all.

Using ingestion-time based pseudo-field (_PARTITIONTIME) as partition while clustering

I'd like to cluster our ingestion-time partitioned tables without having to change the ETL scripts we use to update them. All of our tables are partitioned on the pseudo-field _PARTITIONTIME, now when I try cluster a table with DML I get the following error:
Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, TABLE, FILE and _ROW_TIMESTAMP
Here's what the DML-script looks like:
CREATE TABLE `table_target`
PARTITION BY DATE(_PARTITIONTIME)
CLUSTER BY a, b, c
AS
SELECT
*, _PARTITIONTIME
FROM
`table_source`
How should I go about this? Is there a way to keep the same pseudo-field as the partition field, should I re-work the partition field, or am I missing something here?
It is Known limitation that:
It is not possible to create an ingestion-time partitioned table from the result of a query. Instead, use a CREATE TABLE DDL statement to create the table, and then use an INSERT DML statement to insert data into it.
In your case, you need to use CREATE TABLE to create target_table with CLUSTER BY first, then migrate data over.

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

Multiple Parquet files while writing to Hive Table(Incremental)

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;

Oracle SQL merge tables without specifying columns

I have a table people with less than 100,000 records and I have taken a backup of this table using the following:
create table people_backup as select * from people
I add some new records to my people table over time, but eventually I want to merge the records from my backup table into people. Unfortunately I cannot simply DROP my table as my new records will be lost!
So I want to update the records in my people table using the records from people_backup, based on their primary key id and I have found 2 ways to do this:
MERGE the tables together
use some sort of fancy correlated update
Great! However, both of these methods use SET and make me specify what columns I want to update. Unfortunately I am lazy and the structure of people may change over time and while my CTAS statement doesn't need to be updated, my update/merge script will need changes, which feels like unnecessary work for me.
Is there a way merge entire rows without having to specify columns? I see here that not specifying columns during an INSERT will direct SQL to insert values by order, can the same methodology be applied here, is this safe?
NB: The structure of the table will not change between backups
Given that your table is small, you could simply
DELETE FROM table t
WHERE EXISTS( SELECT 1
FROM backup b
WHERE t.key = b.key );
INSERT INTO table
SELECT *
FROM backup;
That is slow and not particularly elegant (particularly if most of the data from the backup hasn't changed) but assuming the columns in the two tables match, it does allow you to not list out the columns. Personally, I'd much prefer writing out the column names (presumably those don't change all that often) so that I could do an update.