How to exchange partition within same hive table - hive

I need help in exchanging the partition within the same table.Let us assume that I have one table with the below definition.
create table test (ID STRING) partitioned by (data_processed string,date1 string);
id data_processed date1
1 0 2018-07-17
1 1 2018-07-16
Now , I want to move the data for partiton(2018-07-17) under data_processed partition '1'.
Desired result:
id data_processed date1
1 1 2018-07-17
1 1 2018-07-16
How to achieve this. Does hive exchange partition supports multi level exchange partition.

You can use hive rename partition command.
Here you can run -->
alter table test partition (data_processed='0',date1='2018-07-17')
rename to partition(data_processed='1',date1='2018-07-17');

Related

Adding column to sqlite database and distribute rows based on primary key

I have some data elements containing a timestamp and information about Item X sales related to this timestamp.
e.g.
timestamp | items X sold
------------------------
1 | 10
4 | 40
7 | 20
I store this data in an SQLite table. Now I want to add to this table. Especially if I get data about another item Y.
The item Y data might or might not have different timestamps but I want to insert this data into the existing table so that it looks like this:
timestamp | items X sold | items Y sold
------------------------------------------
1 | 10 | 5
2 | NULL | 10
4 | 40 | NULL
5 | NULL | 3
7 | 20 | NULL
Later on additional sales data (columns) must be added with the same scheme.
Is there an easy way to accomplish this with SQLite?
In the end I want to fetch data by timestamp and get an overview which items were sold at this time. Most examples consider the usecase to add a complete row (one record) or a complete column if it perfectly matches to the other columns.
Or is sqlite the wrong tool at all? And I should rather use csv or excel?
(Using pythons sqlite3 package to create and manipulate the DB)
Thanks!
Dynamically adding columns is not a good design. You could add them using
ALTER TABLE your_table ADD COLUMN the_column_name TEXT
the column, for existing rows would be populated with nulls, although you could specify a DEFAULT value and the existing rows would then be populated with that value.
e.g. the following demonstrates the above :-
DROP TABLE IF EXISTS soldv1;
CREATE TABLE IF NOT EXISTS soldv1 (timestamp INTEGER PRIAMRY KEY, items_sold_x INTEGER);
INSERT INTO soldv1 VALUES(1,10),(4,40),(7,20);
SELECT * FROM soldv1 ORDER BY timestamp;
ALTER TABLE soldv1 ADD COLUMN items_sold_y INTEGER;
UPDATE soldv1 SET items_sold_y = 5 WHERE timestamp = 1;
INSERT INTO soldv1 VALUES(2,null,10),(5,null,3);
SELECT * FROM soldv1 ORDER BY timestamp;
resulting in the first query returning :-
and the second query returning :-
However, as stated, the above is not considered a good design as the schema is dynamic.
You could alternately manage an equivalent of the above with the addition of either a new column (to also be part of the primary key) or by prefixing/suffixing the timestamp with a type.
Consider, as an example, the following :-
DROP TABLE IF EXISTS soldv2;
CREATE TABLE IF NOT EXISTS soldv2 (type TEXT, timestamp INTEGER, items_sold INTEGER, PRIMARY KEY(timestamp,type));
INSERT INTO soldv2 VALUES('x',1,10),('x',4,40),('x',7,20);
INSERT INTO soldv2 VALUES('y',1,5),('y',2,10),('y',5,3);
INSERT INTO soldv2 VALUES('z',1,15),('z',2,5),('z',9,25);
SELECT * FROM soldv2 ORDER BY timestamp;
This has replicated, data-wise, your original data and additionally added another type (column items_sold_z) without having to change the table's schema (nor having the additional complication of needing to update rather than insert as per when applying timestamp 1 items_sold_y 5).
The result from the query being :-
Or is sqlite the wrong tool at all? And I should rather use csv or excel?
SQLite is a valid tool. What you then do with the data can probably be done as easy as in excel (perhaps simpler) and probably much simpler than trying to process the data in csv format.
For example, say you wanted the total items sold per timestamp and how many types were sold then :-
SELECT timestamp, count(items_sold) AS number_of_item_types_sold, sum(items_sold) AS total_sold FROM soldv2 GROUP by timestamp ORDER BY timestamp;
would result in :-

unable to delete hive table partition contains special character Equal sign(=)

inserted data in Hive table with partition column(CL) value as ('CL=18') which stored as /db/tbname/CL=CL%3D18 (invalid partition contains url encoded special character for equal sign).
As per hortonworks community , it was mentioned hive stored special characters as url escaped.
I tried using escape sequence for equal sign as \x3D(hex) , \u0030 (unicode) but did not work
Ex: alter table tb drop partition (CL='CL\x3D18'); <-- did not work
Can some one help me, am I doing some thing wrong for Equal(=) sign?
Try with alter table id drop partition(cl="cl=18"); (or) by enclosing partition value with single quotes(') also.
i have recreated the scenario on end and able to drop the partitions with special characters without using any hex..etc sequence.
Example:
I have created partition table with cl as partition column stringtype.
hive> alter table t1 add partition(cl="cl=18"); --add the partition to the table
hive> show partitions t1; --list the partititons in the table
+-------------+--+
| partition |
+-------------+--+
| cl=cl%3D18 |
+-------------+--+
hive> alter table t1 drop partition(cl='cl=18'); --drop the partition from the table.
hive> show partitions t1;
+------------+--+
| partition |
+------------+--+
+------------+--+

Add the last modified date of file to Hive external table

I have a requirement where I need to add the time the file was dropped into the HDFS folder as a column in the Hive external table.
Example: I have 2 files dropped on
2017-07-13 15:22
2017-12-13 18:31
So, my last_modified column in the Hive table should reflect 2017-07-13 15:22 for all rows from file 1 and 2017-12-13 18:31 from file 2.
Is there a way to achieve this in the external table create statement.
Thanks in Advance!
I haven't come across any such feature to solve your problem. However, you can try out below steps to maintain last modified time per file in separate column:
Create a partition table on last_modified column.
CREATE EXTERNAL TABLE test (record string) PARTITIONED BY
(last_modified string) location '<warehouse_location>/test.db/test'
For each file add new partition to your table or load using insert statement into partition.
ALTER TABLE test ADD PARTITION (last_modified='2017-07-13 15:22')
location '<data-location>/newfile1/';
create a separate temp table on new file then insert data to
partition table:
CREATE EXTERNAL TABLE tmp (record strin ) location '<new data location>'
INSERT INTO TABLE test PARTITION (
last_modified = '2017-07-13 15:22') SELECT record FROM tmp;

Table partitioning with procedure input parameter

I'm trying to partitioning my table on ID which I got from procedure parameter.
For example my table ddl:
CREATE TABLE bigtable
( ID number )
As input procedure parameter I got eg. number: 130 , So I'm trying to create partition:
Alter table bigtable
add partition part_random_number values(random number);
Of course as random number I mean eg. 120,56 etc : )
But I got an error that object is not partitioned. So I tried to first defined partition clause in crate table statement:
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
But i doesn't work, It works when I defined some partition eg.
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
( partition type values(130);
)
But I would like to avoid it... Is there any other solution?
As result I would like to have table partitioned by procedure input parameterers.
A partitioned table has to have at least one partition. Just create it with a dummy partition and add the ones you actually need using your procedure.

On exchanging partitions, do the exchanged records retain in the original partition?

Suppose, I have following table,
create table SRC_TABLE (
ID NUMBER(2),
NAME VARCHAR(20)
) PARTITION BY LIST (ID)
(
PARTITION "PART_1" VALUES(1),
PARTITION "PART_2" VALUES(2)
)
Following are the records in SRC_TABLE
ID NAME
----- -------
1 src1
1 src11
1 src111
2 src2
2 src22
and other staging table,
create table STAGE_TABLE (
ID NUMBER(2),
NAME VARCHAR(20)
)
Following are the records in STAGE_TABLE:
ID NAME
----- -------
2 2src22
On running the following query,
ALTER TABLE "SRC_TABLE" EXCHANGE PARTITION "PART_1" WITH TABLE "STAGE_TABLE" WITHOUT VALIDATION
Data of SRC_TABLE BECOMES:
ID NAME
----- -------
2 2src22
2 src2
2 src22
So, now the record with name = '2src22' (which came from stage table as a result of exchange) remain in PART_1 or PART_2 since based on ID it should come in PART_2?
When you use the WITHOUT VALIDATION clause, you're telling Oracle: "don't check the new records if they satisfy the partition clause, I have made sure that they all satisfy the partitioning scheme".
Basically you have introduced corrupted data in your database and you've told Oracle not to perform any check. You've intentionally deactivated the protection, so naturally the records will end up in the wrong partition:
SQL> select * from src_table partition (part_1);
ID NAME
--- ------------------------------------------------------------
2 2src22
I'm sure you'll run into fun bugs if you let your data in the wrong partition. Some select may return inconsistent/wrong results. You may also experience unusual error messages.
For instance, a simple partition pruning will give the wrong result (thanks #Alex Poole):
SQL> SELECT * FROM src_table WHERE ID = 1;
ID NAME
--- ------------------------------------------------------------
2 2src22
What happens if you actually use validation:
SQL> ALTER TABLE "SRC_TABLE" EXCHANGE PARTITION "PART_1" WITH TABLE STAGE_TABLE;
ORA-14099: all rows in table do not qualify for specified partition
You get a nice error message explaining that you're trying to do something wrong. Don't try to work around error messages by deactivating protections. Correct your data instead.