Insert overwrite on partitioned table is not deleting the existing data

Insert overwrite on partitioned table is not deleting the existing data - hive

I am trying to run insert overwrite over a partitioned table.
The select query of insert overwrite omits one partition completely. Is it the expected behavior?
Table definition
CREATE TABLE `cities_red`(
`cityid` int,
`city` string)
PARTITIONED BY (
`state` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'auto.purge'='true',
'last_modified_time'='1555591782',
'transient_lastDdlTime'='1555591782');
Table Data
+--------------------+------------------+-------------------+--+
| cities_red.cityid | cities_red.city | cities_red.state |
+--------------------+------------------+-------------------+--+
| 13 | KARNAL | HARYANA |
| 13 | KARNAL | HARYANA |
| 1 | Nagpur | MH |
| 22 | Mumbai | MH |
| 22 | Mumbai | MH |
| 755 | BPL | MP |
| 755 | BPL | MP |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 10 | BANGLORE | TN |
| 12 | NOIDA | UP |
| 12 | NOIDA | UP |
+--------------------+------------------+-------------------+--+
Queries
insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';
It does not delete any data from the table
insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';
It removes the expected 2 rows from the table.
Is this an expected behavior from Hive in case of partitioned tables?

Yes, this is expected behavior.
Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.
In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.
Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.
It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.
And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it.
Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.

Related

Create column based on values on another column in redshift

Suppose I have the following table:
|---------------------|
| id |
|---------------------|
| 12 |
|---------------------|
| 390 |
|---------------------|
| 13 |
|---------------------|
And I want to create another column based on a map of the id column, for example:
12 -> qwert
13 -> asd
390 -> iop
So I basically want a query to create a column based on that map, my final table would be:
|---------------------|---------------------|
| id | col |
|---------------------|---------------------|
| 12 | qwert |
|---------------------|---------------------|
| 390 | iop |
|---------------------|---------------------|
| 13 | asd |
|---------------------|---------------------|
I have this map in a python dictionary.
Is this possible?
(It is basically pandas.map)

It appears that you wish to "fix" some data that is already in your PostgreSQL database.
You could include the data using this technique:
WITH foo AS (VALUES (12, 'qwert'), (13, 'asd'), (390, 'iop'))
SELECT table.id, foo.column2
FROM table
JOIN foo ON (foo.column1 = table.id)
You could do it as an UPDATE statement, but it gets tricky. It would probably be easier to craft a SELECT statement that has everything you want, then use CREATE TABLE new_table AS SELECT...
See: CREATE TABLE AS - Amazon Redshift

BigQuery DML COUNT() across multiple tables

I'm looking for a mechanism to control the accuracy of data that I import daily on multiple BigQuery tables. Each table have similar format with a DATE and an ID column. The Table format looks like this:
Table_1
| DATE | ID |
| 2018-10-01 | A |
| 2018-10-01 | B |
| 2018-10-02 | A |
| 2018-10-02 | B |
| 2018-10-02 | C |
What I want to control is the evolution of the number of IDs, through such kind of output table:
CONTROL_TABLE
| DATE | COUNT(Table1.ID) | COUNT(Table2.ID) | COUNT(Table3.ID) |
| 2018-10-01 | 2 | 487654 | 675386 |
| 2018-10-02 | 3 | 488756 | 675447 |
I'm trying to do such through 1 single SQL query, but face several limits with the DML such as:
-> One single SELECT with all the tables jointed is out of question for performance purpose (20+ tables with millions lines)
-> I was thinking of going through temporary tables, but it seems I cannot run Multiple DELETE + INSERT functions on several tables with DML
-> I cannot use a wildcard table as the output of the query
Would anyone have an idea how to get such result in an optimized way, ideally through 1 single query ?

SQL Server selecting data as array from two tables

I have a database in which there are two tables tableA, tableB. Now for each primary id in tableA there may be multiple rows in tableB.
Table A primary key (ServiceOrderId)
+----------------+-------+-------+-------------+
| ServiceOrderId | Tax | Total | OrderNumber |
+----------------+-------+-------+-------------+
| 12 | 45.00 | 347 | 1011 |
+----------------+-------+-------+-------------+
Table B foreign key (ServiceOrderId)
+----+-------------+---------------------+----------+-------+------+----------------+
| Id | ServiceName | ServiceDescription | Quantity | Price | Cost | ServiceOrderId |
+----+-------------+---------------------+----------+-------+------+----------------+
| 39 | MIN-C | Commercial Pretreat | NULL | 225 | 23 | 12 |
+----+-------------+---------------------+----------+-------+------+----------------+
| 40 | MIN-C | Commercial Pretreat | NULL | 225 | 25 | 12 |
+----+-------------+---------------------+----------+-------+------+----------------+
Is there a way in which I can fetch the values as an array of multiple rows of tableB with single row of tableA. Because when I am saving to database I am using temp table to save multiple rows of tableB with single row of tableA.
Query I am using
SELECT
ordr.*,
info.*
FROM
tblServiceOrder as ordr
JOIN
tblServiceOrderInfo as info ON ordr.ServiceOrderId = info.ServiceOrderId
But above query is giving two rows for each ServiceOrderId. I am using node api to fetch data. I want something like;
Object:{
objectA:{id:12,tax:45.00:total:347,ordernumber:1011},
objectB:[
{id:39,servicename:'MIN-C',description:'Commercial Pretreat',Quantity :NULL,Price:225,Cost:23,ServiceOrderId:12 },
{id:40,servicename:'MIN-C',description:'Commercial Pretreat',Quantity :NULL,Price:225,Cost:25,ServiceOrderId:12}
]
}

There are several solutions. The first one is to use your SELECT, but with adding ORDER BY ServiceOrderID and when data are converting to object, to use the first row only in the loop for new ServiceOrderId from ordr table and add every row for the data from info table.
Other possibility is to select data from ordr table only and for every row to make another select by ServiceOrderId from info table. This solution should not be used for huge tables.

Oracle Recursive Select to Find Current ID Associated with a Customer

I have a table that contains the history of Customer IDs that have been merged in our CRM system. The data in the historical reporting Oracle schema exists as it was when the interaction records were created. I need a way to find the Current ID associated with a customer from potentially an old ID. To make this a bit more interesting, I do not have permissions to create PL/SQL for this, I can only create Select statements against this data.
Sample Data in customer ID_MERGE_HIST table
| OLD_ID | NEW_ID |
+----------+----------+
| 44678368 | 47306920 |
| 47306920 | 48352231 |
| 48352231 | 48780326 |
| 48780326 | 50044190 |
Sample Interaction table
| INTERACTION_ID | CUST_ID |
+----------------+----------+
| 1 | 44678368 |
| 2 | 48352231 |
| 3 | 80044190 |
I would like a query with a recursive sub-query to provide a result set that looks like this:
| INTERACTION_ID | CUST_ID | CUR_CUST_ID |
+----------------+----------+-------------+
| 1 | 44678368 | 50044190 |
| 2 | 48352231 | 50044190 |
| 3 | 80044190 | 80044190 |
Note: Cust_ID 80044190 has never been merged, so does not appear in the ID_MERGE_HIST table.
Any help would be greatly appreciated.

You can look at CONNECT BY construction.
Also, you might want to play with recursive WITH (one of the descriptions: http://gennick.com/database/understanding-the-with-clause). CONNECT BY is better, but ORACLE specific.
If this is frequent request, you may want to store first/last cust_id for all related records.
First cust_id - will be static, but will require 2 hops to get to the current one
Last cust_id - will give you result immediately, but require an update for the whole tree with every new record

Check via Hector if secondary index already exists for a dynamic column in Cassandra

After the data import to my Cassandra Test-Cluster I found out that I need secondary indexes for some of the columns. Since the data is already inside the cluster, I want to achieve this by updating the ColumnFamilyDefinitions.
Now, the problem is: those columns are dynamic columns, so they are invisible to the getColumnMetaData() call.
How can I check via Hector if a secondary index has already been created and create one if this is not the case?
(I think the part how to create it can be found in http://comments.gmane.org/gmane.comp.db.hector.user/3151 )
If this is not possible, do I have to copy all data from this dynamic column family into a static one?

No need to copy all data from dynamic column family into static one.
Then How?? Let me explain you with an example, Suppose you have an CF schema mentioned below:
CREATE TABLE sample (
KEY text PRIMARY KEY,
flag boolean,
name text
)
NOTE I have done indexing on flag and name.
Now here are some data in the CF.
KEY,1 | address,Kolkata | flag,True | id,1 | name,Abhijit
KEY,2 | address,Kolkata | flag,True | id,2 | name,abc
KEY,3 | address,Delhi | flag,True | id,3 | name,xyz
KEY,4 | address,Delhi | flag,True | id,4 | name,pqr
KEY,5 | address,Delhi | col1,Hi | flag,True | id,4 | name,pqr
From the data you can understand that address, id & col1 all are dyamically created.
Now if i query something like that
SELECT * FROM sample WHERE flag =TRUE AND col1='Hi';
Note: col1 is not indexed, but i can filter using that field
Output:
KEY | address | col1 | flag | id | name
-----+---------+------+------+----+------
5 | Delhi | Hi | True | 4 | pqr
Another Query
SELECT * FROM sample WHERE flag =TRUE AND id>=1 AND id <5 AND address='Delhi';
Note: Here neither id is indexed, nor the address, still i am getting the output
Output:
KEY,3 | address,Delhi | flag,True | id,3 | name,xyz
KEY,4 | address,Delhi | flag,True | id,4 | name,pqr
KEY,5 | address,Delhi | col1,Hi | flag,True | id,4 | name,pqr
So basically if you have a column which value is always something you know, and its being indexed. Then you can easily filter on the rest of the dynamic columns aggregating them with indexed always positive column.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Insert overwrite on partitioned table is not deleting the existing data - hive

Related

Create column based on values on another column in redshift

BigQuery DML COUNT() across multiple tables

SQL Server selecting data as array from two tables

Oracle Recursive Select to Find Current ID Associated with a Customer

Check via Hector if secondary index already exists for a dynamic column in Cassandra

Categories

Resources