Creating a partitioned hive table from a non partitioned table - hive

I have a Hive table which was created by joining data from multiple tables. The data for this resides in a folder which has multiple files ("0001_1" , "0001_2", ... and so on). I need to create a partitioned table based on a date field in this table called pt_dt (either by altering this table or creating a new one). Is there a way to do this?
I've tried creating a new table and inserting into it (below) which did not work
create external table table2 (acct_id bigint, eval_dt string)
partitioned by (pt_dt string);
insert into table2
partition (pt_dt)
select acct_id, eval_dt, pt_dt
from jmx948_variable_summary;
This throws the error
"FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 189 Cumulative CPU: 401.68 sec HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 6 minutes 41 seconds 680 msec"

Was able to figure it out after some trial & error.
Enable dynamic partitioning in Hive:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Create schema for partitioned table:
CREATE TABLE table1 (id STRING, info STRING)
PARTITIONED BY ( tdate STRING);
Insert into partitioned table :
FROM table2 t2
INSERT OVERWRITE TABLE table1 PARTITION(tdate)
SELECT t2.id, t2.info, t2.tdate
DISTRIBUTE BY tdate;

In the version I am working with below works (Hive 0.14.0.2.2.4.2-2)
INSERT INTO TABLE table1 PARTITION(tdate) SELECT t2.id, t2.info, t2.tdate
From the source table select the column that needs to be partitioned by last, in the above example, date is selected as the last column in Select. Similarly, if the one needs the table to be partitioned by the column "info", then
INSERT INTO TABLE table1 PARTITION(info) SELECT t2.id, , t2.tdate, t2.info
If you want to create the table with multiple partitions the select query needs to be i that order. If you want to partition the above table with "date" and then "info"
INSERT INTO TABLE table1 PARTITION(date, info) SELECT t2.id, , t2.tdate, t2.info
With "info", then "date"
INSERT INTO TABLE table1 PARTITION(info, date) SELECT t2.id, , t2.info, t2.tdate

Related

Combining Columns into One Table in SQl

I have two tables (Table 1 and Table 2) that include information on a company's insurance policies. There are thousands of rows and around 30 columns in each table. I need to create a table that combines certain columns from each table into a new table.
From Table 1 I need:
InvestmentCode, IndexType, Amount, FundID, PolicyNumber
From Table 2 I need:
PolicyNumber, FundValue, DepositDate, FundID
I want to merge the tables by FundID and Policynumber
Actually creating one more table would be data redundancy (because you already have data present and you are just copying) ,
you can always create view for this , for your query - view will be something as below
CREATE OR REPLACE VIEW <view_name> AS
select T1.InvestmentCode , T1.IndexType , T1.Amount , T1.FundID , T1.PolicyNumber ,
T2.FundValue , T2.DepositDate from Table1 T1 , Table2 T2
where T1.FundID = T22.FundID
and T2.PolicyNumber = T2.PolicyNumber
WITH READ ONLY

1 billion rows table filtering and join

So here is the deal, I have two tables, each has ~ 1B rows. I am trying to query the tables so I can process the data and insert them into other tables. But I don't need to process all ~1B. I only need 10s or 100s of rows to work with each time.
However, the join and the filtering take a long time to return the data.
The tables don't have non-clustered indexes on them but they do have a clustered index on the primary key.
An example of the query:
select Col1, Col2, Col3
from 1b_table_1 t1
inner join (
select *
from 1b_table_2
where expression=condition
) t2
on t1.join_col = t2.join_col
where CAST(t2.timestamp as date) >= date_var1
and CAST(t2.timestamp as date) <= date_var2
UPDATE
I tried adding a non-clustered index on 1b_table_1 but the issue I have now is that there is a script running somewhere else that is continuously inserting data into these two tables and I can't create a new index or it will lock the table when building the index and the data write will begin to fail and will cause data loss.
ANOTHER ONE
SELECT count(*) from 1b_table_1
~1.2B
SELECT count(*) from 1b_table_2
~22M
SELECT count(*) from 1_table_2 where col like condition_string
BEEN RUNNING FOR OVER 5 minutes and no results.
The column here is nvarchar(max)!!
Also, I can't change the table structures or indexes on the table.
First, the subquery is unnecessary. You can write the query as:
select Col1, Col2, Col3
from 1b_table_1 t1 join
1b_table_2 t2
on t1.join_col = t2.join_col
where t2.expression = t2.condition and
t2.where_col1 = where_exp1 and
t2.where_col2 = where_exp2;
Then for this query, you want indexes on:
1b_table_2(where_col1, where_col2, + columns in the "expression", join_col)
1b_table_1(join_col).

Create daily log table using triggers

I have a query, the results of which are stored in a table.
select id, name, category, date1, count1,count2, count3
into stage
from table1 t1 join table2 t2 on t1.is =t2.id join table3 t3 on t2.id = t3.id
The results of this query must be stored daily in a new log table with an additional date field added that captures the datetime it was logged.
How do I create this?
You can do it via a trigger but cannot recreate the table stage because every time you recreate it (with the into) you lose the trigger. Try this pattern:
create table t21 (i1 int) -- source table
create table t21s (i1 int) -- stage table
create table t2log(i1 int, when1 datetime); -- log table
go
;
create trigger t_t21s on t21s after insert
as
set nocount on
insert into t2log(i1, when1)
select inserted.i1,getdate()
from inserted;
insert into t21 values (5)
-- every day or whenever you want to fill the staging table
truncate table t21s -- every day or period
insert into t21s (i1) -- fill up stage table without destroying trigger
select * from t21 -- see what is in stage
select * from t2log -- see what is in log

SQL Server 2008 R2: Multiple UNION on different databases table's

I have two databases namely db1 and db2.
The database db1 contain 1 table namely test1 and database db2 contain two tables namely test2 and test3.
Here is the following table's with some demo records.
Database: db1
Table: test1
Create table test1
(
ID int,
hrs int,
dates date,
st_date date,
ed_date date
);
insert into test1 values(1,10,'2000-01-01','1900-01-01','2016-01-01');
insert into test1 values(2,20,'2000-02-01','1900-01-01','2016-01-01');
insert into test1 values(3,30,'2000-03-01','1900-01-01','2016-01-01');
insert into test1 values(4,40,'2000-04-01','1900-01-01','2016-01-01');
Database: db2
Table: test2
create table test2
(
ID int,
ID2 int
);
insert into test2 values(1,11);
insert into test2 values(2,22);
insert into test2 values(3,33);
insert into test2 values(4,44);
Database: db2
Table: test3
create table test3
(
ID int,
date date
);
insert into test3 values(1,'2000-01-01');
insert into test3 values(2,'2000-02-02');
insert into test3 values(3,'2000-03-03');
insert into test3 values(4,'2000-04-04');
Note: Now i am executing following query by union all three table's but getting performance issue. There are also test2,test3 tables are present in the db1
select nm,sum(x.avghrs)
from
(
select t1.nm, sum(hrs) as avghrs
from db2.test3 t3,
db2.test2 t2,
db1.test1 t1
where t1.id = t2.id
t3.id = t2.id
group by t1.nm
union
select t1.nm, sum(hrs) as avghrs
from db1.test3 t3,
db1.test2 t2,
db1.test1 t1
where t1.id = t2.id
t3.id = t2.id
group by t1.nm
union
select nm, 0 as avghrs
from test1
where dates between st_date and ed_date
) x
group by nm;
Please tell me if there is any need of modification?
I think the problem is related to JOINs between columns from tables residing in different databases. You can try the following:
1) mirror the tables in a single database (replicate schema and data)
2) apply the appropriate indexes (at least to contain ids used in JOINs)
3) Change your query to SELECT only from mirrored tables
Also, do you need to perform distinct for your unioned results? If not, UNION should be replaced with UNION ALL to avoid the implicit DISTINCT.
UNION ALL will perform better than UNION when you're not concerned about eliminating duplicate records because you're avoiding an expensive distinct sort operation.

SQL Import Data - Insert Only if it doesn't exists

I am using SQL server management tool 2008 to import data to the web host database. I have tables with primary keys. (Id for each row) Now I can import data normally. But when I am importing data for the second time..I need to make sure only those rows that doesn't currently exists only then it's inserted. If there's a to do this using the wizard? If not, then what's the best practice?
Insert the data into a temp table
use left join with main table to identify which records to insert
--
CREATE TABLE T1(col1 int)
go
CREATE TABLE Temp(col1 int )
go
INSERT INTO T1
SELECT 1
UNION
SELECT 2
INSERT INTO TEMP
SELECT 1
UNION
SELECT 2
UNION
SELECT 3
UNION
SELECT 4
INSERT INTO T1
SELECT TEMP.col1
FROM Temp
LEFT JOIN T1
ON TEMP.col1 = T1.col1
WHERE T1.col1 IS NULL
I've used this some time ago, maybe it can help:
insert into timestudio.dbo.seccoes (Seccao,Descricao,IdRegiao,IdEmpresa)
select distinct CCT_COD_CENTRO_CUSTO, CCT_DESIGNACAO, '9', '10' from rhxxi.dbo.RH_CCTT0
where CCT_COD_CENTRO_CUSTO not in (select Seccao from TimeStudio.dbo.Seccoes where idempresa = '10')
Or, just use simple a IF Statement.