Hive | Create partition on a date - hive

I need to create an external hive table on top of a csv file. CSV is having col1, col2, col3 and col4.
But my external hive table should be partitioned on month but my csv file doesn't have any month field. col1 is date field.
How can I do this?

You need to reload data into partitioned table.
Create non-partitioned table (mytable) on top of folder with CSV.
Create partitioned table (mytable_part)
create table mytable_part(
--columns specification here for col1, col2, col3, col4
)
partitioned by (part_month string) ...
stored as textfile --you can chose any format you need
Load data into partitioned table using dynamic partitioning, calculate partition column in the query:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table mytable_part partition (part_month)
select
col1, col2, col3, col4,
substr(col1, 1, 7) as part_month --partition column in yyyy-MM format
from mytable
distribute by substr(col1, 1, 7) --to reduce the number of files
;

Try this way
Copy the csv data into a folder in HDFS location hdfs://somepath/5 and add that path to your external table as partition.
create external table ext1(
col1 string
,col2 string
,col3 string
,col4 string
)
partition by (mm int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC;
alter table ext1 add partition(mm = 5) location 'hdfs://yourpath/5';

Related

How to create a temp table in redshift to load csv data with varying number of columns in csv?

I am trying to push data frame with varying number of columns to aws redshift.
this is the data frame header :
col1 col2 col3
I have created a temp table using something like this :
DROP TBALE TEMP;
CREATE TABLE temp (
col1 int,
col2 int,
col3 int
);
but now the data frame has two new columns and the number of columns keeping changing.
How to drop create this table temp based on changing data frame columns
col1 col2 col3 col4 col5
any way to tackle this in one shot or do i keep editing ddl every time data is read
Assuming that you are loading the data via a COPY command from S3, maybe you can try creating a table with the maximum number of columns you expect to receive in the CSV files, and then use the flag FILLRECORD (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-fillrecord)
In this way if the file contains less columns, the other columns will have NULL values.
Example, if your file has 3 columns, but the TEMP table has 5:
col1
col2
col3
col4
col5
2
5
6
NULL
NULL

Is it possible to convert a hive table format to ORC and make it bucketed

I have a set of hive tables that are not in ORC format and also not bucketed. I want to change their formats to ORC as well as make them bucketed. Couldn't find a concrete answer throughout the net. Any answer or guidance is appreciated.
Hive version is 2.3.5
Or if it is possible to do it in spark (pyspark or scala)?
The simplest solution would be to create a new table which is bucketed and is in ORC format then insert into it from the old table. Looking for an in-place solution.
Hive:
Use a staging table to read the un-bucketed data (assuming TEXTFILE format) using these commands:
CREATE TABLE staging_table(
col1 colType,
col2 colType, ...
coln colType
)
STORED AS
TEXTFILE
LOCATION
'/path/of/input/data';
CREATE TABLE target_table(
col1 colType,
col2 colType, ...
coln colType
)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;
INSERT OVERWRITE TABLE table_bucketed
SELECT
col1, col2, ..., coln
FROM
staging_table;
The same can be done in **Spark** DataFrame APIs (assuming CSV format) like this:
df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", ",")
.option("path", "/path/of/input/data/")
.load()
df.write.format("orc")
.option("path", "/path/of/output/data/")
.save()
Create bucketed table and load data into it using INSERT OVERWRITE:
CREATE TABLE table_bucketed(col1 string, col2 string)
CLUSTERED BY(col1) INTO 10 BUCKETS
STORED AS ORC;
INSERT OVERWRITE TABLE table_bucketed
select ...
from table_not_bucketed
See also Sorted Bucketed Table.

Creating temp table from another table including partition column in hive

I am creating a temp table from another table using AS clause where I am including the partition column of another table also be part of temp table and then I am getting the below error. Below is the table create statement where col4 is the partition column of table xyz.
And while running the create statement i am getting the below error. And when I am removing the col4 from the create statement its running fine.
Error:
Error while compiling statement: FAILED: NumberFormatException For
input string: "HIVE_DEFAULT_PARTITION" (state=42000,code=40000)
Please help.
Example:
CREATE TEMPORARY TABLE abc STORED AS PARQUET AS SELECT
col1 AS col1,
col2 AS col2,
col3 AS col3,
col4 AS col4
FROM xyz;
This is a problem with source table xyz because it contains partition __HIVE_DEFAULT_PARTITION__
Hive creates a partition with value __HIVE_DEFAULT_PARTITION__ when in dynamic partition mode inserted partition value is NULL.
Partition __HIVE_DEFAULT_PARTITION__ is not compatible with numeric type and this causing error because it cannot be cast to numeric type.
To remove or query this partition, you need to change the column type to string first:
ALTER TABLE xyz PARTITION COLUMN (col4 string);
Of course you may want to backup table and check the data before removing and decide what to do with this data.
To remove partition:
ALTER TABLE xyz DROP PARTITION (col4 = '__HIVE_DEFAULT_PARTITION__');
After removing partition you can change the type of partition column back to numeric type.

HIve insert statement taking too long

I have 200 Insert statements in a single file (test.hql) to insert them to a ORC format hive table.Each insert takes significant time(40 secs) making the complete process to take close to 2 hours. Is there way to speed things up ?
I could have created a tmp (text format) table and then do simple insert overwrite but that is not allowed.. I cannot create new DDLs..
-> One option is to break the test.hql in shell and execute in parallel processes.
Is there any other way I can make these inserts fast in Hive itself ??
Many insert statements are slower than single one. Transform your 200 inserts into single one using UNION ALL:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select value1 as col1, value2 as col2... coln from default.dual union all
select value1 as col1, value2 as col2... coln from default.dual union all
...
select value1 as col1, value2 as col2... coln from default.dual;
Better you can create a input file and load into table at once.
Create table with particular row format(with delimiters)
Create table test (a string, b string) row format fields terminated by ',' stored as textfile;
And then load data into it,
LOAD DATA inpath "/path" into table table_name;

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like
INSERT INTO tablename VALUES (value1,value2..)
I have read that you can load the data from a file to hive table or you can import data from one table to hive table but is there any way to append the data as in SQL?
Some of the answers here are out of date as of Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL
It is now possible to insert using syntax such as:
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
You can use the table generating function stack to insert literal values into a table.
First you need a dummy table which contains only one line. You can generate it with the help of limit.
CREATE TABLE one AS
SELECT 1 AS one
FROM any_table_in_your_database
LIMIT 1;
Now you can create a new table with literal values like this:
CREATE TABLE my_table AS
SELECT stack(3
, "row1", 1
, "row2", 2
, "row3", 3
) AS (column1, column2)
FROM one
;
The first argument of stack is the number of rows you are generating.
You can also add values to an existing table:
INSERT INTO TABLE my_table
SELECT stack(2
, "row4", 1
, "row5", 2
) AS (column1, column2)
FROM one
;
Slightly better version of the unique2 suggestion is below:
insert overwrite table target_table
select * from
(
select stack(
3, # generating new table with 3 records
'John', 80, # record_1
'Bill', 61 # record_2
'Martha', 101 # record_3
)
) s;
Which does not require the hack with using an already exiting table.
You can use below approach. With this, You don't need to create temp table OR txt/csv file for further select and load respectively.
INSERT INTO TABLE tablename SELECT value1,value2 FROM tempTable_with_atleast_one_records LIMIT 1.
Where tempTable_with_atleast_one_records is any table with atleast one record.
But problem with this approach is that If you have INSERT statement which inserts multiple rows like below one.
INSERT INTO yourTable values (1 , 'value1') , (2 , 'value2') , (3 , 'value3') ;
Then, You need to have separate INSERT hive statement for each rows. See below.
INSERT INTO TABLE yourTable SELECT 1 , 'value1' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 2 , 'value2' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 3 , 'value3' FROM tempTable_with_atleast_one_records LIMIT 1;
No. This INSERT INTO tablename VALUES (x,y,z) syntax is currently not supported in Hive.
You could definitely append data into an existing table. (But it is actually not an append at the HDFS level). It's just that whenever you do a LOAD or INSERT operation on an existing Hive table without OVERWRITE clause the new data will be put without replacing the old data. A new file will be created for this newly inserted data inside the directory corresponding to that table. For example :
I have a file named demo.txt which has 2 lines :
ABC
XYZ
Create a table and load this file into it
hive> create table demo(foo string);
hive> load data inpath '/demo.txt' into table demo;
Now,if I do a SELECT on this table it'll give me :
hive> select * from demo;
OK
ABC
XYZ
Suppose, I have one more file named demo2.txt which has :
PQR
And I do a LOAD again on this table without using overwrite,
hive> load data inpath '/demo2.txt' into table demo;
Now, if I do a SELECT now, it'll give me,
hive> select * from demo;
OK
ABC
XYZ
PQR
HTH
Ways to insert data into Hive table:
for demonstration, I am using table name as table1 and table2
create table table2 as select * from table1 where 1=1;
or
create table table2 as select * from table1;
insert overwrite table table2 select * from table1;
--it will insert data from one to another. Note: It will refresh the target.
insert into table table2 select * from table1;
--it will insert data from one to another. Note: It will append into the target.
load data local inpath 'local_path' overwrite into table table1;
--it will load data from local into the target table and also refresh the target table.
load data inpath 'hdfs_path' overwrite into table table1;
--it will load data from hdfs location iand also refresh the target table.
or
create table table2(
col1 string,
col2 string,
col3 string)
row format delimited fields terminated by ','
location 'hdfs_location';
load data local inpath 'local_path' into table table1;
--it will load data from local and also append into the target table.
load data inpath 'hdfs_path' into table table1;
--it will load data from hdfs location and also append into the target table.
insert into table2 values('aa','bb','cc');
--Lets say table2 have 3 columns only.
Multiple insertion into hive table
Yes you can insert but not as similar to SQL.
In SQL we can insert the row level data, but here you can insert by fields (columns).
During this you have to make sure target table and the query should have same datatype and same number of columns.
eg:
CREATE TABLE test(stu_name STRING,stu_id INT,stu_marks INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE test SELECT lang_name, lang_id, lang_legacy_id FROM export_table;
To insert entire data of table2 in table1. Below is a query:
INSERT INTO TABLE table1 SELECT * FROM table2;
You can't do insert into to insert single record. It's not supported by Hive. You may place all new records that you want to insert in a file and load that file into a temp table in Hive. Then using insert overwrite..select command insert those rows into a new partition of your main Hive table. The constraint here is your main table will have to be pre partitioned. If you don't use partition then your whole table will be replaced with these new records.
Enter the following command to insert data into the testlog table with some condition:
INSERT INTO TABLE testlog SELECT * FROM table1 WHERE some condition;
I think in such scenarios you should be using HBASE which facilitates such kind of insertion but it does not provide any SQL kind of query language. You need you use Java API of HBASE like the put method to do such kind of insertion. Moreover HBASE is column oriented no-sql database.
You still can insert into complex type in Hive - it works
(id is int, colleagues array)
insert into emp (id,colleagues) select 11, array('Alex','Jian') from (select '1')
you can add values to specific columns as well, just specify the column names in which you like to add corresponding values:
Insert into Table (Col1, Col2, Col4,col5,Col7) Values ('Va11','Va2','Val4','Val5','Val7');
Make sure the columns you skip dont have not null value type.
There are few properties to set to make a Hive table support ACID properties and to insert the values into tables as like in SQL .
Conditions to create a ACID table in Hive.
The table should be stored as ORC file. Only ORC format can support ACID prpoperties for now.
The table must be bucketed
Properties to set to create ACID table:
set hive.support.concurrency =true;
set hive.enforce.bucketing =true;
set hive.exec.dynamic.partition.mode =nonstrict
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads= 1;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set the property hive.in.test to true in hive.site.xml
After setting all these properties , the table should be created with tblproperty 'transactional' ='true'. The table should be bucketed and saved as orc
CREATE TABLE table_name (col1 int,col2 string, col3 int) CLUSTERED BY col1 INTO 4
BUCKETS STORED AS orc tblproperties('transactional' ='true');
Now its possible to inserte values into the table like SQL query.
INSERT INTO TABLE table_name VALUES (1,'a',100),(2,'b',200),(3,'c',300);
Yes we can use Insert query in Hive.
hive> create table test (id int, name string);
INSERT: INSERT...VALUES is available starting in version 0.14.
hive> insert into table test values (1,'mytest');
This is going to work for insert. We have to use values keyword.
Note: User cannot insert data into a complex datatype column (array, map, struct, union) using the INSERT INTO...VALUES clause.