Skipping header in hive is removing first line of my data - hive

I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?

Related

How to create imapala table with complex data type and how I can specify delimiter for array type column

I am trying to create impala table with array column type, I have to use custom delimiter for array type column.
I tried below query. But, its throwing error.
Create table array_demo( arra_col ARRAY<string>) row format delimited fields terminated by ','
collection items terminated by '|' stored as parquet
You should omit the ROW FORMAT clause and the subclauses specifying the terminators, and include a STORED AS clause (Parquet is the only format Impala supports with complex data).
The data files to load the table have to be in parquet format too.
If you don't have the data file in Parquet format, you can create the table in Hive,
then create a copy using CREATE TABLE … AS SELECT (CTAS statement), with STORED AS PARQUET.
You then can query the table in Impala.
As an example
-- Create table in Hive
CREATE TABLE array_demo( arra_col ARRAY<STRING>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
STORED AS TEXTFILE;
-- Copy the table as parquet format
CREATE TABLE array_demo_impala AS
SELECT *
FROM array_demo
STORED AS PARQUET;

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Dynamic partition cannot be the parent of a static partition

I'm trying to aggregate data from 1 table (whose data is re-calculated monthly) in another table (holding the same data but for all time) in Hive. However, whenever I try to combine the data, I get the following error:
FAILED: SemanticException [Error 10094]: Line 3:74 Dynamic partition cannot be the parent of a static partition 'category'
The code I'm using to create the tables is below:
create table my_data_by_category (views int, submissions int)
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
location '${hiveconf:OUTPUT}/${hiveconf:DATE_DIR}/my_data_by_category';
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
The code I'm using to populate the tables is below:
insert overwrite table my_data_by_category partition(category)
select mdcc.col1, mdcc2.col2, pcc.category
from my_data_col1_counts_by_category mdcc
left outer join my_data_col2_counts_by_category mdcc2 where mdcc.category = mdcc2.category
group by mdcc.category, mdcc.col1, mdcc2.col2;
insert overwrite table my_data_lifetime_total_by_category partition(category)
select mdltc.col1 + mdc.col1 as col1, mdltc.col2 + mdc.col2, mdc.category
from my_data_lifetime_total_by_category mdltc
full outer join my_data_by_category mdc on mdltc.category = mdc.category
where mdltc.col1 is not null and mdltc.col2 is not null;
The frustrating part is that I have this data partitioned on another column and repeating this same process with that partition works without a problem. I've tried Googling the "Dynamic partition cannot be the parent of a static partition" error message, but I can't find any guidance on what causes this or how it can be fixed. I'm pretty sure that there's an issue with a way that 1 or more of my tables is set up, but I can't see what. What's causing this error and what I can I do resolve it?
There is no partitioned by clause in this script. As you are trying to insert into non partitioned table using partition in insert statement, it is failing.
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
No. You don't need to add partition clause.
You are doing group by mdcc.category in insert overwrite table my_data_by_category partition(category)..... but you are not using any UDAF.
Are you sure you can do this?
I think that if you change your second create statement to:
create table if not exists my_data_lifetime_total_by_category
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
you should then be free of errors

Inserting system timestamp into a timestamp field in hive table

I m using Hive 0.8.0 version. I wanted to insert the system timestamp into a timestamp field while loading data into a hive table.
In Detail:
I have a file with 2 fields like below:
id name
1 John
2 Merry
3 Sam
Now i wanted to load this file on hive table along with the extra column "created_date". So i have created hive table with the extra filed like below:
CREATE table mytable(id int,name string, created_date timestamp) row format delimited fields terminated by ',' stored as textfile;
If i load the data file i used the below query:
LOAD DATA INPATH '/user/user/data/' INTO TABLE mytable;
If i run the above query the "created_date" field will be NULL. But i wanted that field should be inserted with the system timestamp instead of null while loading the data into hive table. Is it possible in hive. How can i do it?
You can do this in two steps. First load data from the file into a temporary table without the timestamp. Then insert from the temp table into the actual table, and generate the timestamp with the unix_timestamp() UDF:
create table temptable(id int, name string)
row format delimited fields terminated by ','
stored as textfile;
create table mytable(id int, name string, created_date timestamp)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/user/data/' into table temptable;
insert into table mytable
select id, name, unix_timestamp()
from temptable;

Exporting Hive Table to a S3 bucket

I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this:
CREATE TABLE csvimport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport;
I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance.
Does anyone know how to do this?
Yes you have to export and import your data at the start and end of your hive session
To do this you need to create a table that is mapped onto S3 bucket and directory
CREATE TABLE csvexport (
id BIGINT, time STRING, log STRING
)
row format delimited fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
LOCATION 's3n://bucket/directory/';
Insert data into s3 table and when the insert is complete the directory will have a csv file
INSERT OVERWRITE TABLE csvexport
select id, time, log
from csvimport;
Your table is now preserved and when you create a new hive instance you can reimport your data
Your table can be stored in a few different formats depending on where you want to use it.
Above Query needs to use EXTERNAL keyword, i.e:
CREATE EXTERNAL TABLE csvexport ( id BIGINT, time STRING, log STRING )
row format delimited fields terminated by ',' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 's3n://bucket/directory/';
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;
An another alternative is to use the query
INSERT OVERWRITE DIRECTORY 's3n://bucket/directory/' select id, time, log from csvimport;
the table is stored in the S3 directory with HIVE default delimiters.
If you could access aws console and have the "Access Key Id" and "Secret Access Key" for your account
You can try this too..
CREATE TABLE csvexport(id BIGINT, time STRING, log STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3n://"access id":"secret key"#bucket/folder/path';
Now insert the data as other stated above..
INSERT OVERWRITE TABLE csvexport select id, time, log from csvimport;