Delete records from Hive table using filename - hive

I have a use case where I build a hive table from a bunch of csv files. While writing csv information into hive table, I assign INPUT__FILE__NAME (part of the name) to one of the columns. When I want to the update the records for the same filename, I need to delete the records of the csv file before writing it again.
I use the below query but failed
CREATE EXTERNAL TABLE T_TEMP_CSV(
F_FRAME_RANK BIGINT,
F_FRAME_RATE BIGINT,
F_SOURCE STRING,
F_PARAMETER STRING,
F_RECORDEDVALUE STRING,
F_VALIDITY INT,
F_VALIDITY_INTERPRETATION STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
location '/user/baamarna5617/HUMS/csv'
TBLPROPERTIES ("skip.header.line.count"="2");
DELETE FROM T_RECORD
WHERE T_RECORD.F_SESSION = split(reverse(split(reverse(T_TEMP_CSV.INPUT__FILE__NAME),"/")[0]), "[.]")[0]
from T_TEMP_CSV;
The T_RECORD table has a column called F_SESSION which was assigned part of the INPUT__FILE__NAME using the split method shown above. I want to use the same method while removing those records. Can someone please point me where i am going wrong in this query?
I could successfully delete the records using the below syntax
DELETE FROM T_RECORD
WHERE F_SESSION = 68;
I need to get that 68 from the INPUT_FILE_NAME.

Related

Hive Table name starts with underscore select statement issue

In the process of executing my hql script, i have to store data into a temporary table before inserting to the main table.
In that scenario, I have tried to create a temporary table with an underscore at the starting.
Note: with quotes the table name with underscore is not working.
Working Create Statement:
create table
dbo.`_temp_table` (
emp_id int,
emp_name string)
stored as ORC
tblproperties ('ORC.compress' = 'ZLIB')';
Working Insert Statement:
insert into table dbo.`_temp_table` values (123, 'ABC');
But, the select statement on the temp table is not working and it is showing null records even though we have inserted the record as per insert statement.
select * from dbo.`_temp_table`;
Everything is working fine, but select statement to view the rows is not working.
I still not sure, that we can create a temp table in the above way???
Hadoop uses such filenames started with underscore for hidden files and ignores them when reading. For example "_$folder$" file which is created when you execute mkdir to create empty folder in S3 bucket.
See HIVE-6431 - Hive table name start with underscore
By default, FileInputFormat(which is the super class of various
formats) in hadoop ignores file name starts with "_" or ".", and hard
to walk around this in hive codebase.
You can try to create external table and specify table location without underscore and still having underscore in table name. Also consider using TEMPORARY tables.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

create table and load data in same command

To create a table and load data in that table using a .tbl file we need to, first create the schema and then load the data?
Is not possible to do both operations just in one command, like this command below?
create external table customer (
C_CUSTKEY INT,
C_NAME STRING,
C_ADDRESS STRING,
C_NATIONKEY INT,
C_PHONE STRING,
C_ACCTBAL DOUBLE,
C_MKTSEGMENT STRING,
C_COMMENT STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/hadoopadmin/tables/customer.tb.';
Because I try to do this above command first and the schema was created but when I tried to do "select count(*) from customer" I get 0 rows.
But using the first command to create the schema and then use "LOAD DATA INPATH" command to load the data works. Now with select count(*) I get the number of rows in table.
So its necessary to execute two commands? Its not possible with my first code example with "LOCATION" option, or the first code have some issue and because of that didnt work?

Dynamic partition cannot be the parent of a static partition

I'm trying to aggregate data from 1 table (whose data is re-calculated monthly) in another table (holding the same data but for all time) in Hive. However, whenever I try to combine the data, I get the following error:
FAILED: SemanticException [Error 10094]: Line 3:74 Dynamic partition cannot be the parent of a static partition 'category'
The code I'm using to create the tables is below:
create table my_data_by_category (views int, submissions int)
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
location '${hiveconf:OUTPUT}/${hiveconf:DATE_DIR}/my_data_by_category';
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
The code I'm using to populate the tables is below:
insert overwrite table my_data_by_category partition(category)
select mdcc.col1, mdcc2.col2, pcc.category
from my_data_col1_counts_by_category mdcc
left outer join my_data_col2_counts_by_category mdcc2 where mdcc.category = mdcc2.category
group by mdcc.category, mdcc.col1, mdcc2.col2;
insert overwrite table my_data_lifetime_total_by_category partition(category)
select mdltc.col1 + mdc.col1 as col1, mdltc.col2 + mdc.col2, mdc.category
from my_data_lifetime_total_by_category mdltc
full outer join my_data_by_category mdc on mdltc.category = mdc.category
where mdltc.col1 is not null and mdltc.col2 is not null;
The frustrating part is that I have this data partitioned on another column and repeating this same process with that partition works without a problem. I've tried Googling the "Dynamic partition cannot be the parent of a static partition" error message, but I can't find any guidance on what causes this or how it can be fixed. I'm pretty sure that there's an issue with a way that 1 or more of my tables is set up, but I can't see what. What's causing this error and what I can I do resolve it?
There is no partitioned by clause in this script. As you are trying to insert into non partitioned table using partition in insert statement, it is failing.
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
No. You don't need to add partition clause.
You are doing group by mdcc.category in insert overwrite table my_data_by_category partition(category)..... but you are not using any UDAF.
Are you sure you can do this?
I think that if you change your second create statement to:
create table if not exists my_data_lifetime_total_by_category
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
you should then be free of errors

Bucket is not creating on hadoop-hive

I'm trying to create a bucket in hive by using following commands:
hive> create table emp( id int, name string, country string)
clustered by( country)
row format delimited
fields terminated by ','
stored as textfile ;
Command is executing successfully: when I load data into this table, it executes successfully and all data is shown when using select * from emp.
However, on HDFS it is only creating one table and only one file is there with all data. That is, there is no folder for specific country records.
First of all, in the DDL statement you have to explicitly mention how many buckets you want.
create table emp( id int, name string, country string)
clustered by( country)
INTO 2 BUCKETS
row format delimited
fields terminated by ','
stored as textfile ;
In the above statement I have mention 2 buckets, similarly you can mention any number you want.
Still you are not done!!
After that, while loading data into the table you also have to mention the below hint to hive.
set hive.enforce.bucketing = true;
That should do it.
After this you should be able to see that number of files created under the table directory is same as the number of buckets mentioned in the DDL statement.
Bucketing doesn't create HDFS folders, rather if you want a separate floder to be created for a country then you should PARTITION.
Please go through hive partitioning and bucketing in detail.