INSERT OVERWRITE LOCAL DIRECTORY - why works for some queries - hive

This query works fine - stores result in a file:
INSERT OVERWRITE LOCAL DIRECTORY '/export/home/devtmpl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from cincdr where eventdatetime > '2015-02-15' and sliceEventCostVat is not null;
But this one creates an empty file :
INSERT OVERWRITE LOCAL DIRECTORY '/export/home/devtmpl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from cincdr where sliceEventCostVat is not null;
As you can see, second query differs only in 'where' clause.
If I run queries without INSERT OVERWRITE ... both give non empty result...
Have you any idea why INSERT OVERWRITE gives different result than a simple query?
Regards
Pawel

Related

Skipping header in hive is removing first line of my data

I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?

Outputting hive table to HDFS as a single file

I'm trying to output the contents of a table I have in hive to hdfs as a single csv file, however when I run the code below it splits it into 5 separate files of ~500mb each. Am I missing something in terms of outputting the results as one single csv file?
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
INSERT OVERWRITE DIRECTORY "/dl/folder_name"
row format delimited fields terminated by ','
select * from schema.mytable;
Add orderby clause in your select query then Hive will force to run single reducer which will create only one file in HDFS directory.
INSERT OVERWRITE DIRECTORY "/dl/folder_name"
row format delimited fields terminated by ','
select * from schema.mytable order by <col_name>;
Note:
If the number of rows in the output is too large, the single reducer could take a very long time to finish.

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

How to use 'COPY FROM VERTICA' on same database to copy data from one table to another

I want to copy data from one table to another in vertica using COPY FROM VERTICA command. I have a table having large data in it and i want to select few data (where field1 = 'some val' etc) from it and copy to another table.
Source table has columns of type long varchar and i want to copy these value in another table having different column type like varchar, date and boolean etc. What i want is that only valid values should be copied in destination table, error data should be rejected.
I tried to move data using insert command like below, but problem is that if even there is a single row with invalid data then it 'll terminate process (i have nothing copied in destination table).
INSERT INTO cb.destTable(field1, field2, field3)
Select cast(field1 as varchar), cast(field2 as varchar), cast(field3 as int)
FROM sourceTable Where Id = 2;
How this can be done?
COPY FROM VERTICA and EXPORT TO VERTICA are intended to copy data between clusters. Even if you did loopback the connection, you would not be able to use rejects as they are not supported by COPY FROM VERTICA. The mappings are strict, so if it cannot coerce it will fail.
You'll have to:
INSERT ... SELECT ... WHERE <conditions to filter out data that won't coerce>
INSERT ... SELECT <expressions that massage data that won't coerce>
Export data to a file using vsql (you can turn off headers/footers, turn off padding, set the delimiter to something that doesn't exist in your data, etc) Then use a copy to load it back in.
Try exporting it into a csv file:
=>/o output.csv
=>Select cast(field1 as varchar), cast(field2 as varchar), cast(field3 as int) FROM sourceTable Where Id = 2;
=>/o
Then use COPY command to load it back into the desired table.
COPY FROM '(csv_directory)' DELIMITER '(comma or your configured delimiter)' NO ESCAPE NULL '(NULL indicator)' SKIP 1;
Are they both in the same Vertica database? If so an alternative is:
DROP TABLE IF EXISTS cb.destTable;
CREATE TABLE cb.destTable AS
SELECT field1::VARCHAR, field2::VARCHAR, field3::VARCHAR
FROM sourceTable WHERE Id = 2;

Dynamic partition cannot be the parent of a static partition

I'm trying to aggregate data from 1 table (whose data is re-calculated monthly) in another table (holding the same data but for all time) in Hive. However, whenever I try to combine the data, I get the following error:
FAILED: SemanticException [Error 10094]: Line 3:74 Dynamic partition cannot be the parent of a static partition 'category'
The code I'm using to create the tables is below:
create table my_data_by_category (views int, submissions int)
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
location '${hiveconf:OUTPUT}/${hiveconf:DATE_DIR}/my_data_by_category';
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
The code I'm using to populate the tables is below:
insert overwrite table my_data_by_category partition(category)
select mdcc.col1, mdcc2.col2, pcc.category
from my_data_col1_counts_by_category mdcc
left outer join my_data_col2_counts_by_category mdcc2 where mdcc.category = mdcc2.category
group by mdcc.category, mdcc.col1, mdcc2.col2;
insert overwrite table my_data_lifetime_total_by_category partition(category)
select mdltc.col1 + mdc.col1 as col1, mdltc.col2 + mdc.col2, mdc.category
from my_data_lifetime_total_by_category mdltc
full outer join my_data_by_category mdc on mdltc.category = mdc.category
where mdltc.col1 is not null and mdltc.col2 is not null;
The frustrating part is that I have this data partitioned on another column and repeating this same process with that partition works without a problem. I've tried Googling the "Dynamic partition cannot be the parent of a static partition" error message, but I can't find any guidance on what causes this or how it can be fixed. I'm pretty sure that there's an issue with a way that 1 or more of my tables is set up, but I can't see what. What's causing this error and what I can I do resolve it?
There is no partitioned by clause in this script. As you are trying to insert into non partitioned table using partition in insert statement, it is failing.
create table if not exists my_data_lifetime_total_by_category
like my_data_by_category
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
No. You don't need to add partition clause.
You are doing group by mdcc.category in insert overwrite table my_data_by_category partition(category)..... but you are not using any UDAF.
Are you sure you can do this?
I think that if you change your second create statement to:
create table if not exists my_data_lifetime_total_by_category
partitioned by (category string)
row format delimited
fields terminated by ','
escaped by '\\'
stored as textfile
location '${hiveconf:OUTPUT}/lifetime-totals/my_data_by_category';
you should then be free of errors