Outputting hive table to HDFS as a single file - hive

I'm trying to output the contents of a table I have in hive to hdfs as a single csv file, however when I run the code below it splits it into 5 separate files of ~500mb each. Am I missing something in terms of outputting the results as one single csv file?
set hive.execution.engine=tez;
set hive.merge.tezfiles=true;
INSERT OVERWRITE DIRECTORY "/dl/folder_name"
row format delimited fields terminated by ','
select * from schema.mytable;

Add orderby clause in your select query then Hive will force to run single reducer which will create only one file in HDFS directory.
INSERT OVERWRITE DIRECTORY "/dl/folder_name"
row format delimited fields terminated by ','
select * from schema.mytable order by <col_name>;
Note:
If the number of rows in the output is too large, the single reducer could take a very long time to finish.

Related

Skipping header in hive is removing first line of my data

I have the following query in hive:
CREATE EXTERNAL TABLE shop.id_store (
person_id INT,
shop_category STRING
)
row format delimited fields terminated by ',' stored as textfile
LOCATION "user/schema/table"
tblproperties('skip.header.line.count'='1', 'external.table.purge'='true');
LOAD DATA INPATH 'tmp/ids.csv' OVERWRITE INTO TABLE shop.id_store;
INSERT OVERWRITE TABLE shop.id_store
SELECT
*
FROM
shop.id_store
my csv ids.csv, does contain headers, however i have noticed that the above code actually removes the first row of my actual data. What is going on?

INSERT OVERWRITE LOCAL DIRECTORY - why works for some queries

This query works fine - stores result in a file:
INSERT OVERWRITE LOCAL DIRECTORY '/export/home/devtmpl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from cincdr where eventdatetime > '2015-02-15' and sliceEventCostVat is not null;
But this one creates an empty file :
INSERT OVERWRITE LOCAL DIRECTORY '/export/home/devtmpl'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
select * from cincdr where sliceEventCostVat is not null;
As you can see, second query differs only in 'where' clause.
If I run queries without INSERT OVERWRITE ... both give non empty result...
Have you any idea why INSERT OVERWRITE gives different result than a simple query?
Regards
Pawel

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.

import csv file into table using SQL Loader [but large no. of Columns]

I want to import data in the form of csv file into a table.[using Oracle SQL developer].I have such hundred files and each has about 50 columns.
From the wiki of SQL*Loader (http://www.orafaq.com/wiki/SQL*Loader_FAQ)
load data
infile 'c:\data\mydata.csv'
into table emp
fields terminated by "," optionally enclosed by '"'
( empno, empname, sal, deptno ) //these are the columns headers
What i don't want to do is list down all the column headers.I just want all the enteries in the csv file to be assigned to members in the tables in the order in which they appear.
Moreover after all think i want to automate it for all the 100 files.
You should write down the columns (and their type optionally) so as to assign the values of your csv file to each column. You should do this because the order of the columns in the table in your Oracle Database is not known in the script.
After you write the columns in the order they appear in your csv files, you can automate this script for all of your files by typing:
infile *.csv
You can try oracle csv loader. It automatically creates the table and the controlfile based on the csv content and loads the csv into an oracle table using sql loader.
An alternative to sqlldr that does what you are looking for is the LOAD command in SQLcl. It simply matches header row in the csv to the table and loads it. However this is not as performant nor as much control as sqlldr.
LOAD [schema.]table_name[#db_link] file_name
Here's the full help for it.
sql klrice/klrice
...
KLRICE#xe>help load
LOAD
-----
Loads a comma separated value (csv) file into a table.
The first row of the file must be a header row. The columns in the header row must match the columns defined on the table.
The columns must be delimited by a comma and may optionally be enclosed in double quotes.
Lines can be terminated with standard line terminators for windows, unix or mac.
File must be encoded UTF8.
The load is processed with 50 rows per batch.
If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches.
The load is terminated if more than 50 errors are found.
LOAD [schema.]table_name[#db_link] file_name
KLRICE#xe>
Example from a git repo I have at https://github.com/krisrice/maxmind-oracledb
SQL> drop table geo_lite_asn;
Table GEO_LITE_ASN dropped.
SQL> create table geo_lite_asn (
2 "network" varchar2(32),
3 "autonomous_system_number" number,
4 "autonomous_system_organization" varchar2(200))
5 /
Table GEO_LITE_ASN created.
SQL> load geo_lite_asn GeoLite2-ASN-CSV_20180130/GeoLite2-ASN-Blocks-IPv4.csv
--Number of rows processed: 397,040
--Number of rows in error: 0
0 - SUCCESS: Load processed without errors
SQL>

Inserting Data into Hive Table

I am new to hive. I have successfully setup a single node hadoop cluster for development purpose and on top of it, I have installed hive and pig.
I created a dummy table in hive:
create table foo (id int, name string);
Now, I want to insert data into this table. Can I add data just like sql one record at a time? kindly help me with an analogous command to:
insert into foo (id, name) VALUES (12,"xyz);
Also, I have a csv file which contains data in the format:
1,name1
2,name2
..
..
..
1000,name1000
How can I load this data into the dummy table?
I think the best way is:
a) Copy data into HDFS (if it is not already there)
b) Create external table over your CSV like this
CREATE EXTERNAL TABLE TableName (id int, name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'place in HDFS';
c) You can start using TableName already by issuing queries to it.
d) if you want to insert data into other Hive table:
insert overwrite table finalTable select * from table name;
There's no direct way to insert 1 record at a time from the terminal, however, here's an easy straight forward workaround which I usually use when I want to test something:
Assuming that t is a table with at least 1 record. It doesn't matter what is the type or number of columns.
INSERT INTO TABLE foo
SELECT '12', 'xyz'
FROM t
LIMIT 1;
Hive apparently supports INSERT...VALUES starting in Hive 0.14.
Please see the section 'Inserting into tables from SQL' at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
What ever data you have inserted into one text file or log file that can put on one path in hdfs and then write a query as follows in hive
hive>load data inpath<<specify inputpath>> into table <<tablename>>;
EXAMPLE:
hive>create table foo (id int, name string)
row format delimited
fields terminated by '\t' or '|'or ','
stored as text file;
table created..
DATA INSERTION::
hive>load data inpath '/home/hive/foodata.log' into table foo;
to insert ad-hoc value like (12,"xyz), do this:
insert into table foo select * from (select 12,"xyz")a;
this is supported from version hive 0.14
INSERT INTO TABLE pd_temp(dept,make,cost,id,asmb_city,asmb_ct,retail) VALUES('production','thailand',10,99202,'northcarolina','usa',20)
It's a limitation of hive.
1.You cannot update data after it is inserted
2.There is no "insert into table values ... " statement
3.You can only load data using bulk load
4.There is not "delete from " command
5.You can only do bulk delete
But you still want to insert record from hive console than you can do select from statck. refer this
You may try this, I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop
#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hive
hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hive
hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hive
Sample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay
You can use following lines of code to insert values into an already existing table. Here the table is db_name.table_name having two columns, and I am inserting 'All','done' as a row in the table.
insert into table db_name.table_name
select 'ALL','Done';
Hope this was helpful.
Hadoop file system does not support appending data to the existing files. Although, you can load your CSV file into HDFS and tell Hive to treat it as an external table.
Use this -
create table dummy_table_name as select * from source_table_name;
This will create the new table with existing data available on source_table_name.
LOAD DATA [LOCAL] INPATH '' [OVERWRITE] INTO TABLE <table_name>;
use this command it will load the data at once just specify the file path
if file is in local fs then use LOCAL if file is in hdfs then no need to use local