populating Hive table from file yields far too many rows - sql

I am creating a Hive table from a file with 8k rows, but the table created has 78k rows. The command line is the following:
bin/hive_executable < my_script.hql
my_script.hql:
create table my_table(k1 t1, k2 t2....);
load data local inpath 'path/to/table_file.txt' INTO TABLE my_table;
table_file.txt:
v1 v2 v3...
I've tried both space and tab delimited fields, and explicitly declaring the structure in the create table statement. When I use example code to create a table from $HIVE_HOME/example/file/kv1.txt, the table and file both have 500 lines / rows.
Any ideas?
Thanks

Strip text fields of all newline characters.

Related

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?
Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1
As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

Load into Hive table imported entire data into first column only

I am trying to copy the Hive data from one server to another server. By this, I am exporting into hive data into CSV from server1 and trying to import that CSV file into Hive in server2.
My table contains following datatypes:
bigint
string
array
Here is my commands:
export:
hive -e 'select * from sample' > /home/hadoop/sample.csv
import:
load data local inpath '/home/hadoop/sample.csv' into table sample;
After importing into Hive table, entire row data into inserted into first column only.
How can I overcome this, or else is there a better way to copy data from one server to another server?
While creating table add below line at the end of create statment
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Like Below:
hive>CREATE TABLE sample(id int,
name String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Then Load Data:
hive>load data local inpath '/home/hadoop/sample.csv' into table sample;
For Your Example
sample.csv
123,Raju,Hello|How Are You
154,Nishant,Hi|How Are You
So In above sample data first column is bigint, second is String and third is Array separated by |
hive> CREATE TABLE sample(id BIGINT,
name STRING,
messages ARRAY<String>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|';
hive> LOAD DATA LOCAL INPATH '/home/hadoop/sample.csv' INTO TABLE sample;
Most important point :
Define delimiter for collection items and don't impose the array
structure you do in normal programming.
Also, try to make the field
delimiters different from collection items delimiters to avoid
confusion and unexpected results.
You really should not be using CSV as your data transfer format
DistCp copies data between Hadoop clusters as-is
Hive supports Export, Import
Circus Train allows Hive table replication
why not use hadoop command to transfer data from one cluster to another such as
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
then load the data to your new table
load data inpath '/bar/foo/*' into table wyp;
your problem may caused by the delimiter
,The default delimiter '\001' if you havn't set when create a hivetable ..
if you use hive -e 'select * from sample' > /home/hadoop/sample.csv will make all cloumn to one cloumn

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Strip first whitespace importing csv data

I would like to import data into my postgresql table.
I have .csv file that is formated like this:
1; John Blake
2; Roberto Young
3;Mark Palmer
Any solution how to strip first whitespace where it exists?
i used following code
\copy users from 'users.csv' using delimiters E';'
And it does keep whitespaces
COPY to a temporary staging table and INSERT into the target table from there, trimming the text column.
CREATE TEMP TABLE tmp_x AS
SELECT * FROM users LIMIT 0; -- empty temp table with structure of target
\copy tmp_x FROM '/absolute/path/to/file' delimiters E';'; -- psql command (!)
INSERT INTO users
(usr_id, usr, ...) -- list columns
SELECT usr_id, ltrim(usr), ...
FROM tmp_x;
DROP TABLE tmp_x; -- optional; is destroyed at end of session automatically
ltrim() only trims space from the left of the string.
This sequence of actions performs better than updating rows in the table after COPY, which take longer and produce a dead rows. Also, only newly imported rows are manipulated this way.
Related answer:
Delete rows of a table specified in a text file in Postgres
You won't be able to use COPY alone to do that.
You can use an UPDATE coupled with trim:
UPDATE table SET column = trim(from column)
Or use a script to clean the data before bulk inserting the data to the DB.

import csv file into table using SQL Loader [but large no. of Columns]

I want to import data in the form of csv file into a table.[using Oracle SQL developer].I have such hundred files and each has about 50 columns.
From the wiki of SQL*Loader (http://www.orafaq.com/wiki/SQL*Loader_FAQ)
load data
infile 'c:\data\mydata.csv'
into table emp
fields terminated by "," optionally enclosed by '"'
( empno, empname, sal, deptno ) //these are the columns headers
What i don't want to do is list down all the column headers.I just want all the enteries in the csv file to be assigned to members in the tables in the order in which they appear.
Moreover after all think i want to automate it for all the 100 files.
You should write down the columns (and their type optionally) so as to assign the values of your csv file to each column. You should do this because the order of the columns in the table in your Oracle Database is not known in the script.
After you write the columns in the order they appear in your csv files, you can automate this script for all of your files by typing:
infile *.csv
You can try oracle csv loader. It automatically creates the table and the controlfile based on the csv content and loads the csv into an oracle table using sql loader.
An alternative to sqlldr that does what you are looking for is the LOAD command in SQLcl. It simply matches header row in the csv to the table and loads it. However this is not as performant nor as much control as sqlldr.
LOAD [schema.]table_name[#db_link] file_name
Here's the full help for it.
sql klrice/klrice
...
KLRICE#xe>help load
LOAD
-----
Loads a comma separated value (csv) file into a table.
The first row of the file must be a header row. The columns in the header row must match the columns defined on the table.
The columns must be delimited by a comma and may optionally be enclosed in double quotes.
Lines can be terminated with standard line terminators for windows, unix or mac.
File must be encoded UTF8.
The load is processed with 50 rows per batch.
If AUTOCOMMIT is set in SQLCL, a commit is done every 10 batches.
The load is terminated if more than 50 errors are found.
LOAD [schema.]table_name[#db_link] file_name
KLRICE#xe>
Example from a git repo I have at https://github.com/krisrice/maxmind-oracledb
SQL> drop table geo_lite_asn;
Table GEO_LITE_ASN dropped.
SQL> create table geo_lite_asn (
2 "network" varchar2(32),
3 "autonomous_system_number" number,
4 "autonomous_system_organization" varchar2(200))
5 /
Table GEO_LITE_ASN created.
SQL> load geo_lite_asn GeoLite2-ASN-CSV_20180130/GeoLite2-ASN-Blocks-IPv4.csv
--Number of rows processed: 397,040
--Number of rows in error: 0
0 - SUCCESS: Load processed without errors
SQL>