I have a CSV file stored on a remote machine. I need to load this data into my Hive Database which is installed in different machine. Is there any way to do this?
note: I am using Hive 0.12.
Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).
Here's an example from the command line.
Create csv file on local machine:
$ vi famous_dictators.csv
... and this is what the file looks like:
$ cat famous_dictators.csv
1,Mao Zedong,63000000
2,Jozef Stalin,23000000
3,Adolf Hitler,17000000
4,Leopold II of Belgium,8000000
5,Hideki Tojo,5000000
6,Ismail Enver Pasha,2500000
7,Pol Pot,1700000
8,Kim Il Sung,1600000
9,Mengistu Haile Mariam,950000
10,Yakubu Gowon,1100000
Then scp the csv file to a cluster node:
$ scp famous_dictators.csv hadoop01:/tmp/
ssh into the node:
$ ssh hadoop01
Create a folder in HDFS:
[awoolford#hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/
Copy the csv file from the local filesystem into the HDFS folder:
[awoolford#hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/
Then login to hive and create the table:
[awoolford#hadoop01 ~]$ hive
hive> CREATE TABLE `famous_dictators`(
> `rank` int,
> `name` string,
> `deaths` int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> LOCATION
> 'hdfs:///tmp/famous_dictators';
You should now be able to query your data in Hive:
hive> select * from famous_dictators;
OK
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
3 Adolf Hitler 17000000
4 Leopold II of Belgium 8000000
5 Hideki Tojo 5000000
6 Ismail Enver Pasha 2500000
7 Pol Pot 1700000
8 Kim Il Sung 1600000
9 Mengistu Haile Mariam 950000
10 Yakubu Gowon 1100000
Time taken: 0.789 seconds, Fetched: 10 row(s)
Related
I have a folder with multiple csv files, they all have the same column attributes.
My goal is to make every csv file into a distinct postgresql table named as the file's name but as there are 1k+ of them it would be a pretty long process to do manually.
I've been trying to search a solution for the whole day but the closest I've came up to solving the problem was this code:
for filename in select pg_ls_dir2 ('/directory_name/') loop
if (filename ~ '.csv$') THEN create table filename as fn
copy '/fullpath/' || filename to table fn
end if;
END loop;
the logic behind this code is to select every filename inside the folder, create a table named as the filename and import the content into said table.
The issue is that I have no idea how to actually put that in practice, for instance where should I execute this code since both for and pg_ls_dir2 are not SQL instructions?
If you use DBeaver, there is a recently-added feature in the software which fixes this exact issue. (On Windows) You have to right click the section "Tables" inside your schemas (not your target table!) and then just select "Import data" and you can select all the .csv files you want at the same time, creating a new table for each file as you mentioned.
Normally, I don' t like giving the answer directly, but I think you will need to change a few things at least.
Depending on the example from here I prepared a small example using bash script. Let' s assume you are in the directory that your files are kept.
postgres#213b483d0f5c:/home$ ls -ltr
total 8
-rwxrwxrwx 1 root root 146 Jul 25 13:58 file1.csv
-rwxrwxrwx 1 root root 146 Jul 25 14:16 file2.csv
On the same directory you can run:
for i in `ls | grep csv`
do
export table_name=`echo $i | cut -d "." -f 1`;
psql -d test -c "CREATE TABLE $table_name(emp_id SERIAL, first_name VARCHAR(50), last_name VARCHAR(50), dob DATE, city VARCHAR(40), PRIMARY KEY(emp_id));";
psql -d test -c "\COPY $table_name(emp_id,first_name,last_name,dob,city) FROM './$i' DELIMITER ',' CSV HEADER;";
done
I am trying to load data into Netezza database using "nzload" utility. The control file is as below and it works without any issues.
Is there a way to provide multiple data files as the input in a single control file?
DATAFILE C:\Karthick\data.txt
{
Database test1
TableName test
Delimiter '%'
maxErrors 20
Logfile C:\Karthick\importload.log
Badfile C:\Karthick\inventory.bad
}
$ cat my_control_file
datafile my_file1 {}
datafile my_file2 {}
datafile my_file3 {}
datafile my_file4 {}
# Below, I specify many of the options
# on the command line itself ... so I don't have
# to repeat them in the control file.
$ nzload -db system -t my_table -delim "|" -maxerrors 10 -cf my_control_file
Load session of table 'MY_TABLE' completed successfully
Load session of table 'MY_TABLE' completed successfully
Load session of table 'MY_TABLE' completed successfully
Load session of table 'MY_TABLE' completed successfully
Yes, you can specify multiple data files in single control file. Those data files can be loaded to same table or different tables. See an example at https://www.ibm.com/docs/en/psfa/7.2.1?topic=command-nzload-control-file
Following are two data files "/tmp/try1.dat" and "/tmp/try2.dat" to be loaded in a table "test" in "system" database:
[nz#nps ]$ cat /tmp/try1.dat
1
2
[nz#nps ]$ cat /tmp/try2.dat
3
4
Following control file defines two "DATAFILE" blocks one for each data file.
[nz#nps ]$ cat /tmp/try.cf
DATAFILE /tmp/try1.dat
{
Database system
TableName test
Delimiter '|'
Logfile /tmp/try1.log
Badfile /tmp/try1.bad
}
DATAFILE /tmp/try2.dat
{
Database system
TableName test
Delimiter '|'
Logfile /tmp/try2.log
Badfile /tmp/try2.bad
}
Load the data using "nzload -cf" option and verify that data is loaded.
[nz#nps ]$ nzload -cf /tmp/try.cf
Load session of table 'TEST' completed successfully
Load session of table 'TEST' completed successfully
[nz#nps ]$ nzsql -c "select * from test"
A1
----
2
3
4
1
(4 rows)
I am familiar with storing output/results for a Hive Query to file, but what command do I use in the script to display the results of the HQL to the terminal?
Normally Hive prints results to the stdout, if not redirected it displays on console. You do not need any special command for this.
If you want to display results on the console screen and at the same time store them in a file, use tee command:
hive -e "use mydb; select * from test_t" | tee ./results.txt
OK
123 {"value(B)":"Bye"}
123 {"value(G)":"Jet"}
Time taken: 1.322 seconds, Fetched: 2 row(s)
Check file contains results
cat ./results.txt
123 {"value(B)":"Bye"}
123 {"value(G)":"Jet"}
See here: https://ru.wikipedia.org/wiki/Tee
This was my output:
There was no output, because I had yet to properly use the LOAD DATA INPATH command to my hdfs. After loading, I received output from the SELECT statement in the script.
I can see the file is on HDFS.
$hadoop fs -cat /user/root/1.txt
1
2
3
but from hive, it is not recognize the file.
hive> create table test4 (numm INT);
OK
Time taken: 0.187 seconds
hive> load data inpath '/user/root/1.txt' into table test4;
FAILED: SemanticException Line 1:17 Invalid path ''/user/root/1.txt'': No files matching path file:/user/root/1.txt
load file from local file system looks good.
Requesting you to please put the complete path for the file.
Eg. load data inpath 'Namenode:' in to table .
Hope this help. Please let me know if you still face any difficulties.
I'm dealing with compressed (gzip) fixed length flat files which I then need to turn into delimited flat files so I can feed it to gpload. I was told it is possible to delimit the file without needing to decompress it, and feed it directly to gpload since it can handle compressed files.
Does anybody know of a way to delimit the file while it is in .gz format?
There is no way to delimit the gzip-compressed data without decompressing it. But you don't need to delimit it, you can just load it as a fixed-width data type, it would be decompressed on the fly by gpfdist. Refer to the "Importing and Exporting Fixed Width Data" chapter in admin guide here: http://gpdb.docs.pivotal.io/4330/admin_guide/load.html
Here's an example:
[gpadmin#localhost ~]$ gunzip -c testfile.txt.gz
Bob Jones 27
Steve Balmer 50
[gpadmin#localhost ~]$ gpfdist -d ~ -p 8080 &
[1] 41525
Serving HTTP on port 8080, directory /home/gpadmin
[gpadmin#localhost ~]$ psql -c "
> CREATE READABLE EXTERNAL TABLE students (
> name varchar(20),
> surname varchar(30),
> age int)
> LOCATION ('gpfdist://127.0.0.1:8080/testfile.txt.gz')
> FORMAT 'CUSTOM' (formatter=fixedwidth_in,
> name='20', surname='30', age='4');
> "
CREATE EXTERNAL TABLE
[gpadmin#localhost ~]$ psql -c "select * from students;"
name | surname | age
-------+---------+-----
Bob | Jones | 27
Steve | Balmer | 50
(2 rows)