How to search through directory and copy into database - sql

For a single file, I would use (I'm using PostgreSQL):
COPY mytable FROM '/usr/info.csv' NULL '' HEADER CSV;
However, now I have a folder with possibly 60-100 files that are in the same format ending in the ext as infoPROCESSED.csv and was wondering if there is a method in postgresql to go through the directory and copy the files into mytable? or would I have to write a script to do this?

Here's a (hackish) way of loading the csv files:
for x in *.ext; do psql -d yourdb -qtAc "copy mytable from '/path/to/files/$x' csv header null ''"; done

Related

Create a csv file of a view in hive and put it in s3 with headers excluding the table names

I have a view in hive named prod_schoool_kolkata. I used to get the csv as:
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
that was in EC2-Instance. I want the path to be in S3.
I tried giving the path like :
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > s3://data/prod_schoool_kolkata.csv
But the csv is not getting stored.
I also had a problem that the csv file is getting generated but every column head is having pattern like: tablename.columnname for example prod_schoool_kolkata.id. Is there any way to remove the table names in the csv getting formed.
You have to first install the AWS Command Line Interface.
Refer the Link : Installing the AWS Command Line Interface and follow the relevant installation instructions or go to the Sections at the bottom to get the installation links relevant to your Operating System(Linux/Mac/Windows etc).
After verifying that it's installed properly, you may run normal commands like cp,ls etc over the aws file system. So, you could do
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata'|
sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
aws s3 cp /home/data/prod_schoool_kolkata.csv s3://data/prod_schoool_kolkata.csv
Also see How to use the S3 command-line tool

Export PSQL table to CSV file using cmd line args

I am attempting to export my PSQL table to a CSV file. I have read many tips online, such as Save PL/pgSQL output from PostgreSQL to a CSV file
From these posts I have been able to figure out
\copy (Select * From foo) To '/tmp/test.csv' With CSV
However I do not want to specify the path, I would like the user to be able to enter it via the export shell script. Is it possible to pass args to a .sql script from a .sh script?
I was able to figure out the solution.
In my bash script I used something similar to the following
psql $1 -c "\copy (SELECT * FROM users) TO '"$2"'" DELIMINTER ',' CSV HEADER"
This code copies the user table from the database specified by the first argument and exports it to a csv file located at the second arguement

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

Hive output to xlsx

I am not able to open an .xlsx file. Is this the correct way to output the result to an .xlsx file?
hive -f hiveScript.hql > output.xlsx
hive -S -f hiveScript.hql > output.xls
This will work
There is no easy way to create an Excel (.xlsx) file directly from hive. You could output you queries content to an older version of Excel (.xls) by the answers given above and it would open in Excel properly (with an initial warning in latest versions of Office) but in essence it is just a text file with .xls extension. If you open this file with any text editor you would see the contents of the query output.
Take any .xlsx file on your system and open it with a text editor and see what you get. It will be all junk characters since that is not a simple text file.
Having said that there are many programming languages that allow you to convert/read a text file and create xlsx. Since no information is provided/requested on this I will not go into details. However, you may use Pandas in Python to create excels.
output csv or tsv file, and I used Python to do converting (pandas library)
I am away from my setup right now so really cannot test this. But you can give this a try in your hive shell:
hive -f hiveScript.hql >> output.xls

SQLLDR control file: Loading multiple files

Iam trying to load several data files into a single table. Now the files themselves have the following format:
file_uniqueidentifier.dat_date
My control file looks like this
LOAD DATA
INFILE '/home/user/file*.dat_*'
into TABLE NEWFILES
FIELDS TERMINATED BY ','
TRAILING NULLCOLS
(
FIRSTNAME CHAR NULLIF (FIRSTNAME=BLANKS)
,LASTNAME CHAR NULLIF (LASTNAME=BLANKS)
)
My SQLLDR on the other hand looks like this
sqlldr control=loader.ctl, userid=user/pass#oracle, errors=99999,direct=true
The error produced is SQL*Loader-500 unable to open file (/home/user/file*.dat_*) SQL*Loader-553 file not found
Does anyone have an idea as to how I can deal with this issue?
SQLLDR does not recognize the wildcard. The only way to have it use multiple files to to list them explicitly. You could probably do this using a shell script.
Your file naming convention seem like you can combine those files in to one making that one being used by the sqlldr control file. I don't know how you can combine those files into one file in Unix, but in Windows I can issue this command
copy file*.dat* file.dat
This command will read all the contents of the files that have the names that start with file and extension of dat and put in the file.dat file.
I have used this option and this works fine for multiple files uploading into single table.
-- SQL-Loader Basic Control File
options ( skip=1 )
load data
infile 'F:\oracle\dbHome\BIN\sqlloader\multi_file_insert\dept1.csv'
infile 'F:\oracle\dbHome\BIN\sqlloader\multi_file_insert\dept2.csv'
truncate into table scott.dept2
fields terminated by ","
optionally enclosed by '"'
( DEPTNO
, DNAME
, LOC
, entdate
)