apache pig how to load files in a filenames.txt - apache-pig

I have a list of file names stored in a filenames.txt. Is it possible to load them all together using a single LOAD command?
They are not in the same directory, nor with similar format, so it is not like using /201308 to load 20130801.gz through 20130831.gz.
Plus there are too many files in the list, preventing me to do like this:
shell: pig -f script.pig -param input=/user/training/test/{20100810..20100812}
pig: temp = LOAD '$input' USING SomeLoader() AS (...);
Thanks in advance for insights!

If the number of files are reasonably small (e.g: in the command line you fit into ARG_MAX) you may try to concat the lines in the file into one string:
pig -param input=`cat filenames.txt | tr "\n" ","` -f script.pig
script.pig:
A = LOAD '$input' ....
Probably it would be better to list the directories rather than the individual files if it is an option for you.

Related

How to execute pig script and save the result in another file?

I have a "solution.pig" file which contain all load, join and dump queries. I need to run them by typing "solution.pig" in grunt> and save all the result in other file. How can I do that?
You can run the file directly with pig -f solution.pig. Don't open the grunt REPL
And in the file, you can use as many STORE commands as you want to save results into files, rather than DUMP

Storing output to CSV file in Pig

While trying to store output to CSV file in Pig, the command runs successfully but a new folder is getting created in the destination location instead of the file name.
Can you please help me?
This is the command i used
STORE A into '/home/cloudera/Downloads/res.csv';
The STORE command writes the output to hdfs and based on the number of reducers, it will write the final result to files equal to the total number of reducers used.If you want to get the results to a single csv file,you have to merge it, write to local system and then copy it back to the lcoation of your choice.
You can have the hadoop commands in your Pig script.
fs -getmerge /home/cloudera/Downloads/* /your/local/dir/res.csv
fs -copyFromLocal /your/local/dir/res.csv /home/cloudera/Downloads
Or
fs -cat /home/cloudera/Downloads/* | fs -put - /home/cloudera/Downloads/res.csv

Apache pig load multiple files

I have the following folder structure containing my content adhering to the same schema -
/project/20160101/part-v121
/project/20160105/part-v121
/project/20160102/part-v121
/project/20170104/part-v121
I have implemented a pig script which uses JSONLoader to load & processes individual files. However I need to make it generic to read all the files under the dated folder.
Right now I have managed to extract the file paths using the following -
hdfs -ls hdfs://local:8080/project/20* > /tmp/ei.txt
cat /tmp/ei.txt | awk '{print $NF}' | grep part > /tmp/res.txt
Now I need to know how do I pass this list to pig script so that my program runs on all the files.
We can use regex path in LOAD statement.
In your case the below statement should help, let me know if you face any issues.
A = LOAD 'hdfs://local:8080/project/20160102/*' USING JsonLoader();
Assuming .pig_schema (produced by JsonStorage) in the input directory.
Ref : https://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

parameter substitution in pig

students = load '/home/vm4learning/Desktop/students-db.txt' using PigStorage('|') as (rnum, sname, name, age, gender, class, subject, marks);
I am facing syntax error while using parameter substitution for /home/vm4learning/Desktop/students-db.txt.
So what is the correct command with proper syntax to use here.
Thanks
You need to specify the HDFS path to your Pig LOAD script
First you need to copy your input file in HDFS then you can specify the hdfs path in your pig script
You can use hadoop put command to copy your input file into HDFS using:
hadoop fs -put /home/vm4learning/Desktop/students-db.txt /user/input
then you can use that path in your pig script
students = load '/user/input/students-db.txt' using PigStorage('|') as (.....);
UPDATE:
save your pig scripts in a file with extention .pig file.
process.pig:
students = load '$inputPath' using PigStorage('|') as (.....);
Now from terminal you can issue the following command to execute your pig file by passing your input path as argument:
pig -p inputPath=/user/input/students-db.txt process.pig
For more details you can check here
use pig -x filename dryrun -param key=value -param key2=value2