How to execute pig script and save the result in another file? - apache-pig

I have a "solution.pig" file which contain all load, join and dump queries. I need to run them by typing "solution.pig" in grunt> and save all the result in other file. How can I do that?

You can run the file directly with pig -f solution.pig. Don't open the grunt REPL
And in the file, you can use as many STORE commands as you want to save results into files, rather than DUMP

Related

Behave framework .ini file usage

In my .ini file I have
[behave]
format=rerun
outfiles=rerun_failing.features
So I want to use "rerun_failing.features" file for storing scenarios that fail.
However when I run '--steps-catalog' command, it also stores that catalog to the same file. Why is that?
How to make set up two separate files for commands '--rerun' and '--steps-catalog'?
Thanks!
Use behave --dry-run -f steps.catalog ... instead. The output of the steps.catalog formatter is written to stdout, not the "rerun-outputfile".

Apache pig load multiple files

I have the following folder structure containing my content adhering to the same schema -
/project/20160101/part-v121
/project/20160105/part-v121
/project/20160102/part-v121
/project/20170104/part-v121
I have implemented a pig script which uses JSONLoader to load & processes individual files. However I need to make it generic to read all the files under the dated folder.
Right now I have managed to extract the file paths using the following -
hdfs -ls hdfs://local:8080/project/20* > /tmp/ei.txt
cat /tmp/ei.txt | awk '{print $NF}' | grep part > /tmp/res.txt
Now I need to know how do I pass this list to pig script so that my program runs on all the files.
We can use regex path in LOAD statement.
In your case the below statement should help, let me know if you face any issues.
A = LOAD 'hdfs://local:8080/project/20160102/*' USING JsonLoader();
Assuming .pig_schema (produced by JsonStorage) in the input directory.
Ref : https://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

apache pig how to load files in a filenames.txt

I have a list of file names stored in a filenames.txt. Is it possible to load them all together using a single LOAD command?
They are not in the same directory, nor with similar format, so it is not like using /201308 to load 20130801.gz through 20130831.gz.
Plus there are too many files in the list, preventing me to do like this:
shell: pig -f script.pig -param input=/user/training/test/{20100810..20100812}
pig: temp = LOAD '$input' USING SomeLoader() AS (...);
Thanks in advance for insights!
If the number of files are reasonably small (e.g: in the command line you fit into ARG_MAX) you may try to concat the lines in the file into one string:
pig -param input=`cat filenames.txt | tr "\n" ","` -f script.pig
script.pig:
A = LOAD '$input' ....
Probably it would be better to list the directories rather than the individual files if it is an option for you.

Using Pentaho Kettle, how can I convert a csv using commas to a csv with pipe delimiters?

I have a CSV input file with commas. I need to change the delimiter to pipe. Which step should I use in Pentaho kettle? Please do suggest.
Thanks!
Do not use big gun when you try to shoot small target. Can use sed or awk. Or when you want to integrate with kettle, can use step to run shell script and within script use sed for example.
If your goal is to output a pipe separated CSV file from data within a transform and you're already running Kettle, just use a Text File output step.
If the goal is to do something unusual with CSV data within the transform itself, you might look into the Concat Fields step.
If the goal is simply to take a CSV file and write out another CSV with different separators, use the solution #martinnovoty suggests.
You can achieve this easy:
Add a javascript step after the load your csv step into a variable "foo" and add this code onto the js step:
var newFoo = replace(foo,",", "|");
now your cvs file is loaded in newFoo var with pipes.