I have the following folder structure containing my content adhering to the same schema -
/project/20160101/part-v121
/project/20160105/part-v121
/project/20160102/part-v121
/project/20170104/part-v121
I have implemented a pig script which uses JSONLoader to load & processes individual files. However I need to make it generic to read all the files under the dated folder.
Right now I have managed to extract the file paths using the following -
hdfs -ls hdfs://local:8080/project/20* > /tmp/ei.txt
cat /tmp/ei.txt | awk '{print $NF}' | grep part > /tmp/res.txt
Now I need to know how do I pass this list to pig script so that my program runs on all the files.
We can use regex path in LOAD statement.
In your case the below statement should help, let me know if you face any issues.
A = LOAD 'hdfs://local:8080/project/20160102/*' USING JsonLoader();
Assuming .pig_schema (produced by JsonStorage) in the input directory.
Ref : https://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore
Related
I have a "solution.pig" file which contain all load, join and dump queries. I need to run them by typing "solution.pig" in grunt> and save all the result in other file. How can I do that?
You can run the file directly with pig -f solution.pig. Don't open the grunt REPL
And in the file, you can use as many STORE commands as you want to save results into files, rather than DUMP
I have multiple files starting with DUMP_*.
Each file has data for a particular dump.
I want to print filename as well as contents of file in stdout
The expected output should be
FILENAME
ALL CONTENTS OF FILE
and so on
Closest thing I have tried is
cat $(ll DUMP_* | awk -F ' ' '{print $9}' ) | less
With this I am not able to figure out which content belongs to which file.
Also, I am reluctant to use a shell script, an adhoc command is preferred.
This answer is not fully in line with your expectations, but you see the link between a filename and its content even better:
Situation:
Prompt>cat DUMP_1
Info
More Info
Prompt>cat DUMP_2
Info
Solution:
Prompt>grep "" DUMP_*
DUMP_1:Info
DUMP_1:More Info
DUMP_2:Info
I have a view in hive named prod_schoool_kolkata. I used to get the csv as:
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
that was in EC2-Instance. I want the path to be in S3.
I tried giving the path like :
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > s3://data/prod_schoool_kolkata.csv
But the csv is not getting stored.
I also had a problem that the csv file is getting generated but every column head is having pattern like: tablename.columnname for example prod_schoool_kolkata.id. Is there any way to remove the table names in the csv getting formed.
You have to first install the AWS Command Line Interface.
Refer the Link : Installing the AWS Command Line Interface and follow the relevant installation instructions or go to the Sections at the bottom to get the installation links relevant to your Operating System(Linux/Mac/Windows etc).
After verifying that it's installed properly, you may run normal commands like cp,ls etc over the aws file system. So, you could do
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata'|
sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
aws s3 cp /home/data/prod_schoool_kolkata.csv s3://data/prod_schoool_kolkata.csv
Also see How to use the S3 command-line tool
I have a number of files created by a program on our selling system that are produced in a format like the following:
CRY_SKI_14_EDI.LIS
CRY_SUM_14_EDI.LIS
THO_SKI_14_EDI.LIS
THO_LAK_14_EDI.LIS
CRY_SKI_IE_14_EDI.LIS
These files differ in numbers depending on the split of our product to different brandings. Is it possible to rename them all so that they read like the following:
CRY_SKI_14_EDI_DEMO.LIS
CRY_SUM_14_EDI_DEMO.LIS
THO_SKI_14_EDI_DEMO.LIS
THO_LAK_14_EDI_DEMO.LIS
CRY_SKI_IE_14_EDI_DEMO.LIS
I need the files to be correctly named prior to their FTP as a hardcoded file could potentially not exist due to the brand not being on sale and terminate the FTP which would prevent the other files following it from being transmitted to our FTP server.
The OpenVMS rename command is more handy (imho) than the windows or unix variants, because it can bulk change chuncks of the full file name. Such as 'name', 'type' or (sub)directory.
For example:
$ rename *.dat *.old
That's great but it will not change within the chunks (components) like the name part requested here.
For that, the classic DCL approach is a quick loop, either parsing directory output (Yuck!) or using F$SEARCH. For example:
$loop:
$ file = f$search("*EDI.LIS")
$ if file .eqs. "" then exit
$ name = f$parse(file,,,"name","syntax_only") ! grab name component from full name
$ rename/log 'file' 'name'_demo ! rename 'fills in the blanks'
$ goto loop
Personally I use PERL one-liners for this kind of work.
(and I test with -le using 'print' instead of 'rename' first. :-)
$ perl -e "for (<*edi.lis>) { $old = $_; s/_edi/_edi_demo/; rename $old,$_}"
Enjoy!
Hein
I have a list of file names stored in a filenames.txt. Is it possible to load them all together using a single LOAD command?
They are not in the same directory, nor with similar format, so it is not like using /201308 to load 20130801.gz through 20130831.gz.
Plus there are too many files in the list, preventing me to do like this:
shell: pig -f script.pig -param input=/user/training/test/{20100810..20100812}
pig: temp = LOAD '$input' USING SomeLoader() AS (...);
Thanks in advance for insights!
If the number of files are reasonably small (e.g: in the command line you fit into ARG_MAX) you may try to concat the lines in the file into one string:
pig -param input=`cat filenames.txt | tr "\n" ","` -f script.pig
script.pig:
A = LOAD '$input' ....
Probably it would be better to list the directories rather than the individual files if it is an option for you.