Recursively listing directory with Apache pig - apache-pig

Is there a way to recursively list all files in a specific directory, using only pig embedded functions ? An equivalent would be ls -R in bash.
There exists a ls command, but it doesn't take parameters.
I'm aware it may easily be implemented in java, but would rather avoid it if possible.

To recursively list directory in hdfs:
fs -lsr
in local filesystem, you can use sh to run any shell command.
see http://pig.apache.org/docs/r0.12.0/cmds.html#fs
and http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

Related

gsutil rsync only files matching a pattern

I need to rsync files from a bucket to a local machine everyday, and the bucket contains 20k files. I need to download only the changed files that end with *some_naming_convention.csv .
What's the best way to do that? using a wildcard in the download source gave me an error.
I don't think you can do that with Rsynch. As Christopher told you, you can skip files by using the "-x" flag, but no just synch those [1]. I created a public Feature Request on your behalf [2] for you to follow updates there.
As I say in the FR, IMHO I consider this to not follow the purpose of rsynch, as it's to keep folders/buckets synchronise, and just synchronising some of them don't fall in that purpose.
There is a possible "workaround" by using gsutil cp to copy files and -n to skip the ones that already exist. The whole command for your case should be:
gsutil -m cp -n <bucket>/*some_naming_convention.csv <directory>
Other option, maybe a little bit more far-fetched is to copy/move those files to a folder and then use that folder to rsynch.
I hope this works for you ;)
Original Answer
From here, you can do something like gsutil rsync -r -x '^(?!.*\.json$).*' gs://mybucket mydir to rsync all json files. The key is the ?! prefix to the pattern you actually want.
Edit
The -x flag excludes a pattern. The pattern ^(?!.*\.json$).* uses negative look-ahead to specify patterns not ending in .json. It follows that the result of the gsutil rsync call will get all files which end in .json.
Rsync lets you include and exclude files matching patterns.
For each file rsync applies the first patch that matches, some if you want to sync only selected files then you need to include those, and then exclude everything else.
Add the following to your rsync options:
--include='*some_naming_convention.csv' --exclude='*'
That's enough if all your files are in one directory. If you also want to search sub folders then you need a little bit more:
--include='*/' --include='*some_naming_convention.csv' --exclude='*'
This will duplicate all the directory tree, but only copy the files you want. If that leaves empty directories you don't want then add --prune-empty-dirs.

Copying all files from a directory using a pig command

Hey I need to copy all files from a local directory to the HDFS using pig.
In the pig script I am using the copyFromLocal command with a wildcard in the source-path
i.e copyFromLocal /home/hive/Sample/* /user
It says the source path doesnt exist.
When I use copyFromLocal /home/hive/Sample/ /user , it makes another directory in the HDFS by the name of 'Sample', which I don't need.
But when I include the file name i.e /home/hive/Sample/sample_1.txt it works.
I dont need a single file. I need to copy all the files in the directory without making a directory in the HDFS.
PS: Ive also tried *.txt, ?,?.txt
No wildcards work.
Pig copyFromLocal/toLocal commands work only for a file or a directory.It will never take series of files (or) wildcard.More over, pig concentrates on processing data from/to HDFS.Upto my knowledge you cant even loop the files in a directory with ls.because it lists out files in HDFS. So, for this scenario I would suggest you to write a shell script/action(i.e. fs command) to copy files from locally to HDFS.
check this link below for info:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#copyFromLocal

Backup file folder in correct way

My situation is I only have execute permission from some folder:
Lets say, I would like to backup entire folder and exclude some folder and files with exclude.txt
Here is path I would like to backup:
/pdf/data/pdfnew/2014
And I only have permission to execute from this folder (main):
/pdf/data/pdfnew/2014/public/main
I put exclude.txt in same folder which I can execute the command (main)
I execute this command in (main folder):
tar -cjvf -X exclude.txt 2014.tar.bz2 /pdf/data/pdfnew/2014
The result is it still included folder that I dont want to backup.
Is there a correct way doing this?
Do you have a user/home directory on that server? You should, so you should just place exclude.txt in your user/homedirectory on that server & run it like this from that directory:
tar -cjvf -X ~/exclude.txt ~/2014.tar.bz2 /pdf/data/pdfnew/2014
The ~/ is a shorthand for your user/home directory so in this case it is explicitly stating, “Read exclude.txt from the user/home directory & write ~/2014.tar.bz2 to the user/home directory.
But you also ask this:
Is there a correct way doing this?
There is never one canonical best way of doing something like this. It is all based on your final/end goal. Nothing more. Nothing less. That said, if I were you I would do it like this instead using the -C option:
tar -cjvf -X ~/exclude.txt ~/2014.tar.bz2 -C /pdf/data/pdfnew/ 2014
The uppercase -C option allows tar to internally change the working directory to /pdf/data/pdfnew/ so you can then create an archive of 2014 without having to have the whole directory tree retained in the backup. I find this is easier to work with because many times I want to backup the contents of a directory but have no use to retain the parent structure. That way the archive is more like a traditional ZIP archive which I find is easer to understand & work with.

Problem with multiple listings of the same file in RPM spec

I have some problems with an rpm spec file that is listing the same file multiple times. For this spec we do some normal compilation and then we have script that copies everything to the buildroot. Within this buildroot we have a lot of generic scripts that need to be installed on the final system, so we just list this directory.
However the problem is, that one of the scripts might be changed and configuration options might be changed within the script. So we list this script with different attributes as %config. However this means the script is defined multiple times with conflicting attributes, so rpmbuild complains and does not include the script at all in the installation package.
Is there a good way to handle this problem and to tell rpmbuild to only use the second definition, or do we have to seperate the script into two parts, one containing the configuration and one containing the actual logic?
Instead of specifying the directory, you can create a file list and then prune duplicate files from that.
So where you have something like
%files
%dir foo
%config foo/scriptname
You modify those parts to
find $RPM_BUILD_ROOT -type f | sed -e "s|^$RPM_BUILD_ROOT||" > filelist
sed -i "\|^foo/scriptname$|d" filelist
%files -f filelist
%config foo/scriptname
You can also use %{buildroot} in place of $RPM_BUILD_ROOT.

How to view files stored under root?

I have been able to successfully create a root folder and store documents there. How can i be able to view whats there using SSH? What commands do i have to use? Thanks!
ls is the command, / refers to the system root. So go with ls /
ls lists files if the remote is a Unix host. man ls explains how it works (in a rather terse way).
Go with ls.
from any folder: ls /path/to/folder
from within folder: ls