Copying all files from a directory using a pig command - apache-pig

Hey I need to copy all files from a local directory to the HDFS using pig.
In the pig script I am using the copyFromLocal command with a wildcard in the source-path
i.e copyFromLocal /home/hive/Sample/* /user
It says the source path doesnt exist.
When I use copyFromLocal /home/hive/Sample/ /user , it makes another directory in the HDFS by the name of 'Sample', which I don't need.
But when I include the file name i.e /home/hive/Sample/sample_1.txt it works.
I dont need a single file. I need to copy all the files in the directory without making a directory in the HDFS.
PS: Ive also tried *.txt, ?,?.txt
No wildcards work.

Pig copyFromLocal/toLocal commands work only for a file or a directory.It will never take series of files (or) wildcard.More over, pig concentrates on processing data from/to HDFS.Upto my knowledge you cant even loop the files in a directory with ls.because it lists out files in HDFS. So, for this scenario I would suggest you to write a shell script/action(i.e. fs command) to copy files from locally to HDFS.
check this link below for info:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#copyFromLocal

Related

Inspect Parquet in S3 from Command Line

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?
I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

Is there a way to have a file path in T-sql and PostgreSQL that stays within the project directory?

I am importing data from a CSV file using the COPY FROM in PostgreSQL. It works flawlessly on my machine, however, were I to clone the repository onto a new machine the code would cease to function to the file path being hardcoded starting at the Users directory of the computer.
In other languages, I would be able to use something like ./ or ~/ to start somewhere not at the absolute beginning of the file, but I haven't found T-SQL or Postgres to have that functionality available.
What I have
COPY persons(name,address,email,phone)
FROM '/Users/admin/Development/practice/data/persons.csv
How can I make that file path function on any machine the project gets cloned to?

S3DistCP copies some files for the manifest and doesnt copy the rest

We are using S3Distcp to copy files from S3 to HDFS by using a manifest file - i.e., we use --copyFromManifest argument in the S3Distcp command. At the S3DistCP step, however, only some of the files are copied that are listed in the manifest. I am not sure where should we start looking for problems - i.e., why are some files being copies and others are not?
Thanks
Maybe the problem is you have files with the same name but on different directories. In this case you will need to change the way you construct the baseName and srcDir fields. Please describe how you build your manifest file.

Flowgear access to files on the local file system

I am creating a Flowgear workflow that needs to process a raft of XML data.
I have the xml data contained in a set of .xml files (approximately 400 files) in a folder on my local machine hard-drive and I want to read them into a workflow, run an XSLT transform and then write out the resultant XML to another folder on the same local hard-drive.
How do I get the flowgear workflow to read these files?
It depends on the use case, the File Enumerator works exceptionally well to loop (as in for-each) through each file. Sometimes, one wants to get a list of files in a particular folder and check whether a file has been found or not. For this, I would recommend a c# script to get a list of files with code:
Directory.GetFiles(#"{FilePath}", "*.{extension}", SearchOption.TopDirectoryOnly);
Further on, use the File node to read, write, or delete files from a file directory.
NB! You will need to install a DropPoint on the PC/Server to allow access to the files. For more information regarding Drop Points, please click here
You can use a File Enumerator or File Watcher to read the files up. The difference is that a File Enumerator will enumerate all files in a folder once, the File Watcher will watch a folder indefinitely and provide new files to the workflow as they are copied into the folder.
You can then use the File node to write the files back the the file system.

Empty files on S3 prevent from downloading using s3cmd and s3sync

I am trying to setup a backup/restore using S3. The upload sync worked well using s3sync. However, next to each folder there is an empty file with matching name. I read somewhere that this is created to define the folder structure but I am not sure about that as it doesn't happen if I create a folder using a different method s3fox etc.
These empty files prevent me from restoring the directories/files. When I do s3cmd sync, I get an error message "can not make directory: File exists" as it first creates that empty file and that fails when trying to create the directory. Any ideas how I can solve this problem?