Inspect Parquet in S3 from Command Line - amazon-s3

I can download a single snappy.parquet partition file with:
aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet
And then use:
parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet
But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"
Also what if the schema is different in different row groups across different partition files?
Following instructions here I downloaded a jar file and ran
hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/
But this resulted in error: No FileSystem for schema "s3".
This answer seems promising, but only for reading from HDFS. Any solution for S3?

I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.
You should be able to do:
pip install "clidb[extras]"
clidb s3://bucket/
and then click to load parquet files as views to inspect and run SQL against.

Related

How run pg_restore for s3 folder?

Here is example about restoring postgres from s3 sigle file. It reads this file into stdout and redirects this strem into pg_restore tool. But what if there are several gz files on s3? Is there a way to make pg_restore read them whihout downloading into any temp folder?
About loop restoring
First of all it is unclear how postgress handle this situation. I have 50-100 gz files with different sizes. Yes they have a nems and can be sorted. But will postgres perform correct restore when I you only single file?
Also loop leads to donwload all files into some folder. Files can be big. So it it better to restore them from s3 directly.

Can make figure out file dependencies in AWS S3 buckets?

The source directory contains numerous large image and video files.
These files need to be uploaded to an AWS S3 bucket with the aws s3 cp command. For example, as part of this build process, I copy my image file my_image.jpg to the S3 bucket like this: aws s3 cp my_image.jpg s3://mybucket.mydomain.com/
I have no problem doing this copy to AWS manually. And I can script it too. But I want to use the makefile to upload my image file my_image.jpg iff the same-named file in my S3 bucket is older than the one in my source directory.
Generally make is very good at this kind of dependency checking based on file dates. However, is there a way I can tell make to get the file dates from files in S3 buckets and use that to determine if dependencies need to be rebuilt or not?
The AWS CLI has an s3 sync command that can take care of a fair amount of this for you. From the documentation:
A s3 object will require copying if:
the sizes of the two s3 objects differ,
the last modified time of the source is newer than the last modified time of the destination,
or the s3 object does not exist under the specified bucket and prefix destination.
I think you'll need to make S3 look like a file system to make this work. On Linux it is common to use FUSE to build adapters like that. Here are some projects to present S3 as a local filesystem. I haven't tried any of those, but it seems like the way to go.

Add jsonserde.jar to EMR Hive permanantly

We know that
add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
is only effective during the current session.
Is there a way to add jar to Hive permanently and globally so that the jar will be available during the lifecycle of the cluster?
UPDATE:
I figured out a way: download the jar by using aws cli, aws s3 cp s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar ., then copy the jar to /usr/lib/hive/lib of all nodes of the EMR cluster
Is there a better way to do this?
insert your ADD JAR commands in your .hiverc file and start hive.
add jar yourjarName.jar
What is .hiverc file?
It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, on start of the hive shell. This could be:
Setting column headers to be visible in query results
Making the current database name part of the hive prompt
Adding any jars or files
Registering UDFs
2 .hiverc file location
The file is loaded from the hive conf directory.
I have the CDH4.2 distribution and the location is:
/etc/hive/conf.cloudera.hive1
If the file does not exist, you can create it. It needs to be
deployed to every node from where you might launch the Hive shell.
ref-http://hadooped.blogspot.in/2013/08/hive-hiverc-file.html

Copying all files from a directory using a pig command

Hey I need to copy all files from a local directory to the HDFS using pig.
In the pig script I am using the copyFromLocal command with a wildcard in the source-path
i.e copyFromLocal /home/hive/Sample/* /user
It says the source path doesnt exist.
When I use copyFromLocal /home/hive/Sample/ /user , it makes another directory in the HDFS by the name of 'Sample', which I don't need.
But when I include the file name i.e /home/hive/Sample/sample_1.txt it works.
I dont need a single file. I need to copy all the files in the directory without making a directory in the HDFS.
PS: Ive also tried *.txt, ?,?.txt
No wildcards work.
Pig copyFromLocal/toLocal commands work only for a file or a directory.It will never take series of files (or) wildcard.More over, pig concentrates on processing data from/to HDFS.Upto my knowledge you cant even loop the files in a directory with ls.because it lists out files in HDFS. So, for this scenario I would suggest you to write a shell script/action(i.e. fs command) to copy files from locally to HDFS.
check this link below for info:
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#copyFromLocal

Is it possible to sync a single file to s3?

I'd like to sync a single file from my filesystem to s3.
Is this possible or can only directories by synced?
Use include/exclude options for the sync-directory command:
e.g. To sync just /var/local/path/filename.xyz to S3 use:
s3 sync /var/local/path s3://bucket/path --exclude='*' --include='*/filename.xyz'
cp can be used to copy a single file to S3. If the filename already exists in the destination, this will replace it:
aws s3 cp local/path/to/file.js s3://bucket/path/to/file.js
Keep in mind that per the docs, sync will only make updates to the target if there have been file changes to the source file since the last run: s3 sync updates any files that have a size or modified time that are different from files with the same name at the destination. However, cp will always make updates to the target regardless of whether the source file has been modified.
Reference: AWS CLI Command Reference: cp
Just to comment on pythonjsgeo's answer. That seems to be the right solution but make sure so execute the command without the = symbol after the include and exclude tag. I was including the = symbol and getting weird behavior with the sync command.
s3 sync /var/local/path s3://bucket/path --exclude '*' --include '*/filename.xyz'
You can mount S3 bucket as a local folder (using RioFS, for example) and then use your favorite tool to synchronize file(-s) or directories.