S3DistCp filename after merging files - amazon-s3

I am having an issue with the command s3-dist-cp on Amazon EMR. What I want to achieve is being able to define the name of the file when merging all the little files in my S3 folder. Example:
s3://bucket/event/v1/2017/11/06/10/event-1234.json
s3://bucket/event/v1/2017/11/06/10/event-4567.json
s3://bucket/event/v1/2017/11/06/10/event-7890.json
.... so on
and the result is the following:
s3://test/test/event
I am able to merge all the files above but the result filename is wrong.
The command is:
s3-dist-cp --src s3://bucket/event/v1/2017/11/06/10/ --dest s3://test/test/ --groupBy='.*(event).*' --targetSize=2048
and the result I want to achieve is:
s3://test/test/events.hourly.json
How can I change the destination file name ?

Related

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

EMR spark step and merge output into one file

I am trying emr spark step. I have an input s3 directory. Which have multiple files: f1,f2,f3
I am adding spark step like this:
aws emr --region us-west-2 add-steps --cluster-id foo --steps '[{"Args":["spark-submit","--deploy-mode","cluster","--class","JsonToDataToParquetJob","s3://foo/My.assembly.jar","s3://inputDir/","output/"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]'
Which has following code:
delimitedData.write.mode(SaveMode.Append).parquet(output)
The problem I am facing is:
I am having multiple output files. But what I am looking for is single output file in the directory. How can I achieve that?
By default, an output file is generated per partition.
You should be able to achieve what you want by doing a repartition(1).
like this:
repartition(1).write().mode(SaveMode.Append).parquet(output);

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?
S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.
Two options:
AWS CLI method
Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh
aws s3 ls s3://bucket.name/path | cut -c32-
In PDI:
Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh
Execute a Process: Process field: process, Output Line Delimiter |
Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename
S3 CSV Input: The filename field: filename
S3 Local Sync
Mount the S3 directory to a local directory, using s3fs
If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine
Then use the standard file reading tools
$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.
With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');
You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

Need help on Pig

I am executing a Pig script, which reads files from a directory, performs some operation and stores to some output directory. In output directory I'm getting one or more "part" files, one _SUCCESS file and one _logs directory. My questions are:
Is there any way to control the name of files generated (upon execution of STORE command) in output directory. To be specific, I don't want the names to be "part-.......". I want Pig to generate files according to the file name pattern I specify.
Is there any way to suppress the _SUCCESS file and the _log directory? Basically I don't want the _SUCCESS and _logs to be generated in the output directory.
Regards
Biswajit
See this post.
To remove _SUCCESS, use SET mapreduce.fileoutputcommitter.marksuccessfuljobs false;. I'm not 100% sure how to remove _logs but you could try SET pig.streaming.log.persist false;.