EMR spark step and merge output into one file

EMR spark step and merge output into one file - apache-spark-sql

I am trying emr spark step. I have an input s3 directory. Which have multiple files: f1,f2,f3
I am adding spark step like this:
aws emr --region us-west-2 add-steps --cluster-id foo --steps '[{"Args":["spark-submit","--deploy-mode","cluster","--class","JsonToDataToParquetJob","s3://foo/My.assembly.jar","s3://inputDir/","output/"],"Type":"CUSTOM_JAR","ActionOnFailure":"CONTINUE","Jar":"command-runner.jar","Properties":"","Name":"Spark application"}]'
Which has following code:
delimitedData.write.mode(SaveMode.Append).parquet(output)
The problem I am facing is:
I am having multiple output files. But what I am looking for is single output file in the directory. How can I achieve that?

By default, an output file is generated per partition.
You should be able to achieve what you want by doing a repartition(1).
like this:
repartition(1).write().mode(SaveMode.Append).parquet(output);

Related

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

Create a csv file of a view in hive and put it in s3 with headers excluding the table names

I have a view in hive named prod_schoool_kolkata. I used to get the csv as:
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
that was in EC2-Instance. I want the path to be in S3.
I tried giving the path like :
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata' | sed 's/[\t]/,/g' > s3://data/prod_schoool_kolkata.csv
But the csv is not getting stored.
I also had a problem that the csv file is getting generated but every column head is having pattern like: tablename.columnname for example prod_schoool_kolkata.id. Is there any way to remove the table names in the csv getting formed.

You have to first install the AWS Command Line Interface.
Refer the Link : Installing the AWS Command Line Interface and follow the relevant installation instructions or go to the Sections at the bottom to get the installation links relevant to your Operating System(Linux/Mac/Windows etc).
After verifying that it's installed properly, you may run normal commands like cp,ls etc over the aws file system. So, you could do
hive -e 'set hive.cli.print.header=true; select * from prod_schoool_kolkata'|
sed 's/[\t]/,/g' > /home/data/prod_schoool_kolkata.csv
aws s3 cp /home/data/prod_schoool_kolkata.csv s3://data/prod_schoool_kolkata.csv
Also see How to use the S3 command-line tool

S3DistCp filename after merging files

I am having an issue with the command s3-dist-cp on Amazon EMR. What I want to achieve is being able to define the name of the file when merging all the little files in my S3 folder. Example:
s3://bucket/event/v1/2017/11/06/10/event-1234.json
s3://bucket/event/v1/2017/11/06/10/event-4567.json
s3://bucket/event/v1/2017/11/06/10/event-7890.json
.... so on
and the result is the following:
s3://test/test/event
I am able to merge all the files above but the result filename is wrong.
The command is:
s3-dist-cp --src s3://bucket/event/v1/2017/11/06/10/ --dest s3://test/test/ --groupBy='.*(event).*' --targetSize=2048
and the result I want to achieve is:
s3://test/test/events.hourly.json
How can I change the destination file name ?

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?

S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.

Two options:
AWS CLI method
Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh
aws s3 ls s3://bucket.name/path | cut -c32-
In PDI:
Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh
Execute a Process: Process field: process, Output Line Delimiter |
Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename
S3 CSV Input: The filename field: filename
S3 Local Sync
Mount the S3 directory to a local directory, using s3fs
If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine
Then use the standard file reading tools
$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com

Use a Hive Registered UDF in Spark

I register a udf in Hive through beeline using the following:
CREATE FUNCTION udfTest AS 'my.udf.SimpleUDF' USING JAR 'hdfs://hostname/pathToMyJar.jar'
Then I can use it in beeline as follows:
SELECT udfTest(name) from myTable;
Which returns the expected result.
I then launch a spark-shell and run the following
sqlContext.sql("SELECT udfTest(name) from myTable")
Which fails. The stack is several hundred lines long (which I can't paste here) but the key parts are:
org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader
Unable to load resources for default.udftest:java.lang.IllegalArgumentException: Unable to register [/tmp/blarg/pathToMyJar.jar]
I can provide more detail if anything stands out.
Is it possible to use UDFs Registered through Hive in Spark?
Spark Version 1.3.0

When using a custom UDF, make sure that the jar file for your UDF is included with your application, OR use the --jars command-line option to specify the UDF-file as a parameter while launching spark-shell as shown below
./bin/spark-shell --jars <path-to-your-hive-udf>.jar
For more details refer Calling Hive User-Defined Functions from Spark.

we had the same issue recently. What we noticed was that if the jar path is available locally then all goes through fine. and if the jar path is on hdfs , it doesnt work. So what we ended up doing was copying the jar locally using FileSystem.copytoLocalFile and then adding the copied file. Worked for us in cluster and client mode
PS. this is Spark 2.0 Im talking about

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

EMR spark step and merge output into one file - apache-spark-sql

By default, an output file is generated per partition. You should be able to achieve what you want by doing a repartition(1). like this: repartition(1).write().mode(SaveMode.Append).parquet(output);

Related

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

Create a csv file of a view in hive and put it in s3 with headers excluding the table names

S3DistCp filename after merging files

How to use pentaho kettle to load multiple files from s3 bucket

Use a Hive Registered UDF in Spark

Categories

Resources