Apache Spark to S3 upload performance Issue - amazon-s3

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...
Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.
I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.
Has anyone encountered this issue ?
Update 1
How can I increase the number of threads that does this move process ?
Any suggestion is highly appreciated...
Thanks

This was fixed with SPARK-3595 (https://issues.apache.org/jira/browse/SPARK-3595). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark).

I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.
ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
",");
counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
TextOutputFormat.class);
Make sure that you create more output files . more number of smaller files will make upload faster.
API details
saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit
Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.

Related

Redshift Unload command with CSV extension

I'm using the following Unload command -
unload ('select * from '')to 's3://**summary.csv**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file created in S3 is summary.csv000
If I change and remove the file extension from the command like below
unload ('select * from '')to 's3://**summary**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file create in S3 is summary000
Is there a way to get summary.csv, so I don't have to change the file extension before importing it into excel?
Thanks.
actually a lot of folks asked the similar question, right now it's not possible to have an extension for the files. (but parquet files can have)
The reason behind this is, RedShift by default export it in parallel which is a good thing. Each slice will export its data. Also from the docs,
PARALLEL
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
So it has to create new files after 6GB that's why they are adding numbers as a suffix.
How do we solve this?
No native options from RedShift, but we can do some workaround with lambda.
Create a new S3 bucket and a folder inside it specifically for this process.(eg: s3://unloadbucket/redshift-files/)
Your unload files should go to this folder.
Lambda function should be triggered based on S3 put object event.
Then the lambda function,
Download the file(if it is large use EFS)
Rename it with .csv
Upload to the same bucket(or different bucket) into a different path (eg: s3://unloadbucket/csvfiles/)
Or even more simple if you use shell/powershell script to do the following process
Download the file
Rename it with .csv
As per AWS Documentation around UNLOAD command, it's possible to save data as CSV.
In your case, this is what your code would look like:
unload ('select * from '')
to 's3://summary/'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key='''
CSV <<<
parallel off
allowoverwrite
CSV HEADER;

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Merge small files from S3 to create a 10 Mb file

I am new to map reduce. I have a s3 bucket that gets 3000 files every minute. I am trying to use Map reduce to merge these files to make a file between size 10 -100 MB. The python code will use Mrjob and will run on aws EMR. Mrjob's documentation say, mapper_raw can be used to pass entire files to the mapper.
class MRCrawler(MRJob):
def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator
with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...
Is there a way to limit it to only read 5000 files in one run and delete those files after the reducer saves the results to S3 so that the same files are not picked in the next run.
You can do as follows:
configure SQS on the S3 bucket
have lambda which gets triggered by cron; which reads the events from the SQS and copies the relevant files into a staging folder -- you can configure this lambda to read only 5000 messages at a given time.
do all your processing on top of staging folder and once you're done with your Spark job in emr, clean the staging folder

How to get the first 100 lines of a file on S3?

I have a huge (~6 GB) file on Amazon S3 and want to get the first 100 lines of it without having to download the whole thing. Is this possible?
Here's what I'm doing now:
aws cp s3://foo/bar - | head -n 100
But this takes a while to execute. I'm confused -- shouldn't head close the pipe once it's read enough lines, causing aws cp to crash with a BrokenPipeError before it has time to download the entire file?
Using the Range HTTP header in a GET request, you can retrieve a specific range of bytes in an object stored in Amazon S3. (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html)
if you use aws cli you can use aws s3api get-object --range bytes=0-xxx, see http://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
It is not exactly as a number of lines but should allow you to retrieve your file in part so avoid downloading the full object

Internal error while loading to Bigquery table

I ran this command to load 11 files to a Bigquery table:
bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part* /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
I got this error:
Waiting on bqjob_r46f38146351d545_00000147ef890755_1 ... (11s) Current status: DONE
BigQuery error in load operation: Error processing job 'ardent-course-601:bqjob_r46f38146351d545_00000147ef890755_1': Too many errors encountered. Limit is: 0.
Failure details:
- File: 5: Unexpected. Please try again.
I tried many times after that and still got the same error.
To debug what went wrong, I instead load each file one by one to the Bigquery table. For example:
/usr/local/bin/bq load --project_id=ardent-course-601 --source_format=NEWLINE_DELIMITED_JSON dw_test.rome_defaults_20140819_test gs://sm-uk-hadoop/queries/logsToBq_transformLogs/rome_defaults/20140819/23af7218-617d-42e8-884e-f213a583094a/part-m-00011.gz /opt/sm-analytics/projects/logsTobqMR/jsonschema/rome_defaultsSchema.txt
There are 11 files total and each ran fine.
Could someone please help? Is this a bug on Bigquery side?
Thank you.
There was an error reading one of the files: gs://...part-m-00005.gz
Looking at the import logs, it appears that the gzip reader encountered an error decompressing the file.
It looks like that file may not actually be compressed. BigQuery samples the header of the first file in the list to determine whether it is dealing with compressed or uncompressed files and to determine the compression type. When you import all of the files at once, it only samples the first file.
When you run the files individually, bigquery reads the header of the file and determines that it isn't actually compressed (despite having the suffix '.gz') so imports it as a normal flat file.
If you run a load that doesn't mix compressed and uncompressed files, it should work successfully.
Please let me know if you think this is not the case and I'll dig in some more.