Spark to process many tar.gz files from s3 - amazon-s3

I have many files in the format log-.tar.gz in s3. I would like to process them, process them (extract a field from each line) and store it in a new file.
There are many ways we can do this. One simple and convenient method is to access the files using textFile method.
//Read file from s3
rdd = sc.textFile("s3://bucket/project_name/date_folder/logfile1.*.gz")
I am concerned about the memory limit of the cluster. This way, the master node will be overloaded. Is there any rough estimate for the size of the files that can be processed by the type of clusters?
I am wondering if there is a way to parallelize the process of getting the *.gz files from s3 as they are already grouped by date.

With an exception of parallelize / makeRDD all methods creating RDDs / DataFrames require data to be accessible from all workers and are executed in parallel without loading on a driver.

Related

Which is better, multiple requests for multiple files or single request for a single file S3?

I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?
With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

How map-reduce works on HDFS vs S3?

I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:
Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.
Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).
Please share your thoughts.
Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).
Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.
Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.
If you have a Hadoop cluster in EC2 and you run a MapReduce job over S3 data, yes the data will be streamed into the cluster in order to run the job. As you say, S3 is just a data store, so you can not bring the computation to the data. These non-local reads could cause a bottleneck on processing large jobs, depending on the size of the data and the size of the cluster.

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.
From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.

Writing single Hadoop map reduce output into multiple S3 objects

I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.