Saving a >>25T SchemaRDD in Parquet format on S3 - amazon-s3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.

From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.

Related

Writing Pyspark DF to S3 faster

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)
Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

Running Spark application using HDFS or S3

In my spark application, I just want to access a big file, and distribute the computation across many nodes on EC2.
Initially, my file is stored on S3.
It's very convenient for me to load the file with sc.textFile() function from S3.
However, I can put some efforts to load the data to HDFS and then read the data from there.
My question is, will the performance be better with HDFS?
My code involves the spark partitions(mapPartitions transforamtion), so does it really matter what is my initial file system?
Obviously when using S3 the latency is higher and the data throughput is lower compared to HDFS on local disk.
But it depends what you do with your data. It seems most of programs are limited more by CPU power than network throughput. So you should be fine with the 1Gbps throughput that you get from S3.
Anyway you can check recent slides from Aaron Davidson's talk on Spark Summit 2015. This topic is discussed there.
http://www.slideshare.net/databricks/spark-summit-eu-2015-lessons-from-300-production-users/16

How map-reduce works on HDFS vs S3?

I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:
Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.
Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).
Please share your thoughts.
Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).
Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.
Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.
If you have a Hadoop cluster in EC2 and you run a MapReduce job over S3 data, yes the data will be streamed into the cluster in order to run the job. As you say, S3 is just a data store, so you can not bring the computation to the data. These non-local reads could cause a bottleneck on processing large jobs, depending on the size of the data and the size of the cluster.