Writing Pyspark DF to S3 faster - amazon-s3

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)

Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.

Related

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

Which is better, multiple requests for multiple files or single request for a single file S3?

I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?
With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.

Concatenate files in S3 using AWS Lambda

Is there a way to use Lambda for S3 file concatenation?
I have Firehose streaming data into S3 with the longest possible interval (15 minutes or 128mb) and therefore I have 96 data files daily, but I want to aggregate all the data to a single daily data file for the fastest performance when reading the data later in Spark (EMR).
I created a solution where Lambda function gets invoked when Firehose streams a new file into S3. Then the function reads (s3.GetObject) the new file from source bucket and the concatenated daily data file (if it already exists with previous daily data, otherwise creates a new one) from the destination bucket, decode both response bodies to string and then just add them together and write to the destination bucket with s3.PutObject (which overwrites the previous aggregated file).
The problem is that when the aggregated file reaches 150+ MB, the Lambda function reaches its ~1500mb memory limit when reading the two files and then fails.
Currently I have a minimal amount of data, with a few hundred MB-s per day, but this amount will be growing exponentially in the future. It is weird for me that Lambda has such low limits and that they are already reached with so small files.
Or what are the alternatives of concatenating S3 data, ideally invoked by S3 object created event or somehow a scheduled job, for example scheduled daily?
I would reconsider whether you actually want to do this:
The S3 costs will go up.
The pipeline complexity will go up.
The latency from Firehose input to Spark input will go up.
If a single file injection into Spark fails (this will happen in a distributed system) you have to shuffle around a huge file, maybe slice it if injection is not atomic, upload it again, all of which could take very long for lots of data. At this point you may find that the time to recover is so long that you'll have to postpone the next injection…
Instead, unless it's impossible in the situation, if you make the Firehose files as small as possible and send them to Spark immediately:
You can archive S3 objects almost immediately, lowering costs.
Data is available in Spark as soon as possible.
If a single file injection into Spark fails there's less data to shuffle around, and if you have automated recovery this shouldn't even be noticeable unless some system is running full tilt at all times (at which point bulk injections would be even worse).
There's a tiny amount of latency increase from establishing TCP connections and authentication.
I'm not familiar with Spark specifically, but in general such a "piped" solution would involve:
A periodic trigger or (even better) an event listener on the Firehose output bucket to process input ASAP.
An injector/transformer to move data efficiently from S3 to Spark. It sounds like Parquet could help with this.
A live Spark/EMR/underlying data service instance ready to receive the data.
In case of an underlying data service, some way of creating a new Spark cluster to query the data on demand.
Of course, if it is not possible to keep Spark data ready (but not queriable ("queryable"? I don't know)) for a reasonable amount of money, this may not be an option. It may also be possible that it's extremely time consuming to inject small chunks of data, but that seems unlikely for a production-ready system.
If you really need to chunk the data into daily dumps you can use multipart uploads. As a comparison, we're doing light processing of several files per minute (many GB per day) from Firehose with no appreciable overhead.
You may create a Lambda function that will be invoked only once a day using Scheduled Events and in your Lambda function you should use Upload Part - Copy that does not need to download your files on the Lambda function. There is already an example of this in this thread

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.
From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.

Optimize data upload on GoogleBigQuery

I'm currently using the Google BigQuery platform for uploading many datas (~ > 6 Go) and work with them as datasource with Tableau Desktop Software.
Presently it takes me an average of one hour to upload 12 tables in CSV format (total of 6 Go), uncompressed, with a python script using the Google API.
The google docs specify that "If loading speed is important to your app and you have a lot of bandwidth to load your data, leave files uncompressed.".
How can I optimize this process ? Should be a solution to compressed my csv files to improve the upload speed ?
I also think about using Google Cloud Storage, but I expect my problem will be the same?
I need to reduce the time it's take me to upload my data files, but I don't find great solutions.
Thanks in advance.
Compressing your input data will reduce the time to upload the data, but will increase the time for the load job to execute once your data has been uploaded (compression restricts our ability to process your data in parallel). Since it sounds like you'd prefer to optimize for upload speed, I'd recommend compressing your data.
Note that if you're willing to split your data into several chunks and compress them each individually, you can get the best of both worlds--fast uploads and parallel load jobs.
Uploading to Google Cloud Storage should have the same trade-offs, except for one advantage: you can specify multiple source files in a single load job. This is handy if you pre-shard your data as suggested above, because then you can run a single load job that specifies several compressed input files as source files.