Which is better, multiple requests for multiple files or single request for a single file S3? - pandas

I have a 10gb CSV file. I can put the file in S3 in 2 ways.
1) Upload the entire file into single csv object.
2) Divide the file into multiple chunks(say 200mb) and upload.
Now I need to get all the data in the object into a pandas data frame which is running on a EC2 instance.
1) One way is to make a single request and get the file, if it is to be a one big file and put the data in dataframe.
2) Other way is to make multiple requests for each object and keep appending data to dataframe.
Which is the better way of doing it?

With multiple files, you will have possibility to download them simultaneously in parallel threads. But this has 2 drawbacks:
These operations are IO heavy (network mostly), so depending on your instance type you might have worse performance overall
Multithreaded apps include some overhead in handling errors, aggregating results and such.
Depending on what you do, you might also want to look at AWS Athena, which can query data in S3 for you and produce results in seconds, so you don't have to download it at all.

Related

Writing Pyspark DF to S3 faster

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)
Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.

AWS : How do Athena GET requests on S3 work?

How do Athena GET requests on S3 work? I had the impression that one S3 GET request = getting one single file from a bucket. But that doesn't seem to be the case since a single query that uses 4 files is costing me around 400 GET requests.
What's happening exactly?
If you run queries against files that are splittable and are large enough Athena will spin up workers that will read partial files. This improves performance because of parallelization. Splittable files are for example Parquet files.
A 100x amplification sounds very high though. I don't know what size Athena aims for when it comes to splits, and I don't know the sizes for your files. There could also be other explanations for the additional GET operations, both inside of Athena and from other sources – how sure are you that these requests are from Athena?
One way you could investigate further is to turn on object level logging in CloudTrail for the bucket. You should be able to see all the request parameters like what byte ranges are read. If you assume a role and pass a unique session name and make only a single query with the credentials you get you should be able to isolate all the S3 operations made by Athena for that query.

copy multiple objects into one object in amazon S3

I stuck with the following problem: I need to upload objects in small parts (512KB), so I can not use multipart upload (since the minimum 5MB restriction). On the grounds of that, I have to put my parts in a "partitions" bucket and run a Cron task to download partitions and upload a single concatenated object into a "completed" bucket.
I would like to clarify, however, that there is no more elegant way to do this except direct download and concatenation. AWS CLI suggests one can copy objects as a whole, but I see no way to copy and concatenate several objects into one. Is there a way to do this via AWS S3 means?
UPD: I am not guaranteed 512KB chunk size (in fact, it is 512KB to 16MB), but it is usually 512KB and this limit takes origin from vendor of my IP cameras so I can not really change that. And I know the result size beforehead, the camera tells me "I am going to upload 33MB" with a separate call to my backend, but I have no control over number of chunks or their size except the guaranteed boundaries above.

Loading files from GCS to BigQuery - what's the best approach?

I need to load around 1 million rows into bigquery table. My approach will be to write data into cloud storage, and then use load api to load multiple files at once.
What's the most efficient way to do this? I can parallelize the writing into gcs part. When I call load api, I pass in all the uris so I only need to call it once. I'm not sure how this loading is conducted in the backend. If I pass in multiple file names, will this loading run in multiple processes? How do I decide the size of each file to get the best performance?
Thanks
Put all the million rows in one file. If the file is not compressed, BigQuery can read it in parallel with many workers.
From https://cloud.google.com/bigquery/quota-policy
BigQuery can read compressed files (.gz) of up to 4GB.
BigQuery can read uncompressed files (.csv, .json, ...) of up to 5000GB. BigQuery figures out how to read it in parallel - you don't need to worry.

Spark to process many tar.gz files from s3

I have many files in the format log-.tar.gz in s3. I would like to process them, process them (extract a field from each line) and store it in a new file.
There are many ways we can do this. One simple and convenient method is to access the files using textFile method.
//Read file from s3
rdd = sc.textFile("s3://bucket/project_name/date_folder/logfile1.*.gz")
I am concerned about the memory limit of the cluster. This way, the master node will be overloaded. Is there any rough estimate for the size of the files that can be processed by the type of clusters?
I am wondering if there is a way to parallelize the process of getting the *.gz files from s3 as they are already grouped by date.
With an exception of parallelize / makeRDD all methods creating RDDs / DataFrames require data to be accessible from all workers and are executed in parallel without loading on a driver.