Merging dask partitions into one file when writing to s3 Bucket in AWS - amazon-s3

I've managed to write an oracle database table to an s3 bucket in AWS in parquet format using Dask. However, I was hoping to have a single file written out like in Pandas. I know Dask partitions the data which creates separate files and a folder. I've tried setting append to true and the number of partitions to false but it doesn't make a difference. Is there a way to merge/append the partitions while writing to an s3 Bucket to create a single parquet file without the folder?
Thanks

No this functionality does not exist currently within Dask. It probably is not too hard to leverage pyarrow or fastparquet to do the work, though, taking the partitions and streaming them into whatever new chunking scheme you like.
I am not sure, but it may be possible to use s3 copy functionality to selectively chop out bytes chunks from the data files and paste into the master file you want to make... This would be far more involve.

Related

File Copy from one S3 bucket to other S3 bucket using Lambda - timing constraint?

I need to copy large files ( may be even greater than 50 GB) from one S3 bucket to other S3 bucket ( event based). I am planning to use s3.Object.copy_from to do this inside Lambda ( using boto3).
I wanted to see if anyone has tried this? will this have any performance issue for larger files (100 GB etc.) causing Lambda timeout?
If yes, is there any alternate option ? ( I am trying to use code since I might need to do some other additional logic like rename file, move source file to archive etc.).
Note- I am also exploring AWS S3 Replication options, but looking for other solutions in parallel.
You can use AWS S3 replication feature.
It supports key prefix and API filtering as well.

Writing Pyspark DF to S3 faster

I am pulling data from mysql DB using pyspark and trying to upload the same data using Pyspark.
While doing so, it takes around 5-7 mins to upload a chunk of 100K records.
This process will take months for the data pull as there are around 3,108,700,000 recs in source.
Is there any better way by which the S3 upload process can be improved.
NOTE : Data pull for a single fetch of 100K recs take only 20-30 seconds, its just the S3 upload causing the issue.
Here is how I am writing the DF to S3.
df = spark.read.format("jdbc").
option('url', jdbcURL).
option('driver', driver).
option('user', user_name).
option('password', password).
option('query', data_query).load()
output_df = df.persist()
output_df.repartition(1).write.mode("overwrite").parquet(target_directory)
Reparation is a good move as writing large files to S3 is better than writing small files.
Persist will slow you down as your writing all the files to S3 with that. So you are writing the data to S3 twice.
S3 is made for large, slow, inexpensive storage. It's not made to move data quickly. If you want to migrate the database AWS has tools for that and it's worth looking into them. Even if its so you can then move the files into S3.
S3 writes to buckets and it determines the buckets by file path, It uses tail variation to assign & auto split buckets. (/heres/some/variation/at/the/tail1,/heres/some/variation/at/the/tail2) Buckets are your bottleneck here. To get multiple buckets, keep the vary the file at the head of the file path.(/head1/variation/isfaster/,/head2/variation/isfaster/)
Try and remove the persist. (At least consider cache() as a cheaper alternative.
Keep the repartition
vary the head of the file path to get assigned more buckets.
consider a redesign that pushes the data into S3 with rest api multi-part upload.

Spark to process many tar.gz files from s3

I have many files in the format log-.tar.gz in s3. I would like to process them, process them (extract a field from each line) and store it in a new file.
There are many ways we can do this. One simple and convenient method is to access the files using textFile method.
//Read file from s3
rdd = sc.textFile("s3://bucket/project_name/date_folder/logfile1.*.gz")
I am concerned about the memory limit of the cluster. This way, the master node will be overloaded. Is there any rough estimate for the size of the files that can be processed by the type of clusters?
I am wondering if there is a way to parallelize the process of getting the *.gz files from s3 as they are already grouped by date.
With an exception of parallelize / makeRDD all methods creating RDDs / DataFrames require data to be accessible from all workers and are executed in parallel without loading on a driver.

HBase backed by S3

I just read about being able to use HBase that is backed by S3 as the filesystem.
I also read elsewhere that S3 is blob storage and lacks functionality to append to a file. Now minus any append functionality I am unable to understand how HBase can use S3 as the underlying filesystem. For e.g. what happens at S3 layer when I add a single new column to HBase?
Please help with my confusion!
Thanks,
Vivek
If you add a small column, my understanding is that HBase will not immediately modify the underlaying storage.
Instead, hbase will (1) write the addition of a column / cell into a write ahead log WAL persistently and then (2) also modify the memcache.
When the memcache gets flushed to disk, only then will HBase modify the underlying data in relatively large chunks (which is well suitable to storage implementations such as S3 and HDFS).

Writing single Hadoop map reduce output into multiple S3 objects

I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.