Optimal maximum Parquet file size in S3 - amazon-s3

I'm trying to work out what the optimal file size when partitioning Parquet data on S3. AWS recommends avoiding having files less than 128MB. But is there also a recommended maximum file size?
Databricks recommends files should be around 1GB, but it's not clear to me whether this only applies to HDFS. I know that the optimal file size is dependent on the HDFS block size. However, S3 doesn't have any concept of block size.
Any thoughts?

You Should probably consider two things:
1) in case of pure object stores such as s3, it does not matter on s3 side what is your block size - you don't need to align to anything.
2) what is more important is how and with what are you going to read the data?
Consider partitioning, pruning, rowgroups and predicate pushdowns - also how you'll going to join this?
e.g.: Presto (Athena) prefers files that are over 128Mb, but too big will cause poor parallelisation - i usually aim for 1-2gb files
Redshift prefers to be massively parallel, so e.g. 4 nodes, 160 files will be better then 4 nodes 4 files :)
suggested read:
https://www.upsolver.com/blog/aws-athena-performance-best-practices-performance-tuning-tips
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Related

Copy files from s3 to redshift is taking too long

I'm using AWS to COPY log files from my S3 bucket to a table inside my Redshift Cluster. Each file has approximately 100MB and I didn't 'gziped' them yet. I have 600 of theses files now, and still growing. My cluster has 2 dc1.large compute nodes and one leader node.
The problem is, the COPY operation time is too big, at least 40 minutes. What is the best approach to speed it up?
1) Get more nodes ou a better machine for the nodes?
2) If I gzip the files, will it really matters in terms of COPY operation time gain?
3) The is some design pattern that helps here?
Rodrigo,
Here are the answers:
1 - There is probably some optimization you can do before you change your hardware setup. You would have to test for sure, but after making sure all optimizations are done, if you still need better performance, I would suggest using more nodes.
2 - Gzipped files are likely to give you a performance boost. But I suspect that there are other optimizations that you need to do first. See this recommendation on the Redshift documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-compress-data-files.html
3 -- Here are the things you should look at in order of importance:
Distribution key -- Does your distribution key provide nice distribution across multiple slices? If you have a "bad" distribution key, that would explain the problem you are seeing.
Encoding -- Make sure the encoding is optimal. Use the ANALYZE COMPRESSION command.
Sort Key -- Did you choose a sort key that is appropriate for this
table. Having a good sort key can have a dramatic impact on
compression, which in turn impacts read and write times.
Vacuum -- If you have been performing multiple tests in this table, did you vacuum between the tests. Redshift does not remove the data after a delete or update(update is processed as a delete and an insert, instead of an in-place update).
Multiple files -- You should have a large number of files. You already do that, but this may be good advice in general for someone trying to load data into Redshift.
Manifest file -- Use a manifest file to allow Redshift to parallelize your load.
I would expect a load of 60GB to go faster than what you have seen, even in a 2-node cluster. Check these 6 items and let us know.
Thanks
#BigDataKid

Storing large objects in Couchbase - best practice?

In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.

Amazon EMR NativeS3FileSystem internals query

Does anybody have insights on the internal working of NativeS3FileSystem with different InputFormat's in Amazon EMR case as compared to normal Hadoop HDFS i.e. input split calculation, actual data flow? What is the best practices & points to consider when using Amazon EMR with S3?
Thanks,
What's important is that if you're planning to use S3N instead of HDFS, you should know that it means you will lose the benefits of data locality, which can have a significant impact on your jobs.
In general when using S3N you have 2 choices for your jobflows:
Stream data from S3 as a replacement for HDFS: this is useful if you need constant access to your whole dataset, but as explained there can be some performance constraints.
Copy your data from S3 to HDFS: if you only need access to a small sample of your data at some point in time, you should just copy to HDFS to retain the benefit of data locality.
From my experience I also noticed that for large jobs, splits calculation can become quite heavy, and I've even seen cases where the CPU was at 100% just for calculating input splits. The reason for that is that I think the Hadoop FileSystem layer tries to get the size of each file separately, which in case of files stored in S3N involves sending API calls for every file, so if you have a big job with many input files that's where the time can be spent.
For more information, I would advise taking a look at the following article where someone asked a similar questions on the Amazon forums.

Write multiple streams to a single file without knowing the length of the streams?

For performance of reading and writing a large dataset, we have multiple threads compressing and writing out separate files to a SAN. I'm making a new file spec that will instead have all these files appended together into a single file. I will refer to each of these smaller blocks of a data as a subset.
Since each subset will be an unknown size after compression there is no way to know what byte offset to write to. Without compression each writer can write to a predictable address.
Is there a way to append files together on the file-system level without requiring a file copy?
I'll write an example here of how I would expect the result to be on disk. Although I'm not sure how helpful it is to write it this way.
single-dataset.raw
[header 512B][data1-45MB][data2-123MB][data3-4MB][data5-44MB]
I expect the SAN to be NTFS for now in case there are any special features of certain file-systems.
If I make the subsets small enough to fit into ram, I will know the size after compression, but keeping them smaller has other performance drawbacks.
Use sparse files. Just position each subset at some offset "guaranteed" to be beyond the last subset. Your header can then contain the offset of each subset and the filesystem handles the big "empty" chunks for you.
The cooler solution is to write out each subset as a separate file and then use low-level filesystem functions to join the files by chaining the first block of the next file to the last block of the previous file (along with deleting the directory entries for all but the first file).

Spark RDD.saveAsTextFile writing empty files to S3

I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz) that reads input from and writes output to S3.
The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)
However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)
When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN files (my number of partitions are constant for these experiments), but each one is empty.
When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.
I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.
Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?
Turns out I was working on this too late at night. This morning, I took a step back and found a bug in my map-reduce which was effectively filtering out all the results.
You should use coalesce before saveAsTextFile
from spark programming guide
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.
eg:
outputRDD.coalesce(100).saveAsTextFile(s3n://bucket/path/to/output/)