How does parquet encryption work in AWS EMR? - hive

I'm looking at the AWS documentation for enabling encryption on EMR, but I can't find any information on how this impacts the performance of Parquet files. Can EMR still take advantage of Parquet when optimizing queries?
Examples:
select count(1) from my_table
Would only scan the metadata in the parquet file and wouldn't require downloading the entire file.
select column from my_table
Would only fetch data for that particular column.
How is this possible when the files are encrypted?

Related

Apache Flink: Reading parquet files from S3 in Data Stream APIs

We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.
Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?
The newer FileSource class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat() in particular, for reading Parquet files.
You use the FileSourceBuilder returned by the above method call, and then .monitorContinuously(Duration.ofHours(1)); (or whatever interval makes sense).

Merging dask partitions into one file when writing to s3 Bucket in AWS

I've managed to write an oracle database table to an s3 bucket in AWS in parquet format using Dask. However, I was hoping to have a single file written out like in Pandas. I know Dask partitions the data which creates separate files and a folder. I've tried setting append to true and the number of partitions to false but it doesn't make a difference. Is there a way to merge/append the partitions while writing to an s3 Bucket to create a single parquet file without the folder?
Thanks
No this functionality does not exist currently within Dask. It probably is not too hard to leverage pyarrow or fastparquet to do the work, though, taking the partitions and streaming them into whatever new chunking scheme you like.
I am not sure, but it may be possible to use s3 copy functionality to selectively chop out bytes chunks from the data files and paste into the master file you want to make... This would be far more involve.

AWS Spectrum giving blank result for parquet files generated by AWS Glue

We are building a ETL with AWS Glue. And to optimise the query performance we are storing data in apache parquet. Once the data is saved on S3 in parquet format. We are using AWS Spectrum to query on that data.
We successfully tested the entire stack on our development AWS account. But when we moved to our production AWS account. We are stuck with a weird problem. When we query the rows are returned, but the data is blank.
Though the count query return a good number
On further investigation we came to know the apache parquet files in development AWS account is RLE encoded and files in production AWS account is BITPACKED encoded. To make this case stronger, I want to convert BITPACKED to RLE and see if I am able to query data.
I am pretty new to parquet files and couldn't find much help to convert the encodings. Can anybody get me the ways of doing it.
Currently our prime suspect is the different encoding. But if you can guess any other issue. I will be happy to explore the possibilities.
We found our configuration mistake. The column names of our external tables and specified in AWS Glue were inconsistent. We fixed that and now able to view data. I bit shortfall from AWS Spectrum part would be not giving appropriate error message.

Easiest way to migrate data from Aurora to S3 in Apache ORC or Apache Parquet

Athena looks nice.
To use it, at our scale, we need to make it cheaper and more performant, which would mean saving our data in ORC or Parquet formats.
What is the absolute easiest way to migrate an entire Aurora database to S3, transforming it into one of those formats?
DMS and Data Pipeline seem to get you there minus the transformation step...
The transform step can be done with python, here is a sample: https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
See this article: http://docs.aws.amazon.com/athena/latest/ug/partitions.html
I would try DMS to initially create the data in s3 and then use the above python.

Saving a >>25T SchemaRDD in Parquet format on S3

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this
import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")
I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G.
This puts an upper bound on the maximum size SchemaRDD I can process this way.
The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.
Questions dealing with the specific problems I have encountered are
Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards
This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.
To distill things down there are 2 problems
Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.
Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.
How can I write a >>25T RDD as Parquet to S3?
I am using Spark-1.1.0.
From AWS S3 documentation:
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
One way to go around this:
Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.
It also gives a smaller load on your instance.
To access that data, you need to attache the snapshot as an EBS to an instance.