Once in 2 hours, spark job is running to convert some tgz files to parquet.
The job appends the new data into an existing parquet in s3:
In spark-submit output I can see significant time is being spent on reading old parquet files, for example:
16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading
16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key
seeking to position '149195444'
It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.
Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?
I use EMR 4.8.2 and DirectParquetOutputCommitter:
sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')

I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3

Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.


Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

Which file format I have to use which supports appending?

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level
If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

creating a single parquet file in s3 pyspark job

I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 .
using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data .
what are ways to create one single file in s3 but without compromising on performance ?
coalesce(1) or repartition(1) will put all your data on 1 partition (with a shuffle step when you use repartition compare to coalesce). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.
That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.
Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :
you acquire your files with python
you concatenate your files as one
you upload that one file on S3.
It works well for CSV, not that well for non-sequential file type.

UPSERT in parquet Pyspark

I have parquet files in s3 with the following partitions:
year / month / date / some_id
Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days..
I tried two save modes:
append - wasn't good because it just adds another file.
overwrite - is deleting the past data and data for other partitions.
Is there any way or best practice to overcome that? should I read all the data from s3 in each run, and write it back again? maybe renaming the files so that append will replace the current file in s3?
Thanks a lot!
I usually do something similar. In my case I do an ETL and append one day of data to a parquet file:
The key is to work with the data you want to write (in my case the actual date), make sure to partition by the date column and overwrite all data for the current date.
This will preserve all old data. As an example:
.option("replaceWhere", "2020-01-27")
Also you could take a look at which is an extension of the parquet format that gives some interesting features like ACID transactions.
To my knowledge, S3 doesn't have an update operation. Once an object is added to s3 cannot be modified. (either you have to replace another object or append a file)
Anyway to your concern that you've to read all data, you can specify the timeline you want to read, partition pruning helps in reading only the partitions within the timeline.
Thanks all for the useful solutions.
I ended up using some configuration that served my use case - using overwrite mode when I write parquet, along with this configuration:
I added this config:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
with this configuration spark will only overwrite the partitions for which it has data to be written to. All the other (past) partitions remain intact - see here:

"not a Parquet file (too small)" from Presto during Spark structured streaming run

I have a pipeline set up that reads data from Kafka, processes it using Spark structured streaming and then writes parquet files to HDFS. Downstream clients of the data query is using Presto configured to read the data as Hive tables.
Kafka --> Spark --> Parquet on HDFS --> Presto
In general this works. The problem arises when a query happens while the Spark job is running a batch. The Spark job creates a zero-length Parquet file on HDFS. If Presto attempts to open this file in the course of processing a query, then it throws an error:
Query 20171116_170937_07282_489cc failed: Error opening Hive split hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet (offset=0, length=0): hdfs://namenode:50071/hive/warehouse/table/part-00000-5a7c242a-3e53-46d0-9ee4-5d004ef4b1e4-c000.snappy.parquet is not a Parquet file (too small)
The file is indeed zero bytes at this time, so the error is strictly correct, but this is not the behavior I want for the pipeline. I would like to be able to continuously write in to the appropriate HDFS folders, without disturbing the Presto queries.
The Spark scala code for the job looks like this:
val FilesOnDisk = 1
.trigger(ProcessingTime("1 hours"))
.option("path", job.outputFilesPath)
The job starts at the top of the hour, :00. The file is first visible on HDFS as a zero-length file at :05. It is not updated until it is written completely at :21, just before the job finishes. This makes the table effectively unusable from Presto 25% of the time.
Each file is only a little over 500kB, so I wouldn't expect the physical writing of the file to take very long. From my understanding, Parquet files have their metadata at the end of the file so someone writing bigger files would have even more trouble.
What strategies have people used to integrate Spark structured streaming and Presto while working around this Presto error?
You could try to persuade Presto (or Presto team) to ignore empty files, but that wouldn't help, as the program writing the file (here: Spark) will eventually flush partial data and the file would appear partial, non-empty and not well formed, thus leading to an error as well.
The approach preventing Presto (or other programs reading the table data for that matter) from seeing partial file would be to assembler the file in different location and then atomically move the file into the correct location.