pyspark write overwrite is partitioned but is still overwriting the previous load - amazon-s3

I am running a pyspark script where I'm saving off some data to a s3 bucket each time the script is run and I have this code:
data.repartition(1).write.mode("overwrite").format("parquet").partitionBy('time_key').save( "s3://path/to/directory")
It is partitioned by time_key but at each run, but the latest data dump is overwriting the previous data instead of being adding a partition. The time_key is unique to each run.
Is this the correct code if I want to write the data to s3 and partition by time key at each run?

If you are on Spark Version 2.3 + then this issue has been fixed via https://issues.apache.org/jira/browse/SPARK-20236
You have to set the spark.sql.sources.partitionOverwriteMode="dynamic" flag to overwrite the specific partition of the data.
And also as per your statement time_key being unique for each run you could probably make use of the append mode itself.

Related

How can we update existing partition data in aws glue table without running crawler?

When we are updating data in existing partition by using manual upload to s3 bucket then the data is showing in existing partition in athena glue table.
But when data is updated by using API, the data uploaded to s3 bucket is in existing partition, but in glue table data is stored in the different partition which is current date[last modified] (August 2, 2022, 17:52:15 (UTC+05:30)),
but in my s3 bucket partition date is different(s3://aiq-grey-s3-sink-created-at-partition/topics/core.Test.s3/2022/07/19/) which is 2022/07/19 this.
so when I check same object in glue table I want the partition by this date 2022/07/19.
but it shows the partition by current date without running crawler.
When I run crawler it writes data in correct partition,
but I don't want to run crawler every single time.
How can I update data in existing partition on glue table by using API ?
Am I missing some configuration that is needed to achieve required result for this process ?
Please suggest if anybody has Idea on this.
Here're two solutions I proposed:
use boto3 to run an Athena query to alter partition: ALTER TABLE ADD PARTITION
athena = boto3.client('athena')
response = athena.start_query_execution(
QueryString='ALTER TABLE table ADD PARTITION ... LOCATION ... ', // compose the query as you need
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': output,
}
)
use boto3 to create partition via glue data catalog: glue.Client.create_partition(**kwargs)

UPSERT in parquet Pyspark

I have parquet files in s3 with the following partitions:
year / month / date / some_id
Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days..
I tried two save modes:
append - wasn't good because it just adds another file.
overwrite - is deleting the past data and data for other partitions.
Is there any way or best practice to overcome that? should I read all the data from s3 in each run, and write it back again? maybe renaming the files so that append will replace the current file in s3?
Thanks a lot!
I usually do something similar. In my case I do an ETL and append one day of data to a parquet file:
The key is to work with the data you want to write (in my case the actual date), make sure to partition by the date column and overwrite all data for the current date.
This will preserve all old data. As an example:
(
sdf
.write
.format("parquet")
.mode("overwrite")
.partitionBy("date")
.option("replaceWhere", "2020-01-27")
.save(uri)
)
Also you could take a look at delta.io which is an extension of the parquet format that gives some interesting features like ACID transactions.
To my knowledge, S3 doesn't have an update operation. Once an object is added to s3 cannot be modified. (either you have to replace another object or append a file)
Anyway to your concern that you've to read all data, you can specify the timeline you want to read, partition pruning helps in reading only the partitions within the timeline.
Thanks all for the useful solutions.
I ended up using some configuration that served my use case - using overwrite mode when I write parquet, along with this configuration:
I added this config:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
with this configuration spark will only overwrite the partitions for which it has data to be written to. All the other (past) partitions remain intact - see here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html

Hive Add partition to external table slow

So I need to create a external table for some data stored on S3 and add partitions explicitly (unfortunately, the directory hierarchy does not fit the dynamic partition functionality due to the name mismatch)
for example:
add partition for region:euwest1, year:2018, month:01, day:18, hour:18 at:s3://mybucket/mydata/euwest1/YYYY=2018/MM=01/dd=18/HH=18/
I ran this on an EMR cluster with Hive 2.3.2 and instance type r4.2xarge, which has 8 vCores and 61GB ram.
It takes about 4 seconds to finish adding one partition, it's not too bad but if we need to process multiple days of data then adding partitions would take a long time.
Is there anyway to make this process faster?
Thanks

AWS Athena fails when there are empty files in S3

I have a data pipeline that copies data partitioned by date. Sometimes there is no data for a day. The datapipeline creates a 0 bytes csv file. When I run an Athena query for this date it fails instead of returning 0 results. The error I get is
HIVE_CURSOR_ERROR: Unexpected end of input stream
How can I avoid this. I understand one way is to never create files with empty data but I could never figure out how to do that in a data pipeline. Is there anything I can tweak in Athena so that it does not fail this way?
Try running the below command after your data has been copied by data pipeline.
MSCK REPAIR TABLE table_name
This would recover \ update the partitions to the Athena catalog.
It can be the last step in your data pipeline. Before you actually make it part of your pipeline, try executing it in the Athena Query console and verify if it resolves the issue.

Best practice to add time partitions to a table

having an event tables, partitioned by time (year,month,day,hour)
Wanna join a few events in hive script that gets the year,month,day,hour as variables,
how can you add for example also events from all 6 hours prior to my time
without 'recover all...'
10x
So basically what i needed was a way to use a date that the Hive script receives as parameter
and add all partitions 3 hour before and 3 hours after that date, without recovering all partitions and add the specific hours in every Where clause.
Didn't find a way to do it inside the hive script, so i wrote a quick python code that gets a date and table name, along with how many hours to add from before/after.
When trying to run it inside the Hive script with:
!python script.py tablename ${hivecond:my.date} 3
i was surprised that the variable substition does not take place in a line that starts with !
my workaround was to get the date that the hive script recieved from the log file in the machine using something like:
'cat /mnt/var/log/hadoop/steps/ls /mnt/var/log/hadoop/steps/ |sort -r|head -n 1/stdout'
and from there you can parse each hive parameter in the python code without passing it via Hive.