Which file format I have to use which supports appending? - amazon-s3

Currently We use orc file format to store the incoming traffic in s3 for fraud detection analysis
We did choose orc file format for following reasons
compression
and ability to query the data using athena
Problem :
As the orc files are read only as soon and we want to update the file contents constantly every 20 minutes
which implies we
need to download the orc files from s3,
read the file
write to the end of file
and finally upload it back to s3
This was not a problem but as the data grows significantly every day ~2GB every day. It is highly costly process to download 10Gb files read it and write and upload it
Question :
Is there any way to use another file format which also offers appends/inserts and can be used by athena to query?
From this article it says avro is file format, but not sure
If athena can be used for querying ?
any other issues ?
Note: My skill on big data technologies is on beginner level

If your table is not partitioned, can simply copy (aws s3 cp) your new orc files to the target s3 path for the table and they will be available instantly for querying via Athena.
If your table is partitioned, you can copy new files to the paths corresponding to your specific partitions. At the end of copying new files to the partition, you need to add or update that partition into Athena's metastore.
For example, if your table is partitioned by date, then you need to run this query to ensure your partition gets added/updated:
alter table dataset.tablename add if not exists
partition (date = YYYYMMDD)
location 's3://your-bucket/path_to_table/date=YYYYMMDD/'

Related

Querying Glue Partitions through Athena while being overwritten?

I have a Glue table on S3 where partitions are populated through Spark save mode overwrite (script executed through Glue job).
What is expected behavior from Athena if we are querying such partitions while they are being overwritten?
If you rewrite files while queries are running you may run into errors like "HIVE_FILESYSTEM_ERROR: Incorrect fileSize 1234567 for file".
The reason is that during query planning all the files are listed on S3, and among other things the file sizes are used to divide up the work between the worker nodes. If a file is splittable, which includes file formats like ORC and Parquet, as well as uncompressed text formats (e.g. JSON, CSV), parts of it (called splits) may be processed by different nodes.
If the file changes between query planning and query execution the plan is no longer valid and the query execution fails.
New partitions are being picked up by Athena as long as you set enableUpdateCatalog = True when writing. If you just overwrite the content of existing partitions, Athena will be able to query the data, as long as you don't have a schema mismatch.

UPSERT in parquet Pyspark

I have parquet files in s3 with the following partitions:
year / month / date / some_id
Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days..
I tried two save modes:
append - wasn't good because it just adds another file.
overwrite - is deleting the past data and data for other partitions.
Is there any way or best practice to overcome that? should I read all the data from s3 in each run, and write it back again? maybe renaming the files so that append will replace the current file in s3?
Thanks a lot!
I usually do something similar. In my case I do an ETL and append one day of data to a parquet file:
The key is to work with the data you want to write (in my case the actual date), make sure to partition by the date column and overwrite all data for the current date.
This will preserve all old data. As an example:
(
sdf
.write
.format("parquet")
.mode("overwrite")
.partitionBy("date")
.option("replaceWhere", "2020-01-27")
.save(uri)
)
Also you could take a look at delta.io which is an extension of the parquet format that gives some interesting features like ACID transactions.
To my knowledge, S3 doesn't have an update operation. Once an object is added to s3 cannot be modified. (either you have to replace another object or append a file)
Anyway to your concern that you've to read all data, you can specify the timeline you want to read, partition pruning helps in reading only the partitions within the timeline.
Thanks all for the useful solutions.
I ended up using some configuration that served my use case - using overwrite mode when I write parquet, along with this configuration:
I added this config:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
with this configuration spark will only overwrite the partitions for which it has data to be written to. All the other (past) partitions remain intact - see here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

Performance improvement for GZ to ORC File

Please let me know Is there any faster way to move (*.gz) to ORC table directly.
1)Another thought, from *.gz file to NON Partition table, Rather than creating External Table and dumping gz file data to External Table. Is there any other approach for quicker loading from Gz to External Table. We are thinking of 2 other approaches like Can we have ADF with Custom .exe to uncompress *.gz file and upload to Azure Blob.
For Example : If the *.Gz File is 10 GB and Un Compressed File is 120 GB , time it takes to uncompress is 40 Mins, How do we upload this un compressed 120 GB data File to Azure Blob. Do we need to have Azure Blob SDK for uploading or Will ADF Executes .exe at location where data is present i.e. exactly at the cluster which holds Blob Data. ( If ADF executes .exe at Azure Blob Storage Data Center’s Cluster, then there will be no Network cost, No Network latency and upload time to upload Uncompressed data will be very less). So Is it possible with ADF?. Will it be right approach ?
If above approach doesn’t work, If we create MR Solution where Mapper is going to UnCompress Gz File and Uploads to Azure Blob Storage, will there be any performance improvement, since I just need to create External Table pointing to uncompressed File. MR will be executing at Azure Blob storage location.
We see ORC and ORC with Partition are performing at same (sometimes we see minimal difference b/w ORC partition and ORC without partition). Will ORC With Partition perform better than ORC . Will ORC With Partition Bucketing performs better than ORC Partition ?. I see each ORC Partition File is close 50-100 MB and ORC With Out Partition (each File size 30-50 MB).
**Note: 120 GB of Un Compressed Data is compressed to 17 GB of ORC File Format
The only way that I know to move from gz to ORC file format is by writing a Hive query. Using a compressed format will always be slower since it needs to be decompressed before conversion. You may want to play around with these parameters as shown here, to see if it speed up moving from gz to orc.
For question #1 above, you may want to follow up with Azure Data Factory team.
For question #3, I have not tried it but computing on uncompressed data should be faster than using compressed data.
For #4, depends on what the field you are partitioning on. Make sure your key is not under partitioned (i.e. results in too few partitions). Also ensure you add sorted by to add a secondary partitioning key. Refer to this link for more details.
Hive has native support for compressed format, including GZIP, BZIP2 and deflate. So you can upload .gz files to Azure Blob and create external table with those files directly. And then you can create table with ORC and load the data there. Normally Hive runs faster with compressed files, please refer to Compression in Hadoop by MSIT for details.

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.