Incrementally add data to Parquet tables in S3 - amazon-s3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().
Is there any way to add data to and existing Parquet table
without writing a whole new copy of it
particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.

The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`

You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.
Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.

Related

creating a single parquet file in s3 pyspark job

I have written a pyspark program that is reading data from cassandra and writing into aws s3 . Before writing into s3 I have to do repartition(1) or coalesce(1) as this creates one single file otherwise it creates multiple parquet files in s3 .
using repartition(1) or coalesce(1) has performance issue and I feel creating one big partition is not good option with huge data .
what are ways to create one single file in s3 but without compromising on performance ?
coalesce(1) or repartition(1) will put all your data on 1 partition (with a shuffle step when you use repartition compare to coalesce). In that case, only 1 worker will have to write all your data, which is the reason why you have performance issues - you already figured it out.
That is the only way you can use Spark to write 1 file on S3. Currently, there is no other way using just Spark.
Using Python (or Scala), you can do some other things. For example, you write all your files with spark without changing the number of partitions, and then :
you acquire your files with python
you concatenate your files as one
you upload that one file on S3.
It works well for CSV, not that well for non-sequential file type.

DataBricks - save changes back to DataLake (ADLS Gen2)

I have legacy data stored as CSV in an Azure DataLake Gen2 storage account. I'm able to connect to this and interrogate it using DataBricks. I have a requirement to remove certain records once their retention period expires, or if a GDPR "right to be forgotten" needs applying to the data.
Using Delta I can load a CSV into a Delta table and use SQL to locate and delete the required rows, but what is the best way to save these changes? Ideally back to the original file, so that the data is removed from the original. I've used the LOCATION option when creating the Delta table to persist the generated Parquet format files to the DataLake but it would be nice to keep it in the original CSV format.
Any advice appreciated.
I'd be careful here. Right to be forgotten means you need to delete the data. Delta doesn't actually delete it from the original file (initially at least) - this will only happen once the data is vacuumed.
The safest way to delete data is to read all the data into a dataframe, filter off the records you do not want and then write it back using overwrite. This will ensure the data is remove and the same structure is re-written.
Convert Parquet to CSV in ADF
The versioned parquet files created in the ADLS Gen2 location can be converted to CSV using the Copy Data task in an Azure Data Factory pipeline.
So, you could read the CSV data into a Delta table(with location pointing to a Data lake folder), perform the required changes using SQL and then convert the parquet files to CSV format using ADF.
I have tried this and it works. The only hurdle might be detecting the column headers while reading the CSV file to Delta. You could read it to a dataframe and create a Delta table from it.
If you are running the delete operations periodically then it is costly to save file in csv, As every time you are reading the file and transforming the dataframe to Delta and then query on it and finally after filtering the records you are again saving it to csv and deleting the Delta table.
So my suggestion here would be, transform the csv to Delta once, perform delete periodically and generate csv only when it's needed.
The advantage here is - Delta internally stores data in parquet format which stores data in binary format and allow better compression and encoding/decoding of data.

Grok classifier for parquet

Is it possible to create a grok classifier for Parquet files? If so, where can I find examples?
I'm using AWS Glue Catalog and I'm trying to create external tables on top of Parquet files. I'd like the classifier to split the files according to one of the column of the files.
All my files have the column "table" and all records in a file have the same table.
My S3 structure is like this
- s3://my-bucket/my-prefix/table1/...
- s3://my-bucket/my-prefix/table2/...
No, classifier is not used for conditional parsing of data and moving to different tables.
You may write lambda/ecs/glue-job (depending on processing time) which will take these files and move to table wise folders in s3 bucket. e.g. s3-data-lake/ingestion/table1, s3-data-lake/ingestion/table2 and so on. Then you can run crawler over s3-data-lake/ingestion/ which will create all glue tables.

Spark HDFS Direct Read vs Hive External table read

We have couple HDFS directories in which data stored in delimited format. These directories created as one directory per ingestion date. These directories added as a partitions to a Hive external table.
Directory structure:
/data/table1/INGEST_DATE=20180101
/data/table1/INGEST_DATE=20180102
/data/table1/INGEST_DATE=20180103 etc.
Now we want to process this data in spark job. From the program I can directly read these HDFS directories by giving exact directory path(Option 1) or I can read from Hive into a data frame and process(Option 2).
I would like to know if there is any significant difference in following Option1 or Option2. Please let me know if need any other details.
Thanks in Advance
If you want to select a subset of the columns, then that it is only possible via spark.sql. In your use case I don't think there will be a significant difference.
With Spark SQL you can get Partition pruning automatically.

How to load multiple huge csv (with different columns) into AWS S3

I have around 50 csv files each of different structure. Each csv file has close to 1000 columns. I am using DictReader to merge csv files locally, but it is taking too much time to merge. The approach was to merge 1.csv and 2.csv to create 12.csv. Then merge 12.csv with 3.csv. This is not the right approach.
for filename in inputs:
with open(filename, "r", newline="") as f_in:
reader = csv.DictReader(f_in) # Uses the field names in this file
Since I have to finally upload this huge single csv to AWS, I was thinking about a better AWS based solution. Any suggestions on how I can import these multiple different structure csv and merge it in AWS?
Launch an EMR cluster and merge the files with Apache Spark. This gives you complete control over the schema. This answer might help for example.
Alternatively, you can also try your luck and see how AWS Glue handles the multiple schemas when you create a crawler.
You should copy your data to s3 in both cases.