AWS GLUE Pyspark job delete S3 folder unexpectly - amazon-s3

My glue workflow is DDB -> GLUE table (by using Crawler) -> S3 (by using GLUE job)
I create S3 folder manually before the workflow run.
For DDB table with size at 500~MB it always works fine (runs 7-10min to finish), the s3 path will have correct result: e.g. s3://glue_example/ddb_500MB/ (I know data is correct by checking them in athena after connecting to s3)
For DDB table with size 50GB the folder is deleted by the GLUE JOB (runs 2 hours to finish, no error), e.g. s3://glue_example/ddb_50GB this folder is deleted. (I enabled the log for s3, and in log, GlueJobRunnerSession used DeleteObject on this folder path)
This delete folder behavior is not consistent, it happened most of the time, but if I find the folder is deleted, and I created manually, next run will have correct data in that s3 folder.
The code of GLUE job (Glue 3.0 - Supports spark 3.1, Scala 2, Python 3) is super simple. the only line that write to s3 is: ApplyMapping_node2.toDF().write.mode("overwrite").format("parquet").save('s3://glue_example/ddb_50GB')
concurrency of workflow/job is 1, so it's not competing caused problem
I use overwrite to keep the folder to have only latest data. but I don't know why this keep deleting folder with large size DDB as data source. Any idea?

The issue was due to whole table being read into single partition as it is default behaviour. Increasing dynamodb.splits while reading from DDB table should help as it reads data in parallel into multiple partitions.Below is an example in pySpark.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "test_source",
"dynamodb.throughput.read.percent": "1.0",
"dynamodb.splits": "100"
}
)
Refer to below link for more information:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

Related

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists

I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02 and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region.
Any other insights?
It could be for some reason that you get this error.
When you load data from Cloud Storage into a BigQuery table, the
dataset that contains the table must be in the same regional or
multi- regional location as the Cloud Storage bucket.
Due to consistency, for buckets, while metadata updates are strongly
consistent for read-after-metadata-update operations, the process
could take time to finish the changes.
Using a Multi-region bucket is not recommended.
In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.
Unfortunately, there is not a simple way, to solve this problem, but here are some options:
You can add a pub/sub routine to the bucket and/or file and quick off
your query after the service finished updating the files.
Make a workflow that blocks the updating of the files in their
buckets until their query finishes.
If the query fails with “not found” for file ABCD and you have
verified ABCD exists in GCS, then retry the query X times.
You need to backup your data into another location where you won't
update these files constantly, just once a day.
You could move the data into a managed storage where you won't have
this problem because you can do snapshotting.

DynamoDB data to S3 in Kinesis Firehose output format

Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension
I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.
Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.
However, the problem is that Glue creates Hive-style partitions which look like :
s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead.
I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.
Leaving the solution I ended up implementing here in case it helps anyone.
I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.
The Lambda used Copy and delete function from boto3 s3Client to implement the same.
It worked like a charm even though I had like > 10^6 output files split across different partitions.

Copy and Merge files to another S3 bucket

I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.

How to avoid reading old files from S3 when appending new data?

Once in 2 hours, spark job is running to convert some tgz files to parquet.
The job appends the new data into an existing parquet in s3:
df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet")
In spark-submit output I can see significant time is being spent on reading old parquet files, for example:
16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet' for reading
16/11/27 14:06:15 INFO S3NativeFileSystem: Stream for key
'foo.parquet/id=123/day=2016-11-26/part-r-00003-e80419de-7019-4859-bbe7-dcd392f6fcd3.snappy.parquet'
seeking to position '149195444'
It looks like this operation takes less than 1 second per file, but the amount of files increases with time (each append adds new files), which makes me think that my code will not be able to scale.
Any ideas how to avoid reading old parquet files from s3 if I just need to append new data?
I use EMR 4.8.2 and DirectParquetOutputCommitter:
sc._jsc.hadoopConfiguration().set('spark.sql.parquet.output.committer.class', 'org.apache.spark.sql.parquet.DirectParquetOutputCommitter')
I resolved this issue by writing the dataframe to EMR HDFS and then using s3-dist-cp uploading the parquets to S3
Switch this over to using Dynamic Partition Overwrite Mode using:
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
Also, avoid the DirectParquetOutputCommitter, and instead don't modify this - you will achieve better results in terms of speed using the EMRFS File Committer.

How to see output in Amazon EMR/S3?

I am new to Amazon Services and tried to run the application in Amazon EMR.
For that I have followed the steps as:
1) Created the Hive Scripts which contains --> create table, load data statement in Hive with some file and select * from command.
2) Created the S3 Bucket. And I load the object into it as: Hive Script, File to load into the table.
3) Then Created the Job Flow (Using Sample Hive Program). Given the input, ouput, and script path (like s3n://bucketname/script.q, s3n://bucketname/input.txt, s3n://bucketname/out/). Didn't create out directory. I think it will get created automatically.
4) Then Job Flow start to run and after some time I saw the states as STARTING, BOOTSTRAPING, RUNNING, and SHUT DOWN.
5) While running SHUT DOWN state, it get terminated automatically showing FAILES status for SHUT DOWN.
Then on the S3, I didn't see the out directory. How to see the output? I saw directory like daemons, nodes, etc......
And also how to see the data from HDFS in Amazon EMR?
The output path that you specified in step 3 should contain your results (From your description, it is s3n://bucketname/out/)
If it doesn't, something went wrong with your Hive script. If your Hive job failed, you will find information about the failure/exception in the jobtracker log. The jobtracker log exists under <s3 log location>/daemons/<master instance name>/hadoop-hadoop-jobtracker-<some Amazon internal IP>.log
Only one file in your logs directory would have it's S3 key in the above format. This file will contain any exceptions that may have happened. You probably want to concentrate on the bottom end of the file.