We use UNLOAD commands to run some transformation on s3-based external tables and publish data into a different s3 bucket in PARQUET format.
I use ALLOWOVERWRITE option in the unload operation to replace the files if they already exist. This works fine for most of the cases but inserts duplicate files for the same data at times which causes external table to show duplicate numbers.
For eg, if the parquet in the partition is 0000_part_00.parquet which contains complete data.In the next run, unload is expected to overwrite this file but instead inserts new file 0000_part_01.parquet which doubles the total output.
This again would not repeat if I just clean up entire partition and rerun again. This inconsistency is making our system unreliable.
unload (<simple select statement>)
to 's3://<s3 bucket>/<prefix>/'
iam_role '<iam-role>' allowoverwrite
PARQUET
PARTITION BY (partition_col1, partition_col2);
Thank you.
To prevent redundant data, you must use Redshift's CLEANPATH option in your UNLOAD statement. Note the difference, from the documentation (Perhaps AWS could clear this up a bit more):
ALLOWOVERWRITE
By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.
CLEANPATH
The CLEANPATH option removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location.
If you include the PARTITION BY clause, existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation.
You must have the s3:DeleteObject permission on the Amazon S3 bucket. For information, see Policies and Permissions in Amazon S3 in the Amazon Simple Storage Service Console User Guide. Files that you remove by using the `CLEANPATH` option are permanently deleted and can't be recovered.
You can't specify the `CLEANPATH` option if you specify the `ALLOWOVERWRITE` option.
Therefore, as #Vzzarr says, ALLOWOVERWRITE only overwrites files that share the same names as the incoming file name. For recurring unload operations that do not require the state of the past data to remain intact, then you must use CLEANPATH.
And note that you cannot use both ALLOWOVERWRITE and CLEANPATH in the same UNLOAD statement.
Here's an example:
f"""
UNLOAD ('{your_query}')
TO 's3://{destination_prefix}/'
iam_role '{IAM_ROLE_ARN}'
PARQUET
MAXFILESIZE 4 GB
MANIFEST verbose
CLEANPATH
"""
From my experience the ALLOWOVERWRITE parameter is only based on the generated file names: so a result is overwritten only if 2 files have the same name.
This parameter works in most of the cases but in this domain "most of the cases" is not good enough. I stopped using it since then (and I was quite disappointed). What I do instead is manually delete the files from S3 console (or actually move them in a staging folder) and then unloading the data without relying on the ALLOWOVERWRITE parameter.
Also mentioned in comments of this answer https://stackoverflow.com/a/61594603/4725074
Related
I have a source bucket where small 5KB JSON files will be inserted every second.
I want to use AWS Athena to query the files by using an AWS Glue Datasource and crawler.
For better query performance AWS Athena recommends larger file sizes.
So I want to copy the files from the source bucket to bucket2 and merge them.
I am planning to use S3 events to put a message in AWS SQS for each file created, then a lambda will be invoked with a batch of x sqs messages, read the data in those files, combine and save them to the destination bucket. bucket2 then will be the source of the AWS Glue crawler.
Will this be the best approach or am I missing something?
Instead of receiving 5KB JSON file every second in Amazon S3, the best situation would be to receive this data via Amazon Kinesis Data Firehose, which can automatically combine data based on either size or time period. It would output fewer, larger files.
You could also achieve this with a slight change to your current setup:
When a file is uploaded to S3, trigger an AWS Lambda function
The Lambda function reads the file and send it to Amazon Kinesis Data Firehose
Kinesis Firehose then batches the data by size or time
Alternatively, you could use Amazon Athena to read data from multiple S3 objects and output them into a new table that uses Snappy-compressed Parquet files. This file format is very efficient for querying. However, your issue is that the files are arriving every second so it is difficult to query the incoming files in batches (so you know which files have been loaded and which ones have not been loaded). A kludge could be a script that does the following:
Create an external table in Athena that points to a batching directory (eg batch/)
Create an external table in Athena that points to the final data (eg final/)
Have incoming files come into incoming/
At regular intervals, trigger a Lambda function that will list the objects in incoming/, copy them to batch/ and delete those source objects from incoming/ (any objects that arrive during this copy process will be left for the next batch)
In Athena, run INSERT INTO final SELECT * FROM batch
Delete the contents of the batch/ directory
This will append the data into the final table in Athena, in a format that is good for querying.
However, the Kinesis Firehose option is simpler, even if you need to trigger Lambda to send the files to the Firehose.
You can probably achive that using glue itself. Have a look here https://github.com/aws-samples/aws-glue-samples/blob/master/examples/join_and_relationalize.md
This is what I think will be more simpler
Have input folder input/ let 5kb/ 1kb files land here; /data we will use this to have Json files with max size of 200MB.
Have a lambda that runs every 1minute which reads a set of files from input/ and appends to the last file in the folder /data using golang/ java.
The lambda (with max concurrency as 1) copies a set of 5kb files from input/ and the XMB files from data/ folder into its /tmp folder; and merge them and then upload the merged file to /data and also delte the files from input/ folder
When ever the file size crosses 200MB create a new file into data/ folder
The advantage here is at any instant if somebody wants data its the union of input/ and data/ folder or in other words
With little tweeks here and there you can expose a view on top of input and data folders which can expose final de-duplicated snapshot of the final data.
I ended up manually deleting some delta lake entries(hosted on S3) .
Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system.
I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html
but I am not sure how should I run this utility in my case.
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks.
you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN from the above command and it should do the stuff that you wat.
I am doing an Insert overwrite operation through a hive external table onto AWS S3. Hive creates a output file 000000_0 onto S3. However at times I am noticing that it creates file with other names like 0000003_0 etc. I always need to overwrite the existing file but with inconsistent file names I am unable to do so. How do I force hive to always create a consistent filename like 000000_0? Below is an example of how my code looks like, where tab_content is a hive external table.
INSERT OVERWRITE TABLE tab_content
PARTITION(datekey)
select * from source
Better do not do this and modify your program to accept any number of files in the directory.
Each reducer (or mapper if it runs on map-only) creates it's own file. These reducers do know nothing about each other, they named during creation. Files are marked as 000001_0,000002_0. But it can be 000001_1 also if attempt number 0 has failed and attempt number 1 has succeeded. Also if table is partitioned and there is no distribute by partition key at the end, each reducer will create it's own file in each partition.
You can force it to work on a single final reducer (it can be done for example if you add order by clause or setting set mapred.reduce.tasks = 1;). But bear in mind that this solution is not scalable, because too many data will cause performance problems on single reducer. Also What will happen if attempt 0 has failed and it was restarted and attempt 1 succeeded? It will create 000001_1 instead of 000001_0.
I'm working on a Spring project that needs exporting Redshift table data into local a single CSV file. The current approach is to:
Execute Redshift UNLOAD to write data across multiple files to S3 via JDBC
Download said files from S3 to local
Joining them together into one single CSV file
UNLOAD (
'SELECT DISTINCT #{#TYPE_ID}
FROM target_audience
WHERE #{#TYPE_ID} is not null
AND #{#TYPE_ID} != \'\'
GROUP BY #{#TYPE_ID}'
)
TO '#{#s3basepath}#{#s3jobpath}target_audience#{#unique}_'
credentials 'aws_access_key_id=#{#accesskey};aws_secret_access_key=#{#secretkey}'
DELIMITER AS ',' ESCAPE GZIP ;
The above approach has been fine and all. But i think the overall performance can be improved by, for example skipping the S3 part and get data directly from Redshift to local.
After searching through online resources, i found that you can export data from redshift directly through psql or to perform SELECT queries and move the result data myself. But neither option can top Redshift UNLOAD performance with parallel writing.
So is there any way i can mimic UNLOAD parallel writing to achieve the same performance without having to go through S3 ?
You can avoid the need to join files together by using UNLOAD with the PARALLEL OFF parameter. It will output only one file.
This will, however, create multiple files if the filesize exceeds 6.2GB.
See: UNLOAD - Amazon Redshift
It is doubtful that you would get better performance by running psql, but if performance is important for you then you can certainly test the various methods.
We do exactly same as you'r trying to do here. In our performance comparison, it found to be almost same or even better in some cases in our user case. Hence programming and debugging wise its easy. As there is practically one step.
//replace user/password,host,region,dbname appropriately in given command
psql postgresql://user:password#xxx1.xxxx.us-region-1.redshift.amazonaws.com:5439/dbname?sslmode=require -c "select C1,C2 from sch1.tab1" > ABC.csv
This enables us to avoid 3 steps,
Unload using JDBC
Download the exported Data from S3
Decompress gzip file, (this we used to save network Input/Output).
On other hand also saving some cost(S3 storing, though its negligible).
By the way, pgsql(9.0+) onwards, sslcompression is bydefault on.
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.