How to check Redshift COPY command performance from AWS S3? - amazon-s3

I'm working on an application wherein I'll be loading data into Redshift.
I want to upload the files to S3 and use the COPY command to load the data into multiple tables.
For every such iteration, I need to load the data into around 20 tables.
I'm now creating 20 CSV files for loading data into 20 tables wherein for every iteration, the 20 created files will be loaded into 20 tables. And for next iteration, new 20 CSV files will be created and dumped into Redshift.
With the current system that I have, each CSV file may contain a maximum of 1000 rows which should be dumped into tables. Maximum of 20000 rows for every iteration for 20 tables.
I wanted to improve the performance even more. I've gone through https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html
At this point, I'm not sure how long it's gonna take for 1 file to load into 1 Redshift table. Is it really worthy to split every file into multiple files and load them parallelly?
Is there any source or calculator to give an approximate performance metrics of data loading into Redshift tables based on number of columns and rows so that I can decide whether to go ahead with splitting files even before moving to Redshift.

You should also read through the recommendations in the Load Data - Best Practices guide: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
Regarding the number of files and loading data in parallel, the recommendations are:
Loading data from a single file forces Redshift to perform a
serialized load, which is much slower than a parallel load.
Load data files should be split so that the files are about equal size,
between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB and 125 MB after compression.
The number of files should be a multiple of the number of slices in your
cluster.
That last point is significant for achieving maximum throughput - if you have 8 nodes then you want n*8 files e.g. 16, 32, 64 ... this is so all nodes are doing maximum work in parallel.
That said, 20,000 rows is such a small amount of data in Redshift terms I'm not sure any further optimisations would make much significant difference to the speed of your process as it stands currently.

Related

hive alter table concatenate command risks

I have been using tez engine to run map reduce jobs. I have a MR job which takes ages to run, because i noticed i have over 20k files with 1 stripe each, and tez does not evenly distributes mappers based on amount of files, rather amount of stripes. And i can have a bunch of mappers with 1 file but a lot of stripes, and some mappers processing 15k files but with same amount of stripes than the other one.
As a workaround test, i used ALTER TALE table PARTITION (...) CONCATENATE in order to bring down the amount of files to process into more evenly distributed stripes per files, and now the map job runs perfectly fine.
My concern is that i didnt find in the documentation if there are any risks in running this command and losing data, since it works on the same files.
Im trying to assess if its better to use concatenate to bring down the amount of files before the MR job versus using bucketing which reads files and drops bucketed output into a separate location. Which in case of failure i dont lose source data.
Concatenate takes 1 minute per partition, versus bucketing taking more time but not risking losing source data.
My question: is there any risk of data loss when running concatenate command?
thanks!
It should work as safe as rewriting the table from query. It uses the same mechanism: result is prepared in staging first, after that staging moved to the table or partition location.
Concatenation works as a separate MR job, prepares concatenated files in staging directory and only if everything went without errors, moves them to the table location. You shold see something like this in logs:
INFO : Loading data to table dbname.tblName partition (bla bla) from /apps/hive/warehouse/dbname.db/tblName/bla bla partition path/.hive-staging_hive_2018-08-16_21-28-01_294_168641035365555493-149145/-ext-10000

Is partitioning helpful in Amazon Athena if query doesn't filter based on partition?

I have a large amount of data, but there is no particular column I would like to filter based on (that is, my 'where clause' can be any column). In this scenario, does partitioning provide any benefit (maybe helps with read-parallelism?) when the queries end up scanning all the data?
If there is no column all, or most, queries would filter on then partitions will only hurt performance. Instead aim for files around 100 MB, as few as possible, Parquet if possible, and put all files directly under the table's LOCATION.
The reason why partitions would hurt performance is that when Athena starts executing your query it will list all files, and the way it does it is as if S3 was a file system. It starts by listing the table's LOCATION, and if it finds anything that looks like a directory it will list it separately, and so on, recursively. If you have a deep directory structure this can end up taking a lot of time. You want to help Athena by having all your files in a flat structure, but also fewer than 1000 of them, because that's the page size for S3's list operation. With more than 1000 files you want to have directories so that Athena can parallelize the listing (but as few as possible still, because there's a limit to how many listings it will do in parallel).
You want to keep file sizes to around 100 MB because that's a good size that trades off how long it takes to process a file against the overhead of getting it from S3. The exact recommendation is 128 MB.

S3 partition (file size) for effecient Athena query

I have a pipeline that load daily records into S3. I then utilize AWS Glue Crawler to create partition for facilitating AWS Athena query. However, there is a large partitioned data, if compared to others.
S3 folders/files are displayed as follows:
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/00/00/2019-00-00.parquet.gzip') 7.8 MB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/11/2019-01-11.parquet.gzip') 29.8 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/12/2019-01-12.parquet.gzip') 28.5 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/13/2019-01-13.parquet.gzip') 29.0 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/14/2019-01-14.parquet.gzip') 43.3 KB
s3.ObjectSummary(bucket_name='bucket', key='database/table/2019/01/15/2019-01-15.parquet.gzip') 139.9 KB
with the file size displayed at the end of each line. Note that, 2019-00-00.parquet.gzip contains all records before 2019-01-11 and therefore, its size is large. I have read this and it says that "If your data is heavily skewed to one partition value, and most queries use that value, then the overhead may wipe out the initial benefit."
So, I wonder should I split 2019-00-00.parquet.gzip into smaller parquet files with different partitions. For example,
key='database/table/2019/00/00/2019-00-01.parquet.gzip',
key='database/table/2019/00/00/2019-00-02.parquet.gzip',
key='database/table/2019/00/00/2019-00-03.parquet.gzip', ......
However, I suppose this partitioning is not so useful as it does not reflect when were the old records stored. I am opened for all workarounds. Thank you.
If the full size of your data is less than a couple of gigabytes in total, you don't need to partition your table at all. Partitioning small datasets hurt performance much more than it helps. Keep all the files in the same directory, deep directory structures in unpartitioned tables also hurt performance.
For small datasets you'll be better off without partitioning as long as there aren't too many files (try to keep it below a hundred). If you for some reason must have lots of small files you might get benefits from partitioning, but benchmark it in that case.
When the size of the data is small, like in your case, the overhead of finding the files on S3, opening, and reading them will be higher than actually processing them.
If your data grows to hundreds of megabytes you can start thinking about partitioning, and aim for a partitioning scheme where partitions are around a hundred megabytes to a gigabyte in size. If there is a time component to your data, which there seems to be in your case, time is the best thing to partition on. Start by looking at using year as partition key, then month, and so on. Exactly how to partition your data depends on the query patterns, of course.

Select Query output in parallel streams

I need to spool over 20 million records in a flat file. A direct select query would be time utilizing. I feel the need to generate the output in parallel based on portions of the data - i.e having 10 select queries over 10% of the data each in parallel. Then sort and merge on UNIX.
I can utilize rownum to do this, however this would be tedious, static and needs to be updated every time my rownum changes.
Is there a better alternative available?
If the data in SQL is well spread out over multiple spindles and not all on one disk, and the IO and network channels are not saturated currently, splitting into separate streams may reduce your elapsed time. It may also introduce random access on one or more source hard drives which will cripple your throughput. Reading in anything other than cluster sequence will induce disk contention.
The optimal scenario here would be for your source table to be partitioned, that each partition is on separate storage (or very well striped), and each reader process is aligned with a partition boundary.

SSIS DataFlowTask DefaultBufferSize and DefaultBufferMaxRows

I have a task which pulls records from Oracle db to our SQL using dataflow task. This package runs everyday around 45 mins. This package will refresh about 15 tables. except one, others are incremental update. so almost every task runs 2 to 10 mins.
the one package which full replacement runs up to 25 mins. I want to tune this dataflow task to run faster.
There is just 400k of rows in the table. I did read some articles about DefaultBufferSize and DefaultBufferMaxRows. I have below doubts.
If I can set DefaultBufferSize upto 100 MB, Is there any place to look or analyse how much I can provide.
DefaultBufferMaxRows is set to 10k. Even If I give 50k and I provided 10 MB for DefaultBufferSize if which can only hold up to some 20k then what will SSIS do. Just ignore those 30k records or still it will pull all those 50k rocords(Spooling)?
Can I use Logging options to set proper limits?
As a general practice (and if you have enough memory), a smaller number of large buffers is better than a larger number of small buffers BUT not until the point where you have paging to disk (which is bad for obvious reasons)
To test it, you can log the event BufferSizeTuning, which will show you how many rows are in each buffer.
Also, before you begin adjusting the sizing of the buffers, the most important improvement that you can make is to reduce the size of each row of data by removing unneeded columns and by configuring data types appropriately.