I'm using AWS to COPY log files from my S3 bucket to a table inside my Redshift Cluster. Each file has approximately 100MB and I didn't 'gziped' them yet. I have 600 of theses files now, and still growing. My cluster has 2 dc1.large compute nodes and one leader node.
The problem is, the COPY operation time is too big, at least 40 minutes. What is the best approach to speed it up?
1) Get more nodes ou a better machine for the nodes?
2) If I gzip the files, will it really matters in terms of COPY operation time gain?
3) The is some design pattern that helps here?
Rodrigo,
Here are the answers:
1 - There is probably some optimization you can do before you change your hardware setup. You would have to test for sure, but after making sure all optimizations are done, if you still need better performance, I would suggest using more nodes.
2 - Gzipped files are likely to give you a performance boost. But I suspect that there are other optimizations that you need to do first. See this recommendation on the Redshift documentation: http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-compress-data-files.html
3 -- Here are the things you should look at in order of importance:
Distribution key -- Does your distribution key provide nice distribution across multiple slices? If you have a "bad" distribution key, that would explain the problem you are seeing.
Encoding -- Make sure the encoding is optimal. Use the ANALYZE COMPRESSION command.
Sort Key -- Did you choose a sort key that is appropriate for this
table. Having a good sort key can have a dramatic impact on
compression, which in turn impacts read and write times.
Vacuum -- If you have been performing multiple tests in this table, did you vacuum between the tests. Redshift does not remove the data after a delete or update(update is processed as a delete and an insert, instead of an in-place update).
Multiple files -- You should have a large number of files. You already do that, but this may be good advice in general for someone trying to load data into Redshift.
Manifest file -- Use a manifest file to allow Redshift to parallelize your load.
I would expect a load of 60GB to go faster than what you have seen, even in a 2-node cluster. Check these 6 items and let us know.
Thanks
#BigDataKid
Related
I will be ingesting about 20-years of data that includes files with millions of rows about 500 columns. Reading through Snowflake (SF) documentation I saw that I should load the files in an order that would allow SF to create the micro-partitions (MP) with metadata optimized for pruning. However, I am concerned because I will be updating previously loaded records that could ruin the integrity of the MP. Is there a best practice for handling updates? Might I at some point need to reorg the table data to regain its performance structure. Are cluster keys adequate for handling or should I consider a combination of the two. I am planning on splitting the load files into logical combinations that would also support the proper metadata definitions but am also wondering if there is preferred limit to number of columns. If there is a know best practice document please let me know. Thanks. hs
You don’t need to worry about any of that, regarding how the data is stored. Snowflake doesn’t do updates, it only inserts into the micro partitions.
Performance with updates will obviously be slower than pure inserts, but that’s a different issue.
I am trying to load a series of CSV files, ranging from 100MB to 20GB in size (total of ~3TB). So, I need every performance enhancement that I can. I am aiming to use filegrouping, and partitioning as a mean. I performed a series of tests to see the optimum approach.
First, I tried various filegroup combination; the best I get is when I am loading into a table that is on 1 filegroup; with multiple files assigned to it, and they are all siting on one disc. This combination outperformed to the case that I have multiple filegroups.
Next step was naturally to have partitioning. ODDLY, all the partitioning combination that I examined have lower performance. I tried defining various partition function/schemes and various filegroup combinations. But all showed a lower loading speed.
I am wondering what I am missing here!?
So far, I managed to load (using bulk insert) a 1GB csv file in 3 minutes. Any idea is much appreciated.
For gaining optimal Data Loading speed you need to first understand SQL Server data load process, which means understanding how SQL Server achieves below mentioned optimizations.
Minimal Logging.
Parallel Loading.
Locking Optimization.
These two article will explain in detail how you can achieve all the above optimizations in detail. Fastest Data Loading using Bulk Load and Minimal Logging and Bulk Loading data into HEAP versus CLUSTERED Table
Hope this helps.
i have recently discovered MonetDB and i am evaluating it for an internal project, so probably my questions are from a really newbie point of view. Maybe someone could point me to a site and/or document where i could find more info (i haven't found too much googling)
regarding scalability, correct me please if i am wrong, but what i understand is that if i need to scale, i would launch more server instances and discover them from the control node, is it right?
is there any limit on the number of servers?
the other point is about storage, is it possible to use amazon S3 to back MonetDB readonly instances?
update we would need to store a massive amount of Call Detail Records from different sources, on a read-only basis. We would aggregate/reduce that data for the day-to-day operation, accessing the bigger tables only when the full detail is required.
We would store the historical data as well to perform longer-term analysis. My concern is mostly about memory, disk storage wouldn't be the issue i think; if the hot dataset involved in a report/analysis eats up the whole memory space (fast response times needed, not sure about how memory swapping would impact), i would like to know if i can scale somehow instead of reingeneering the report/analysis process (maybe i am biased by the horizontal scaling thing :-) )
thanks!
You will find advantages of monetdb easily on net so let me highlight some disadvantages
1. In monetdb deleting rows does not free up the space
Solution: copy data in other table,drop existing table, and rename the other table
2. Joins are little slower
3. We can can not give table name as dynamic variable
Eg: if you have table name stored in one main table then you can't make a query like "for each (select tablename from mytable) select data from tablename)" the sql
You can't make functions with tablename as variable argument.
But it is still damn fast and can store large amount of data.
I'm setting a home server primarily for backup use. I have about 90GB of personal data that must be backed up in the most reliable manner, while still preserving disk space. I want to have full file history so I can go back to any file at any particular date.
Full weekly backups are not an option because of the size of the data. Instead, I'm looking along the lines of an incremental backup solution. However, I'm aware that a single corruption in a set of incremental backups makes the entire series (beyond a point) unrecoverable. Thus simple incremental backups are not an option.
I've researched a number of solutions to the problem. First, I would use reverse-incremental backups so that the latest version of the files would have the least chance of loss (older files are not as important). Second, I want to protect both the increments and backup with some sort of redundancy. Par2 parity data seems perfect for the job. In short, I'm looking for a backup solution with the following requirements:
Reverse incremental (to save on disk space and prioritize the most recent backup)
File history (kind of a broader category including reverse incremental)
Par2 parity data on increments and backup data
Preserve metadata
Efficient with bandwidth (bandwidth saving; no copying the entire directory over for each increment). Most incremental backup solutions should work this way.
This would (I believe) ensure file integrity and relatively small backup sizes. I've looked at a number of backup solutions already but they have a number of problems:
Bacula - Simple normal incremental backups
bup - incremental and implements par2 but isn't reverse incremental and doesn't preserve metadata
duplicity - incremental, compressed, and encrypted but isn't reverse incremental
dar - incremental and par2 is easy to add, but isn't reverse incremental and no file history?
rdiff-backup - almost perfect for what I need but it doesn't have par2 support
So far I think that rdiff-backup seems like the best compromise but it doesn't support par2. I think I can add par2 support to backup increments easily enough since they aren't modified each backup but what about the rest of the files? I could generate par2 files recursively for all files in the backup but this would be slow and inefficient, and I'd have to worry about corruption during a backup and old par2 files. In particular, I couldn't tell the difference between a changed file and a corrupt file, and I don't know how to check for such errors or how they would affect the backup history. Does anyone know of any better solution? Is there a better approach to the issue?
Thanks for reading through my difficulties and for any input you can give me. Any help would be greatly appreciated.
http://www.timedicer.co.uk/index
Uses rdiff-backup as the engine. I've been looking at it, but that requires me to set up a "server" using linux or a virtual machine.
Personally, I use WinRAR to make pseudo-incremental backups (it actually makes a full backup of recent files) run daily by a scheduled task. It is similarly a "push" backup.
It's not a true incremental (or reverse-incremental) but it saves different versions of files based on when it was last updated. I mean, it saves the version for today, yesterday and the previous days, even if the file is identical. You can set the archive bit to save space, but I don't bother anymore as all I backup are small spreadsheets and documents.
RAR has it's own parity or recovery record that you can set in size or percentage. I use 1% (one percent).
It can preserve metadata, I personally skip the high resolution times.
It can be efficient since it compresses the files.
Then all I have to do is send the file to my backup. I have it copied to a different drive and to another computer in the network. No need for a true server, just a share. You can't do this for too many computers though as Windows workstations have a 10 connection limit.
So for my purpose, which may fit yours, backs up my files daily for files that have been updated in the last 7 days. Then I have another scheduled backup that backups files that have been updated in the last 90 days run once a month or every 30 days.
But I use Windows, so if you're actually setting up a Linux server, you might check out the Time Dicer.
Since nobody was able to answer my question, I'll write a few possible solutions I found while researching the topic. In short, I believe the best solution is rdiff-backup to a ZFS filesystem. Here's why:
ZFS checksums all blocks stored and can easily detect errors.
If you have ZFS set to mirror your data, it can recover the errors by copying from the good copy.
This takes up less space than full backups, even though the data is copied twice.
The odds of an error in both the original and mirror is tiny.
Personally I am not using this solution as ZFS is a little tricky to get working on Linux. Btrfs looks promising but hasn't been proven stable from years of use. Instead, I'm going with a cheaper option of simply checking hard drive SMART data. Hard drives should do some error checking/correcting themselves and by monitoring this data I can see if this process is working properly. It's not as good as additional filesystem parity but better than nothing.
A few more notes that might be interesting to people looking into reliable backup development:
par2 seems to be dated and buggy software. zfec seems like a much faster modern alternative. Discussion in bup occurred a while ago: https://groups.google.com/group/bup-list/browse_thread/thread/a61748557087ca07
It's safer to calculate parity data before even writing to disk. i.e. don't write to disk, read it, and then calculate parity data. Do it from ram, and check against the original for additional reliability. This might only be possible with zfec, since par2 is too slow.
The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.