Amazon S3 Different Download Speeds for File Sizes - amazon-s3

I'm wanting to know if my theory is true, I have the following files hosted on S3:-
Single 83.9 MB ZIP File
The Single ZIP File separated into 12 files
The Single ZIP File separated into 24 files
I was assuming the single ZIP file would have the best results but this doesn't appear to be the case.
Latest Result
Single: 31 minutes
12 Files: 2.8 minutes
24 Files: 6 minutes
The single file download in particular varies in speeds, I've had results ranging from 15 minutes to 35 minutes for this file.
Question: Does Amazon S3 have different download methods/speeds for different file sizes?

No amazon s3 does not have different method of different count of download. It will be better if you let us know which tool you are using for downloading your data. because if your tool is allowing your to download your data in parallel thread (process) then your part operation will take less time than single file processing time.
Second the download time may vary because of internet speed too.

Related

Uploading a huge file to s3 (larger than my hard drive)

I'm trying to upload a file to an S3 bucket, but the problem is that the file I want to upload (and also create) is bigger than what my hard drive can hold (I want to store a 500TB file on the bucket)
Is there any way to do so?
The file is generated, so I thought about generating the file as I go while it uploads, but I can't quite figure out how to do it.
Any help is appreciated :)
Thanks in advace
The Multipart Upload API allows you to upload a file in chunks, including on-the-fly content generation... but the maximum size of an object in S3 is 5 TiB.
Also, it costs a minimum of $11,500 to store 500 TiB in S3 for 1 month, not to mention the amount of time it takes to upload it... but if this is a justifiable use case, you might consider using some Snowball Edge devices, each of which has its own built-in 100 TiB of storage.

Download large file from FTP server in chunks

I need to download a large file from an FTP server. A new file is uploaded once a week, and I need to be the first to download the file. I've made a check which checks if the file is uploaded, and if the file is there, it will start the download. Problem is this is a big file (3 gb). I can download about 10% of the file within the first few minutes, but as more and more people discover the file is uploaded, the avg download speed drops and drops, to the point where it takes about 3-4 hours to download the remaining 80-90%.
The time isn't a huge problem, but sure would be nice if i could get the download done quicker. The problem is my download never finishes, and i think its because the connection gets timed out.
Solution would be to extend the download timeout, but ideally i have another suggestion. My suggestion is to download the file in chunks: Right now I'm downloading from the beginning to the end in 1 go. It starts of with a good downloadspeed, but as more and more people begin their download, it slows all of us down. I would like to split up the download in smaller chunks and then have all the separate downloads start at the same time. I've made an illustration:
Here i have 8 starting points, which means i'll end up with 8-parts of the zip file, which i then need to recombine to one file once the download has ended. Is this even possible and how would i approach this solution? If i could do this, i would be able complete with the entire download in about 10-15 minutes and I wouldn't have to wait the extra 3-4 hours for the download to fail and then having to restart the download.
Currently i use a web client to download the ftp file, since all other approaches couldn't finish the download, because the file is larger than 2,4 gb.
Private wc As New WebClient()
wc.DownloadFileAsync(New Uri("ftp://user:password#ip/FOLDER/" & FILENAME), downloadPath & FILENAME)

SpamAssassin creating bayes.toks.expire text files

I have a shared hosting account at HostGator and have been using spamassassin for several months with no problem. About 10 days ago, I logged in to cPanel > File Manager > .spamassassin folder, and there were 10-12 text files created like these:
bayes.toks.expire40422 2.99 MB
bayes.toks.expire42356 5.07 MB
bayes.toks.expire44593 5.07 MB
On average, about 2 new files like these are being created each day. I can't open them in File Manager because the file size is too large and if I download a file to my pc and open it with Notepad, there is a lot there, but it is unreadable.
So far, there are also 3 other odd files being created about every 3-4 days like the following:
bayes.locl.gator.hostgator.com.15247 180 bytes
bayes.locl.gator.hostgator.com.28605 210 bytes
bayes.locl.gator.hostgator.com.78666 180 bytes
I searched Google and most posts are pretty old from 2006-2009 and none seem to have a clear answer other than stating these files can be manually deleted. Naturally, I don't want to login every week to manually delete these files so I am trying to find out the cause and a resolution.
I submitted a support ticket to HostGator and their only reply was: 'This is caused due to spamassassin configuration', which does not help.
Also in the .spamassassin folder are these 3 related files:
bayes_journal 28 KB
bayes_seen 324 KB
bayes_toks 5.07 MB
I have the user_prefs file configured and working. Does anyone know the cause of these files or how to prevent them in a shared hosting environment where I do not have direct access to the server?

Simple way to load new files only into Redshift from S3?

The documentation for the Redshift COPY command specifies two ways to choose files to load from S3, you either provide a base path and it loads all the files under that path, or you specify a manifest file with specific files to load.
However in our case, which I imagine is pretty common, the S3 bucket periodically receives new files with more recent data. We'd like to be able to load only the files that haven't already been loaded.
Given that there is a table stl_file_scan that logs all the files that have been loaded from S3, it would be nice to somehow exclude those that have successfully been loaded. This seems like a fairly obvious feature, but I can't find anything in the docs or online about how to do this.
Even the Redshift S3 loading template in AWS Data Pipeline appears to manage this scenario by loading all the data -- new and old -- to a staging table, and then comparing/upserting to the target table. This seems like an insane amount of overhead when we can tell up front from the filenames that a file has already been loaded.
I know we could probably move the files that have already been loaded out of the bucket, however we can't do that, this bucket is the final storage place for another process which is not our own.
The only alternative I can think of is to have some other process running that tracks files that have been successfully loaded to redshift, and then periodically compares that to the s3 bucket to determine the differences, and then writes the manifest file somewhere before triggering the copy process. But what a pain! We'd need a separate ec2 instance to run the process which would have it's own management and operational overhead.
There must be a better way!
This is how I solved the problem,
S3 -- (Lambda Trigger on newly created Logs) -- Lambda -- Firehose -- Redshift
It works at any scale. With more load, more calls to Lambda, more data to firehose and everything taken care automatically.
If there are issues with the format of the file, you can configure dead letter queues, events will be sent there and you can reprocess once you fix lambda.
Here I would like to mention some steps that includes process that how to load data in redshift.
Export local RDBMS data to flat files (Make sure you remove invalid
characters, apply escape sequence during export).
Split files into 10-15 MB each to get optimal performance during
upload and final Data load.
Compress files to *.gz format so you don’t end up with $1000
surprise bill :) .. In my case Text files were compressed 10-20
times.
List all file names to manifest file so when you issue COPY command
to Redshift its treated as one unit of load.
Upload manifest file to Amazon S3 bucket.
Upload local *.gz files to Amazon S3 bucket.
Issue Redshift COPY command with different options.
Schedule file archiving from on-premises and S3 Staging area on AWS.
Capturing Errors, setting up restart ability if something fails
Doing it easy way you can follow this link.
In general compare of loaded files to existing on S3 files is a bad but possible practice. The common "industrial" practice is to use message queue between data producer and data consumer that actually loads the data. Take a look on RabbitMQ vs Amazon SQS and etc..

How do services like Dropbox implement delta encoding if their files are stored in the cloud?

Dropbox claims that during syncing only the portion of files that changes are transmitted back to main server, which is obviously a great functionality, but how do they perform changes to files stored in Amazon S3 cloud? So for example, lets say a 30 page document on user's desktop contains changes to only page 4. Dropbox now syncs the blocks representing the changes and what happens on the backend if they files that they store are in the cloud? Does that mean they have to download the 30 page document stored in S3 to their server, then perform replacement of blocks representing page 4, and then uploading back to the cloud? I doubt this would be the case because that would be somewhat inefficient. The other option I could think of is if Amazon S3 provides update of file stored in the cloud based on byte ranges, so for example, make a PUT request to file X from bytes 100-200 which will replace all the bytes from 100 to 200 with value of PUT request. So I was curious how companies that use other cloud services such as Amazon, implement this type of syncing.
Thanks
As S3 and similar storages don't offer filesystem capabilities, anything that pretends to store files and directories needs to emulate a file system. And when doing this files are often split to pages of certain size, where each page is stored in a separate file in the storage. This way the changed block requires uploading only one page (for example) and not the whole file. I should note, that with files like office documents this approach can be faulty if file size is changed - for example, if you insert a page at the beginning or delete a page, then the whole file will be changed and the complete file would need to be re-uploaded. We didn't analyze how Dropbox in particular does his job, and I just described the common scenario. There exist also different "patch algorithms", where a patch can be created locally (if Dropbox has an older local copy in the cache) and then applied to one or more blocks on the server.
There are several synchronizing tools which transfer deltas over the wire like rsync, rdiff, rdiff-backup, etc. For bi-directional synchronising with S3 there are paid services like s3rsync for example. For pure client-side synchronising, tools like zsync can be considered (which is what many people employ to roll-out app updates).
An alternative approach would be to tar-ball a directory, generate a delta file (using rdiff or xdelta3), and upload the delta file by using a timestamp as part of the key. In order to sync, all you need to do is to perform these 2 checks client-side:
You have all the delta files from S3. If not pull them and apply them to generate the latest backup state.
Your last backup state corresponds to your current directory. If not generate a new delta file and push to S3.
The concerning factor here would be the at least 100% additional space utilization, client-side. But this approach will help you revert changes if needed.