Read / Write large amounts of files daily - distributed-filesystem

We receive around 10 million images per day ranging in size from 3kb to 200kb. At peak times it is around 400 images per second. It is an average of around 30kb per image.
At the moment all these images come into a single server with a 1TB NVMe SSD for storage.
At night we move the days images to an archive server.
At peak times users want to read the latest images as they are being written but there are delays as it appears the server is attempting read and write at the same time.
What is the best way to be able to succeed in read / write to the same volume and be able to scale easily going forward?
I've started looking into distributed file systems like SeaweedFS. Is this the right way to go?
Are there better options than SeaweedFS?
Thank you

Related

Optimizing Neptune Bulk Load Jobs?

Currently we have an automation engine running to queue up billions of nodes/edges for our Neptune historical load.
The data pulls off Kafka and writes bulk CSVs into S3 to initiate the load. Currently I'm uploading files after each batch pulls a couple million records off the queue.
I'm using oversubscribe param and looked at the high-level docs for bulk optimizations. I'm seeing I can get about 36M records an hour, but looking to go faster. Do I want the output files to be larger? I can only run one job at a time and my queue is constantly filled up to the 65 cap limit.
In general, larger files should give better performance than smaller ones as the worker threads running the load will divide the file up amongst themselves. Larger instances also help the loads go faster. If possible, a db.r5.12xlarge is a good choice when you have a lot of data to load. You can scale it back down again once the volume of writes you need to achieve slows down and a smaller instance will suffice.

Imageresizer cache azure storage quota limits

I'm using ImageResizer as an Azure webapp with a service plan with 50 Gb file storage. My settings for DiskCache are:
<diskcache dir="~/imagecache" autoclean="true" hashModifiedDate="true" subfolders="1024" asyncWrites="true" asyncBufferSize="10485760" cacheAccessTimeout="15000" logging="true"/>
But that doesn't seem to stop the imagecache folder to get to the 50 Gb limit quite quickly. I have around 100 Gb of images in blob storage (original size), not all will be used on the same day, however the same image could be cached with different parameters multiple times. The images cached are around 200Kb average?.
Is there a way to stop the storage filling up so quick? Is there maybe a better way of using DiskCache? or use something else? The Premium Plans with 250Gb and decent CPU/RAM are far too expensive to justify the cost for this.
Thanks
You can't limit the cache by files size, only by a (very) rough count. Deleting the cache and setting subfolders="256" should keep you under 50GB, assuming that 200kb average holds true.
... However, if your cache fills up "quickly" (as in 1-3 days), then you're probably going to experience serious cache churn and poor performance as your disk write queue skyrockets.
You might consider using a CDN if you can't get storage space for, say, 10 days worth of cached files.

Loading from Google cloud storage to Big Query seems slow

I'm running a test using Big Query. Basically I have 50,000 files, each of which are 27MB in size, on average. Some larger, some smaller.
Timing each file upload reveals:
real 0m49.868s
user 0m0.297s
sys 0m0.173s
Using something similar to:
time bq load --encoding="UTF-8" --field_delimiter="~" dataset gs://project/b_20130630_0003_1/20130630_0003_4565900000.tsv schema.json
Running command: "bq ls -j" and subsequently running "bq show -j " reveals that I have the following errors:
Job Type State Start Time Duration Bytes Processed
load FAILURE 01 Jul 22:21:18 0:00:00
Errors encountered during job execution. Exceeded quota: too many imports per table for this table
After checking the database, the rows seems to of loaded fine which is puzzling since, given the error, I would of expected nothing to of gotten loaded. The problem is that I really don't understand how I reached my quota limit since I've only just started
uploading files recently and thought the limit was 200,000 requests.
All the data is currently on Google Cloud Storage so I would expect the data loading to happen fairly quickly since the interaction is between cloud storage and Big Query both of which are in the cloud.
By my calculations the entire load is going to take: (50,000 * 49 seconds) 28 days.
Kinda hoping these numbers are wrong.
Thanks.
The quota limit per table is 1000 loads per day. This is to encourage people to batch their loads, since we can generate a more efficient representation of the table if we can see more of the data at once.
BigQuery can perform load jobs in parallel. Depending on the size of your load, a number of workers will be assigned to your job. If your files are large, those files will be split among workers; alternately if you pass multiple files, each worker may process a different file. So the time that it takes for one file is not indicative of the time that it takes to run a load job with multiple files.

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!

Affordability of Amazon Simple Storage Service (S3)

I have a website that attracts about 30,000 visitors per month. It has a lot of photos and PDF files which eat up a good deal of bandwidth. It's hosted by site5.com, which offers unlimited bandwidth & storage for ~$5 per month. According to site5's statistics, my site has about 20 GB of downloads per day, but I've seen it as high as 116 GB. Uploads range from 5-15 GB daily. (Though, I don't really upload things everyday, so I don't know where they get those numbers from.)
In anticipation of growing my site even more, perhaps by hosting videos, high-res photos, etc., I was looking into other storage options, even though site5 has been pretty good. Specifically, amazon.com's Simple Storage Service (S3) looks pretty good and is supposed to be a "highly scalable, reliable, fast, inexpensive data storage infrastructure."
Using Amazon's Simple Monthly Calculator, I multiplied out my worst-case scenario numbers:
Storage: 2 GB
Data Transfer-in: 15 GB/day * 31 days = 465 GB/month
Data Transfer-out: 116 GB/day * 31 days = 3596 GB/month
With those numbers alone, the calculator estimates my monthly bill to be a whopping $658.27!!! That's insane! Is anyone here using S3? Are your bills outrageous?
Wow, are you sure about those stats? I suppose that's possible, but you're lucky that your host hasn't given you the boot. Leasing a dedicated server will typically get you somewhere in the neighborhood of 1.5TB/month for at least 20 times what you are paying now. If you're doing 3.5TB for $5 per month and your host isn't complaining, don't even think about moving.
(note: most unlimited plans are indeed limited by the company's terms of service, which usually allows them to give anyone the boot for using "too many" resources.)
I would try to find some way to verify your stats before you continue.
$5/3500GB is $0.0014 per gig. That's insane.
3.6TB/month is kind of a lot. Just as a sanity-check, my internet connection seems to deliver somewhere around 100kB/sec reception if I'm lucky (I assume the send/receive rat are about the same). At that bandwidth limit it would take my computer 417 days sending continuously to deliver that amount of data.
10c per gigabyte seems pretty reasonable to me. NearlyFreeSpeech.net charges $1/gigabyte delivered but that decreases to 20c/gigabyte at high volumes. Mosso charges 22c/GB delivered.
If you are paying $5 for unlimited transfer and storage I would stick with your current provider as they are offering something that no-one else is going to be able to offer you for that price.
S3 is also a content distribution network, it has certain uptime guarantees, data storage guarantees, your host probably does not. When Amazon says they can deliver your 116 GB a day they really mean it, whereas your host is probably overselling their capacity and hoping people don't really use their unlimited transfer.
You are getting a steal in terms of what you use. Good luck finding that elsewhere.