I'm building an application for an artist and suggested that they host the mp3 files on s3 as its free up to a certain bandwidth for the first year.
The app will stream several record albums of 160kbit mp3 files on s3, along with some 1280 x 720.jpg images which the application downloads on launch.
The app needs to get these images (about 250 k each) as fast as possible, and is not able to cache them locally after it is terminated, each time it is launched it will re-download the images.
Given an expectation of 10k to 100k users, possibly more, mostly in the USA, would adding Cloudfront to the picture be of value, and is it more likely to bankrupt someone who doesn't have a high income than just using S3 alone?
Have a look at the Amazon Web Services Simple Monthly Calculator.
You can enter your assumptions about storage and data transfer into it and see the expected costs. You clearly already know that CloudFront will have dramatically lower latency for your clients.
I'm making a basic assumption of 1 hour of daily streaming per user, and your range of 10k to 100k users.
This source gives 160kbit mp3 audio as 72 MB/hour, so we will use the formula below to calculate total monthly transfers:
72 MB/hour * 1 hour/user/day * 30 day/month * 10k user
This gives:
20 terabytes / month for 10k users
200 terabytes / month for 100k users
Using the above numbers, here's your costs:
S3-only is between $2,000 and $16,000 per month
CloudFront adds between $2,000 and $13,000 per month on top of that
TL;DR In your scenario, using CloudFront roughly doubles the cost - regardless of the number of users.
You should also consider that the actual cost might be lower if you contact Amazon to buy reserved capacity on CloudFront in exchange for a better rate for transfer costs. From their CloudFront pricing page:
Reserved Capacity gives you the option to commit to a minimum monthly usage level for 12 months or longer and in turn receive a significant discount. Reserved Capacity agreements begin at a minimum of 10 TB of data transfer per month from a single region. Customers who commit to higher usage receive additional discounts.
Apparently in some cases the difference of having reserved capacity might save you as much as 50% of CloudFront costs, which would mean CloudFront would only account for about 1/3rd of your total transfer costs instead of half.
S3 is meant for static data only like images and using cloudfront will be of great help.
Related
We receive around 10 million images per day ranging in size from 3kb to 200kb. At peak times it is around 400 images per second. It is an average of around 30kb per image.
At the moment all these images come into a single server with a 1TB NVMe SSD for storage.
At night we move the days images to an archive server.
At peak times users want to read the latest images as they are being written but there are delays as it appears the server is attempting read and write at the same time.
What is the best way to be able to succeed in read / write to the same volume and be able to scale easily going forward?
I've started looking into distributed file systems like SeaweedFS. Is this the right way to go?
Are there better options than SeaweedFS?
Thank you
I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.
According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.
I'm hitting my s3 bucket via its website endpoint with various paths/keys. I'm able to get ok (200) responses when I'm hitting it at 1,000 requests per second over the course of 5 minutes. I'm using a popular tool: https://github.com/tsenart/vegeta so I have confidence in these stats.
This is suprising considering the documentation says that anything above is 800 per second is problematic.
Is using a website endpoint different than an API call in terms of throttling? Is 800 a real rate limit or a crude theshhold?
It's a soft limit, and not really a limit from the bucket level perspective. Read carefully. The documentation warns of a rapid request rate increase beyond 800 requests per second potentially resulting in temporary rate limits on your request rate.
S3 increases available capacity by keyspace partition splitting and it takes some time for this to happen... but buckets scale up with workload.
If you are requesting the same object(s) repeatedly, you are also not likely to be imposing as much load on the available resources as you would be if you were hitting 800 unique objects per second and reading between the lines, that is the threshold under discussion -- the time to look up keys in the bucket index. Recent hits are probably already more accessible than cold spots in the index.
The problem that document highlights is that of your object keys are lexically sequential, then S3 will be unable to split the partitions meaningfully, because you will always be creating new objects on only one side of the split or the other and thus working against the scaling algorithm of S3.
The documentation has been updated in meantime and the limits have been increased. Now the limits are per bucket prefix and 1000 req/s isn't a problem any more. For more see the mentioned doc.
The new Google Sheets API v4 currently has an unlimited read/write quota per day (which is fantastic), but restricted to 500 reads/writes per account per 100 seconds, and 100 read/writes per key per 100 seconds (or, I have found, multiple keys coming from the same IP). This is probably plenty for most use cases, but I have an edge case that requires bringing a frequently-updated Google Sheet with 70 tabs down to a node.js server that distributes these to user's clients every ~30-60 seconds or so (users are data annotators who are student research assistants). This wasn't so bad early in the project when there were only 20-30 tabs, but now that the data is large the server is blowing through the 100 quota and returning errors every 10-15 minutes.
The problem is such that:
Frequent data updates: Only data on 1-5 of the 70 tabs is likely to be updated on any given minute, but which tabs have new data is random (so I am pulling down the whole sheet of 70 = 70 reads).
Update interval: The need for updates happens randomly at about 30 second to 5-minute intervals (so some within the quota, some about 3-5x the quota).
Throttling: I have tried throttling the update to be within the 100 calls/100 seconds (my previous solution), but this introduces large usability issues, significantly decreasing usability/productivity/work quality.
Quota increase: The sheets API does not currently appear to include a way to pay to increase the quota. It does allow filling out a form to request an increase in the quota, but I'm not sure what the mean response time is on this (my request is only a few days old).
Multiple service accounts: I have tried using multiple service accounts to get the full 500 requests/100 seconds quota (rather than the per-user quota), since this is a server, but Google Sheets looks to rate-limit to 100 requests/100 seconds from a given IP
Alternatives: I have considered that this project may have just grown beyond the size that Sheets is easily able to handle, but there do not appear to be any good, usable, self-hosted, collaborative spreadsheets with easy-to-interface-to APIs out there.
Are there settings/methods suggested to achieve the full 500 calls/100 seconds for a server?
You can request quota update in Google Cloud Platform and it will be increased to 2500 per account an 500 per user. (about your #4)
You can use spreadsheets.get to read the entire spreadsheet in a single call, rather than 1 call per request. Alternately, you can use spreadsheets.values.batchGet to read multiple different ranges in a single call, if all you need are the values.
The Drive API offers "push notifications", so you can get notified when changes occur and react to those, instead of polling for them. The latency of the notifications is a little on the slow side, but it gets the job done.
I have a website that attracts about 30,000 visitors per month. It has a lot of photos and PDF files which eat up a good deal of bandwidth. It's hosted by site5.com, which offers unlimited bandwidth & storage for ~$5 per month. According to site5's statistics, my site has about 20 GB of downloads per day, but I've seen it as high as 116 GB. Uploads range from 5-15 GB daily. (Though, I don't really upload things everyday, so I don't know where they get those numbers from.)
In anticipation of growing my site even more, perhaps by hosting videos, high-res photos, etc., I was looking into other storage options, even though site5 has been pretty good. Specifically, amazon.com's Simple Storage Service (S3) looks pretty good and is supposed to be a "highly scalable, reliable, fast, inexpensive data storage infrastructure."
Using Amazon's Simple Monthly Calculator, I multiplied out my worst-case scenario numbers:
Storage: 2 GB
Data Transfer-in: 15 GB/day * 31 days = 465 GB/month
Data Transfer-out: 116 GB/day * 31 days = 3596 GB/month
With those numbers alone, the calculator estimates my monthly bill to be a whopping $658.27!!! That's insane! Is anyone here using S3? Are your bills outrageous?
Wow, are you sure about those stats? I suppose that's possible, but you're lucky that your host hasn't given you the boot. Leasing a dedicated server will typically get you somewhere in the neighborhood of 1.5TB/month for at least 20 times what you are paying now. If you're doing 3.5TB for $5 per month and your host isn't complaining, don't even think about moving.
(note: most unlimited plans are indeed limited by the company's terms of service, which usually allows them to give anyone the boot for using "too many" resources.)
I would try to find some way to verify your stats before you continue.
$5/3500GB is $0.0014 per gig. That's insane.
3.6TB/month is kind of a lot. Just as a sanity-check, my internet connection seems to deliver somewhere around 100kB/sec reception if I'm lucky (I assume the send/receive rat are about the same). At that bandwidth limit it would take my computer 417 days sending continuously to deliver that amount of data.
10c per gigabyte seems pretty reasonable to me. NearlyFreeSpeech.net charges $1/gigabyte delivered but that decreases to 20c/gigabyte at high volumes. Mosso charges 22c/GB delivered.
If you are paying $5 for unlimited transfer and storage I would stick with your current provider as they are offering something that no-one else is going to be able to offer you for that price.
S3 is also a content distribution network, it has certain uptime guarantees, data storage guarantees, your host probably does not. When Amazon says they can deliver your 116 GB a day they really mean it, whereas your host is probably overselling their capacity and hoping people don't really use their unlimited transfer.
You are getting a steal in terms of what you use. Good luck finding that elsewhere.