How are S3 (Amazon Simple Storage System) storage prices calculated? - amazon-s3

I'm not quite sure if this is the correct stack exchange site for this question, but i've found no site which fits better.
I'm planning to use S3 for my next project, but i'm not sure how the prices for the storage is actually billed. I would have no problem if i would use S3 just for throwing gigabytes of data in and almost never delete data. But thats not the case.
What if I store an 1 megabyte file in S3, delete it after 1 hour and put another 1 megabyte file onto S3? Will I be billed for 1 megabyte of storage for that month, or 2 megabytes?
Amazon states:
First 1 TB / month of Storage Used
I don't think they will just bill whats stored on my S3 account at the end of the month and will bill that. The other way around - bill me for every store request as "storage used" will not work either, because the stored file might be stored for a long time, during multiple billing months.
I hope someone has the answer to that, i couldn't find anything :-)

Storage is billed as an average of all data stored per month. From the Amazon docs:
The volume of storage billed in a
month is based on the average storage
used throughout the month. This
includes all object data and metadata
stored in buckets that you created
under your AWS account. We measure
your storage usage in
“TimedStorage-ByteHrs,” which are
added up at the end of the month to
generate your monthly charges.
Storage Example: Assume you store
100GB (107,374,182,400 bytes) of
standard Amazon S3 storage data in
your bucket for 15 days in March, and
100TB (109,951,162,777,600 bytes) of
standard Amazon S3 storage data for
the final 16 days in March.
At the end of March, you would have
the following usage in Byte-Hours:
Total Byte-Hour usage
= [107,374,182,400 bytes x 15 days x (24 hours / day)] +
[109,951,162,777,600 bytes x 16 days x
(24 hours / day)] =
42,259,901,212,262,400 Byte-Hours.
Let’s convert this to GB-Months:
42,259,901,212,262,400 Byte-Hours x (1
GB / 1,073,741,824 bytes) x (1 month /
744 hours) = 52,900 GB-Months
So in your example (assuming the 2nd megabyte is stored for the remainder of the month) you will be charged for 1MB.
Remember though, that there are other charges to consider too, like data transfer in/out and total requests etc.

Related

How do databases store live second data?

So what I mean by live second data is something like the stock market where every second the data is getting inputted to the exact area of the specific stock item.
How would the data look in the database? Does it have a timestamp of each second? If so, wouldn't that cause the database to quickly fill up? Are there specific Databases that manage this type of stuff?
Thank you!
Given the sheer amount of money that gets thrown around in fintech, I'd be surprised if trading platforms even use traditional RDMBS databases to store their trading data, but I digress...
How would the data look in the database?
(Again, assuming they're even using a relation-based model in the first place) then something like this in SQL:
CREATE TABLE SymbolPrices (
Symbol char(4) NOT NULL, -- 4 bytes, or even 3 bytes given a symbol char only needs 32 bits-per-char.
Utc datetime NOT NULL, -- 8 byte timestamp (nanosececond precision)
Price int NOT NULL -- Assuming integer cents (not 4 digits), that's 4 bytes
)
...which has a fixed row length of 16 bytes.
Does it have a timestamp of each second?
It can do, but not per second - you'd need far greater granularity than that: I wouldn't be surprised if they were using at least 100-nanosecond resolution, which is a common unit for computer system clock "ticks" (e.g. .NET's DateTime.Ticks is a 64-bit integer value of 100-nanosecond units). Java and JavaScript both use milliseconds, though this resolution might be too coarse.
Storage space requirements for changing numeric values can always be significantly optimized if you instead store the deltas instead of absolute values: I reckon it could come down to 8 bytes per record:
I reason that 3 bytes is sufficient to store trade timestamp deltas at ~1.5ms resolution assuming 100,000 trades per day per stock: that's 16.7m values to represent a 7 hour (25,200s) trading window,
Price deltas also likely be reduced to a 2 byte value (-$327.68 to +$327.67).
And assuming symbols never exceed 4 uppercase Latin characters (A-Z), then that can be represented in 3 bytes.
Giving an improved fixed row length of 8 bytes (3 + 3 + 2).
Though you would now need to store "keyframe" data every few thousand rows to prevent needing to re-play every trade from the very beginning to get the current price.
If data is physically partitioned by symbol (i.e.. using a separate file on disk for each symbol) then you don't need to include the symbol in the record at all, bringing the row length down to merely 5 bytes.
If so, wouldn't that cause the database to quickly fill up?
No, not really (at least assuming you're using HDDs made since the early 2000s); consider that:
Major stock-exchanges really don't have that many stocks, e.g. NASDAQ only has a few thousand stocks (5,015 apparently).
While high-profile stocks (APPL, AMD, MSFT, etc) typically have 30-day sales volumes on the order of 20-130m, that's only the most popular ~50 stocks, most stocks have 30-day volumes far below that.
Let's just assume all 5,000 stocks all have a 30-day volume of 3m.
That's ~100,000 trades per day, per stock on average.
That would require 100,000 * 16 bytes per day per stock.
That's 1,600,000 bytes per day per stock.
Or 1.5MiB per day per stock.
556MiB per year per stock.
For the entire exchange (of 5,000 stocks) that's 7.5GiB/day.
Or 2.7TB/year.
When using deltas instead of absolute values, then the storage space requirements are halved to ~278MiB/year per stock, or 1.39TB/year for the entire exchange.
In practice, historical information would be likely be archived and compressed (likely using a column-major approach to make them more amenable to good compression with general purpose compression schemes, and if data is grouped by symbol then that shaves off another 4 bytes).
Even without compression, partitioning by symbol and using deltas means needing around only 870GB/year for the entire exchange.
That's small enough to fit into a $40 HDD drive from Amazon.
Are there specific Databases that manage this type of stuff?
Undoubtedly, but I don't think they'd need to optimize for storage-space specifically - more likely write-performance and security.
They use different big data architectures like Kappa and Lambda where data is processed in both near real-time and batch pipelines, in this case live second data is "stored" in a messaging engine like Apache Kafka and then it's retrieved, processed and ingested to databases with streaming processing engines like Apache Spark Streaming
They often don't use RDMBS databases like MySQL, SQL Server and so forth to store the data and instead they use NoSQL data storage or formats like Apache Avro or Apache Parquet stored in buckets like AWS S3 or Google Cloud Storage properly partitioned to improve performance.
A full example can be found here: Streaming Architecture with Apache Spark and Kafka

Databricks delta live tables stuck when ingest file from S3

I'm new to databricks and just created a delta live tables to ingest 60 millions json file from S3. However the input rate (the number of files that it read from S3) is stuck at around 8 records/s, which is very low IMO. I have increased the number of worker/core in my delta live tables but the input rate stays the same.
Is there any config that I have to add to increase the input rate for my pipeline?

Why LogDNA backlogs to Amazon S3 only part of the data?

I compared the logs in LogDNA and the stored files for the same time period, hour, (8 hours before) and only portion of the lines I find in S3. Why?

Load Limit for BigQuery Table

I have tons of avro format documents saved in GCS.
I would like to use BigQuery REST API to load them back as BigQuery tables.
Is there a limit for the total amount of data (such as 10 TB) I can load per day?
Thanks,
Yefu
You are limited by several quota [1]
Maximum size per load job — 15 TB across all input files for CSV, JSON, Avro, Parquet, and ORC
Maximum number of source URIs in job configuration — 10,000 URIs
Maximum number of files per load job — 10 Million total files including all files matching all wildcard URIs
From above, you can upload up to 15TB each time (per job)
Load jobs per table per day — 1,000 (including failures)
Load jobs per project per day — 100,000 (including failures)
With these 2 limits, you can upload up to 1000 * 15TB = 15PB per table per day.
[1] https://cloud.google.com/bigquery/quotas#load_jobs

CloudFront: Cost estimate

Have to come up with a proposal to use Amazon S3 with CloudFront as CDN.
One of the important thing is to do a cost estimate. I read over AWS website and forums, used their calculator, but couldn't come to a conclusion with the final number (approx) that I will be confident of. Honestly, I got confused between terms like "Data Transfer Out", "GET and Other Requests" and whether I need to fill in the details both at Amazon S3 and Amazon CloudFront and then do a sum total.
So need help here to estimate my monthly bill.
I will be using S3 to store files (mostly images)
I will be configuring cloud front with my S3 bucket to deliver the content.
Most of the client base (almost 95%) is in US.
Average file size: 500KB
Average number of files stored on S3 monthly: 80000 (80K)
Approx number of total users requesting for the file monthly or approx number of total requests to fetch the file from CloudFront: 30 Millions monthly
There will be some invalidation requests per month (lets say 1000)
Would be great if I can get more understanding as to how my monthly bill will be calculated and what approximately it will be.
Also, with the above data and estimates, any approx on how much the monthly bill, if I use Akamai or Rackspace.
I'll throw another number into the ring.
Using http://calculator.s3.amazonaws.com/calc5.html
CloudFront
data transfer out
0.5MB x 30 million = ~15,000GB
Average size 500kb
1000 invalidation requests
95% US
S3
Storage
80K x 0.5MB 4GB
requests
30million
My initial result is $1,413. As #user2240751 noted, a factor of safety of 2 isn't unreasonable, so that's in the $1,500 - $3,000/month range.
I'm used to working with smaller numbers, but the final amount is always more than you might expect because of extra requests and data transfer.
Corrections or suggestions for improvements welcome!
Good luck
The S3 put and get request fields (in your case) should be restricted to the number of times you are likely to call / update the files in S3 from your application only.
To calculate the Cloudfront service costs, you should work out the rough outbound bandwidth of your page load (number of objects served from cloudfront per page - then double it - to give yourself some headroom), and fill in the rest of the fields.
Rough calc.
500GB data out (guess)
500k average object size
1000 invalidation requests
95% to US based edge location
5% to Europe based edge location
Comes in at $60.80 + your S3 costs.
I think the maths here is wrong 0.5MB * 30,000,000 is 14503GB NOT 1500GB - thats a factor of 10 out unless I'm missing something
Which means your monthly costs are going to be around $2000 not $200