CloudFront: Cost estimate - amazon-s3

Have to come up with a proposal to use Amazon S3 with CloudFront as CDN.
One of the important thing is to do a cost estimate. I read over AWS website and forums, used their calculator, but couldn't come to a conclusion with the final number (approx) that I will be confident of. Honestly, I got confused between terms like "Data Transfer Out", "GET and Other Requests" and whether I need to fill in the details both at Amazon S3 and Amazon CloudFront and then do a sum total.
So need help here to estimate my monthly bill.
I will be using S3 to store files (mostly images)
I will be configuring cloud front with my S3 bucket to deliver the content.
Most of the client base (almost 95%) is in US.
Average file size: 500KB
Average number of files stored on S3 monthly: 80000 (80K)
Approx number of total users requesting for the file monthly or approx number of total requests to fetch the file from CloudFront: 30 Millions monthly
There will be some invalidation requests per month (lets say 1000)
Would be great if I can get more understanding as to how my monthly bill will be calculated and what approximately it will be.
Also, with the above data and estimates, any approx on how much the monthly bill, if I use Akamai or Rackspace.

I'll throw another number into the ring.
data transfer out
0.5MB x 30 million = ~15,000GB
Average size 500kb
1000 invalidation requests
95% US
80K x 0.5MB 4GB
My initial result is $1,413. As #user2240751 noted, a factor of safety of 2 isn't unreasonable, so that's in the $1,500 - $3,000/month range.
I'm used to working with smaller numbers, but the final amount is always more than you might expect because of extra requests and data transfer.
Corrections or suggestions for improvements welcome!
Good luck

The S3 put and get request fields (in your case) should be restricted to the number of times you are likely to call / update the files in S3 from your application only.
To calculate the Cloudfront service costs, you should work out the rough outbound bandwidth of your page load (number of objects served from cloudfront per page - then double it - to give yourself some headroom), and fill in the rest of the fields.
Rough calc.
500GB data out (guess)
500k average object size
1000 invalidation requests
95% to US based edge location
5% to Europe based edge location
Comes in at $60.80 + your S3 costs.

I think the maths here is wrong 0.5MB * 30,000,000 is 14503GB NOT 1500GB - thats a factor of 10 out unless I'm missing something
Which means your monthly costs are going to be around $2000 not $200


How do databases store live second data?

So what I mean by live second data is something like the stock market where every second the data is getting inputted to the exact area of the specific stock item.
How would the data look in the database? Does it have a timestamp of each second? If so, wouldn't that cause the database to quickly fill up? Are there specific Databases that manage this type of stuff?
Thank you!
Given the sheer amount of money that gets thrown around in fintech, I'd be surprised if trading platforms even use traditional RDMBS databases to store their trading data, but I digress...
How would the data look in the database?
(Again, assuming they're even using a relation-based model in the first place) then something like this in SQL:
CREATE TABLE SymbolPrices (
Symbol char(4) NOT NULL, -- 4 bytes, or even 3 bytes given a symbol char only needs 32 bits-per-char.
Utc datetime NOT NULL, -- 8 byte timestamp (nanosececond precision)
Price int NOT NULL -- Assuming integer cents (not 4 digits), that's 4 bytes
...which has a fixed row length of 16 bytes.
Does it have a timestamp of each second?
It can do, but not per second - you'd need far greater granularity than that: I wouldn't be surprised if they were using at least 100-nanosecond resolution, which is a common unit for computer system clock "ticks" (e.g. .NET's DateTime.Ticks is a 64-bit integer value of 100-nanosecond units). Java and JavaScript both use milliseconds, though this resolution might be too coarse.
Storage space requirements for changing numeric values can always be significantly optimized if you instead store the deltas instead of absolute values: I reckon it could come down to 8 bytes per record:
I reason that 3 bytes is sufficient to store trade timestamp deltas at ~1.5ms resolution assuming 100,000 trades per day per stock: that's 16.7m values to represent a 7 hour (25,200s) trading window,
Price deltas also likely be reduced to a 2 byte value (-$327.68 to +$327.67).
And assuming symbols never exceed 4 uppercase Latin characters (A-Z), then that can be represented in 3 bytes.
Giving an improved fixed row length of 8 bytes (3 + 3 + 2).
Though you would now need to store "keyframe" data every few thousand rows to prevent needing to re-play every trade from the very beginning to get the current price.
If data is physically partitioned by symbol (i.e.. using a separate file on disk for each symbol) then you don't need to include the symbol in the record at all, bringing the row length down to merely 5 bytes.
If so, wouldn't that cause the database to quickly fill up?
No, not really (at least assuming you're using HDDs made since the early 2000s); consider that:
Major stock-exchanges really don't have that many stocks, e.g. NASDAQ only has a few thousand stocks (5,015 apparently).
While high-profile stocks (APPL, AMD, MSFT, etc) typically have 30-day sales volumes on the order of 20-130m, that's only the most popular ~50 stocks, most stocks have 30-day volumes far below that.
Let's just assume all 5,000 stocks all have a 30-day volume of 3m.
That's ~100,000 trades per day, per stock on average.
That would require 100,000 * 16 bytes per day per stock.
That's 1,600,000 bytes per day per stock.
Or 1.5MiB per day per stock.
556MiB per year per stock.
For the entire exchange (of 5,000 stocks) that's 7.5GiB/day.
Or 2.7TB/year.
When using deltas instead of absolute values, then the storage space requirements are halved to ~278MiB/year per stock, or 1.39TB/year for the entire exchange.
In practice, historical information would be likely be archived and compressed (likely using a column-major approach to make them more amenable to good compression with general purpose compression schemes, and if data is grouped by symbol then that shaves off another 4 bytes).
Even without compression, partitioning by symbol and using deltas means needing around only 870GB/year for the entire exchange.
That's small enough to fit into a $40 HDD drive from Amazon.
Are there specific Databases that manage this type of stuff?
Undoubtedly, but I don't think they'd need to optimize for storage-space specifically - more likely write-performance and security.
They use different big data architectures like Kappa and Lambda where data is processed in both near real-time and batch pipelines, in this case live second data is "stored" in a messaging engine like Apache Kafka and then it's retrieved, processed and ingested to databases with streaming processing engines like Apache Spark Streaming
They often don't use RDMBS databases like MySQL, SQL Server and so forth to store the data and instead they use NoSQL data storage or formats like Apache Avro or Apache Parquet stored in buckets like AWS S3 or Google Cloud Storage properly partitioned to improve performance.
A full example can be found here: Streaming Architecture with Apache Spark and Kafka

AWS Cloudwatch ELB monitoring active connections

I would like to monitor the maximum number of active connections that my ApplicationELB is managing over a 5-minute period.
The ApplicationELB publishes a metric called ActiveConnectionCount. The documentation describes this in part as:
The total number of concurrent TCP connections active from clients to the load balancer and from the load balancer to targets.
And further states:
The most useful statistic is Sum.
I believe that Sum would total all the active connections reported within the time frame. E.g. Let's say the ELB is maintaining 10 connections and it reports this number every second, then the Sum would be 3000 over a 5-minute period. This is not what I want. Furthermore, when I use SUM over a 5-minute period I'm getting 20k or so -- far more than the number of real concurrent connections which are at most a few hundred.
If I aggregate using Maximum then the number reported by AWS is zero (!?).
If I aggregate using Average then the number appears to be reasonable (ranging from 80 - 200), but also wildly inaccurate. That is, it is almost inversely correlates with new connections and response time. That is, during time so of the day when response time is low and new connections is low, average active connections is higher.
So, I guess, here are my questions:
(1) How can I achieve seeing maximum number of concurrent connections between ELB and clients/app server? (Ideally, I could separate these two, but it doesn't look like the ELB does that).
Less important, but I'm curious:
(2) Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
(3) Why does the documentation say that SUM should be used?
Thanks for any help / insight!
How can I achieve seeing maximum number of concurrent connections
between ELB and clients/app server? (Ideally, I could separate these
two, but it doesn't look like the ELB does that).
Why does MAXIMUM yield zero, while AVERAGE yields 80-200?
As you said, the ELB does not do that. From the metrics you can also see something called "SampleCount" which is the number of samples taken during a period of time, by default 1 minute. If we could somehow access the counts in these samples, we could get a min and max sample. For whatever reason, it's currently broken or not implemented and min/max show as 0. Therefore, the most useful metric, in my opinion at least, is the average which takes the sum (of counts) and divides it by the SampleCount.
Why does the documentation say that SUM should be used?
Good question because if you think about it it doesn't make much sense and doesn't give you much information since it's just a sum of the count in all samples.

Cloud DataFlow performance - are our times to be expected?

Looking for some advice on how best to architect/design and build our pipeline.
After some initial testing, we're not getting the results that we were expecting. Maybe we're just doing something stupid, or our expectations are too high.
Our data/workflow:
Google DFP writes our adserver logs (CSV compressed) directly to GCS (hourly).
A day's worth of these logs has in the region of 30-70 million records, and about 1.5-2 billion for the month.
Perform transformation on 2 of the fields, and write the row to BigQuery.
The transformation involves performing 3 REGEX operations (due to increase to 50 operations) on 2 of the fields, which produces new fields/columns.
What we've got running so far:
Built a pipeline that reads the files from GCS for a day (31.3m), and uses a ParDo to perform the transformation (we thought we'd start with just a day, but our requirements are to process months & years too).
DoFn input is a String, and its output is a BigQuery TableRow.
The pipeline is executed in the cloud with instance type "n1-standard-1" (1vCPU), as we think 1 vCPU per worker is adequate given that the transformation is not overly complex, nor CPU intensive i.e. just a mapping of Strings to Strings.
We've run the job using a few different worker configurations to see how it performs:
5 workers (5 vCPUs) took ~17 mins
5 workers (10 vCPUs) took ~16 mins (in this run we bumped up the instance to "n1-standard-2" to get double the cores to see if it improved performance)
50 min and 100 max workers with autoscale set to "BASIC" (50-100 vCPUs) took ~13 mins
100 min and 150 max workers with autoscale set to "BASIC" (100-150 vCPUs) took ~14 mins
Would those times be in line with what you would expect for our use case and pipeline?
You can also write the output to files and then load it into BigQuery using command line/console. You'd probably save some dollars of instance's uptime. This is what I've been doing after running into issues with Dataflow/BigQuery interface. Also from my experience there is some overhead bringing instances up and tearing them down (could be 3-5 minutes). Do you include this time in your measurements as well?
BigQuery has a write limit of 100,000 rows per second per table OR 6M/per minute. At 31M rows of input that would take ~ 5 minutes of just flat out writes. When you add back the discrete processing time per element & then the synchronization time (read from GCS->dispatch->...) of the graph this looks about right.
We are working on a table sharding model so you can write across a set of tables and then use table wildcards within BigQuery to aggregate across the tables (common model for typical BigQuery streaming use case). I know the BigQuery folks are also looking at increased table streaming limits, but nothing official to share.
Net-net increasing instances is not going to get you much more throughput right now.
Another approach - in the mean time while we work on improving the BigQuery sync - would be to shard your reads using pattern matching via TextIO and then run X separate pipelines targeting X number of tables. Might be a fun experiment. :-)
Make sense?

How are S3 (Amazon Simple Storage System) storage prices calculated?

I'm not quite sure if this is the correct stack exchange site for this question, but i've found no site which fits better.
I'm planning to use S3 for my next project, but i'm not sure how the prices for the storage is actually billed. I would have no problem if i would use S3 just for throwing gigabytes of data in and almost never delete data. But thats not the case.
What if I store an 1 megabyte file in S3, delete it after 1 hour and put another 1 megabyte file onto S3? Will I be billed for 1 megabyte of storage for that month, or 2 megabytes?
Amazon states:
First 1 TB / month of Storage Used
I don't think they will just bill whats stored on my S3 account at the end of the month and will bill that. The other way around - bill me for every store request as "storage used" will not work either, because the stored file might be stored for a long time, during multiple billing months.
I hope someone has the answer to that, i couldn't find anything :-)
Storage is billed as an average of all data stored per month. From the Amazon docs:
The volume of storage billed in a
month is based on the average storage
used throughout the month. This
includes all object data and metadata
stored in buckets that you created
under your AWS account. We measure
your storage usage in
“TimedStorage-ByteHrs,” which are
added up at the end of the month to
generate your monthly charges.
Storage Example: Assume you store
100GB (107,374,182,400 bytes) of
standard Amazon S3 storage data in
your bucket for 15 days in March, and
100TB (109,951,162,777,600 bytes) of
standard Amazon S3 storage data for
the final 16 days in March.
At the end of March, you would have
the following usage in Byte-Hours:
Total Byte-Hour usage
= [107,374,182,400 bytes x 15 days x (24 hours / day)] +
[109,951,162,777,600 bytes x 16 days x
(24 hours / day)] =
42,259,901,212,262,400 Byte-Hours.
Let’s convert this to GB-Months:
42,259,901,212,262,400 Byte-Hours x (1
GB / 1,073,741,824 bytes) x (1 month /
744 hours) = 52,900 GB-Months
So in your example (assuming the 2nd megabyte is stored for the remainder of the month) you will be charged for 1MB.
Remember though, that there are other charges to consider too, like data transfer in/out and total requests etc.

Mesaure upload and download speed in iPhone

I would like to measure the upload and download speed of data in iPhone, is any API available to achieve the same? Is it correct to measure it on the basis of dividing total bytes received with time taken in response?
Yes, it is correct to measure the total bytes / time taken, that is exactly what the speed is. You might want to take an average if you want to constantly show the download speed.., like using 500 bytes and the time it took to download those particular ones.
For doing this you could like have an NSMutableArray, as a buffer, which you empty idk every 2 seconds. Then you do [bufferMutableArray length]/2 and you know how many bytes a second you had those 2 seconds. When you empty the buffer ofc append to the data you are downloading.
There is no direct API to know the speed.
Total data received/sent and time only will give you average speed. There use to be lot of variation in the speed over the time so if you want more accurate value then do the speed calculation based on sampling.
(Data transferred in 1 miniut) /(60 seconds) ---> this solution only if you need greater accuracy in the speed calculation. The sampling duration can changed based on the level of accuracy required.