What is the storage cost of an empty key on S3 - amazon-s3

Suppose I upload a zero length object to S3 and then walk away.
Will I be charged (monthly) for storing it, besides the initial put? Or is the metadata for that object "free".

You will likely be changed for one PUT/POST request which is about $0.000005

The Amazon documentation is not clear on this, but there are a few things that lead me to assume that zero-length objects accrue storage costs.
The S3 Billing FAQs mention that metadata is included in storage costs.
The volume of storage billed in a month is based on the average storage used throughout the month. This includes all object data and metadata stored in buckets that you created under your AWS account.
https://aws.amazon.com/s3/faqs/#Billing
Metadata includes mandatory system-defined fields, such as ETag and the LastModified timestamp.
There are two kinds of metadata in Amazon S3: system-defined metadata and user-defined metadata.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html
Even delete markers, which are comparable to zero-length objects, are charged for the storage of their key name.
Delete markers accrue a nominal charge for storage in Amazon S3. The storage size of a delete marker is equal to the size of the key name of the delete marker. A key name is a sequence of Unicode characters. The UTF-8 encoding adds 1–4 bytes of storage to your bucket for each character in the name.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeleteMarker.html

Related

Partitioning with key in kafka connect s3 sink

Can we partition our output in s3 sink connector with key?
How can we set in connector config to just hold latest 10 record of each key or just hold data of 10 minutes ago? or partitioning with key and time period.
You'd need to set store.kafka.keys=true for the S3 sink to store keys, by default, but those will be written to unique files separately from the value, and within whatever partitioner you've configured.
Otherwise, the FieldPartitioner only uses the value of the record, Therefore, you'd need an SMT to move the record key into the value in order to partition on it.
Last I checked, there is still an open PR on Github for a Field and Time partitioner.
The S3 sink doesn't window/compact any data, it'll dump and store everything. You'll need an external process such as a Lambda function to cleanup data over time

BigQuery: Pricing for Querying parquet files, as external data sources, from the Coldline Cloud Storage class

BigQuery allows for querying of external tables in various storage classes, including Coldline.
Accessing data from Coldline has a data retrieval fee.
Parquet format files provide columnar storage. When accessing Parquet format files from Coldline GCS via BigQuery, is the data retrieval cost based on the columns of data queried or for the entire Parquet file?
To address the easy part of your question first, BigQuery charges based on the logical (uncompressed) size of just the columns read for all files that need to be read. If you read an integer field "foo" in a file that has 1M rows, you'll get charged 8MB (8 bytes per int * # of rows).
If a file can be skipped either due to Hive partition pruning or because the Parquet header contains information that says the file is not necessary for the query, then there are no charges for scanning that file.
The other part of your question is in regards to billing of reads from Coldline. If you read from coldline in BigQuery, you will not be billed for coldline reads. That said, please do not count on this staying to be the case for the long term. There is discussion going on internally within Google about how to close this hole.
In the future, when coldline reads are charged, most likely it will be as follows: the total amount of physical bytes necessary to run the query will be billed.
Parquet files have headers containing file metadata, then blocks with their own metadata, and columns. To read a parquet file you need to read the file header, the block headers, and the columns. Depending on the filter, some blocks may be skippable, in which case you won't get charged for then. On the other hand, some queries may require reading the same file multiple times (e.g. a self-join). The physical read size would then be the sum of all the bytes read for each time the file was read.

Is DynamoDB suitable as an S3 Metadata index?

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:
https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
https://www.youtube.com/watch?v=7Px5g6wLW2A
https://s3.amazonaws.com/big-data-ipc/AWS_Data-Lake_eBook.pdf
However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:
S3 path format: [4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data
Eg: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data
And the schema to record this event in DynamoDB looks like:
Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234
I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.
Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?
The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.
DynamoDBMapper
getS3ClientCache() Returns the underlying S3ClientCache for accessing
S3.
DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.
In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.
Global Secondary Index (GSI)
Difference between Scan and Query in DynamoDB
Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.
Either you have think about alternate data model where you can query by partition key or use some other database.
First, I've read that same AWS blog page too: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
The only way you can make this work with DynamoDB is:
add another attribute called "foo" and put same value 1 for all items
add another attribute called "timestamp" and put epoch timestamp there
create a GSI with partition key "foo" and range key "timestamp", and project all other attributes
Looks a bit dirty, huh? Then you can query items the for last 24 hours with partition key 1 (all items have 1) and use that timestamp range key. Now, the problems:
GSI having all items with same partition key? Performance will suck if data gorws large
Costs more with a GSI
You should think about the costs as well. Think about your data ingestion rate. Putting 1000 objects per second in a bucket would costs you about $600 per month and $600 more with GSI. Just because of that query need (last 24 hrs), you have to spend $600 more.
I'm encountering the same problems for designing this meta data index. DynamoDB just doesn't look right. This is always what you get when you try to use DynamoDB in a way you would use a RDBMS. Because I have few querying needs like yours. I thought about ElasticSearch and the s3 listing river plugin, and it doesn't look good either since I have to manage ES clusters and storage. What about CloudSearch? Looking at its limits, CloudSearch doesn't fell right either.
My requirements:
be able to access the most recent object with a given prefix
be able to access objects within a specific time range
get maximum performance out of S3 by hash strings in key space for AWS EMR, Athena or Redshift Spectrum
I am all lost here. I even thought about S3 versioning feature since I can get the most recent object just naturally. All seems not quite right and AWS documents and blog articles are full of confusions.
This is where I'm stuck for the whole week :(
People at AWS just love drawing diagrams. When they introduce some new architecture scheme or concept, they just put a bunch of AWS product icons there and say it's beautifully integrated.

Is there total size limit for S3 bucket?

I am wondering if there is size limit for S3 bucket? or I can store object limitless?
I need this information to assure I need to write a cleaner tool or not.
You can store unlimited objects in your S3 bucket. However, there are limits on individual objects stored -
An object can be 0 bytes to 5TB.
The largest object that can be uploaded in a single PUT is 5 gigabytes
For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
If you have too many objects then consider randomizing some prefix to your object names.
When your workload is a mix of request types, introduce some randomness to key names by adding a hash string as a prefix to the key name. By introducing randomness to your key names the I/O load will be distributed across multiple index partitions. For example, you can compute an MD5 hash of the character sequence that you plan to assign as the key and add 3 or 4 characters from the hash as a prefix to the key name.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
Yes, you can store object limitless.
BTW, if your consideration is performance, see
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html

S3 limit to objects in a bucket

Does anyone know if there is a limit to the number of objects I can put in an S3 bucket? can I put a million, 10 million etc.. all in a single bucket?
According to Amazon:
Write, read, and delete objects containing from 0 bytes to 5 terabytes of data each. The number of objects you can store is unlimited.
Source: http://aws.amazon.com/s3/details/ as of Sep 3, 2015.
It looks like the limit has changed. You can store 5TB for a single object.
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
http://aws.amazon.com/s3/faqs/#How_much_data_can_I_store
There is no limit on objects per bucket.
There is a limit of 100 buckets per account (you need to request amazon if you need more).
There is no performance drop even if you store millions of objects in a
single bucket.
From docs,
There is no limit to the number of objects that can be stored in a
bucket and no difference in performance whether you use many buckets
or just a few. You can store all of your objects in a single bucket,
or you can organize them across several buckets.
as of Aug 2016
While you can store an unlimited number of files/objects in a single bucket, when you go to list a "directory" in a bucket, it will only give you the first 1000 files/objects in that bucket by default. To access all the files in a large "directory" like this, you need to make multiple calls to their API.
There are no limits to the number of objects you can store in your S3 bucket. AWS claims it to have unlimited storage. However, there are some limitations -
By default, customers can provision up to 100 buckets per AWS account. However, you can increase your Amazon S3 bucket limit by visiting AWS Service Limits.
An object can be 0 bytes to 5TB.
The largest object that can be uploaded in a single PUT is 5 gigabytes
For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
That being said if you really have a lot of objects to be stored in S3 bucket consider randomizing your object name prefix to improve performance.
When your workload is a mix of request types, introduce some randomness to key names by adding a hash string as a prefix to the key name. By introducing randomness to your key names the I/O load will be distributed across multiple index partitions. For example, you can compute an MD5 hash of the character sequence that you plan to assign as the key and add 3 or 4 characters from the hash as a prefix to the key name.
More details - https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
-- As of June 2018
"You can store as many objects as you want within a bucket, and write,
read, and delete objects in your bucket. Objects can be up to 5
terabytes in size."
from http://aws.amazon.com/s3/details/ (as of Mar 4th 2015)
#Acyra- performance of object delivery from a single bucket would depend greatly on the names of the objects in it.
If the file names were distanced by random characters then their physical locations would be spread further on the AWS hardware, but if you named everything 'common-x.jpg', 'common-y.jpg' then those objects will be stored together.
This may slow delivery of the files if you request them simultaneously but not by enough to worry you, the greater risk is from data-loss or an outage, since these objects are stored together they will be lost or unavailable together.