Is DynamoDB suitable as an S3 Metadata index? - amazon-s3

I would like to store and query a large quantity of raw event data. The architecture I would like to use is the 'data lake' architecture where S3 holds the actual event data, and DynamoDB is used to index it and provide metadata. This is an architecture that is talked about and recommended in many places:
https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
https://www.youtube.com/watch?v=7Px5g6wLW2A
https://s3.amazonaws.com/big-data-ipc/AWS_Data-Lake_eBook.pdf
However, I am struggling to understand how to use DynamoDB for the purposes of querying the event data in S3. In the link to the AWS blog above, they use the example of storing customer events produced by multiple different servers:
S3 path format: [4-digit hash]/[server id]/[year]-[month]-[day]-[hour]-[minute]/[customer id]-[epoch timestamp].data
Eg: a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data
And the schema to record this event in DynamoDB looks like:
Customer ID (Partition Key), Timestamp-Server (Sort Key), S3-Key, Size
87423, 1436055953839-i-31cc02, a5b2/i-31cc02/2015-07-05-00-25/87423-1436055953839.data, 1234
I would like to perform a query such as: "Get me all the customer events produced by all servers in the last 24 hours" but as far as I understand, it's impossible to efficiently query DynamoDB without using the partition key. I cannot specify the partition key for this kind of query.
Given this requirement, should I use a database other than DynamoDB to record where my events are in S3? Or do I simply need to use a different type of DynamoDB schema?

The architecture looks fine and feasible using DynamoDB database. The DynamoDBMapper class (present in AWS SDK Java) can be used to create the model which has useful methods to get the data from S3.
DynamoDBMapper
getS3ClientCache() Returns the underlying S3ClientCache for accessing
S3.
DynamoDB database can't be queried without partition key. You have to scan the whole DynamoDB database if partition key is not available. However, you can create a Global Secondary Index (GSI) on date/time field and query the data for your use case.
In simple terms, GSI is similar to the index present in any RDBMS. The difference is that you can directly query the GSI rather than the main table. Normally, GSI is required if you would like to query the DynamoDB for some use case when partition key is not available. There are options available to include ALL (or) selective fields present in the main table in GSI.
Global Secondary Index (GSI)
Difference between Scan and Query in DynamoDB
Yes, in this use case, looks like GSI can't help as the use case requires a RANGE query on partition key. The DynamoDB supports only equality operator. DynamoDB supports range queries on sort keys or other non-key attributes if partition key is available. You may have to scan the DynamoDB to fulfill this use case which is costly operation.
Either you have think about alternate data model where you can query by partition key or use some other database.

First, I've read that same AWS blog page too: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
The only way you can make this work with DynamoDB is:
add another attribute called "foo" and put same value 1 for all items
add another attribute called "timestamp" and put epoch timestamp there
create a GSI with partition key "foo" and range key "timestamp", and project all other attributes
Looks a bit dirty, huh? Then you can query items the for last 24 hours with partition key 1 (all items have 1) and use that timestamp range key. Now, the problems:
GSI having all items with same partition key? Performance will suck if data gorws large
Costs more with a GSI
You should think about the costs as well. Think about your data ingestion rate. Putting 1000 objects per second in a bucket would costs you about $600 per month and $600 more with GSI. Just because of that query need (last 24 hrs), you have to spend $600 more.
I'm encountering the same problems for designing this meta data index. DynamoDB just doesn't look right. This is always what you get when you try to use DynamoDB in a way you would use a RDBMS. Because I have few querying needs like yours. I thought about ElasticSearch and the s3 listing river plugin, and it doesn't look good either since I have to manage ES clusters and storage. What about CloudSearch? Looking at its limits, CloudSearch doesn't fell right either.
My requirements:
be able to access the most recent object with a given prefix
be able to access objects within a specific time range
get maximum performance out of S3 by hash strings in key space for AWS EMR, Athena or Redshift Spectrum
I am all lost here. I even thought about S3 versioning feature since I can get the most recent object just naturally. All seems not quite right and AWS documents and blog articles are full of confusions.
This is where I'm stuck for the whole week :(
People at AWS just love drawing diagrams. When they introduce some new architecture scheme or concept, they just put a bunch of AWS product icons there and say it's beautifully integrated.

Related

Table without date and Primary Key

I have 9M records. We needed to do the following operations:-
daily we receive the entire file of 9M records with 150GB of file size
It is truncate and loads in Snowflake. Daily deleting the entire 9B records and loading
We would want to send only incremental file load to Snowflake. Meaning that:
For example, out of 9Million records, we would only have an update in 0.5Million records(0.1 M Inserts,0.3 Deletes, and 0.2 Updates). How we will be able to compare the file and extract only delta file and load to the snowflake. How to do it cost-effectively and fast way in AWS native tools and load to S3.
P.s data doesn't have any date column. It is a pretty old concept written in 2012. We need to optimize this. The file format is fixed width. Attaching sample RAW data.
Sample Data:
https://paste.ubuntu.com/p/dPpDx7VZ5g/
In a nutshell, I want to extract only Insert, Updates, and Deletes into a File. How do you classify this best and cost-efficient way.
Your tags and the question content does not match, but I am guessing that you are trying to load data from Oracle to Snowflake. You want to do an incremental load from Oracle but you do not have an incremental key in the table to identify the incremental rows. You have two options.
Work with your data owners and put the effort to identify the incremental key. There needs to be one. People are sometimes lazy to put this effort. This will be the most optimal option
If you cannot, then look for a CDC(change data capture) solution like golden gate
CDC stage comes by default in DataStage.
Using CDC stage in combination of Transformer stage, is best approach to identify new rows, changed rows and rows for deletion.
You need to identify column(s) which makes row unique, doing CDC with all columns is not recommended, DataStage job with CDC stage consumes more resources if you add more change columns in CDC stage.
Work with your BA to identifying column(s) which makes row unique in the data.
I had the similar problem what you have. In my case, there are no Primary key and there is no date column to identify the difference. So what I did is actually, I used AWS Athena (presto managed) to calculate the difference between source and the destination. Below is the process:
Copy the source data to s3.
Create Source Table in athena pointing the data copied from source.
Create Destination table in athena pointing to the destination data.
Now use, SQL in athena to find out the difference. As I did not have the both primary key and date column, I used the below script:
select * from table_destination
except
select * from table_source;
If you have primary key, you can use that to find the difference as well and create the result table with the column which says "update/insert/delete"
This option is aws native and then it will be cheaper as well, as it costs 5$ per TB in athena. Also, in this method, do not forget to write file rotation scripts, to cut down your s3 costs.

BigQuery: Best way to handle frequent schema changes?

Our BigQuery schema is heavily nested/repeated and constantly changes. For example, a new page, form, or user-info field to the website would correspond to new columns for in BigQuery. Also if we stop using a certain form, the corresponding deprecated columns will be there forever because you can't delete columns in Bigquery.
So we're going to eventually result in tables with hundreds of columns, many of which are deprecated, which doesn't seem like a good solution.
The primary alternative I'm looking into is to store everything as json (for example where each Bigquery table will just have two columns, one for timestamp and another for the json data). Then batch jobs that we have running every 10minutes will perform joins/queries and write to aggregated tables. But with this method, I'm concerned about increasing query-job costs.
Some background info:
Our data comes in as protobuf and we update our bigquery schema based off the protobuf schema updates.
I know one obvious solution is to not use BigQuery and just use a document storage instead, but we use Bigquery as both a data lake and also as a data warehouse for BI and building Tableau reports off of. So we have jobs that aggregates raw data into tables that serve Tableau.
The top answer here doesn't work that well for us because the data we get can be heavily nested with repeats: BigQuery: Create column of JSON datatype
You are already well prepared, you layout several options in your question.
You could go with the JSON table and to maintain low costs
you can use a partition table
you can cluster your table
so instead of having just two timestamp+json column I would add 1 partitioned column and 5 cluster colums as well. Eventually even use yearly suffixed tables. This way you have at least 6 dimensions to scan only limited number of rows for rematerialization.
The other would be to change your model, and do an event processing middle-layer. You could first wire all your events either to Dataflow or Pub/Sub then process it there and write to bigquery as a new schema. This script would be able to create tables on the fly with the schema you code in your engine.
Btw you can remove columns, that's rematerialization, you can rewrite the same table with a query. You can rematerialize to remove duplicate rows as well.
I think this use case can be implemeted using Dataflow (or Apache Beam) with Dynamic Destination feature in it. The steps of dataflow would be like:
read the event/json from pubsub
flattened the events and put filter on the columns which you want to insert into BQ table.
With Dynamic Destination you will be able to insert the data into the respective tables
(if you have various event of various types). In Dynamic destination
you can specify the schema on the fly based on the fields in your
json
Get the failed insert records from the Dynamic
Destination and write it to a file of specific event type following some windowing based on your use case (How frequently you observe such issues).
read the file and update the schema once and load the file to that BQ table
I have implemented this logic in my use case and it is working perfectly fine.

Structuring a large DynamoDB table with many searchable attributes?

I've been struggling with the best way to structure my table. Its intended to have many, many GBs of data (I haven't been given a more detailed estimate). The table will be claims data (example here) with a partition key being the resourceType and a sort key being the id (although these could be potentially changed). The end user should be able to search by a number of attributes (institution, provider, payee, etc totaling ~15).
I've been toying with combining global and local indices in order to achieve this functionality on the backend. What would be the best way to structure the table to allow a user to search the data according to 1 or more of these attributes in essentially any combination?
If you use resourceType as a partition key you are essentially throwing away the horizontal scaling features that DynamoDB provides out of the box.
The reason to partition your data is such that you distribute it across many nodes in order to be able to scale without incurring a performance penalty.
It sounds like you're looking to put all claim documents into a single partition so you can do "searches" by arbitrary attributes.
You might be better off combining your DynamoDB table with something like ElasticSearch for quick, arbitrary search capabilities.
Keep in mind that DynamoDB can only accommodate approximately 10GB of data in a single partition and that a single partition is limited to up to 3000 reads per second, and up to 1000 writes per second (reads + 3 * writes <= 3000).
Finally, you might consider storing your claim documents directly into ElasticSearch.

Dynamic partitioning in google cloud dataflow?

I'm using dataflow to process files stored in GCS and write to Bigquery tables. Below are my requirements:
input files contain events records, each record pertains to one eventType;
need to partition records by eventType;
for each eventType output/write records to a corresponding Bigquery table, one table per eventType.
events in each batch input files vary;
I'm thinking of applying transforms such as "groupByKey" and "partition", however seems that I have to know number of (and type of) events at the development time which is needed to determine the partitions.
Do you guys have a good idea to do the partitioning dramatically? meaning partitions can be determined at run time?
Why not loading everything into a single "raw" bigquery table and then using BigQuery API determine the different number of events and export each event type to its own table (e.g., via https://cloud.google.com/bigquery/bq-command-line-tool#createtablequery) or an API call?
If your input format is simple, you can do that without using dataflow at all and it will be probably more cost efficient.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.