Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id - google-bigquery

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.

We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

Related

API which provide data from Elastic Search and not SQL

I have a system where there are large dataset(s) where I want to have quick searches, and elastic search is suitable for it. So the data resides in SQL, and is synced to ES. There is an obvious small delay in this sync.
There are consumers of this data which could work with slightly stale data. So if there's an API for UI which end users use to see the dataset. A delay of 3-4 seconds is acceptable. So API handler which deals with ES is perfect here.
Then there are consumers of this data (bots) who want to work with real time data. So for the almost same requirements, should I create another API just like that in UI consumer, which gets data from SQL?
What is the usual best practice which is followed, and I'm assuming this is a very common usecase.
You probably should stick to creating just a sinlge API and use a query string parameter to decide which of the two data sources to use. This will result in less code to maintain.

Custom Dataflow Template - BigQuery to CloudStorage - documentation? general solution advice?

I am consuming a BigQuery table datasource. It is 'unbounded' as it is updated via a batch process. It contains session keyed reporting data from server logs where each row captures a request. I do not have access to the original log data and must consume the BigQuery table.
I would like to develop a custom Java based google Dataflow template using beam api with the goals of :
collating keyed session objects
deriving session level metrics
deriving filterable window level metrics based on session metrics, e.g., percentage of sessions with errors during previous window and percentage of errors per filtered property, e.g., error percentage per device type
writing the result as a formatted/compressed report to cloud storage.
This seems like a fairly standard use case? In my research thus far, I have not yet found a perfect example and still have not been able to determine the best practice approach for certain basic requirements. I would very much appreciate any pointers. Keywords to research? Documentation, tutorials. Is my current thinking right or do I need to consider other approaches?
Questions :
beam windowing and BigQuery I/O Connector - I see that I can specify a window type and size via beam api. My BQ table has a timestamp field per row. Am I supposed to somehow pass this via configuration or is it supposed to be automagic? Do I need to do this manually via a SQL query somehow? This is not clear to me.
fixed time windowing vs. session windowing functions - examples are basic and do not address any edge cases. My sessions can last hours. There are potentially 100ks plus session keys per window. Would session windowing support this?
BigQuery vs. BigQueryClientStorage - The difference is not clear to me. I understand that BQCS provides a performance benefit, but do I have to store BQ data in a preliminary step to use this? Or can I simply query my table directly via BQCS and it takes care of that for me?
For number 1 you can simply use a withTimestamps function before applying windowing, this assigns the timestamp to your items. Here are some python examples.
For number 2 the documentation states:
Session windowing applies on a per-key basis and is useful for data that is irregularly distributed with respect to time. [...] If data arrives after the minimum specified gap duration time, this initiates the start of a new window.
Also in the java documentation, you can only specify a minimum gap duration, but not a maximum. This means that session windowing can easily support hour-lasting sessions. After all, the only thing it does is putting a watermark on your data and keeping it alive.
For number 3, the differences between the BigQuery IO Connector and the BigQuery storage APIs is that the latter (an experimental feature as of 01/2020) access directly data stored, without the logical passage through BigQuery (BigQuery data isn't stored in BigQuery). This means that with storage APIs, the documentation states:
you can't use it to read data sources such as federated tables and logical views
Also, there are different limits and quotas between the two methods, that you can find in the documentation link above.

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

BigQuery table design for immutable data

Background
We're probably going to use BigQuery to store our immutable business events so that we can replay them later to other services. I'm thinking that one approach would be to essentially just store each event as a blob (with some metadata). In order to replay them easily it would of course be nice to maintain a global order of our events and just persist each event to the same table in BigQuery. We probably have something like 10 events per second (which is nowhere near the limit of 100000 messages per second).
Question
Would it be ok to simply persist all events in the same table?
Would it perhaps be better to shard messages in different tables (perhaps based on event type, topic or date)?
If (2), is it possible to join/scan through multiple tables sorted by time so that it's possible to replay events in the same order?
If you primary usage scenario to store events and then reply them - there is no reason to split different event types into different tables. Especially since each event is an opaque blob. Keeping them all in the same table will have small benefit of you being able to do analysis by types of events and other metadata.
Sharding by days makes sense, especially if you will be looking at the most recent data - this will help you to keep the BigQuery query costs down.
But I was worried about your requirement of replying events in order. There is no clustered index in BigQuery, so every time you will need to reply your events, you will have to use "ORDER BY timestamp" in your query, and it can scale only to relatively small amount of data (tens of megabytes). So you will want to replay a lot of events - this design won't work for you.
i prefer create table based on event type and store the time in event table,you can join tables using relationship(use primary,foreign key).Since its storedon time basis you can replay as well.
Points you must remember:
Immutable business events will give you concurrency,Once an event
has been accepted and committed, it becomes an unalterable,it can be
copied everywhere.
The only way to “undo” an event is to add a compensating event on
top like a negative transaction in accounting.
Hope its useful to you.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.