BigQuery table design for immutable data - google-bigquery

Background
We're probably going to use BigQuery to store our immutable business events so that we can replay them later to other services. I'm thinking that one approach would be to essentially just store each event as a blob (with some metadata). In order to replay them easily it would of course be nice to maintain a global order of our events and just persist each event to the same table in BigQuery. We probably have something like 10 events per second (which is nowhere near the limit of 100000 messages per second).
Question
Would it be ok to simply persist all events in the same table?
Would it perhaps be better to shard messages in different tables (perhaps based on event type, topic or date)?
If (2), is it possible to join/scan through multiple tables sorted by time so that it's possible to replay events in the same order?

If you primary usage scenario to store events and then reply them - there is no reason to split different event types into different tables. Especially since each event is an opaque blob. Keeping them all in the same table will have small benefit of you being able to do analysis by types of events and other metadata.
Sharding by days makes sense, especially if you will be looking at the most recent data - this will help you to keep the BigQuery query costs down.
But I was worried about your requirement of replying events in order. There is no clustered index in BigQuery, so every time you will need to reply your events, you will have to use "ORDER BY timestamp" in your query, and it can scale only to relatively small amount of data (tens of megabytes). So you will want to replay a lot of events - this design won't work for you.

i prefer create table based on event type and store the time in event table,you can join tables using relationship(use primary,foreign key).Since its storedon time basis you can replay as well.
Points you must remember:
Immutable business events will give you concurrency,Once an event
has been accepted and committed, it becomes an unalterable,it can be
copied everywhere.
The only way to “undo” an event is to add a compensating event on
top like a negative transaction in accounting.
Hope its useful to you.

Related

De-duplicating BigQuery in an Asynchronous Real Time ETL Pipeline

Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.
We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Multiple sorted sets in Redis

There are many accounts, which get events (data points with timestamps) stored in realtime. I discovered that it is a good idea to store events using a sorted set. I tried to store events for multiple accounts in a one sorted set, but then didn't figure out how to filter events by account id.
Is it a good idea to create multiple sorted sets for each account (> 1000 accounts)?
Questions:
How long will you keep these events in memory ?
Your number of accounts won't grow ?
Are you sure you will have enough memory ?
... but yes, you should definitely create a sorted set for each account, that's the state of art when using Redis.
However, if it's all about real-time events (storing and retrieval) you may want to give a try to a database like InfluxDB that provides a powerful SQL-like query system. It seems a better answer to your problem.

Postgres SQL: Best way to check for new data in a database I don't control

For an application I am writing, I need to be able to identify when new data is inserted into several tables of a database.
The problem is two fold, this data will be been inserted many times per minute into sometimes very large databases (and I need to be sensitive to demand / database polling issues) and I have no control of the application creating this data (so as far as I know, I can't use the notify / listen functionality available within postgres for exactly this kind of task*).
Any suggestion regarding a good strategy would be much appreciated.
*I believe the application controlling this data is using the notify / listen functionality itself, but I haven't a clue how (if at all possible) to know what the "channel" it uses externally and if it is ever able to latch on to that.
Generally, you need something in the table that you can use to determine newness, and there are a few approaches.
A timestamp column would let you use the date but you'd still have the application issue of storing a date outside of your database, and data that isn't in the database means another realm of data to manage. Yuck.
A tracking table that stored last update/insert timestamps on a per-table basis could give you what you want. You'd want to use a trigger to maintain the last-DML timestamp.
A solution you don't want to use is a serial (integer) id that comes from nextval, for any purpose than uniqueness. The standard/common mistake is to presume serial keys will be contiguous (they're not) or monotonic (they're not).