In Google Bigquery do NULL fields take up space? - schema

I'm working on designing an event schema to be put into Google Bigquery. The current design is such that many of the fields will often be NULL, e.g. an event from a mobile application won't have URL or browser information, while an event from a website won't have hardware specs. Additionally, a lot of the information in the current schema is fairly static, and wouldn't need to be included with every event.
If fields in events are left as NULL, will they still take up space within the table? I'm wondering if it's just better to break up the events somehow. Are there best practices on storing what would otherwise be duplicate information?

From BigQuery's pricing page:
Null values for any data type are calculated as 0 bytes.
So no, they do not take up space from a byte-count or pricing perspective.

NULLs will not take any space within the table

Related

Suitable Google Cloud data storage option for raw JSON events with auto-incrementing id

I'm looking for an appropriate google data/storage option to use as a location to stream raw, JSON events into.
The events are generated by users in response to very large email broadcasts so throughput could be very low one moment and up to ~25,000 events per-second for short periods of time. The JSON representation for these events will probably only be around 1kb each
I want to simply store these events as raw and unprocessed JSON strings, append-only, with a separate sequential numeric identifier for each record inserted. I'm planning to use this identifier as a way for consuming apps to be able to work through the stream sequentially (in a similar manner to the way Kafka consumers track their offset through the stream) - this will allow me to replay the event stream from points of my choosing.
I am taking advantage of Google Cloud Logging to aggregate the event stream from Compute Engine nodes, from here I can stream directly into a BigQuery table or Pub/Sub topic.
BigQuery seems more than capable of handling the streaming inserts, however it seems to have no concept of auto-incrementing id columns and also suggests that its query model is best-suited for aggregate queries rather than narrow-result sets. My requirement to query for the next highest row would clearly go against this.
The best idea I currently have is to push into Pub/Sub and have it write each event into a Cloud SQL database. That way Pub/Sub could buffer the events if Cloud SQL is unable to keep up.
My desire for an auto-identifier and possibly an datestamp column makes this feel like a 'tabular' use-case and therefore I'm feeling the NoSQL options might also be inappropriate
If anybody has a better suggestion I would love to get some input.
We know that many customers have had success using BigQuery for this purpose, but it requires some work to choose the appropriate identifiers if you want to supply your own. It's not clear to me from your example why you couldn't just use a timestamp as the identifier and use the ingestion-time partitioned table streaming ingestion option?
https://cloud.google.com/bigquery/streaming-data-into-bigquery#streaming_into_ingestion-time_partitioned_tables
As far as Cloud Bigtable, as noted by Les in the comments:
Cloud Bigtable could definitely keep up, but isn't really designed for sequential adds with a sequential key as that creates hotspotting.
See:
You can consult this https://cloud.google.com/bigtable/docs/schema-design-time-series#design_your_row_key_with_your_queries_in_mind
You could again use a timestamp as a key here although you would want to do some work to e.g. add a hash or other unique-fier in order to ensure that at your 25k writes/second peak you don't overwhelm a single node (we can generally handle about 10k row modifications per second per node, and if you just use lexicographically sequential IDs like an incrementing number all your writes wouldb be going to the same server).
At any rate it does seem like BigQuery is probably what you want to use. You could also refer to this blog post for an example of event tracking via BigQuery:
https://medium.com/streak-developer-blog/using-google-bigquery-for-event-tracking-23316e187cbd

Table with multiple foreign keys -- only one not null

I'm trying to design a system where an administrator will have to approve changes to the data and other various administrative tasks -- add a user, add an admin etc.
My idea is to have a notification table that contains these notifications, but the problem is that a notification can be any of the previously mentioned types, ie it's data is stored in one of many tables. Here is a picture to describe my current plan -- note I'm sure that it's not a proper ER diagram.
full_screen
Also, the data goes into a pending table, that reflects the table it will eventually wind up in, provided the data is approved -- it's a staging ground of sorts. So, a pending_user is a user that is not in the user table. And as you can see the user table, amongst others, is not shown here, but one can use their imagination.
I'm concerned that the multiple null values in the pending table will have adverse effects that I'm not totally aware of, such as increased space usage and possibly increase query time. Also, I'm not sure how I'll implement the retrieval of these notifications. My naive approach is to select the first X notifications, analyze the rows to find the non-null column, retrieve the appropriate data and then load all the data in a response.
Is there a more straight forward pattern for this type of problem?
Thanks in advance for any help.
I think, the traditional way is to provide various levels of access/read/write rights to users. These access rights define what actions a user can and can't perform. In this traditional approach if a user has access to a certain function, he can do it without further approval.
Also, traditionally there are some kind of audit logs that contain a trace of all important changes to the data. With such logs it would be possible to know who made a change (and when).
If you need to build a two-stage system, where a change has to go through an approval, I'd add a flag column to each important table that would indicate that values in the given row are not final and have to be approved. The table would store all historical changes to the data and with the help of this flag the system would know which variant is the latest approved version and which variant is pending and waiting for approval.
I would not try to make a single universal table that would hold data related to changes in many different tables. Each table is different and approval process for each table is likely to be different. I doubt that you'll have more than a dozen entities that are important enough to go through this approval process.

BigQuery table design for immutable data

Background
We're probably going to use BigQuery to store our immutable business events so that we can replay them later to other services. I'm thinking that one approach would be to essentially just store each event as a blob (with some metadata). In order to replay them easily it would of course be nice to maintain a global order of our events and just persist each event to the same table in BigQuery. We probably have something like 10 events per second (which is nowhere near the limit of 100000 messages per second).
Question
Would it be ok to simply persist all events in the same table?
Would it perhaps be better to shard messages in different tables (perhaps based on event type, topic or date)?
If (2), is it possible to join/scan through multiple tables sorted by time so that it's possible to replay events in the same order?
If you primary usage scenario to store events and then reply them - there is no reason to split different event types into different tables. Especially since each event is an opaque blob. Keeping them all in the same table will have small benefit of you being able to do analysis by types of events and other metadata.
Sharding by days makes sense, especially if you will be looking at the most recent data - this will help you to keep the BigQuery query costs down.
But I was worried about your requirement of replying events in order. There is no clustered index in BigQuery, so every time you will need to reply your events, you will have to use "ORDER BY timestamp" in your query, and it can scale only to relatively small amount of data (tens of megabytes). So you will want to replay a lot of events - this design won't work for you.
i prefer create table based on event type and store the time in event table,you can join tables using relationship(use primary,foreign key).Since its storedon time basis you can replay as well.
Points you must remember:
Immutable business events will give you concurrency,Once an event
has been accepted and committed, it becomes an unalterable,it can be
copied everywhere.
The only way to “undo” an event is to add a compensating event on
top like a negative transaction in accounting.
Hope its useful to you.

Postgres SQL: Best way to check for new data in a database I don't control

For an application I am writing, I need to be able to identify when new data is inserted into several tables of a database.
The problem is two fold, this data will be been inserted many times per minute into sometimes very large databases (and I need to be sensitive to demand / database polling issues) and I have no control of the application creating this data (so as far as I know, I can't use the notify / listen functionality available within postgres for exactly this kind of task*).
Any suggestion regarding a good strategy would be much appreciated.
*I believe the application controlling this data is using the notify / listen functionality itself, but I haven't a clue how (if at all possible) to know what the "channel" it uses externally and if it is ever able to latch on to that.
Generally, you need something in the table that you can use to determine newness, and there are a few approaches.
A timestamp column would let you use the date but you'd still have the application issue of storing a date outside of your database, and data that isn't in the database means another realm of data to manage. Yuck.
A tracking table that stored last update/insert timestamps on a per-table basis could give you what you want. You'd want to use a trigger to maintain the last-DML timestamp.
A solution you don't want to use is a serial (integer) id that comes from nextval, for any purpose than uniqueness. The standard/common mistake is to presume serial keys will be contiguous (they're not) or monotonic (they're not).

Observing social web behavior: to log or populate databases?

When considering social web app architecture, is it a better approach to document user social patterns in a database or in logs? I thought for sure that behavior, actions, events would be strictly database stored but I noticed that some of the larger social sites out there also track a lot by logging what happens.
Is it good practice to store prominent data about users in a database and since thousands of user actions can be spawned easily, should they be simply logged?
Remember that Facebook, for example, doesn't update users information per se, they just insert your new information and use the most recent one, keeping the old one. If you plan to take this approach is HIGHLY recommended, if not mandatory, to use a NoSQL DB like Cassandra, you'll need speed over integrity.
Information = money. Update = lose information = lose money.
Obviously, it depends on what you want to do with it (and what you mean be "logging").
I'd recommend a flexible database storage. That way you can query it reasonably easily, and also make it flexible to changes later on.
Also, from a privacy point of view, it's appropriate to be able to easily associate items with certain entities so they can be removed, if so requested.
You're making an artificial distinction between "logging" and "database".
Whenever practical, I log to a database, even though this data will effectively be static and never updated. This is because the data analysis is much easier if you can cross-reference the log table with other, non-static data.
Of course, if you have a high volume of things to track, logging to a SQL data table may not be practical, but in that case you should probably be considering some other kind of database for the application.