Most efficient way to clean data in BigQuery

Most efficient way to clean data in BigQuery - google-bigquery

I need some help cleaning my data...
I have a BQ table where I receive new entries from my back-end, these data are recorded to my BQ and I'm using Google Data Studio to present these data.
My problem is, I a field named sessions that sometimes are duplicates, I can't solve that directly in my back-end because a user can send different data from the same session so I can't just stop recording duplicates.
I've managed my problem by creating a View that selects the newest duplicate record and I'm using this view as data-source for my report. The problem with this approach is that I lost the feature of "real-time report" and that is important in this case. And another problem is that I also lost Accelerated by BigQuery BI Engine and I would like to have these feature too.
Is this the best solution for my problem and I'll need to accept this outcome or there is another way?
Many thanks in advance, kind regards.

Using the view should work for BI Engine acceleration. Can you please share more details on BI Engine? It should show you the reason query wasn't accelerated, likely mentioning one of the limitations. If you hover over "not accelerated" sign it should give you more details on why your query wasn't supported. Feel free to share it here and I will be happy to help.
Another way you can clean up the data: Have scheduled job to preprocess the data. It will mean data may not be the most recent, but it will give you ability to clean up and aggregate data.

Related

BigQuery handling dependencies and retries in query scheduling

I'm looking for some best (simplest;)) practices here.
I have Google Analytics data that is send to BigQuery on a daily basis. I have a query running on a daily basis that uses the data from the previous day's table.
However, I can't be sure this table and the data is there at the time the query runs and I'd like to check if it does. If it isn't there I want to retry later.
Ideally I have some monitoring/alerting around this as well.
Of course this can be done within the Google Cloud in many ways, I'm looking for some best practices how others do this?
I'm used to working with Airflow, but using Composer just for this seems a bit over the top. Cloud Run would be an option and I'm sure there are others. Also I've seen this question discussing how to handle a dependency in SQL, I'm just not sure if I could have it retry using just SQL as well?
EDIT:
I've got the check for the table working in SQL. I guess I just have to see if BigQuery has a way to build in delay like 'WAITFOR'

How to pre-process BigQuery data coming from Stackdriver

I am currently exporting logs from Stackdriver to BigQuery using sinks. But i am only interessted in the jsonPayload. I would like to ignore pretty much everything else.
But since the table creation and data insertion happens automatically, i could not do this.
Is there a way to preprocess data coming from sink to store only what matters?
If the answer is no, is there a way to run a cron job each day to copy yesterday data into a seperate table and then remove it? (knowing that the tables are named using timestamps which makes it possible to query them by day)

As far as I know both options mentioned are currently not possible in the GCP platform. On my end I've also tried to create an internal reproduction of your request and noticed that there isn't a way to solely filter the jsonPayload.
I would therefore suggest creating a feature request in regards to your ask on the following public issue tracker link. Note that feature requests do not have an ETA as to when they'll processed or if they'll be implemented.

BigQuery web UI is unresponsive & eventually crashes

When I click on "Details" to see a preview of the data in the table, the web UI locks up and I see the following errors. I've tried refreshing and restating my browser, but it doesn't help.

Yes, we (the BQ team) are aware of performance issues when viewing very repeated rows in the BigQuery web UI. The public genomics tables are known to tickle these performance issues since individual rows of their table are highly repeated.
We're considering a few methods of fixing this, but the simplest would probably be to default to the JSON display of rows for problematic tables, and allow switching to the tabular view with a "View it at your own risk!"-style warning message.

It took a little time for me too, but it eventually (1min 40sec) loaded up to UI.
I think it is because of how table data is presented in Native BQ UI for Preview mode.
As you could noticed - it is showed in sort of hierarchical way.
I noticed this slowness for heavy tables (row size and/or hierarchical structure wise) when this was intorduced. And btw. only one row is showed for this particular table because of this.
Of course this is just my guess - would be great to hear from Google Team!
Meantime - when I am using internal application that uses same APIs for preview table data - i dont see any slowness at all (10 rows in 3 sec), which supports my above guess.

Best approach to follow in SSIS package

I am working on simple transform SSIS package to import data from one server and load to another server. Only one table is used on each.
I wanted to know, as its just refreshing of data, does the old data in the table needed to be deleted before loading, I needed expert advice about what should I do. Should I truncate the old table or use delete? What other concerns should I keep in mind?
Please give the justification for your answers, it will help to fight technically with my lead.

It depends on what the requirements are.
Do you need to keep track of any changes to the data? If so, truncating the data each time, will not allow you to track the history of your data. A good option in this case is to stage the source data in a separate table/database and load your required data in another structure (with possible history tracking, e.g., a fact table with slowly changing dimensions).
Truncating is the best option to remove data, as it's a minimally logged operation.

Versioning data in SQL Server so user can take a certain cut of the data

I have a requirement that in a SQL Server backed website which is essentially a large CRUD application, the user should be able to 'go back in time' and be able to export the data as it was at a given point in time.
My question is what is the best strategy for this problem? Is there a systematic approach I can take and apply it across all tables?

Depending on what exactly you need, this can be relatively easy or hell.
Easy: Make a history table for every table, copy data there pre update or post insert/update (i.e. new stuff is there too). Never delete from the original table, make logical deletes.
Hard: There is an fdb version counting up on every change, every data item is correlated to start and end. This requires very fancy primary key mangling.

Just add a little comment to previous answers. If you need to go back for all users you can use snapshots.

The simplest solution is to save a copy of each row whenever it changes. This can be done most easily with a trigger. Then your UI must provide search abilities to go back and find the data.
This does produce an explosion of data, which gets worse when tables are updated frequently, so the next step is usually some kind of data-based purge of older data.

An implementation you could look at is Team Foundation Server. It has the ability to perform historical queries (using the WIQL keyword ASOF). The backend is SQL Server, so there might be some clues there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Most efficient way to clean data in BigQuery - google-bigquery

Related

BigQuery handling dependencies and retries in query scheduling

How to pre-process BigQuery data coming from Stackdriver

BigQuery web UI is unresponsive & eventually crashes

Best approach to follow in SSIS package

Versioning data in SQL Server so user can take a certain cut of the data

Categories

Resources