Stream-insert and afterwards periodically merge into BigQuery within Dataflow pipeline [closed]

Stream-insert and afterwards periodically merge into BigQuery within Dataflow pipeline [closed] - google-bigquery

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Is it a valid approach when building a dataflow pipeline which aims to store the newest data per key in BigQuery to
stream-insert the events in a partitioned staging table
periodically merge (update/insert) into target table, (so that only the newest data to a key is stored in this table). It's a requirement that the merge happens every 2-5 minutes and respects all rows in the staging table.
The idea of this approach is taken from the Google project https://github.com/GoogleCloudPlatform/DataflowTemplates, com.google.cloud.teleport.v2.templates.DataStreamToBigQuery
So far it works okay in our tests, the question here arises from the fact, that Google states in its documentation:
"Rows that were written to a table recently by using streaming (the tabledata.insertall method or the Storage Write API) cannot be modified with UPDATE, DELETE, or MERGE statements."
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language#limitations
Has someone gone this road in a production dataflow pipeline with stable positive results?

After a few hours and some thinking, I think I can answer my own question: Since I only stream to the staging table and merge into the target table, the approach is perfectly fine.

I did this yesterday and the time lag is around 15-45 minutes. If you have an ingestion time column/field you can use that to restrict which rows you are UPDATEing.

Related

Dropping SQL tables [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I started to learn mySQL and a thought came to mind, I always see memes about databases and dropping tables, and how much of a problem such an even event can cause. My question is why would someone working in a software development environment ever decide to drop a table or even an entire scheme for that matter?

There can be various reasons, the main ones that come to mind:
As part of a roll back, you migrated something to the production environment which had bugs or shouldn't have been deployed yet. In order to get back to the previous state you'd need to drop the new table.
As part of clean up: legacy parts of the database which you no longer need, old table partitions with already archived data, user schemas of people no longer working for the company.

Issues while implementing Google Big Query [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Our company is going to implement Big Query.
We saw many drawbacks in Big Query like
1. Only 1000 requests per day allowed.
2. No update delete allowed.
and so on...
Can u guys highlight some more drawbacks and also discuss on above two.
Please share any issues come during and after implementing Big Query.
Thanks in Advance.

"Only 1000 requests per day allowed"
Not true, fortunately! There is a limit of how many batch loads you can do to a table per day (1000, so one every 90 seconds), but this is about loading data, not querying it. And if you need to load data more frequently, you can use the streaming API for up to a 100,000 rows per second per table.
"No update delete allowed"
BigQuery is an analytical database which are not optimized for updates and deletes of individual rows. The analytical databases that support these operations usually do with caveats and performance costs. You can achieve the equivalent update and deletes with BigQuery by re-materializing your tables in just a couple minutes: https://stackoverflow.com/a/31663889/132438

Is column order in a table relevant for version control? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
A version control system compares the scripted definition of a table to the checked in state. So I guess many cvs will see column reordering of a table as a change.
Since tsql does not support to add a new column in the middle of a table and because in a relational DB the ordering should not matter, what are good practices for version control of table definitions if the column-order could change.
Sometimes you could need to redo a drop column in the middle of a table.

You should be storing scripts to setup your database in source control, not trying to have something reverse-engineer those scripts from the state of the database. Column-order then becomes a non-issue.
Specifically, I've seen two schemes that work well. In the first, each database schema update script is given a sequential number, and the database tracks which sequence number is the last applied. In the second, each database schema update script is given a UUID, and the database tracks all UUIDs that have been applied.
I would checkout the book Refactoring Databases for more details and examples of how to manage database changes.

Databases: What is a HANA delta table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
What is a delta table in SAP HANA databases?
From initial googling, I understood that these tables are some kind of intermediate tables to help data go from one state to another. But when exactly are they used? How are they different from "normal" tables in HANA databases?

Delta tables are a SAP-HANA specific technique to speed up write operations in the database.
Tables in SAP HANA usually use the column store, which is read-optimized. When writing data to a column-store table, this data is first stored in the delta-space for that table; this delta-space is periodically merged into the column store.
See e.g. https://cookbook.experiencesaphana.com/bw/operating-bw-on-hana/hana-database-administration/system-configuration/delta-merge/column-store/ for more details.

"Delta" is commonly used to mean "difference". A delta table would show only the differences between two tables, the records that were added/deleted/changed during the new process. It's a way to test new code to see what changes it caused.

powershell multi valued variables or sql table [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
i'm wanting to write data into memory only for a temp time. the format is essentially the same as an sql table with say 5 columns and 1,000 rows, give or take. simply i want to store this data and run queries against it to make calculations, sorting it, querying it to then produce chart reports and excel data.
I looked at custom psobjects and then sql and i can't see why i'd use custom psobjects over sql, what do you think?
I also couldn't see that adding multiple rows as such, using psobjects was as straight forward as adding another row in sql.
thanks
steve

I guess it depends on what you're more comfortable with, but if you're going to do it in Powershell then using PS custom objects seems like a logical choice since the cmdlets were designed to work with those.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Stream-insert and afterwards periodically merge into BigQuery within Dataflow pipeline [closed] - google-bigquery

After a few hours and some thinking, I think I can answer my own question: Since I only stream to the staging table and merge into the target table, the approach is perfectly fine.

I did this yesterday and the time lag is around 15-45 minutes. If you have an ingestion time column/field you can use that to restrict which rows you are UPDATEing.

Related

Dropping SQL tables [closed]

Issues while implementing Google Big Query [closed]

Is column order in a table relevant for version control? [closed]

Databases: What is a HANA delta table? [closed]

powershell multi valued variables or sql table [closed]

Categories

Resources