I am looking for the cleanest/fastest way to delete rows from a BigTable cluster that match a regex. What are the best option to do so? Thanks!
There are multiple ways to delete a row in BigTable but I’m not aware of any specific way to delete following a regex. You can delete row that contains a prefix by using this method
Another way to delete data from BigTable could be using this dataflow template
I think that if you share with us an example of what you are trying to achieve and what you tried so far we can provide you with a more accurate answer
Related
This is more of a theoretical question:
I have a scenario where I wish to do a delete + insert from a source table to a target table in DBT. (Match by PK, delete existing records then insert).
DBT doesn't seem to support this incremental strategy for BigQuery (It does for Snowflake).
It instead offers an insert+overwrite by deleting and re-inserting a given partition. Which doesn't solve my specific need.
Is there a reasoning behind this?
I think you want merge, which should accomplish the same as a delete and insert, but with better performance. BQ docs on merge statements.
delete+insert is only supported for Snowflake to support an edge case where the PK of the table is not actually unique. source. On Snowflake merge is also preferred/more performant.
I'm playing with BQ and I create a table and inserted some data. I reinserted it and it created duplicates. I'm sure I'm missing something, but is there something I can do to ignore it if the data exists in the table?
My use case is I get a stream of data from various clients and sometimes their data will include some data they previously already sent(I have no control on them submitting).
Is there a way to prevent duplicates when certain conditions are met? The easy one is if the entire data is the same but also if certain columns are present?
It's difficult to answer your question without a clear idea of the table structure, but it feels like you could be interested in the MERGE statement: ref here.
With this DML statement you can perform a mix of INSERT, UPDATE, and DELETE statements, hence do exactly what you are describing.
Thanks for looking at my query. I have 20k+ unique identification id that I is provided by client, I want to look for all these id's in MongoDB using single query. I tried looking using $in but then it does not seems feasible to put all the 20K Id in $in and search. Is there a better version of achieving.
If the id field is indexed, an IN query should be very fast, but i don't think it is a good idea to perform a query with 20k ids in one time, as it may consume quite a lot of resources like memory, you can split the ids into multiple groups with a reasonable size and do the query separately and you still can perform the queries parallelly in application level.
Consider importing your 20k+ id into a collection(says using mongoimport etc). Then perform a $lookup from your root collection to the search collection. Depending on whether the $lookup result is empty array or not, you can proceed with your original operation that requires $in.
Here is Mongo playground for your reference.
Is there a convenient way (Python, web UI, or CLI) for inserting a new column into an existing BigQuery table (that already has 100 columns or so) and update the schema accordingly?
Say I want to insert it after column 49. If I do this via a query, I will have to type every single column name, will I not?
Update: the suggested answer does not make it clear how this applies to BigQuery. Furthermore, the documentation does not seem to cover
ALTER TABLE `tablename` ADD `column_name1` TEXT NOT NULL AFTER `column_name2`;
Syntax. A test confirmed that the AFTER identifier does not work for BigQuery.
I think that is not possible to perform this action in a simple way, I thought in some workarounds to reach this such as:
Create a view after adding your column.
Creating a table from a query result after adding your column.
On the other hand, I can't catch how this is useful, the only scenario I can think for this requirement is if you are using SELECT * which is not recommended when using BigQuery according with the Bigquery best practices. If is not the case share your case of use to get a better understanding of it.
Since this is not a current feature of BigQuery you can file a feature request asking for this feature.
Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?
You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to