Pricing of data importation into Bigquery - google-bigquery

I'm looking for the price of data importation from Cloud Storage to Big Query (through "bq import").
There is no "update" statement in BigQuery, so I want to drop my table and recreate-it from scratch.
Thanks,
Romain.

As stated in the documentation, importing data is free. Only storing or querying it is charged.

https://cloud.google.com/bigquery/docs/updating-data
There is update statement in BigQuery now.
But the quota is low. So yes, we would drop table and recreate table sometimes, instead of using update.
https://cloud.google.com/bigquery/quotas
Data Manipulation Language statements
The following limits apply to Data Manipulation Language (DML).
Maximum UPDATE/DELETE statements per day per table: 96
Maximum UPDATE/DELETE statements per day per project: 10,000
Maximum INSERT statements per day per table: 1,000

Related

Table creation limit for a data set in big query per day

I need to test some thing in big query, so that I want to create more than 500k tables in a single data ser in big query in a day. Is there any hard limit for a day for table creation in big query ?
Yes.
If you are creating your tables using a load job, you only have 100k per day per project (including failures)

The cost of saving the query result to a table in BigQuery?

As title stated.
I'm curious about there's a python AIP (QueryJobConfig) that I can set the destination table where I can save the query result, so for this kind of saving how much will it cost in GCP?
There are two types of way that I can load or insert external data to BigQuery. (Streaming and Batch uploading)
However, using streaming insert might be costly.
To make it clear, saving the query result may consider as an insertion, so if I use this method to update a table frequently, will it be costly as doing the streaming insert.
BigQuery saves query results to a table, which can be either permanent or temporary.
Temporary table has a lifetime of 24 hours and you are not charged for storing temporary tables.
When you write query results to a permanent table, you can create/append/overwrite the table. When you specify a destination table for large query results, you are charged for storing the data. Storage pricing is based on the amount of data stored in your tables. Please, refer to official documentation describing BigQuery pricing.
Streaming insert can be done either using DataFlow jobs or by calling the BigQuery streaming insert API directly. You need to pay for streaming insert process ($0.01per 200 MB) and storage of the new data.
If you wish to estimate storage and query cost, please refer to official documentation.
I hope you find the above pieces of information useful.

What's the cost of scan including column that has been added recently?

Assuming I have an old table with a lot of data. Two columns are there - user_id existing from very beginning and data added very recently, say, a week ago. My goal is to join this table on user_id but retrieve only the newly created column data. Could it be the case that because data column didn't exist so far, there is no point of scanning whole user_id range and, therefore, query would be cheaper? How is the price calculated for such operation?
According to the documentation there are 2 pricing models for queries:
On-demand pricing
Flat-rate pricing
Seeing that you use On-demand pricing, you will only be billed by the number of bytes processed, you can check how data size is calculated here. In that sense the answer would be: yes, scanning user_id partially would be cheaper. But reading through the documentation you'll find this sentence:
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
So probably the best solution would be creating another table with the data that has to be processed and run the query.

Do Bigquery charge if table gets deleted via retention period?

I have a data of around 150 GB data and I want to store that in bigquery using DML statements.
Here is the pricing model for that.
https://cloud.google.com/bigquery/pricing#dml
According to them they will charge for deleting the table via DML.
If I create a table with retention period will I be charged for that? considering I will always insert data. I am not bothering about cost for inserting data.
Based on the DML Specifications, Google will charge for the deletion of rows if done using DML statement (or using DELETE command in their SQL). The reason being: BigQuery will have to scan rows to delete them (like DELETE FROM mydataset.mytable WHERE id=xxx;, etc.), so you will have to pay for the number of bytes scanned before deleting the resulting rows.
You can always delete your entire table from your dataset for free by either using BigQuery UI or bq command line utility.
Also, you will be charged for the storage costs in BigQuery (irrespective of usage). Meaning: you will pay for the number of bytes your data is occupying on Google disks.
BigQuery charges for deleting from a table, not deleting a table. Executing a DROP TABLE statement is free.
Creating tables is free, unless you create a table from the result of a table, in which case see query pricing to calculate your costs.
The cost of storage is based on the number of bytes stored and how long you keep the data. See storage pricing for more details.

Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted?

Google BigQuery has no primary key or unique constraints.
We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?
If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.
What is your advice and suggestions based on your experience?
It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?
So let's clear some facts up in the first place.
Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.
Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".
So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.
We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.
"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.
"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".
When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.
Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.
BigQuery now doesn't have DML limits.
https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery