I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
Currently on a GC VM I execute these queries, save the results in a local temporary CSV and upload these CSVs to their respective tables.
This is fairly ineffective (not as fast as it can be and uses quite a lot of VM resources). However, it is cheap, since CSV loading jobs are free. If I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs? This is what I'd like to avoid since $0.02/MB can rack up quite a bit since we're adding a lot of data on a daily basis.
Thanks for your help.
Inside Bigquery running a query and saving to destination table results you
query price (anyway you pay it)
storage price (new data gets accumulated to the table - choose partitioned)
no streaming costs
If you have data outside of BQ and you end up adding the data to BQ
if you use load (that's free)
streaming insert (has a cost component)
storage of the new data, the table you added
I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
... if I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs?
Setting the destination table for query job is the most effective way of getting result of that query being added to the existing table. It DOES NOT incur any extra cost related to streaming as there is no streaming happening here at all
Related
My first thought is to first load data from S3 to a temporary table, apply the necessary transformations and then INSERT INTO target, final table. All the tables would have the same columns and are in Redshift.
However, how big of a performance hit would there be because of using multiple UPDATEs? Would it be better to split UPDATEs and filtering between multiple temporary tables for daily batch processing.
Instead of S3 -> TEMP -> FINAL, the flow would look like S3 -> TEMP1 -> ... -> TEMPN - > FINAL, where "->" would be "INSERT INTO".
Also, is it better to create temporary tables (CREATE TEMP TABLE) on the spot and dropping them every day, or use persisting temporary tables that would be truncated every day. I think using persisting temp tables would be the better choice as it allows me to check how the data looked as it was loaded and transformed that day.
As you are seeing there are lots of ways to run an update process and which is better will depend on factors that are not presented here. First off let's clarify what a TEMP table is and differentiate it from a staging table. A temp table only lives as long as the current session (connection) is active. If the connection drops then so does the TEMP table. A staging table is a permanent table used for staging data which more closely matches what you are describing you parts of your question. I'll use these two terms to be clear about which is being made (TEMP or staging).
Your question revolves around how big of a performance hit it would be to have a series of tables in the ETL (ELT?) process to improve, I expect, diagnosability / debug-ability. This is a fine goal but there are some downsides as with all tradeoffs in the real world.
If this is correct these tables will need to be staging tables as TEMP tables will disappear when the ETL session ends.
Saving a bunch of staging tables when one could be used has some downsides but how big these are depends on you situation. If your cluster is fairly idle and the ETL data payload isn't huge then the impact to the ETL process of the extra tables will be real but not large (a couple of seconds or less). These impacts are mostly around setting up (or truncating) the staging or TEMP tables. But if your cluster is running other workloads when ETL runs then the impact can be much larger.
You see there are many "resources" in a Redshift cluster that all need to be shared by everything running on the database. Some like memory allocation can be (somewhat) controlled through the WLM. Others cannot. The two biggies are network bandwidth and disk bandwidth. There is a fixed capacity to these bandwidths in Redshift and even though they are high, they are finite. There are other limits to Redshift's ability to execute a total workload but these in my experience are the big two.
Every time you create a table, TEMP or permanent, the data is stored to disk. This means a write to disk as well as distributing the data per the distribution settings of the table. Then when the table is accessed the data needs to be read from disk. All this unneeded data movement will have some impact, how large will depend on how big it is and what else is going on at the time. So you see the impact will be moderately small up to very large depending on a lot of factors, not the least of which is how many tables are you creating. The cost of doing this will need to be offset by the benefit of having these extra tables which is a business decision.
A common pattern is to load (COPY) data to a temp or staging table and then extract the DELETE patterns to one staging table and the INSERT data to another. Once the deletes and inserts are applied to then save these tables with a date stamp in the name and possibly unloaded to S3. After a while these sets of data are deleted, 1 month is common. This way you can figure out 'what happened' if things go sideways. This plus good database backups can be used to recover from code bugs.
Your secondary question is about whether it is better to drop and recreate or truncate. There have been a number of performance improvements to both of these statements. With a grain of salt, I'll offer my slightly dated experience comparing these. Both are fast but I saw drop and recreate as slightly faster (fewer dependencies to manage). That said the main difference is in how they interoperate with other aspects of the database. DROP will fail if there are dependent views (unless cascaded) and table permissions will be lost. DROP cannot be run in a transaction block and since it needs an exclusive lock on the table can be held off my another session reading the table. TRUNCATE can run in a transaction block but will force a commit so transaction changes will become visible to all. It is usually these differences that made the decision about TRUNCATE vs. DROP and there are other options such as DELETE and ALTER TABLE APPEND that have their own set of plusses and minuses.
So I'd generally advise against creating more tables than are actually needed in the ETL process when all needs are weighed (including performance and business needs). You may have excess capacity now but usually Redshift clusters get busier over time. The guiding principal here is don't move large amounts of data more times than is necessary.
As title stated.
I'm curious about there's a python AIP (QueryJobConfig) that I can set the destination table where I can save the query result, so for this kind of saving how much will it cost in GCP?
There are two types of way that I can load or insert external data to BigQuery. (Streaming and Batch uploading)
However, using streaming insert might be costly.
To make it clear, saving the query result may consider as an insertion, so if I use this method to update a table frequently, will it be costly as doing the streaming insert.
BigQuery saves query results to a table, which can be either permanent or temporary.
Temporary table has a lifetime of 24 hours and you are not charged for storing temporary tables.
When you write query results to a permanent table, you can create/append/overwrite the table. When you specify a destination table for large query results, you are charged for storing the data. Storage pricing is based on the amount of data stored in your tables. Please, refer to official documentation describing BigQuery pricing.
Streaming insert can be done either using DataFlow jobs or by calling the BigQuery streaming insert API directly. You need to pay for streaming insert process ($0.01per 200 MB) and storage of the new data.
If you wish to estimate storage and query cost, please refer to official documentation.
I hope you find the above pieces of information useful.
I need to move BigQuery datasets with many tables (both partitioned and unpartitioned) from the US to the EU.
If the source table is unpartitioned, the documented way of bq extracting the data to GCS and bq loading it in another region works fine, so far so good.
If however the source table is partitioned, during the load step the mapping between data and partition is lost and I'll end up having all data within one partition.
Is there a good (automated) way of exporting and importing partitioned tables in BQ? Any pointers would be greatly appreciated!
There's a few ways to do this, but I would personally use Cloud Dataflow to solve it. You'll have to pay a little bit more for Dataflow, but you'll save a lot of time and scripting in the long run.
High level:
Spin up a Dataflow pipeline
Read partitioned table in US (possibly aliasing the _PARTITIONTIME to make it easier later)
Write results back to BigQuery using same partition.
It's basically the same as what was talked about here.
Another solution is to use DML to load the data, instead of load, https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables. Since you have a timestamp column in the table to infer the partition, you can use
INSERT INTO PROJECT_ID.DATASET.mytable (_PARTITIONTIME, field1, field2) AS SELECT timestamp_column, 1, “one” FROM PROJECT_ID.DATASET.federated_table
You can define a permanent federated table, or a temporary one, https://cloud.google.com/bigquery/external-data-cloud-storage#permanent-tables. You'll need to pay for DML though, while load is free.
I am trying to copy a BigQuery table using the API from one table to the other in the same dataset.
While copying big tables seems to work just fine, copying small tables with a limited number of rows (1-10) I noticed that the destination table comes out empty (created but 0 rows).
I get the same results using the API and the BigQuery management console.
The issue is replicated for any table in any dataset I have. Looks like a bug or a designed behavior.
Could not find any "minimum lines" directive in the docs.. am I missing something?
EDIT:
Screenshots
Original table: video_content_events with 2 rows
Copy table: copy111 with 0 rows
How are you populating the small tables? Are you perchance using streaming insert (bq insert from the command line tool, tabledata.insertAll method)? If so, per the documentation, data can take up to 90 minutes to be copyable/exportable:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
I won't get super detailed, but the reason is that our copy and export operations are optimized to work on materialized files. Data within our streaming buffers are stored in a completely different system, and thus aren't picked up until the buffers are flushed into the traditional storage mechanism. That said, we are working on removing the copy/export delay.
If you aren't using streaming insert to populate the table, then definitely contact support/file a bug here.
There is no minimum records limit to copy the table within the same dataset or over a different dataset. This applies both for the API and the BigQuery UI. I just replicated your scenario of creating a new table with just 2 records and I was able to successfully copy the table to another table using UI.
Attaching screenshot
I tried to copy to a timestamp partitioned table. I messed up the timestamp, and 1000 x current timestamp. Guess it is beyond BigQuery's max partition range. Despite copy job success, no data is actually loaded to the destination table.
I'm using Google's Cloud Storage & BigQuery. I am not a DBA, I am a programmer. I hope this question is generic enough to help others too.
We've been collecting data from a lot of sources and will soon start collecting data real-time. Currently, each source goes to an independent table. As new data comes in we append it into the corresponding existing table.
Our data analysis requires each record to have a a timestamp. However our source data files are too big to edit before we add them to cloud storage (4+ GB of textual data/file). As far as I know there is no way to append a timestamp column to each row before bringing them in BigQuery, right?
We are thus toying with the idea of creating daily tables for each source. But don't know how this will work when we have real time data coming in.
Any tips/suggestions?
Currently, there is no way to automatically add timestamps to a table, although that is a feature that we're considering.
You say your source files are too big to edit before putting in cloud storage... does that mean that the entire source file should have the same timestamp? If so, you could import to a new BigQuery table without a timestamp, then run a query that basically copies the table but adds a timestamp. For example, SELECT all,fields, CURRENT_TIMESTAMP() FROM my.temp_table (you will likely want to use allow_large_results and set a destination table for that query). If you want to get a little bit trickier, you could use the dataset.DATASET pseudo-table to get the modified time of the table, and then add it as a column to your table either in a separate query or in a JOIN. Here is how you'd use the DATASET pseudo-table to get the last modified time:
SELECT MSEC_TO_TIMESTAMP(last_modified_time) AS time
FROM [publicdata:samples.__DATASET__]
WHERE table_id = 'wikipedia'
Another alternative to consider is the BigQuery streaming API (More info here). This lets you insert single rows or groups of rows into a table just by posting them directly to bigquery. This may save you a couple of steps.
Creating daily tables is a reasonable option, depending on how you plan to query the data and how many input sources you have. If this is going to make your queries span hundreds of tables, you're likely going to see poor performance. Note that if you need timestamps because you want to limit your queries to certain dates and those dates are within the last 7 days, you can use the time range decorators (documented here).