I am creating a hive table by doing joins of multiple source tables. This join takes approx 3 hours time because of huge data volume. This hive table is designed truncate and load. This table is further consumed by the downstream.
We plan to refresh this hive table 4 times a day because of data in source tables keep getting updated. Since table load is truncate and load, there will be no data in this table for approx ~3 hr in each times because of join query takes this much of time. And due to this data will not be available for downstream.
Can someone suggest how we can continue to truncate and load the table and old data for downstream is still available during the fresh data loads in the table ?
One of the option to ensure downstream gets the data during the ~3hr downtime is to create a read copy of the same table for the downstream systems. For example, create a tableB which is a select * from tableA_with_joins. This will ensure downstream receives the data from tableB even if a truncate load is happening on tableA.
One of the cons of this approach is that an additional time will be consumed in syncing the data from tableA to tableB. But will ensure your downstream receives the data even during the downtime.
Related
I am using the following query to populate my fact table:
Select sh.isbn_l,sh.id_c,sh.id_s, sh.data,sh.quantity, b.price
from Book as b
inner join Sales as sh
on l.isbn=sh.isbn_l
The main thing is that I want to load the table from a specific time to a specific time. So if I load today, I will get all the records from today till the last time I loaded.
And if I load it the day after tomorrow, I will get the datas from today after load time, till the day after tomorrow.
What I mean is NO DUBLICATED ROWS or DATAS. What should I do ?
Any idea pleasee ?
Thank you in advance
Streams (and maybe Tasks) are your friend here.
A Snowflake Stream records the delta of change data capture (CDC) information for a table (such as a staging table), including inserts and other DML changes. A stream allows querying and consuming a set of changes to a table, at the row level, between two transactional points of time.
In a continuous data pipeline, table streams record when staging tables and any downstream tables are populated with data from business applications using continuous data loading and are ready for further processing using SQL statements.
Snowflake Tasks may optionally use table streams to provide a convenient way to continuously process new or changed data. A task can transform new or changed rows that a stream surfaces. Each time a task is scheduled to run, it can verify whether a stream contains change data for a table (using SYSTEM$STREAM_HAS_DATA) and either consume the change data or skip the current run if no change data exists.
Users can define a simple tree-like structure of tasks that executes consecutive SQL statements to process data and move it to various destination tables.
https://docs.snowflake.com/en/user-guide/data-pipelines-intro.html
I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
Currently on a GC VM I execute these queries, save the results in a local temporary CSV and upload these CSVs to their respective tables.
This is fairly ineffective (not as fast as it can be and uses quite a lot of VM resources). However, it is cheap, since CSV loading jobs are free. If I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs? This is what I'd like to avoid since $0.02/MB can rack up quite a bit since we're adding a lot of data on a daily basis.
Thanks for your help.
Inside Bigquery running a query and saving to destination table results you
query price (anyway you pay it)
storage price (new data gets accumulated to the table - choose partitioned)
no streaming costs
If you have data outside of BQ and you end up adding the data to BQ
if you use load (that's free)
streaming insert (has a cost component)
storage of the new data, the table you added
I'm wondering what kind of insert saving the results of a large query (with multiple joins and unions) to a destination (day partitioned) table is.
... if I were to save the query results into a destination table (appending to old data which already consists of 100M+ rows), would those incur insertion streaming costs?
Setting the destination table for query job is the most effective way of getting result of that query being added to the existing table. It DOES NOT incur any extra cost related to streaming as there is no streaming happening here at all
To keep the uniqueness of a column, my several clients stream data into a staging table in BQ(retry 2 times if not presence in main table with 10 mins interval), and using another cron job to MERGE staging table to a column partitioned table every several mins.
I need to truncate staging table if it's merged into main table, but it seems my clients are streaming data into all the time. Is there any recommendation here?
ALERT: Do not truncate a table that is receiving streaming results.
See https://cloud.google.com/bigquery/troubleshooting-errors#streaming:
Deleting and/or recreating a table may create a period of time where streaming inserts are effectively delivered to the old table and will not be present in the newly created table.
Truncating a table's data (e.g. via a query job that uses writeDisposition of WRITE_TRUNCATE) may similarly cause subsequent inserts during the consistency period to be dropped.
For alternative best practices when streaming into BigQuery, see:
https://cloud.google.com/bigquery/streaming-data-into-bigquery
In this case - why not have your several clients write to Pub/Sub instead? Then you can use Dataflow to move this data into permanent tables.
Bonus: Pub/sub + Dataflow + BigQuery can guarantee "Exactly Once" delivery.
Here's the scenario I need help with:
I have a large Oracle table. This oracle table is being queried by mobile app users 1-10 times per second. This allows for very little down time of the table.
There is a backend process that refreshes all the data in the table, 1mio rows approximately. The process deletes * from the table, then inserts values from the source table. That's it.
The problem: this causes the table to be unavailable for too long (15 minutes).
I read about partition exchange, but all the examples I find deal with a certain partition range which doesn't apply to my problem.
My question: can I somehow refresh the data in a temp offline table and then just make that table my online/live table? That would just be a synonym/name swap, wouldn't it? Are there better methods?
Thanks!!!
I have an SSIS package that runs repeatedly after 1 hour. This package first truncates a table and then populate that table with new data. And this process takes 15-20 minutes. When this package runs, data is not available to the users. So they have to wait until package runs completely. Is there any way to handle this situation so users don't have to wait?
Do not truncate the table. Instead, add a audit column with date data type, partition the table with hourly partitions on this audit column, drop the old partition once the new partition is loaded with new data.
Make sure the users query are directed to the proper partition with the help of the audit column
You can do an 'A-B flip'.
Instead of truncating the client-facing table and reloading it, you could use two tables to do the job.
For example, if the table in question is called ACCOUNT:
Load the data to a table called STG_ACCOUNT
Rename ACCOUNT to ACCOUNT_OLD
Rename STG_ACCOUNT to ACCOUNT
Rename ACCOUNT_OLD to STG_ACCOUNT
By doing this, you minimize the amount of time the users have an empty table.
It's very dangerous practice but you can change isolation levels of your transactions (I mean users queries) from ReadCommitted/Serializable to ReadUncommitted. But the behavior of this queries is very hard to predict. If your table is under using of SSIS package (insert/delete/update) and end users do some uncommitted reads (like SELECT * FROM Table1 WITH (NOLOCK) ), some rows can be counted several times or missed.
If users want to read only 'new-hour-data' you can try to change isolation levels to 'dirty read', but be careful!
If they can work with data from previous hour, the best solution is described by Arnab, but partitions are available only in Enterprise edition. Use rename in another SQL Server editions as Zak said.
[Updated] If the main lag (tens of minutes, as you said) is caused by complex calculations (and NOT because of amount of loaded rows!), you can use another table like a buffer. Store there several rows (hundreds, thousands etc.) and then reload them to the main table. So new data will be available in portions without 'dirty read' tricks.