Enable sync between Big Query and snowflake - google-bigquery

We are using BigQuery and SNOWFLAKE(Azure hosted) and we often export data from big query and import to SNOWFLAKE and vice versa. is there any easy way to integrate both systems like automatically sync big query table to SNOWFLAKE rather than exporting to file and importing ?

You should have a look on Change Data Capture Solutions for automate sync.
Some of them got native Big Query and Snowflake connectors.
Some examples :
HVR
Qlik Replicate
Striim
...

There are many ways to implement this, and the best one will depend on the nature of your data.
For example, if every day you have new data in BigQuery, then all you need to do is set up a daily export of the new data from BigQuery to GCS. Then it's easy to set up Snowflake to read new data in GCS whenever it shows up with Snowpipe:
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-gcs.html
But then how often do you want to sync this data? Is it append only, or does it need to account for past data changing? How do you solve conflicts when the same row changes in different ways on both sides? Etc.

I have the same scenario. I've built this template in Jupyter Notebook. I've done a gap analysis after a few days and, at least in our case, it seems that Firebase/Google Analytics adds more rows to the already compiled daily tables even a few days later. We have about 10% more rows in a BigQuery older Daily than what was captured in Snowflake so be mindful of the gap. To this date, the template I've created is not able to handle the missing rows. For us it works because we look at aggregated values (daily active users, retention...etc) and the gap there is minimal.

You could use Sling, which I worked on. It is a tool that allows to copy data between databases (including BQ source and SF destination) using bulk loading methodologies. There is a free CLI version and a Cloud (hosted) version. I actually wrote a blog entry about this in detail (albeit AWS destination, similar logic), but essentially, if you use the CLI version, you can run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded

Related

Bulk load into Snowflake with Petnatho Data Integration over JDBC is slow

We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.
The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).
We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.
All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:
Instead of:
PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)
we have then:
PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.
It works, but is much more complex than it needs to be (compared to previous on premise database load for example).
I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.
To bulk load in Snowflake, you need to do the put and copy.

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Get the Last Modified date for all BigQuery tables in a BigQuery Project

I have several databases within a BigQuery project which are populated by various jobs engines and applications. I would like to maintain a dashboard of all of the Last Modified dates for every table within our project to monitor job failures.
Are there any command line or SQL commands which could provide this list of Last Modified dates?
For a SQL command you could try this one:
#standardSQL
SELECT *, TIMESTAMP_MILLIS(last_modified_time)
FROM `dataset.__TABLES__` where table_id = 'table_id'
I recommend you though to see if you can log these errors at the application level. By doing so you can also understand why something didn't work as expected.
If you are already using GCP you can make use of Stackdriver (it works on AWS as well), we started using it in our projects and I recommend giving it a try (we tested for python applications though, not sure how the tool performs on other clients but it might be quite similar).
I've just queried stacked GA4 data using the following code:
FROM analytics_#########.__TABLES__
where table_id LIKE 'events_2%'
I have kept the 2 on the events to ensure my intraday tables do not pull through also.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.