Bulk load into Snowflake with Petnatho Data Integration over JDBC is slow - pentaho

We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.
The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).
We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.
All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:
Instead of:
PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)
we have then:
PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.
It works, but is much more complex than it needs to be (compared to previous on premise database load for example).
I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.

To bulk load in Snowflake, you need to do the put and copy.

Related

Enable sync between Big Query and snowflake

We are using BigQuery and SNOWFLAKE(Azure hosted) and we often export data from big query and import to SNOWFLAKE and vice versa. is there any easy way to integrate both systems like automatically sync big query table to SNOWFLAKE rather than exporting to file and importing ?
You should have a look on Change Data Capture Solutions for automate sync.
Some of them got native Big Query and Snowflake connectors.
Some examples :
HVR
Qlik Replicate
Striim
...
There are many ways to implement this, and the best one will depend on the nature of your data.
For example, if every day you have new data in BigQuery, then all you need to do is set up a daily export of the new data from BigQuery to GCS. Then it's easy to set up Snowflake to read new data in GCS whenever it shows up with Snowpipe:
https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-gcs.html
But then how often do you want to sync this data? Is it append only, or does it need to account for past data changing? How do you solve conflicts when the same row changes in different ways on both sides? Etc.
I have the same scenario. I've built this template in Jupyter Notebook. I've done a gap analysis after a few days and, at least in our case, it seems that Firebase/Google Analytics adds more rows to the already compiled daily tables even a few days later. We have about 10% more rows in a BigQuery older Daily than what was captured in Snowflake so be mindful of the gap. To this date, the template I've created is not able to handle the missing rows. For us it works because we look at aggregated values (daily active users, retention...etc) and the gap there is minimal.
You could use Sling, which I worked on. It is a tool that allows to copy data between databases (including BQ source and SF destination) using bulk loading methodologies. There is a free CLI version and a Cloud (hosted) version. I actually wrote a blog entry about this in detail (albeit AWS destination, similar logic), but essentially, if you use the CLI version, you can run one command after setting up your credentials:
$ sling run --src-conn BIGQUERY --src-stream segment.team_activity --tgt-conn SNOWFLAKE --tgt-object public.activity_teams --mode full-refresh
11:37AM INF connecting to source database (bigquery)
11:37AM INF connecting to target database (snowflake)
11:37AM INF reading from source database
11:37AM INF writing to target database [mode: full-refresh]
11:37AM INF streaming data
11:37AM INF dropped table public.activity_teams
11:38AM INF created table public.activity_teams
11:38AM INF inserted 77668 rows
11:38AM INF execution succeeded

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

SQL, moving million records from a database to other database

I am a C# developer, I am not really good with SQL. I have a simple questions here. I need to move more than 50 millions records from a database to other database. I tried to use the import function in ms SQL, however it got stuck because the log was full (I got an error message The transaction log for database 'mydatabase' is full due to 'LOG_BACKUP'). The database recovery model was set to simple. My friend said that importing millions records using task->import data will cause the log to be massive and told me to use loop instead to transfer the data, does anyone know how and why? thanks in advance
If you are moving the entire database, use backup and restore, it will be the quickest and easiest.
http://technet.microsoft.com/en-us/library/ms187048.aspx
If you are just moving a single table read about and use the BCP command line tools for this many records:
The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files. Except when used with the queryout option, the utility requires no knowledge of Transact-SQL. To import data into a table, you must either use a format file created for that table or understand the structure of the table and the types of data that are valid for its columns.
http://technet.microsoft.com/en-us/library/ms162802.aspx
The fastest and probably most reliable way is to bulk copy the data out via SQL Server's bcp.exe utility. If the schema on the destination database is exactly identical to that on the source database, including nullability of columns, export it in "native format":
http://technet.microsoft.com/en-us/library/ms191232.aspx
http://technet.microsoft.com/en-us/library/ms189941.aspx
If the schema differs between source and target, you will encounter...interesting (yes, interesting is a good word for it) problems.
If the schemas differ or you need to perform any transforms on the data, consider using text format. Or another format (BCP lets you create and use a format file to specify the format of the data for export/import).
You might consider exporting data in chunks: if you encounter problems it gives you an easier time of restarting without losing all the work done so far.
You might also consider zipping the exported data files up to minimize time on the wire.
Then FTP the files over to the destination server.
bcp them in. You can use the bcp utility on the destination server for the BULK IMPORT statement in SQL Server to do the work. Makes no real difference.
The nice thing about using BCP to load the data is that the load is what is described as a 'non-logged' transaction, though it's really more like a 'minimally logged' transaction.
If the tables on the destination server have IDENTITY columns, you'll need to use SET IDENTITY statement to disable the identity column on the the table(s) involved for the nonce (don't forget to reenable it). After your data is imported, you'll need to run DBCC CHECKIDENT to get things back in synch.
And depending on what your doing, it can sometimes be helpful to put the database in single-user mode or dbo-only mode for the duration of the surgery: http://msdn.microsoft.com/en-us/library/bb522682.aspx
Another approach I've used to great effect is to use Perl's DBI/DBD modules (which provide access to the bulk copy interface) and write a perl script to suck out the data from the source server, transform it and bulk load it directly into the destination server, without having to save it to disk and move it. Also means you can trap errors and design things for recovery and restart right at the point of failure.
Use BCP to migrate data.
Another approach i have used in the past is to take a backup of the transaction log and shrink the log Prior to the migration. Split the migration script in parts and run the log backup- shrink - migrate iteration a few times.

How do I handle large SQL SERVER batch inserts?

I'm looking to execute a series of queries as part of a migration project. The scripts to be generated are produced from a tool which analyses the legacy database then produces a script to map each of the old entities to an appropriate new record. THe scripts run well for small entities but some have records in the hundreds of thousands which produce script files of around 80 MB.
What is the best way to run these scripts?
Is there some SQLCMD from the prompt which deals with larger scripts?
I could also break the scripts down into further smaller scripts but I don't want to have to execute hundreds of scripts to perform the migration.
If possible have the export tool modified to export a BULK INSERT compatible file.
Barring that, you can write a program that will parse the insert statements into something that BULK INSERT will accept.
BULK INSERT uses BCP format files which come in traditional (non-XML) or XML. Does it have to get a new identity and use it in a child and you can't get away with using SET IDENTITY INSERT ON because the database design has changed so much? If so, I think you might be better off using SSIS or similar and doing a Merge Join once the identities are assigned. You could also load the data into staging tables in SQL using SSIS or BCP and then use regular SQL (potentially within SSIS in a SQL task) with the OUTPUT INTO feature to capture the identities and use them in the children.
Just execute the script. We regularly run backup / restore scripts that are 100's Mb in size. It only takes 30 seconds or so.
If it is critical not to block your server for this amount to time, you'll have to really split it up a bit.
Also look into the -tab option of mysqldump with outputs the data using TO OUTFILE, which is more efficient and faster to load.
It sounds like this is generating a single INSERT for each row, which is really going to be pretty slow. If they are all wrapped in a transaction, too, that can be kind of slow (although the number of rows doesn't sound that big that it would cause a transaction to be nearly impossible - like if you were holding a multi-million row insert in a transaction).
You might be better off looking at ETL (DTS, SSIS, BCP or BULK INSERT FROM, or some other tool) to migrate the data instead of scripting each insert.
You could break up the script and execute it in parts (especially if currently it makes it all one big transaction), just automate the execution of the individual scripts using PowerShell or similar.
I've been looking into the "BULK INSERT" from file option but cannot see any examples of the file format. Can the file mix the row formats or does it have to always be consistent in a CSV fashion? The reason I ask is that I've got identities involved across various parent / child tables which is why inserts per row are currently being used.