Memory and Running Time issues while copying from Excel to SQL using Talend - sql

I have a simple task of copying Excel data to SQL tables.
I am executing one stored procedure initially to delete tables entries. Then I have Excel input from which I am copying data to the SQL tables using tMap.
I have 20 tables to copy data to. I have relatively small number of table entries (10-100) to copy.
Still when I am executing my task, it takes a very long time (5-10 mins) and after copying 12 tables entries its running out of memory.
My work flow is..
(stored procedure ->(on subjob ok) -> excel input -> tmap -> tMSSqlOutput -> (on component ok) -> excel input -> tmap -> tMSSqlOutput (on component ok) - > ...... -> excel input -> tmap -> tMSSqlOutput)
My Excel sheet is on my local machine where as I am copying data to SQL tables on a server.
I have kept my run/debug settings as Xms 1024M, Xmx 8192m. But still its not working.
May I know what can I do to solve this issue?
I am running my talend on a VM (Virtual Machine).
I have attached the screenshot of my job.

You should be running all of these separate steps in separate subjobs, using "on subjob ok" to link them, so that the Java garbage collector can better reallocate memory between steps.
If this still doesn't work you could separate them into completely separate jobs and link them all using tRunJob components and make sure to select to tick "Use an independent process to run subjob":
This will spawn a completely new JVM instance for the process and thus not be memory tied by the JVM. That said, you should be careful not to spawn too many JVM instances as there will be some overhead in the start up of the JVM and obviously you are still limited by any physical memory constraints.
It belongs in a separate question really but you may also find some benefit to using parallelisation in your job to improve performance.

Use onSubJobOK on the excelInput to connect to the next ExcelInput. This would change the whole codegeneration.
The Generated code is a function for every subjob. The difference in code generation between onSubJob and onComponentOk is that OnComponent ok will call the next function, while OnSubJobOk waits for the current subjob/function to finish. The latter let the Garbage Collerctor function better.
If that doesn't solve the problem create subjobs which contain 1 excel-DBoutput. Then link these jobs with OnSubjobOK in a master job.

To avoid consuming too much memory by the job (outOfMemory), you can store large transformed data in your tmap in a temporary directory on the disk.
This printscreen shows how to do that.

Related

How to force Nextflow process to recalculate and ignore cache in resumed workflow

I have a series of processes in nextflow pipeline, employing multiple heavy computing steps and database (SQL) insertion/fetch. I need to insert certain (intermediate) process results to the DB and fetch them later for further processing (within the same pipeline). In the most simplified form it will be something like:
process1 (fetch data from DB)
process2 (analyze process1.out)
process3 (inserts process2.out to DB)
The problem is, that when any values are changed in the DB, output from process1 is still cached (when using -resume flag), so changes in DB are not reflected here at all.
Is there any way to force reprocessing process1 while using -resume and ignore cache?
So far, I was manually deleting respective work folder, or adding dummy line to process1, but that is extremely ineffective solution.
Thanks for any help here.
Result caching is enable by default, but this feature can be disabled using the cache directive by setting the value to false. For example:
process process1 {
cache false
...
}
Not sure if we have the full picture here, but updating a database with some set of process results just to fetch them again later on seems wasteful. Or maybe I've just misunderstood. I would instead try to separate the heavy computational work (hours) from the database transactions (minutes) if at all possible. Note that if you need to make per process database transactions, you might be able to achieve this using the beforeScript and afterScript directives (which can be enabled/disabled using a nextflow.config profile, for example). For example, a beforeScript could be used to create a database object that is updated (using an afterScript) once the process has completed. Since both of these scripts are run from inside the workDir, you could use the basename of the current/working directory (i.e. the task UUID) as a key.

Periodically save SQLite Memory-DB to file

I have a program that runs continuously and saves data to a SQLite database every second.
I want to change this and use a memory database that saves to disk every 15 minutes instead.
My program is written in Java and I use the library SQLite4Java. It works well and I have a 'Job Queue' that handles the inserts with an ExecutorService in Java.
I have found out that the first time I write to file I can use 'SQLiteConnection.initializeBackup(File)' and after that DROP & CREATE the tables in memory and maybe use VACUUM, to clear memory. This works well, but later it gets tricky... How do I append/merge the new data from the memory-DB after another 15 minutes to the existing file?
Is there a standard approach to do this? Would it be good practice to simply fetch all data from each table in memory db (and lock it, to prevent further inserts), loop it through in and insert it in the file db?

How to create a Temporary Table using (Select * into ##temp from table) syntax(For MS SQL) using Pentaho data integration

When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.

Importing Massive text file into a sql server database

I am currently trying to import a text file with 180+ million records with about 300+ columns into my sql server database. Needless to say the file is roughly 70 GBs large. I have been at it for days and when i get close something happens and it craps out on me. I need the quickest and most efficient way to do this import. I have tried the wizard which should have been the easiest, then i tried just saving as an ssis package. I havent been able to figure out how to do a bulk import with the settings i think would work great. The error i keep on getting is 'not enough virtual memory'. I changed my virtual memory to 36 gigs . My system has 24 gigs of physical memory. Please help me.
If you are using BCP (and you should be for files this large), use a batch size. Otherwise, BCP will attempt to load all records in one transaction.
By command line: bcp -b 1000
By C#:
using (System.Data.SqlClient.SqlBulkCopy bulkCopy =
new System.Data.SqlClient.SqlBulkCopy(sqlConnection))
{
bulkCopy.DestinationTableName = destinationTableName;
bulkCopy.BatchSize = 1000; // 1000 rows
bulkCopy.WriteToServer(dataTable); // May also pass in DataRow[]
}
Here are the highlights from this MSDN article:
Importing a large data file as a single batch can be problematic, so
bcp and BULK INSERT let you import data in a series of batches, each
of which is smaller than the data file. Each batch is imported and
logged in a separate transaction...
Try reducing the maximum server memory for SQL Server to as low as you can get away with. (Right click the SQL instance in Mgmt Studio -> properties -> memory).
This may free up enough memory for the OS & SSIS to process such a big text file.
I'm assuming the whole process is happening locally on the server.
I had a similar problem with SQL 2012 and trying to import (as a test) around 7 million records into a database. I too ran out of memory and had to cut the bulk import into smaller pieces. The one thing to note is that all the memory that the import process uses (no matter what manner you leverage) up a ton of memory and won't release said system memory until the server was rebooted. I'm not sure if this is intended behavior by SQL Server but it's something to note for your project.
Because I was using the SEQUENCE command with this process I had to leverage T-sql code saved as sql scripts and then use SQLCMD in small pieces to lessen the memory overhead.
You'll have to play around with what works for you and highly recommend to not run the script all at once.
It's going to be a pain in the ass to break it down in smaller pieces and import it in but in the long run you'll be happier.

SSIS data import with resume

I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.
The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).
I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.
The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.
What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.
Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.
In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.
A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.
Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.