I have a series of processes in nextflow pipeline, employing multiple heavy computing steps and database (SQL) insertion/fetch. I need to insert certain (intermediate) process results to the DB and fetch them later for further processing (within the same pipeline). In the most simplified form it will be something like:
process1 (fetch data from DB)
process2 (analyze process1.out)
process3 (inserts process2.out to DB)
The problem is, that when any values are changed in the DB, output from process1 is still cached (when using -resume flag), so changes in DB are not reflected here at all.
Is there any way to force reprocessing process1 while using -resume and ignore cache?
So far, I was manually deleting respective work folder, or adding dummy line to process1, but that is extremely ineffective solution.
Thanks for any help here.
Result caching is enable by default, but this feature can be disabled using the cache directive by setting the value to false. For example:
process process1 {
cache false
...
}
Not sure if we have the full picture here, but updating a database with some set of process results just to fetch them again later on seems wasteful. Or maybe I've just misunderstood. I would instead try to separate the heavy computational work (hours) from the database transactions (minutes) if at all possible. Note that if you need to make per process database transactions, you might be able to achieve this using the beforeScript and afterScript directives (which can be enabled/disabled using a nextflow.config profile, for example). For example, a beforeScript could be used to create a database object that is updated (using an afterScript) once the process has completed. Since both of these scripts are run from inside the workDir, you could use the basename of the current/working directory (i.e. the task UUID) as a key.
I have a question about the snowflake COPY INTO, searched but did not get my answers.
Suppose I want to push data from snowflake to s3 bucket and using the snowflake COPY INTO command in my code, How will I know if the file is ready or command is completed? So that I can read the file from the s3 location.
You can do the following things to check whether your COPY INTO was successful or at least to retrieve some useful information about your command:
Set DETAILED_OUTPUT = TRUE and check the result (this means you get information about every single unloaded file as a output; if set to "false" you only receive information about the whole unload-process)
Query your stage by using the syntax that can be found here https://docs.snowflake.com/en/user-guide/querying-stage.html
Query the metadata of your staged data by using metadata$filename and metadata$file_row_number: https://docs.snowflake.com/en/user-guide/querying-metadata.html
Keep in mind that even a failed COPY-command can result in some unloaded files on your stage.
More information can also be found at https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html#validating-data-to-be-unloaded-from-a-query
depending on how you're actually running this.
any Snowflake interface will run synchronously so the query will just spin until it's complete.
any async call would need extra checks - the easiest one being the web interface (it will show the status of the query and when it completes the unload is complete)
I have a program that runs continuously and saves data to a SQLite database every second.
I want to change this and use a memory database that saves to disk every 15 minutes instead.
My program is written in Java and I use the library SQLite4Java. It works well and I have a 'Job Queue' that handles the inserts with an ExecutorService in Java.
I have found out that the first time I write to file I can use 'SQLiteConnection.initializeBackup(File)' and after that DROP & CREATE the tables in memory and maybe use VACUUM, to clear memory. This works well, but later it gets tricky... How do I append/merge the new data from the memory-DB after another 15 minutes to the existing file?
Is there a standard approach to do this? Would it be good practice to simply fetch all data from each table in memory db (and lock it, to prevent further inserts), loop it through in and insert it in the file db?
I have a simple task of copying Excel data to SQL tables.
I am executing one stored procedure initially to delete tables entries. Then I have Excel input from which I am copying data to the SQL tables using tMap.
I have 20 tables to copy data to. I have relatively small number of table entries (10-100) to copy.
Still when I am executing my task, it takes a very long time (5-10 mins) and after copying 12 tables entries its running out of memory.
My work flow is..
(stored procedure ->(on subjob ok) -> excel input -> tmap -> tMSSqlOutput -> (on component ok) -> excel input -> tmap -> tMSSqlOutput (on component ok) - > ...... -> excel input -> tmap -> tMSSqlOutput)
My Excel sheet is on my local machine where as I am copying data to SQL tables on a server.
I have kept my run/debug settings as Xms 1024M, Xmx 8192m. But still its not working.
May I know what can I do to solve this issue?
I am running my talend on a VM (Virtual Machine).
I have attached the screenshot of my job.
You should be running all of these separate steps in separate subjobs, using "on subjob ok" to link them, so that the Java garbage collector can better reallocate memory between steps.
If this still doesn't work you could separate them into completely separate jobs and link them all using tRunJob components and make sure to select to tick "Use an independent process to run subjob":
This will spawn a completely new JVM instance for the process and thus not be memory tied by the JVM. That said, you should be careful not to spawn too many JVM instances as there will be some overhead in the start up of the JVM and obviously you are still limited by any physical memory constraints.
It belongs in a separate question really but you may also find some benefit to using parallelisation in your job to improve performance.
Use onSubJobOK on the excelInput to connect to the next ExcelInput. This would change the whole codegeneration.
The Generated code is a function for every subjob. The difference in code generation between onSubJob and onComponentOk is that OnComponent ok will call the next function, while OnSubJobOk waits for the current subjob/function to finish. The latter let the Garbage Collerctor function better.
If that doesn't solve the problem create subjobs which contain 1 excel-DBoutput. Then link these jobs with OnSubjobOK in a master job.
To avoid consuming too much memory by the job (outOfMemory), you can store large transformed data in your tmap in a temporary directory on the disk.
This printscreen shows how to do that.
For unit testing purposes I need to completely reset/clear SQLite3 databases. All databases are created in memory rather than on the file system when running the test suite so I can't delete any files. Additionally, several instances of a class will be referencing the database simultaneously, so I can't just create a new database in memory and assign it to a variable.
Currently my workaround for clearing a database is to read all the table names from sqlite_master and drop them. This is not the same as completely clearing the database though, since meta data and other things I don't understand will probably remain.
Is there a clean and simple way, like a single query, to clear a SQLite3 database? If not, what would have to be done to an existing database to make it identical to a completely new database?
In case it's relevant, I'm using Ruby 2.0.0 with sqlite3-ruby version 1.3.7 and SQLite3 version 3.8.2.
This works without deleting the file and without closing the db connection:
PRAGMA writable_schema = 1;
DELETE FROM sqlite_master;
PRAGMA writable_schema = 0;
VACUUM;
PRAGMA integrity_check;
Another option, if possible to call the C API directly, is by using the SQLITE_DBCONFIG_RESET_DATABASE:
sqlite3_db_config(db, SQLITE_DBCONFIG_RESET_DATABASE, 1, 0);
sqlite3_exec(db, "VACUUM", 0, 0, 0);
sqlite3_db_config(db, SQLITE_DBCONFIG_RESET_DATABASE, 0, 0);
Here is the reference
The simple and quick way
If you use in-memory database, the fastest and most reliable way is to close and re-establish sqlite connection. It flushes any database data and also per-connection settings.
If you want to have some kind of "reset" function, you must assume that no other threads can interrupt that function - otherwise any method will fail. Therefore even you have multiple threads working on that database, there need to be a "stop the world" mutex (or something like that), so the reset can be performed. While you have exclusive access to the database connection - why not closing and re-opening it?
The hard way
If there are some other limitations and you cannot do it the way above, then you were already pretty close to have a complete solution. If your threads don't touch pragmas explicitly, then only "schema_version" pragma can be changed silently, but if your threads can change pragmas, well, then you have to go through the list on http://sqlite.org/pragma.html#toc and write "reset" function which will set each and every pragma value to it's initial value (you need to read default values at the begining).
Note, that pragmas in SQLite can be divided to 3 groups:
defined initially, immutable, or very limited mutability
defined dynamically, per connection, mutable
defined dynamically, per database, mutable
Group 1 are for example page_size, page_count, encoding, etc. Those are definied at database creation moment and usualle cannot be modified later, with some exceptions. For example page_size can be changed prior to "VACUUM", so the new page size will be set then. The page_count cannot be changed by user, but it changes automatically when adding data (obviously). The encoding is defined at creation time and cannot be modified later.
You should not need to reset pragmas from group 1.
Group 2 are for example cache_size, recursive_triggers, jurnal_mode, foreign_keys, busy_timeout, etc. These pragmas are always set to defaults when opening new connection to the database. If you don't disconnect, you will need to reset those to defaults manually.
Group 3 are for example schema_version, user_version, maybe some others, you need to look it up. Those will also need manual reset. If you disconnect from in-memory database, the database gets destroyed, so then you don't need to reset those.
Create an empty memory database.
Use the backup API to copy that database over the actual database.
In the case of sqlite3-ruby, see test/test_backup.rb for an example.
SELECT * FROM dbname.sqlite_master WHERE type='table';
and
DROP TABLE