Using H2 1.4 database can I write new rows if reading other rows - sql

Using H2 1.4 database can I write new rows if reading other rows?
i.e if have 1000 rows in table, and have a SELECT query running that is getting primary key 1-10 would it be possible for an INSERT query to insert some new rows at same time, or would it have to wait for (all) the SELECT query on that table to finish?
What is the situation with an UPDATE of rows in table table but not being retrieved by any SELECT query?
I ask because with H2 1.3 I noticed that my application threads that accessed database seemed to spend a lot of time blocking, it seems better now I have upgraded to 1.4. But in my application that is multithreaded the threads are always dealing with different rows so it is important for me to better understanding how locking works in H2 (with the MV store, was previously using PAGE store with 1.3), and whether H2 can just lock individual rows when UPDATING or if it has to lock whole table.

It depends on storage engine that you choose. All information below applies to the most recent version (1.4.199), old versions have some differences.
With default MVStore engine data modification operations and SELECT … FOR UPDATE lock modified (or selected) rows. Other transactions can't modify locked rows in parallel, but can read their values. Note that read committed isolation level is used by default and other isolation levels are not really supported by this engine. With read committed isolation level other transactions will not see the concurrently modified values, they will see old ones. New values will be visible only when that transaction commits its work. With this engine database runs in multi-threaded mode by default, so a long-running command will not block other sessions.
With legacy PageStore engine (add ;MV_STORE=FALSE to the connection URL if you want to create a database with this engine) the whole tables are locked for writing. It means that you really need to lock the tables in the same order (alphabetical or some other) in all your transactions, otherwise a deadlock is possible. With this engine database runs in single-threaded mode by default, you can enable multi-threaded mode explicitly, but it is not safe with this engine. Different sessions can't do their work concurrently, long-running command will block all other sessions.
Databases are not converted from old (PageStore) format to a new (MVStore) format when you open them with a new version of H2, you have to do it by yourself. Also old databases may have serious problems with new versions, it's recommended to export them to SQL with old version of H2 using the SCRIPT TO 'filename.sql' command and load this script into new database with a new version of H2 using the RUNSCRIPT FROM 'filename.sql' command. You need to do it even if you choose to use the old engine. If you have persistent databases don't forget to create regular backup copies (with BACKUP TO 'filename.zip' command, for example).
You can find more details in the documentation:
https://h2database.com/html/advanced.html#mvcc
https://h2database.com/html/features.html#multiple_connections

Related

Update Hive metadata location for many tables

I would like to change the bucket name in location of many Hive tables. Is it possible for us to connect to mySQL database and update it? I think it is possible.But I would like to know if it is safe to do it in production database.
Yes, it is possible, and I have seen it done; but
(a) the Metastore schema is not documented, and each Hive version brings some minor changes, so you have to do your own exploration to find where/how the StorageDescriptor objects are persisted -- then some unit tests / non-regression tests on a Dev system -- plus, don't forget to run a full DB backup before tinkering with your Prod system (and to rehearse an emergency restoration on your Dev system, too!)
(b) you have to update the StorageDescriptor for tables, but also for partitions -- remember that for partitioned tables, the table-level LOCATION is just used as default root dir for future partitions; once created, a partition retains its location until it is ALTERed explicitly.
For the record, the preferred method for bulk updates is (in theory) the Hive MetaTool but unfortunately, it does not support the kind of updates that you need.Right now it's only good for changing the NameNode alias in all HDFS paths, because that was a real pain point...
A valid alternative to brutal SQL Updates would be to develop a custom Java program, using the Hive MetaStore API, to scan all tables & partitions then read their StorageDescriptor then run RegEx changes on their Location then write back the changes (which is exactly what the MetaTool does, only at a lower level). But that would be overkill.
Finally, a possible compromise would be a SQL Select on the appropriate MySQL table, to generate (with regexp_replace()) a chain of ALTER Table/Partition LOCATION commands to run later in the Hive CLI.Plus a chain of ALTER to revert to the original locations, in case you have to do an emergency rollback :-/

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

How do I completely clear a SQLite3 database without deleting the database file?

For unit testing purposes I need to completely reset/clear SQLite3 databases. All databases are created in memory rather than on the file system when running the test suite so I can't delete any files. Additionally, several instances of a class will be referencing the database simultaneously, so I can't just create a new database in memory and assign it to a variable.
Currently my workaround for clearing a database is to read all the table names from sqlite_master and drop them. This is not the same as completely clearing the database though, since meta data and other things I don't understand will probably remain.
Is there a clean and simple way, like a single query, to clear a SQLite3 database? If not, what would have to be done to an existing database to make it identical to a completely new database?
In case it's relevant, I'm using Ruby 2.0.0 with sqlite3-ruby version 1.3.7 and SQLite3 version 3.8.2.
This works without deleting the file and without closing the db connection:
PRAGMA writable_schema = 1;
DELETE FROM sqlite_master;
PRAGMA writable_schema = 0;
VACUUM;
PRAGMA integrity_check;
Another option, if possible to call the C API directly, is by using the SQLITE_DBCONFIG_RESET_DATABASE:
sqlite3_db_config(db, SQLITE_DBCONFIG_RESET_DATABASE, 1, 0);
sqlite3_exec(db, "VACUUM", 0, 0, 0);
sqlite3_db_config(db, SQLITE_DBCONFIG_RESET_DATABASE, 0, 0);
Here is the reference
The simple and quick way
If you use in-memory database, the fastest and most reliable way is to close and re-establish sqlite connection. It flushes any database data and also per-connection settings.
If you want to have some kind of "reset" function, you must assume that no other threads can interrupt that function - otherwise any method will fail. Therefore even you have multiple threads working on that database, there need to be a "stop the world" mutex (or something like that), so the reset can be performed. While you have exclusive access to the database connection - why not closing and re-opening it?
The hard way
If there are some other limitations and you cannot do it the way above, then you were already pretty close to have a complete solution. If your threads don't touch pragmas explicitly, then only "schema_version" pragma can be changed silently, but if your threads can change pragmas, well, then you have to go through the list on http://sqlite.org/pragma.html#toc and write "reset" function which will set each and every pragma value to it's initial value (you need to read default values at the begining).
Note, that pragmas in SQLite can be divided to 3 groups:
defined initially, immutable, or very limited mutability
defined dynamically, per connection, mutable
defined dynamically, per database, mutable
Group 1 are for example page_size, page_count, encoding, etc. Those are definied at database creation moment and usualle cannot be modified later, with some exceptions. For example page_size can be changed prior to "VACUUM", so the new page size will be set then. The page_count cannot be changed by user, but it changes automatically when adding data (obviously). The encoding is defined at creation time and cannot be modified later.
You should not need to reset pragmas from group 1.
Group 2 are for example cache_size, recursive_triggers, jurnal_mode, foreign_keys, busy_timeout, etc. These pragmas are always set to defaults when opening new connection to the database. If you don't disconnect, you will need to reset those to defaults manually.
Group 3 are for example schema_version, user_version, maybe some others, you need to look it up. Those will also need manual reset. If you disconnect from in-memory database, the database gets destroyed, so then you don't need to reset those.
Create an empty memory database.
Use the backup API to copy that database over the actual database.
In the case of sqlite3-ruby, see test/test_backup.rb for an example.
SELECT * FROM dbname.sqlite_master WHERE type='table';
and
DROP TABLE

Doctrine schema changes while keeping data?

We're developing a Doctrine backed website using YAML to define our schema. Our schema changes regularly (including fk relations) so we need to do a lot of:
Doctrine::generateModelsFromYaml(APPPATH . 'models/yaml', APPPATH . 'models', array('generateTableClasses' => true));
Doctrine::dropDatabases();
Doctrine::createDatabases();
Doctrine::createTablesFromModels();
We would like to keep existing data and store it back in the re-created database. So I copy the data into a temporary database before the main db is dropped.
How do I get the data from the "old-scheme DB copy" to the "new-scheme DB"? (the new scheme only contains NEW columns, NO COLUMNS ARE REMOVED)
NOTE:
This obviously doesn't work because the column count doesn't match.
SELECT * FROM copy.Table INTO newscheme.Table
This obviously does work, however this is consuming too much time to write for every table:
SELECT old.col, old.col2, old.col3,'somenewdefaultvalue' FROM copy.Table as old INTO newscheme.Table
Have you looked into Migrations? They allow you to alter your database schema in programmatical way. WIthout losing data (unless you remove colums, of course)
How about writing a script (using the Doctrine classes for example) which parses the yaml schema files (both the previous version and the "next" version) and generates the sql scripts to run? It would be a one-time job and not require that much work. The benefit of generating manual migration scripts is that you can easily store them in the version control system and replay version steps later on. If that's not something you need, you can just gather up changes in the code and do it directly through the database driver.
Of course, the more fancy your schema changes becomes, the harder the maintenance will get i.e. column name changes, null to not null etc.

How do I handle large SQL SERVER batch inserts?

I'm looking to execute a series of queries as part of a migration project. The scripts to be generated are produced from a tool which analyses the legacy database then produces a script to map each of the old entities to an appropriate new record. THe scripts run well for small entities but some have records in the hundreds of thousands which produce script files of around 80 MB.
What is the best way to run these scripts?
Is there some SQLCMD from the prompt which deals with larger scripts?
I could also break the scripts down into further smaller scripts but I don't want to have to execute hundreds of scripts to perform the migration.
If possible have the export tool modified to export a BULK INSERT compatible file.
Barring that, you can write a program that will parse the insert statements into something that BULK INSERT will accept.
BULK INSERT uses BCP format files which come in traditional (non-XML) or XML. Does it have to get a new identity and use it in a child and you can't get away with using SET IDENTITY INSERT ON because the database design has changed so much? If so, I think you might be better off using SSIS or similar and doing a Merge Join once the identities are assigned. You could also load the data into staging tables in SQL using SSIS or BCP and then use regular SQL (potentially within SSIS in a SQL task) with the OUTPUT INTO feature to capture the identities and use them in the children.
Just execute the script. We regularly run backup / restore scripts that are 100's Mb in size. It only takes 30 seconds or so.
If it is critical not to block your server for this amount to time, you'll have to really split it up a bit.
Also look into the -tab option of mysqldump with outputs the data using TO OUTFILE, which is more efficient and faster to load.
It sounds like this is generating a single INSERT for each row, which is really going to be pretty slow. If they are all wrapped in a transaction, too, that can be kind of slow (although the number of rows doesn't sound that big that it would cause a transaction to be nearly impossible - like if you were holding a multi-million row insert in a transaction).
You might be better off looking at ETL (DTS, SSIS, BCP or BULK INSERT FROM, or some other tool) to migrate the data instead of scripting each insert.
You could break up the script and execute it in parts (especially if currently it makes it all one big transaction), just automate the execution of the individual scripts using PowerShell or similar.
I've been looking into the "BULK INSERT" from file option but cannot see any examples of the file format. Can the file mix the row formats or does it have to always be consistent in a CSV fashion? The reason I ask is that I've got identities involved across various parent / child tables which is why inserts per row are currently being used.