I'm writing a PHP-based web application that should work with multiple database systems. The most important are MySQL and SQLite but PostgreSQL and others would also be nice. For this I try to use as portable SQL as possible. Where this isn't possible, I have defined some meta words in my queries that are handled by my DB layer and converted to platform-specific SQL commands.
I'm now trying to add sequences support. Every DBMS handles sequences differently, there is no common way to write them in SQL. I have read and understood the way PostgreSQL does it. I have found an interesting solution for MySQL that uses MyISAM tables to escape the isolation constraints of a transaction. After all, sequences are not rolled back with the transaction they're used in and that's exactly what I want. Sequences are supposed to be multi-user safe.
Now I haven't found a solution for SQLite. It lacks built-in sequence support. It doesn't provide ways to store data outside a running transaction. My current implementation is to lock the table far enough to do a SELECT MAX(...) and use that value. But I want to get rid of that entirely. In SQLite, this approach requires locking the whole database!
Does anybody know a solution for this with SQLite?
Just create a regular counter table. On creation of a sequence foo, do
create table foo(value int);
insert into foo(value) values(0);
Then, when asking for the next counter value, do
update foo set value=value+1;
While this gets rolled back when the transaction is aborted, it is multi-user safe: no two users will commit the same number. sqlite implements concurrency with a database lock, so the second writer will block anyway (not just because of the sequence update, but also because of the other changes it wants to make).
I would use lastInsertRowID. This returns the rowid of the last inserted data (which equals the INTEGER PRIMARY KEY value of that row).
You won't need any sequence then.
Related
Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?
As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.
I'm thinking about implementing a queue in an Oracle database basically using a table as the queue and am trying to decide how to prevent multiple sessions from inserting into the table at the same time and messing up the ordering.
Is doing a LOCK TABLE mytable IN EXCLUSIVE MODE the only way to prevent multiple sessions from inserting into a table?
Edit:
"Queue" is probably not the right term here. More of a list, in which the relative order between the elements is what's important. I'm thinking of something on the order of Java's ArrayList.
Are you sure you need to implement your own queue? Oracle provides built-in queuing (called Oracle Advanced Queuing). That would certainly tend to be a better option than implementing the same thing yourself.
If you do implement a queue yourself, why would you need to prevent concurrent inserts? What order are you concerned with being "messed up"? Assuming you are using an Oracle sequence object to populate the primary key, I don't see any reason that you'd have to prevent concurrent updates.
It's completely unclear to me what problem you're trying to avoid.
Explicit locking provides serialization, but doesn't seem to provide any meaningful, deterministic order of insertion. If two sessions tried to insert "at the same time", which one gets the lock first is essentially random. You'd get the same effect without explicit locking if the table has a primary key, since multiple inserts would have to serialize on that to enforce uniqueness. But even without that, one session would end up inserting (or more to the point, committing its insert) before the other.
It sounds like you have some sense of an order the items need to be in, independently of the order in which they are inserted. Without more detail, my reaction to that is that the table should have a column that indicates the ordering, and readers of the table should make use of that to properly sort the records. Recall that database tables have no inherent ordering; just because one row is inserted before another does not mean that they are stored in that order or returned to queries in that order.
A similar question has been asked, but since it always depends, I'm asking for my specific situation separately.
I have a web-site page that shows some data that comes from a database, and to generate the data from that database I have to do some fairly complex multiple joins queries.
The data is being updated once a day (nightly).
I would like to pre-generate the data for the said view to speed up the page access.
For that I am creating a table that contains exact data I need.
Question: for my situation, is it reasonable to do complete table wipe followed by insert? or should I do update,insert?
SQL wise seems like DELETE + INSERT will be easier (INSERT part is a single SQL expression).
EDIT: RDBMS: MS SQL Server 2008 Ent
TRUNCATE will be faster than delete, so if you need to empty a table do that instead
You didn't specify your RDBMS vendor but some of them also have MERGE/UPSERT commands This enables you do update the table if the data exists and insert if it doesn't
It partly depends on how the data is accessed. If you have a period of time with no (or very few) users accessing it, then there won't be much impact on the data disappearing (between the DELETE and the completion of the INSERT) for a short while.
Have you considered using a materialized view (MSSQL calls them indexed views) instead of doing it manually? This could also have other performance benefits as an indexed view gives the query optimizer more choices when its constructing execution plans for other queries that reference the table(s) in the view.
It depends on the size of the table and the recovery model on the database. If you are deleting many hundreds of thousands of records and reinstating them vs updating a small batch of a few hundred and inserting tens of rows, it will add an unnecessary size to your transaction logs. However you could use TRUNCATE to get around this as it won't affect the transaction log.
Do you have the option of a MERGE/UPSERT? If you're using MS-SQL you can use CROSS APPLY to do something similar if you don't.
One approach to handling this type of problem is to insert into new table, then do a table Rename. This will insure that all new data is present at the same time.
What if some data that was present yesterdays is not anymore? Delete may be safer or you could end up deleting some records anyway.
And in the end it doesnt really matter which way you go.
Unless on the case #kevinw mentioned
Although I fully agree with SQLMenace's answer I do would like to point out that MERGE does NOT remove unneeded records ! If you're sure that your new data will be a super-set of the existing data, then MERGE is great, otherwise you'll either need to make sure that you delete any superfluous records later on, or use the TRUNCATE + INSERT method ...
(Personally I'm still a fan of the latter as it usually is quite fast, just make sure to drop all indexes/unique constraints upfront and rebuild them one by one. This has the benefit of the INSERT transaction being smaller and the index-adding being done in (smaller) transactions again later on). (**)
(**: yes, this might be tricky on live system, but then again he already mentioned this was done during some kind of overnight anyway, I'm extrapolating there is no user-access at that time)
I'm using a database as a kind of cache - the purpose is to store currently known "facts" (the details aren't that interesting). Crucially, however, I'd like to avoid inserting duplicate facts. To do so, the current data access code uses insert or ignore in many places.
I'd like to use the entity framework to benefit from it's manageability; I've had enough of writing boilerplate code for yet another on subtly different query, and so far the Entity Framework's support for queries looks great.
However, I can't really seem to find good support for adding new facts - the obvious way seems to be to simply construct new Entities in .NET, add them to the model, and save the changes to the database. However, this means that I need to check in advance whether a particular row already exists since if it does insertion is not permitted. In the best of times, that means extra roundtrips, but it's usually much worse: previously this code was highly parallel - the occasional duplicate insert wouldn't matter since the database would just ignore it - but now not only would I need to check whether a row exists beforehand, I'd need to do so in far smaller batches (which is typically many times slower due to transaction overhead) since just one already-existing row from a parallel insert will cause a rollback of the entire transaction.
So, does anyone know how to get similar behavior using the entity framework?
To be specific, I'd like to be able to perform several inserts and have the database simply ignore those inserts that are impossible - or more generally, to be able to express server-side conflict resolution code to avoid unnecessary roundtrips and transactions.
If it matters, I'm using SQLite.
This functionality can be implemented via stored procedures (which the entity framework can use) on DB's that support it (MSSQL, for instance). On SQLite, these aren't supported. However, a simple solution exists for sqlite as well, namely to use triggers such as the following:
CREATE TRIGGER IF NOT EXISTS MyIgnoreTrigger BEFORE INSERT ON TheTable
FOR EACH ROW BEGIN
INSERT OR IGNORE
INTO TheTable (col1, col2, col3)
VALUES (NEW.col1, NEW.col2, NEW.col3);
select RAISE(IGNORE);
END;
Alternatively, the trigger could be set up to directly verify the constraint and only raise "ignore" when the constraint is violated.
What is the best, DBMS-independent way of generating an ID number that will be used immediately in an INSERT statement, keeping the IDs roughly in sequence?
DBMS independent? That's a problem. The two most common methods are auto incrementing columns, and sequences, and most DBMSes do one or the other but not both. So the database independent way is to have another table with one column with one value that you lock, select, update, and unlock.
Usually I say "to hell with DBMS independence" and do it with sequences in PostgreSQL, or autoincrement columns in MySQL. For my purposes, supporting both is better than trying to find out one way that works everywhere.
If you can create a Globally Unique Identifier (GUID) in your chosen programming language - consider that as your id.
They are harder to work with when troubleshooting (it is much easier to type in a where condition that is an INT) but there are also some advantages. By assigning the GUID as your key locally, you can easily build parent-child record relationships without first having to save the parent to the database and retrieve the id. And since the GUID, by definition, is unique, you don't have to worry about incrementing your key on the server.
There is auto increment or sequence
What is the point of this, that is the least of your worries?
How will you handle SQL itself?
MySQL has Limit,
SQL Server has Top,
Oracle has Rank
Then there are a million other things like triggers, alter table syntax etc etc
Yep, the obvious ways in raw SQL (and in my order of preference) are a) sequences b) auto-increment fields. The better, more modern, more DBMS-independent way is to not touch SQL at all, but to use a (good) ORM.
There's no universal way to do this. If there were, everyone would use it. SQL by definition abhors the idea - it's an antipattern for set-based logic (although a useful one, in many real-world cases).
The biggest problem you'd have trying to interpose an identity value from elsewhere is when a SQL statement involves several records, and several values must be generated simultaneously.
If you need it, then make it part of your selection requirements for a database to use with your application. Any serious DBMS product will provide its own mechanism to use, and it's easy enough to code around the differences in DML. The variations are pretty much all in the DDL.
I'd always go for the DB specific solution, but if you really have to the usual way of doing this is to implement your own sequence. Your RDBMS has to support transactions.
You create a sequence table which contains an int column and seed this with the first number, your transaction logic then looks something like this
begin transaction
update tblSeq set intID = intID + 1
select #myID = intID from tblSeq
inset into tblData (intID, ...) values (#myID, ...)
end transaction
The transaction forces a write lock such that the then next queued insert cannot update the tblSeq value before the record has been inserted into tblData. So long as all inserts go though this transaction then your generated ID is in sequence.
Use an auto-incrementing id column.
Is there really a reason that they have to be in sequence? If you're just using it as an ID, then you should just be able to use part of a UUID or the first couple digits of md5(now()).
You could take the time and massage it. It'd be the equivalent of something like
DateTime.Now.Ticks
So it be something like YYYYMMDDHHMMSSSS
It may be of a bit lateral approach, but a good ORM-type library will probably be able to at least hide the differences. For example, in Ruby there is ActiveRecord (commonly used in but not exclusively tied to the Ruby the Rails web framework) which has Migrations. Within a table definition, which is declared in platform-agnostic code, implementation details such as datatypes, sequential id generation and index creation are pushed down below your vision.
I have transparently developed a schema on SQLite, then implemented it on MS SQL Server and later ported to Oracle. Without ever changing the code that generates my schema definition.
As I say, it may not be what you're looking for, but the easiest way to encapsulate what varies is to use a library that has already done the encapsulation for you.
With only SQL, following could be one to the approaches:
Create a table to contain the starting id for your needs
When the application is deployed for the first time, the application should read the value in its context.
Thereafter, increment id (in thread-safe fashion) as required
3.1 Write the id to the database (in thread-safe fashion) which always keeps updated value
3.2 Don't write it to the database, just keep incrementing in the memory (thread-safe manner)
If for any reason server is going down, write the current id value to the database
When the server is up again it will pick from where it left, the last time.