Portable SQL : unique primary keys - sql

Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?

As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.

Related

"x FOR y" in column definition of DB2 create table statement

This is the DB2 create table statement:
CREATE TABLE SCHEMA_F.TABLE_A (
LEAFDESCRIPTION FOR F7UY2 CHAR(10) CCSID 1141 NOT NULL WITH DEFAULT
...
What does LEAFDESCRIPTION FOR F7UY2 mean exactly?
tl;dr
It allows you to specify the short name usable from (older) RPG programs on the i and some system utilities. In this case, LeafDescription is the long name usable from SQL, and F7UY2 usable inside RPG, etc (and SQL, but long names are usually preferred for obvious reasons).
It's part of the system-specific behavior available on the i.
See, the i (iSeries, Systemi, AS/400) started out as a very different machine. Several decades ago, before SQL was really a thing. And at the time, field and table names were limited to around 6 characters, and programs were written with this in mind.
Now, flash forward to when SQL comes around. On the i, DB2 SQL got retrofitted (of a sort) on top of the existing file system. This meant that, among other things, you could query over your existing files with SQL without recreating or repopulating them.
Unfortunately, it also meant they were stuck with the original, semi-cryptic names, in part because you still had all the original programs too. If you created (or re-created, as the case may be) a table via SQL, RPG programs couldn't reference it. So this allowed you to specify a name it could reference - which meant you could switch to SQL-based tables (which had some additional, new, advantages) without needing to change your programs. Even with the name-length restriction being lifted in recent versions of RPG, the ability to avoid renaming everything all at once is a boon.

Cassandra CQL - NoSQL or SQL

I am pretty new to Cassandra, just started learning Cassandra a week ago.
I first read, that it was a NoSQL, but when i started using CQL,
I started to wonder whether Cassandra is a NoSQL or SQL DB?
Can someone explain why CQL is more or less like SQL?
CQL is declarative like SQL and the very basic structure of the query component of the language (select things where condition) is the same. But there are enough differences that one should not approach using it in the same way as conventional SQL.
The obvious items: 1. There are no joins or subqueries. 2. No transactions
Less obvious but equally important to note:
Except for the primary key, you can only apply a WHERE condition on a column if you have created an index on that column. In SQL, you don't have to index a column to filter on it but in CQL the select statement will fail outright.
There are no OR or NOT logical operators, only AND. It is very important to model your data so you won't need these two; it is very easy to accidentally forget.
Date handling is profoundly different. CQL permits ONLY the equal operator for timestamps so extremely common and useful expressions like this do not work: where dateField > TO_TIMESTAMP('2013-01-01','YYYY-MM-DD') Also, CQL does not permit string insert of dates accurate to millis (seconds only) -- but it does permit entry of millis since epoch as a long int -- which most other DB engines do NOT permit. Lastly, timezone (as GMT offset) is invisibly captured for both long millis and string formats without a timezone. This can lead to confusion for those systems that deliberately do not conflate local time + GMT offset.
You can ONLY update a table based on primary key (or an IN list of primary keys). You cannot update based on other column data, nor can you do a mass update like this: update table set field = value; CQL demands a where clause with the primary key.
Grammar for AND does not permit parens. TO be fair, it's not necessary because of the lack of the OR operator but this means traditional SQL rewriters that add "protective" parens around expressions will not work with CQL, e.g.: select * from www where (str1 = 'foo2') and (dat1 = 12312442);
In general, it is best to use Cassandra as a big, resilient permastore of data for which a small number of very high level, very high performance queries can be applied to drag out a subset of data to work with at the application layer. That subset might be 1 million rows, yes. CQL and the Cassandra model is not designed for 2 page long SELECT statements with embedded cases, aggregations, etc. etc.
For all intents and purposes, CQL is SQL, so in the strictest sense Cassandra is an SQL database. However, most people closely associate SQL with the relational databases it is usually applied to. Under this (mis)interpretation, Cassandra should not be considered an "SQL database" since it is not relational, and does not support ACID properties.
Docs for CQLV3.0
CQL DESCRIBE to get schema of keyspace, column family, cluster
CQL Doesn't support some stuffs I had known in SQL like joins group by triggers cursors procedure transactions stored procedures
CQL3.0 Supports ORDER BY
CQL Supports all DML and DDL functionalities
CQL Supports BATCH
BATCH is not an analogue for SQL ACID transactions.
Just the DOC mentioned above is a best reference :)

is there standard sql that works in all database

As seen below the syntax is different for different Database .Isnt there a standard way that works in all Databases.
Is there any tool to convert any sql to any sql
SqlServer2005:
CREATE TABLE Table01 (
Field01 int primary key identity(1,1)
)
Sqlite:
CREATE TABLE Table01 (
Field01 integer PRIMARY KEY AUTOINCREMENT NOT NULL UNIQUE
);
The "query" part of SQL (commonly called DML - Data Manipulation Language) is reasonably standardized, and queries written on one database will often run on another database if no "vendor enhancement" features are used. The "create database bits" parts of SQL (often known as DDL - Data Definition Language) is not standardized, and every database out there does it a little differently. So, as you've discovered, statements such as CREATE TABLE written for one database will not work without some tweaking on another database.
<soapbox>
In my opinion this is a Good Thing. DDL effectively defines what can be created in the database. If all vendors use exactly the same DDL, then all products are going to be exactly the same in terms of the features they support. For example, in order to allow for table inheritance in PostgreSQL there must be some way to define how one table inherits from another, and this definition must be part of how a table is defined, either as part of the CREATE TABLE statement or in some other fashion, but it has to be a DDL statement. Thus, for purely functional reasons (because PostgreSQL supports a feature which most databases do not) DDL in PostgreSQL must be different than in other databases. A similar situation arises in MySQL because it allows the use of different ISAM engines with different capabilities. Similar situations arise in Oracle...and in SQL Server...and <your favorite database here>.
Standards are a double-edged sword. One one hand they're a great thing, because they provide a "standard" way of doing something. On the other hand they're a terrible thing because the very purpose of a standard is to provide a "standard" way of doing something, which is another way of saying "create stagnation and inhibit innovation".
</soapbox>
Share and enjoy.
There are several SQL standards out there, and SQL 1999 is probably the closest you will get, as each DB adopted a different standard (if any).
No, there's no standard SQL for this. As a shameless plug, I can recommend trying Wizardby if you're into anything related to database schema migration/continuous integration.

What is the best way to generate an ID for a SQL Insert?

What is the best, DBMS-independent way of generating an ID number that will be used immediately in an INSERT statement, keeping the IDs roughly in sequence?
DBMS independent? That's a problem. The two most common methods are auto incrementing columns, and sequences, and most DBMSes do one or the other but not both. So the database independent way is to have another table with one column with one value that you lock, select, update, and unlock.
Usually I say "to hell with DBMS independence" and do it with sequences in PostgreSQL, or autoincrement columns in MySQL. For my purposes, supporting both is better than trying to find out one way that works everywhere.
If you can create a Globally Unique Identifier (GUID) in your chosen programming language - consider that as your id.
They are harder to work with when troubleshooting (it is much easier to type in a where condition that is an INT) but there are also some advantages. By assigning the GUID as your key locally, you can easily build parent-child record relationships without first having to save the parent to the database and retrieve the id. And since the GUID, by definition, is unique, you don't have to worry about incrementing your key on the server.
There is auto increment or sequence
What is the point of this, that is the least of your worries?
How will you handle SQL itself?
MySQL has Limit,
SQL Server has Top,
Oracle has Rank
Then there are a million other things like triggers, alter table syntax etc etc
Yep, the obvious ways in raw SQL (and in my order of preference) are a) sequences b) auto-increment fields. The better, more modern, more DBMS-independent way is to not touch SQL at all, but to use a (good) ORM.
There's no universal way to do this. If there were, everyone would use it. SQL by definition abhors the idea - it's an antipattern for set-based logic (although a useful one, in many real-world cases).
The biggest problem you'd have trying to interpose an identity value from elsewhere is when a SQL statement involves several records, and several values must be generated simultaneously.
If you need it, then make it part of your selection requirements for a database to use with your application. Any serious DBMS product will provide its own mechanism to use, and it's easy enough to code around the differences in DML. The variations are pretty much all in the DDL.
I'd always go for the DB specific solution, but if you really have to the usual way of doing this is to implement your own sequence. Your RDBMS has to support transactions.
You create a sequence table which contains an int column and seed this with the first number, your transaction logic then looks something like this
begin transaction
update tblSeq set intID = intID + 1
select #myID = intID from tblSeq
inset into tblData (intID, ...) values (#myID, ...)
end transaction
The transaction forces a write lock such that the then next queued insert cannot update the tblSeq value before the record has been inserted into tblData. So long as all inserts go though this transaction then your generated ID is in sequence.
Use an auto-incrementing id column.
Is there really a reason that they have to be in sequence? If you're just using it as an ID, then you should just be able to use part of a UUID or the first couple digits of md5(now()).
You could take the time and massage it. It'd be the equivalent of something like
DateTime.Now.Ticks
So it be something like YYYYMMDDHHMMSSSS
It may be of a bit lateral approach, but a good ORM-type library will probably be able to at least hide the differences. For example, in Ruby there is ActiveRecord (commonly used in but not exclusively tied to the Ruby the Rails web framework) which has Migrations. Within a table definition, which is declared in platform-agnostic code, implementation details such as datatypes, sequential id generation and index creation are pushed down below your vision.
I have transparently developed a schema on SQLite, then implemented it on MS SQL Server and later ported to Oracle. Without ever changing the code that generates my schema definition.
As I say, it may not be what you're looking for, but the easiest way to encapsulate what varies is to use a library that has already done the encapsulation for you.
With only SQL, following could be one to the approaches:
Create a table to contain the starting id for your needs
When the application is deployed for the first time, the application should read the value in its context.
Thereafter, increment id (in thread-safe fashion) as required
3.1 Write the id to the database (in thread-safe fashion) which always keeps updated value
3.2 Don't write it to the database, just keep incrementing in the memory (thread-safe manner)
If for any reason server is going down, write the current id value to the database
When the server is up again it will pick from where it left, the last time.

Why use "Y"/"N" instead of a bit field in Microsoft SQL Server?

I'm working on an application developed by another mob and am confounded by the use of a char field instead of bit for all the boolean columns in the database. It uses "Y" for true and "N" for false (these have to be uppercase). The type name itself is then aliased with some obscure name like ybln.
This is very annoying to work with for a lot of reasons, not the least of which is that it just looks downright aesthetically unpleasing.
But maybe its me that's stupid - why would anyone do this? Is it a database compatibility issue or some design pattern that I am not aware of?
Can anyone enlighten me?
I've seen this practice in older database schemas quite often. One advantage I've seen is that using CHAR(1) fields provides support for more than Y/N options, like "Yes", "No", "Maybe".
Other posters have mentioned that Oracle might have been used. The schema I referred to was in-fact deployed on Oracle and SQL Server. It limited the usage of data types to a common subset available on both platforms.
They did diverge in a few places between Oracle and SQL Server but for the most part they used a common schema between the databases to minimize the development work needed to support both DBs.
Welcome to brownfield. You've inherited an app designed by old-schoolers. It's not a design pattern (at least not a design pattern with something good going for it), it's a vestige of coders who cut their teeth on databases with limited data types. Short of refactoring the DB and lots of code, grit your teeth and gut your way through it (and watch your case)!
Other platforms (e.g. Oracle) do not have a bit SQL type. In which case, it's a choice between NUMBER(1) and a single character field. Maybe they started on a different platform or wanted cross platform compatibility.
I don't like the Y/N char(1) field as a replacement to a bit column too, but there is one major down-side to a bit field in a table: You can't create an index for a bit column or include it in a compound index (at least not in SQL Server 2000).
Sure, you could discuss if you'll ever need such an index. See this request on a SQL Server forum.
They may have started development back with Microsoft SQl 6.5
Back then, adding a bit field to an existing table with data in place was a royal pain in the rear. Bit fields couldn't be null, so the only way to add one to an existing table was to create a temp table with all the existing fields of the target table plus the bit field, and then copy the data over, populating the bit field with a default value. Then you had to delete the original table and rename the temp table to the original name. Throw in some foriegn key relationships and you've got a long script to write.
Having said that, there were always 3rd party tools to help with the process. If the previous developer chose to use char fields in lieu of bit fields, the reason, in a nutshell, was probably laziness.
The reasons are as follows (btw, they are not good reasons):
1) Y/N can quickly become "X" (for unknown), "L" (for likely), etc. - What I mean by this is that I have personally worked with programmers who were so used to not collecting requirements correctly that they just started with Y/N as sort of 'flags' with the superstition that it might need to expand (to which they should use an int as a status ID).
2) "Performance" - but as was mentioned above, SQL indexes are ruled out if they are not 'selective' enough... a field that only has 2 possible values will never use that index.
3) Lazyness. - Sometimes developers want to output directly to some visual display with the letter "Y" or "N" for human readableness, and they don't want to convert it themselves :)
There are all 3 bad reasons that I've heard/seen before.
I can't imagine any disadvantage in not being able to index a "BIT" column, as it would be unlikely to have enough different values to help the execution of a query at all.
I also imagine that in most cases the storage difference between BIT and CHAR(1) is negligible (is that CHAR a NCHAR? does it store a 16bit, 24bit or 32bit unicode char? Do we really care?)
This is terribly common in mainframe files, COBOL, etc.
If you only have one such column in a table, it's not that terrible in practice (no real bit-wasting); after all SQL Server will not let you say the natural WHERE BooleanColumn, you have to say WHERE BitColumn = 1 and IF #BitFlag = 1 instead of the far more natural IF #BooleanFlag. When you have multiple bit columns, SQL Server will pack them. The case of the Y/N should only be an issue if case-sensitive collation is used, and to stop invalid data, there is always the option of a constraint.
Having said all that, my personal preference is for bits and only allowing NULLs after careful consideration.
Apparently, bit columns aren't a good idea in MySQL.
They probably were used to using Oracle and didn't properly read up on the available datatypes for SQL Server. I'm in exactly that situation myself (and the Y/N field is driving me nuts).
I've seen worse ...
One O/R mapper I had occasion to work with used 'true' and 'false' as they could be cleanly cast into Java booleans.
Also, On a reporting database such as a data warehouse, the DB is the user interface (metadata based reporting tools notwithstanding). You might want to do this sort of thing as an aid to people developing reports. Also, an index with two values will still get used by index intersection operations on a star schema.
Sometimes such quirks are more associated with the application than the database. For example, handling booleans between PHP and MySQL is a bit hit-and-miss and makes for non-intuitive code. Using CHAR(1) fields and 'Y' and 'N' makes for much more maintainable code.
I don't have any strong feelings either way. I can't see any great benefit to doing it one way over another. I know philosophically the bit fields are better for storage. My reality is that I have very few databases that contain a lot of logical fields in a single record. If I had a lot then I would definitely want bit fields. If you only have a few I don't think it matters. I currently work with Oracle and SQL server DB's and I started with Cullinet's IDMS database (1980) where we packed all kinds of data into records and worried about bits and bytes. While I do still worry about the size of data, I long ago stopped worrying about a few bits.