Is there implementation-agnostic way of having SQL DB provide UUIDs? - sql

For development I'm using H2 database, in prod it will most likely be Postgres. Is there a way to instruct, in implementation-agnostic fashion, the database to automatically provide UUIDs for table's rows?

A user defined function could be used.
Related (I know this isn't your question): Please note if you have a lot of rows in the table (millions of rows), and if you have an index on this UUID, you should avoid randomly distributed UUIDs for performance reasons. This is for all databases, except if it the index easily fits completely in memory. Because of that, I personally would avoid UUIDs and use sequences instead if ever possible.

Well apparently, it's as simple as that:
CREATE TABLE items (
uuid SERIAL,
PRIMARY KEY (uuid)
)
I didn't find SERIAL documented on H2, here's the doc for PostgreSQL. I don't know to what extent this is db-agnostic, but it works on both H2 and Postgre so good enough at the moment.

Related

Bigquery - create surrogate keys on migrated data

We are doing a migration from AWS Redshift to GCP BigQuery.
Problem statement:
We have a Redshift table that uses the IDENTITY column functionality to issue an internal EDW surrogate key (PK) for natural/business keys. These natural keys are from at least 20 different source systems for customers. We need a method to identify them in case natural keys are somehow duplicated (because we have so many source systems). In BigQuery, the functionality of the Redshift IDENTITY column does not exist. How can I replicate this in BQ?
We cant use GENERATE_UUID() because all our downstream clients have been using a BIGINT for the last 4 years. All history is based on BIGINT and too much would need to change for a VARCHAR.
Does anyone have any ideas, recommendations or suggestions?
Some considerations I have made:
1. load the data into Spark and keep it in memory and use scala or python functions to issue the surrogate key.
2. use a nosql data store (but this does not seem likely as a use case).
Any ideas are welcome!
In these cases, the idea is generally to identify an injective/bijective function, which can map to some unique space.
How about you try something like: SELECT UNIX_MICROS(current_timestamp()) + x as identity where x is a numeral that you can somehow manage (using case statements or if conditions) based on the business name or something?
You can also eliminate x from this formula if you intend to process things linearly in some order, like one business entity at a time.
Hope it helps.

Portable SQL : unique primary keys

Trying to develop something which should be portable between the bigger RDBMS'es.
The issue is around generating and using auto-increment numbers as the primary key for a table.
There are two topics here
The mechanism used to generate the auto-increment numbers.
How to specify that you want to use this as the primary key on a
table.
I'm looking for verification for what I think is the current state of affairs:
Unfortunately standardization came late to this area and in some respect is still not implemented (as a mandatory standard). This means it is still in 2013 impossible to write a CREATE TABLE statement in a portable way ... if you want it with a auto-generated primary key.
Can this really be so?
Re (1). This is standardized because it came in SQL:2003. As far as I understand the way to go is SEQUENCEs. I believe these are a mandatory part of SQL:2003, right? The other possibility is the IDENTITY keyword which is also defined in SQL:2003 but that one is - as far as I can tell - an optional part of the standard ... which means a key player like Oracle doesn't implement it... and can still claim compliance. Ok, so SEQUENCEs is the designated portable method for this, right ?
Re (2). Database vendors implement this in different ways. In PostgreSQL you can link the CREATE TABLE statement directly with the sequence, in Oracle you would have to create a trigger to ensure the SEQUENCE is used with the table.
So my conclusion is that without a standardized solution to (2) it really doesn't help much that all the major players now support SEQUENCEs. I would still have to write db-specific code for something as simple as a CREATE TABLE statement.
Is this right?
Standards and their implementation aside I would also be interested if anyone has a portable solution to the problem, no matter if it is a hack from a RDBMS best practice perspective. For such a solution to work it would have to be independent from any application, i.e. it must the database that solves the issue, not the application layer. Perhaps if both the concept of TRIGGERs and SEQUENCEs can be said to be standardized then a solution that combines the two of them would be portable?
As for "portable create table statements": It starts with the data types: Whether boolean, int or long data types are part of any SQL standard or not, I really appreciate these types. PostgreSql supports these data types, Oracle does not. Ironically Oracle supports boolean in PL/SQL, but not as a data type in a table. Even the length of table/column names etc. are restricted in Oracle to 30 characters. So not even the most simple "create table" is always portable.
As for auto-generated primary keys: I am not aware of a syntax which is portable, so I do not define this in the "create table". Of couse this only delays the problem, and leaves it to the insert statements. This topic is connected with another problem: Getting the generated key after an insert using JDBC in the most efficient way. This differs substantially between Oracle and PostgreSql, and if you have ever dared to use case sensitive table/column names in Oracle, it won't be funny.
As for constraints, I prefer to add them in separate statements after "create table". The set of constraints may differ, if you implement a boolean data type in Oracle using char(1) together with a check constraint whereas PostgreSql supports this data type directly.
As for "standards": One example
SQL99 standard: for SELECT DISTINCT, ORDER BY expressions must appear in select list
This message is from PostgreSql, Oracle 11g does not complain. After 14 years, will they change it?
Generally speaking, you still have to write database specific code.
As for your conclusion: In our scenario we implemented a portable database application using a model driven approach. This logical meta data is used by the application, and there are different back ends for different database types. We do not use any ORM, just "direct SQL", because this simplifies tuning of SQL statements, and it gives full access to all SQL features. We wrote our own library, and later we found out that the key ideas match these of "Anorm".
The good news is that while there are tons of small annoyances, it works pretty well, even with complex queries. For example, window aggregate functions are quite portable (row_number(), partition by). You have to use listagg on Oracle, whereas you need string_agg on PostgreSql. Recursive commen table expressions require "with recursive" in PostgreSql, Oracle does not like it. PostgreSql supports "limit" and "offset" in queries, you need to wrap this in Oracle. It drives you crazy, if you use SQL arrays both in Oracle and PostgreSql (arrays as columns in tables). There are materialized views on Oracle, but they do not exist in PostgreSql. Surprisingly enough, it is possible to write database stored procedures not only in Java, but in Scala, and this works amazingly well in both Oracle and PostgreSql. This list is not complete. But so far we managed to find an acceptable (= fast) solution for any "portability problem".
Does it pay off? In our scenario, there is a central Oracle installation (RAC, read/write), but there are distributed PostgreSql installations as localhost databases on each application server (only readonly). This gives a big performance and scalability boost, without the cost penalty.
If you really want to have it solved in the database only, there is one possibility: Put anything in stored procedures, write these in Java/Scala, and restrict yourself in the application to call these procedures, and to read the result sets. This of course just moves the complexity from the application layer into the database, but you accepted hacks :-)
Triggers are quite standardized, if you use Java stored procedures. And if it is supported by your databases, by your management, your data center people, and your colleagues. The non-technical/social aspects are to be considered as well. I have even heard of database tuning people which do not accept the general "left outer join" syntax; they insisted on the Oracle way of using "(+)".
So even if triggers (PL/SQL) and sequences were standardized, there would be so many other things to consider.
Update
As for returning the generated primary keys I can only judge the situation from JDBC's perspective.
PostgreSql returns it, if you use Statement.getGeneratedKeys (I consider this the normal way).
Oracle requires you to specify the (primary key) column(s) whose values you want to get back explicitly when you create the prepared statement. This works, but only if you are not using case sensitive table names. In that case all you receive is a misleading ORA-00942: table or view does not exist thrown in Oracle's JDBC driver: There was/is a bug in Oracle's JDBC driver, and I have not found a way to get the value using a portable JDBC method. So at the cost of an additional proprietary "select sequence.currVal from dual" within the same transaction right after the insert, you can get back the primary key. The additional time was acceptable in our case, we compared the times to insert 100000 rows: PostgreSql is faster until the 10000th row, after that Oracle performs better.
See a stackoverflow question regarding the ways to get the primary key and
the bug report with case sensitive table names from 2008
This example shows pretty well the problems. Normally PostgreSql follows the way you expect it to work, but you may have to find a special way for Oracle.

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?
Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.
You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.
I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.

Efficient way to store content translations?

Suppose you have a few, quite large (100k+) objects in available and can provide this data (e.g. name) in 20+ languages. What is an efficient way to store/handle this data in a SQL database.
The obvious way to do that looks like this - however, are there other ways which make more sense? I'm a bit worried about performance.
CREATE TABLE "object" (
"id" serial NOT NULL PRIMARY KEY
);
CREATE TABLE "object_name" (
"object_id" integer NOT NULL REFERENCES "object" ("id")
"lang" varchar(5) NOT NULL,
"name" varchar(50) NOT NULL
);
As for usage, the use will only select one language and that will result in potentially large joins over the object_name table.
Premature optimization or not, I'm interested in other approaches, if only gain some peace of mind, that the obvious solution isn't a very stupid one.
To clarify the actual model is way more complicated. That's just the pattern identified so far.
If you have a combined key on (object_id, lang) there shouldn't be any joins, just an O(1) lookup, right? (Try with EXPLAIN SELECT to be sure)
In my own projects, I don't translate at the DB level. I let the user (or the OS) give me a lang code and then I load all the texts in one go into a hash. The DB then sends me IDs for that hash and I translate the texts the moment I display them somewhere.
Note that my IDs are strings, too. That way, you can see which text you're using (compare "USER" with "136" -- who knows what "136" might mean in the UI without looking into the DB?).
[EDIT] If you can't translate at the UI level, then your DB design is the best you can aim for. It's as small as possible, easy to index and joins don't take a lot.
If you want to take it one step further and you can generate the SQL queries at the app level, you can consider to create views (one per language) and then use the views in the joins which would give you a way to avoid the two-column-join. But I doubt that such a complex approach will have a positive ROI.
Have you considered using multiple tables, one for each language?
It will cost a bit more in terms of coding complexity, but you will be loading/accessing only one table per language, in which metadata will be smaller and therefore more time efficient (possibly also space-wise, as you won't have a "lang" variable for each row)
Also, if you really want one-table-to-rule-them-all, you can create a view and join them :)
In addition to what Wim writed, the table OBJECT in your case is useless. There's no need for such table since it does not store any single information not contained in table OBJECT_NAME.

Oracle 9i: How can I determine, using metadata, whether or not an index is clustered?

The question pretty much sums this up, but I'll provide some more details.
I can almost safely assume that any primary key index in an Oracle database is clustered. But I'm not one to assume. Besides, a user might have created a clustered index that wasn't the primary key. If that's the case, I'd really like to know.
So, in the interests of being really, really thorough, I'd like to remember (not that I forgot or anything), how to determine from the Oracle metadata, whether or not an index is clustered.
(And, per usual, Google was like rooting through a landfill, looking for the vintage Action Comics #1 that your mom through out because she thought it was useless at the time.)
Thanks!
Oracle does not have the concept of "clustered indexes" as SQL Server does. In general, Oracle table are "heaps" with the data stored in no particular order. There is a special type of table called an INDEX ORGANIZED table, which is (as its name suggests) a table that is organised like an index. However, in Oracle most tables would not be index organized - whereas my understanding is that most tables in SQL Server do have a clustered index.
Do not be tempted to declare all your Oracle tables as index organized in an attempt to emulate SQL Server; what is right for one DBMS is not necessarily right for another. I suggest you read the Oracle Database Concepts guide to get to know how Oracle works.
Index organized tables are identified by IOT_TYPE = 'IOT' in ALL_TABLES and USER_TABLES.
What is known as a clustered index in SQL Server world is in Oracle world called an Index Organized Table. Table metadata is available in the all_tables or user_tables system views described here. My guess after skimming that link is that you can determine that a table is index-organized by checking whether the IOT_TYPE column is non-null.