GUID in databases other than SQL Server - sql

Question: I'm planning the database for one of my programs at the moment.
I intend to use ASP.NET MVC for the user backend, the database being on Linux and/or on Windows.
Now, even if I would only make it for windows, I had to take into account, that different customers use different database systems. Now, I figured I use nHibernate, then I can put everything in the code, and it works on all mayor databases, such as Oracle/Sybase/MS/PostGre/MySQL/Firebird.
My probem now is GUIDs. SQL Server uses GUIDs, while the rest uses integer auto-increment as primary keys. While auto-increment is better in theory, it creates problems keeping multiple databases in sync, or problems manually changing things, which requires CSV import/export...
Now, because of the inherent problems with autoid in practise, I like the GUID system better. And since a guid is a 36-character string, I could use varchar(36) as a primary-key, but a varchar as GUID, might just not be an ideal solution...
How would you solve this problem/what do you use as primary-key ?
Or how do you evade the auto-increment problems, say insert a csv file without changing the autoid...

A Guid key using the guid.comb generator key is usable in any database, even if it doesn't have Guid as a native type.

You could also consider generating a primary key which is a combination of auto-increment (i.e. setting up a sequence) and an unique identifier of the machine it was generated on, maybe using the MAC address.
See this for a discussion.
This way you have a locally unique (thanks to the sequence) ID which is also globally unique (thanks to the MAC address part).
I know, I know, you can spoof a MAC address but it's up to you to decide if this is really a risk in your domain. Also, the ability to spoof it could be handy when you test your code.
Please explain better what happens when a new customer DB is born. Will it be registered on the Server? If yes, you can assign a DB-id on the server, and use it in lieu of the MAC address, just assign a number to each new DB and use it along with the sequence.
Basically, if you want an "unique DB instance ID" to avoid "table id" collisions, you have only two choices:
1) Server assigns the DB ID whenever a new DB is added
2) Client autogenerate a unique ID, and this usually needs using the MAC address, either "raw" or processed somehow.
I honestly can't see alternatives given your current description of your problem.

Oracle and PostgreSQL support GUIDs as well, there is no need so use sequences there (and of course Diego is right: if you use your own algorithm to create GUIDs you can always store use a varchar column with your own generated GUID)
Note that it's spelled PostgreSQL, never PostGre

I have never had any trouble using Guids. We used Guid.Comb in a system with many records (millions) and had no trouble because of the Guids themselves. The selling point for me is that I can generate the Ids before i persist something to the database. Even on the client. Which is very helpful in CQRS scenarios.
The only thing that I think you should also consider is human readability. It's hard to look at the database and match records in a lets say master/detail scenario.
And a note on Firebird... Uuid is written as an octet. And most clients that I've used to manage the database can't represent those in a decent format. So it's usually just displayed as a couple of characters (probably by just decoding a byte array as a string). I don't know about other providers though. SQLServer Management Studio for example shows them just fine.

Related

Database optimisation

I'm starting a web application that will be used by a lot of companies (over 20K), and most importantly a lot of information will be recorded daily. I would like your advice on the following idea: create a database for each company to do sql queries like this:
select * from enterprisedb1.tablename;
select * from enterprisedb2.tablename2 where enterprisedb2.tablename2.col='foo'
Pleace i need your advice, i don't find anything on google
If you are selling this to multiple clients then it might come down to separation of their data.
On the one hand everything for the app is in the one database for each client, and provided you get the connection string right you probably don't need to ever specify the company name again for the rest of the app. No more "where customer=123" on every single query.
Also means a client could be deleted, backed up, moved, audited, whatever in a completely independent manner.
And also means there is no risk of a developer or a query accidentally doing cross-client things. So you can even open up to generic query access that still cant accidentally cross a client-to-client border. And security set-up will be simpler.
But if you have a million clients you do end up with a lot of databases. How well this works will depend on all sorts of things, including your database of choice.
You also end up having multiple copies of reference data unless you create an additional database "common" or something like that.
Its going to be very much a "depends" answer, but that's a few things to consider.
I suggest to use common tables for each company. It will better to manage and easy to understand.
Create one table for company data and use Integer reference of that key in another mete data tables. For better performance, Index and Query must be well formed.

Degenerate a single SQL table into multiple domaines tables

In short: I have a client who wish to be able to add domain tables, without adding SQL tables.
I am working with an application in wich data are organized and made available with a postgresql catalogue. What I mean by catalogue is that the database hold the path to the actual data file(s) as well as some metadata.
Adding a new table means that the (Java class of the) client application has to be updated. This is a costly process for the client, who want us to find a way to let him add new kind of data in the catalogue, without having to change the schema.
I don't have many more specificities about the db itself and it's configuration as I'm usualy mostly a client of the said db.
My idea: to solve this was to have a generic table with the most often used columns (like date, comment etc.) and a column containing a domain key. The domain key would be used by the client application to request the kind of generic data is needed (and would have no meaning whatsoever to the db provider). Adding metadata could be done with a companion file within the catalogue and further filtering would have to be done on the client side.
Question: as I am by no mean an SQL expert, I would like to know if it is an acceptable solution, and what limitation I could be facing ? I'm thinking of performance, data volume etc. Or maybe a different approach, is advisable ?
Regarding expected volume, for a single domain data type, it could be arround 30 new entry per day.

Storing database version in the database itself

I'm writing a program that uses a h2 database to store data.
The database will be evolving as we add features to our software, but we still want users to be able to use an older version of the database with a newer version of the program. This way the program could automatically upgrade the database to the newer version (maybe asking first for confirmation from user).
To write this "database upgrader" we need to store the database version inside the database itself, so that it is possible to just move the database file (we're using file mode of the h2 database engine).
We tried doing something like this:
TABLE configuration (databaseVersion INT NOT NULL);
but this would mean having a table where only a single row is ever used without explicit checking of the row count.
Is there any better way to do this?
Thanks in advance for you help.
I think this is a good solution, if you just need to persist the database version.
Sometimes you need to persist more than one such 'global' settings, for example if your application consists of multiple modules, and each module has its own version. Or other things, like the location of the last backup. What I usually use is a settings table with a key/value pair, where both the key (the primary key of that table) and the value are of type varchar.

What are the Best Practices to follow while creating a data-dictionary?

I have large and complex SQL Server 2005 DB used by multiple applications. I want to create a data-dictionary for maintaining not only my DB objects but also cross-reference them against applications that use a specific object.
For example, if a stored procedure is used by 15 diffrent applications I want to record that additional data too.
What are the key elements to be kept in mind so that I get a efficient and scalable Data Dictionary?
So, I recently helped to build a data dictionary for a very large product. We were dealing with documenting more than one-thousand tables using a change request process. I can send you a scrubbed version of the spreadsheet we used if you want. Basically, we captured the following:
Column Name
Data Type
Length
Scale (for decimals)
Whether the column is custom for the application(s) or a default column
Which application(s)/component(s) the column is used in
Release the column was introduced in
Business definition
We also captured information about who requested the addition, their contact information, etc. Our primary focus was on business definition, and clearly identifying why a column was being used or created.
We didn't have stored procedures in our solution, but bear in mind that these would be pretty easy to add to the system.
We used Access for our front-end, even though SQL Server was on the back end. It made it pretty easy for us to build out a rich user interface without much work, using the schema we had already built out.
Hope this helps you get started--feel free to ask if you have additional questions.
I've always been a fan of using the 'extended properties' within SQL Server for storing this kind of meta data. In this way the description of each object lives alongside the object and is accessible by anyone with access to the database itself. I'm sure there are also tools out there that can read these extended properties and turn them into a nicely formatted document.
As far as being "scalable", I don't know of any issues related to adding large amounts of data as extended properties; or I should say I've never had any issues with this.
You can set these extended properties using SQL Server Management Studio 'property' dialog for each table/proc/function/etc and can also use the 'sp_addextendedproperty'.

Ideas for Combining Thousand Databases into One Database

We have a SQL server that has a database for each client, and we have hundreds of clients. So imagine the following: database001, database002, database003, ..., database999. We want to combine all of these databases into one database.
Our thoughts are to add a siteId column, 001, 002, 003, ..., 999.
We are exploring options to make this transition as smoothly as possible. And we would LOVE to hear any ideas you have. It's proving to be a VERY challenging problem.
I've heard of a technique that would create a view that would match and then filter.
Any ideas guys?
Create a client database id for each of the client databases. You will use this id to keep the data logically separated. This is the "site id" concept, but you can use a derived key (identity field) instead of manually creating these numbers. Create a table that has database name and id, with any other metadata you need.
The next step would be to create an SSIS package that gets the ID for the database in question and adds it to the tables that have to have their data separated out logically. You then can run that same package over each database with the lookup for ID for the database in question.
After you have a unique id for the data that is unique, and have imported the data, you will have to alter your apps to fit the new schema (actually before, or you are pretty much screwed).
If you want to do this in steps, you can create views or functions in the different "databases" so the old client can still hit the client's data, even though it has been moved. This step may not be necessary if you deploy with some downtime.
The method I propose is fairly flexible and can be applied to one client at a time, depending on your client application deployment methodology.
Why do you want to do that?
You can read about Multi-Tenant Data Architecture and also listen to SO #19 (around 40-50 min) about this design.
The "site-id" solution is what's done.
Another possibility that may not work out as well (but is still appealing) is multiple schemas within a single database. You can pull common tables into a "common" schema, and leave the customer-specific stuff in customer-specific schema. In some database products, however, the each schema is -- effectively -- a separate database. In other products (Oracle, DB2, for example) you can easily write queries that work in multiple schemas.
Also note that -- as an optimization -- you may not need to add siteId column to EVERY table.
Sometimes you have a "contains" relationship. It's a master-detail FK, often defined with a cascade delete so that detail cannot exist without the parent. In this case, the children don't need siteId because they don't have an independent existence.
Your first step will be to determine if these databases even have the same structure. Even if you think they do, you need to compare them to make sure they do. Chances are there will be some that are customized or missed an upgrade cycle or two.
Now depending on the number of clients and the number of records per client, your tables may get huge. Are you sure this will not create a performance problem? At any rate you may need to take a fresh look at indexing. You may need a much more powerful set of servers and may also need to partion by client anyway for performance.
Next, yes each table will need a site id of some sort. Further, depending on your design, you may have primary keys that are now no longer unique. You may need to redefine all primary keys to include the siteid. Always index this field when you add it.
Now all your queries, stored procs, views, udfs will need to be rewritten to ensure that the siteid is part of them. PAy particular attention to any dynamic SQL. Otherwise you could be showing client A's information to client B. Clients don't tend to like that. We brought a client from a separate database into the main application one time (when they decided they didn't still want to pay for a separate server). The developer missed just one place where client_id had to be added. Unfortunately, that sent emails to every client concerning this client's proprietary information and to make matters worse, it was a nightly process that ran in the middle of the night, so it wasn't known about until the next day. (the developer was very lucky not to get fired.) The point is be very very careful when you do this and test, test, test, and test some more. Make sure to test all automated behind the scenes stuff as well as the UI stuff.
what I was explaining in Florence towards the end of last year is if you had to keep the database names and the logical layer of the database the same for the application. In that case you'd do the following:
Collapse all the data into consolidated tables into one master, consolidated database (hereafter referred to as the consolidated DB).
Those tables would have to have an identifier like SiteID.
Create the new databases with the existing names.
Create views with the old table names which use row-level security to query the tables in the consolidated DB, but using the SiteID to filter.
Set up the databases for cross-database ownership chaining so that the service accounts can't "accidentally" query the base tables in the consolidated DB. Access must happen through the views or through stored procedures and other constructs that will enforce row-level security. Now, if it's the same service account for all sites, you can avoid the cross DB ownership chaining and assign the rights on the objects in the consolidated DB.
Rewrite the stored procedures to either handle the change (since they are now referring to views and they don't know to hit the base tables and include SiteID) or use InsteadOf Triggers on the views to intercept update requests and put the appropriate site specific information into the base tables.
If the data is large you could look at using a partioned view. This would simplify your access code as all you'd have to maintain is the view; however, if the data is not large, just add a column to identify the customer.
Depending on what the data is and your security requirements the threat of cross contamination may be a show stopper.
Assuming you have considered this and deem it "safe enough". You may need/want to create VIEWS or impose some other access control to prevent customers from seeing each-other's data.
IIRC a product called "Trusted Oracle" had the ability to partition data based on such a key (about the time Oracle 7 or 8 was out). The idea was that any given query would automagically have "and sourceKey = #userSecurityKey" (or some such) appended. The feature may have been rolled into later versions of the popular commercial product.
To expand on Gregory's answer, you can also make a parent ssis that calls the package doing the actual moving within a foreach loop container.
The parent package queries a config table and puts this in an object variable. The foreach loop then uses this recordset to pass variables to the package, such as your database name and any other details the package might need.
You table could list all of your client databases and have a flag to mark when you are ready to move them. This way you are not sitting around running the ssis package on 32,767 databases. I'm hooked on the foreach loop in ssis.