When is sqlite's manifest typing useful? - sql

sqlite uses something that the authors call "Manifest Typing", which basically means that sqlite is dynamically typed: You can store a varchar value in a "int" column if you want to.
This is an interesting design decision, but whenever I've used sqlite, I've used it like a standard RDMS and treated the types as if they were static. Indeed, I've never even wished for dynamically typed columns when designing databases in other systems.
So, when is this feature useful? Has anybody found a good use for it in practice that could not have been done just as easily with statically typed columns?

It really just makes types easier to use. You don't need to worry about how big this field needs to be at a database level anymore, or how many digets your intergers can be. More or less it is a 'why not?' thing.
On the other side, static typing in SQL Server allows the system to search and index better, in some cases much better, but for half of all applications, I doubt that database performance improvements would matter, or their performance is 'poor' for other reasons (temp tables being created every select, exponential selects, etc).
I use SqLite all the time for my .NET projects as a client cache because it is just too easy too use. Now if they can only get it to use GUIDs the same as SQL server I would be a happy camper.

Dynamic typing is useful for storing things like configuration settings. Take the Windows Registry, for example. Each key is a lot like having an SQLite table of the form:
CREATE TABLE Settings (Name TEXT PRIMARY KEY, Value);
where Value can be NULL (REG_NONE) or an INTEGER (REG_DWORD/REG_QWORD), TEXT (REG_SZ), or BLOB (REG_BINARY).
Also, I'll have to agree with the Jasons about the usefulness of not enforcing a maximum size for strings. Because much of the time, those limits are purely arbitary, and you can count on someday finding a 32-byte string that needs to be stored in your VARCHAR(30).

Related

SQL Compatibility Chart (esp data types)

So...happens I'm working on some code which...will end up being used on different sql servers at the same time.
Although the SQL code is different depending on the server, the data types and columns are not.
Therefor, I need to know which are the data types common to (at least most) sql server types.
As a starting point, I have the following types:
byte, char, float, int, text, varchar, blob
Please note that spelling is quite important, since the data type name will end in the query as is (eg: although both int and integer are supported, I need the common one).
So, the question is, does anyone know of a chart comparing compatibility between sql servers? Or perhaps someone which did some research in the field?
As far as bias goes, I'm obviously biased to a particular RDBMS, so no need for answers on which RDBMS happens to be better. Let's keep this focused and on topic, ok?
I think you will end up writing specific, casy by case SQL statements for each type of database server. Certainly I did.
I've been in your situation, including having the intention to write database agnostic code, but in the long run it just does not work. One database will not, for example, handle multi-byte strings while another will demand them (ie, SQL Server CE), this will force you to use either Varchar vs NVarchar on columns, for example. Some databses will support multi byte strings, but with awful performance. One will use VARCHAR2 (Oracle), and everyone else will use VARCHAR. One will handle BLOBs one way while another will do so differently. Don't get me started on date data types, either.
Rather than find the magic subset of the SQL language and data types that works in all databases, you would be wiser to look for a data access method/library that can hide the differences for you (maybe some ORM library that lets you create DB objects as well as access them?)
Like I said, I have been (and still am) in your situation of having to support multiple databases and the best solution for me is to write optimal code for each database, rather that trying to find SQL data types and code that works in all of them (I wasn't able to, not to a satisfactory level).
Also, you will be able to squeeze more performance out of each DB if you create separate SQL text for each database (ie, the performance-related parameters you can specify while creating an Oracle table that do not apply at all when creating a table in any other database).
I say, do not fight the syntax differences in the different databases, you will not win. It's a better idea to put up with and use those differences to your advantage as much as possible.
I'd look into the SQL ANSI standard specification and use the data types specified there. A book like this may help you.
They all have good documentation, so I would just read up on their data types. Would probably have all the info you need. The only other information I could find before is pretty old.
Hope that helps.
Edit: Just another thought... you could use the strategy pattern for your SQL, that way it wouldn't matter if it was different, you could use the more advanced features. Though this way you'd have more work to do and more to maintain :/

Which Database can i Safely use a GUID as Primary Key besides SQL Server?

The reason I want to use a Guid is because in the event that I have to split the database into I won't have primary keys that overlap on both databases. So if I use a Guid there won't be any overlapping. I also want to use the GUID in the url also, so the Guid will need to be Indexed.
I will be using ASP.NET C# as my web server.
Postgres has a UUID type. MySQL has a UUID function. Oracle has a SYS_GUID function.
As others have said you can use GUIDs/UUIDs in pretty much any modern DB. The algorithm for generating a GUID is pretty straitforward and you can be reasonably sure that you won't get dupes however there are some considerations.
+) Although GUIDs are generally representations of 128 Bit values the actual format used differs from implementation to implemenation - you may want to consider normalizing them by removing non-significant characters (usually dashes or spaces).
+) To absolutely ensure uniqueness you can also append a value to the guid. For example if you're worried about MS and Oracle guids colliding add "MS" to the former and "Or" to the latter - now even if the guids themselves do collide they keys won't.
As others have mentioned however there is a potentially severe price to pay here: your keys will be large (128 bits) and won't index very well (although this is somewhat dependent on the implementation).
The techique works very well for small databases (especially those where the entire dataset can fit in memory) but as DBs grow you'll definately have to accept a performance trade-off.
One thing you might consider is a hybrid approach. Without more information it's hard to really know what you're trying to do so these might not help:
1) Remember that primary keys don't have to be a single column - you can have a simple numeric key to identify your rows and another row, containing a single value, that identifies the database that hosts the data or created the key. Creating the primary key as aggregate of both columns allows indexing to index fewer complex values and should be significantly faster.
2) You can "fake it" by constructing the key as a concatenated field (as in the above idea to append a DB identifier to the key). So your key would be a simple number followed by some DB identifier (perhaps a guid for each DB).
Indexing such a value (since the values would still be sequential) should be much faster.
In both cases you'll have some manual work to do if you ever do split the DB(s) - you'll have to update some keys with a new DB ID, but this would be a one-time,infrequent event. In exchange you can tune your DB much better.
There are definately other ways to ensure data integrity across mutiple databases. Many enterprise DBMSs have tools built-in for clustering data across multiple servers or databases, some have special tools or design patterns that make it easier, etc.
In short I would say that guids are nice and simple and do what you want, but that you should only consider them if either a) the dataset is small or b) the DBMS has specific features to optimize their use as keys (for example sequential guids). If the datasets are going to be very large or if you're trying to limit DBMS-specific dependencies I would play around more with optimizing a "key + identifier" strategy.
Most any RDBMS you will use can take any number and type of columns as a PK. So, if you're storing the GUID as a CHAR(n) for some length n, you should be fine. Now, I'm not sure if this is advisable, as I'm guessing indexing on CHARs is not as efficient as on integers.
Hope that helps.
I suppose you could store a GUID as an int128 as well.
Both mySQL and postgres are known to support GUID datatypes (I believe it's called UUID but it's the same thing).
Unless I have completely lost my memory, a properly designed 3rd+ normal form database schema does not rely on unique ints, or by extension GUIDs or UUIDs for primary keys. Nor does it use intermediate lookup tables of ints/GUIDS/UUIDS to relate the tables containing the data.
You should grind your schema until it expresses the relations amongst tables of data in terms of the data in the tables, not auto-generated identifiers that have no intrinsic relationship to the data.
I freely grant that you may just possibly be doing something that really really requires GUIDs (or auto-increment integers) for primary keys. But I seriously doubt that is the case - it almost never is.
You can implement your own membership provider based on whatever database schema you choose to design. It's nowhere near as tricky as it may look at first.
google "roll your own membership provider" for plenty of pointers.
In my theoretical little world, you'd be able to do this with SQLite. You'd generate the Guid from .Net and write it to the SQLite database as a string. You could also index that field.
You do loose some of the index benefits because it'd be stored as a string but it should be fully backwards compatible so that you could import/export to/from SQL Server.
From looking through the comments it looks like you are trying to use a different database to MS SQL with the ASP.net membership provider - as others have mentioned you could roll your own provider to use a different DB however a quick Google search turned up a few ready made options:
MySQL Provider
MySQL Provider 2
SqlLite Provider
Hope these help
If you are using other MS technologies already you should consider Sql Server Express.
http://www.microsoft.com/express/sql/default.aspx
It is a real implementation of MS Sql Server and it is free. It does have significant limitations as you might imagine, but if your product can fit inside those you get the support, developer community and stability of Sql Server and a clear upgrade path if you need to grow.

Many-to-many relationship: use associative table or delimited values in a column?

Update 2009.04.24
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
I've seen delimited data used in commercial product databases (Ektron lol).
SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.
/end Update
The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.
Here's an example DB structure:
Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))
Category
------------
ID (PK)
Title
There is a many-to-many relationship between Document and Category.
In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.
To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.
With that model, to get all Documents for a Category, you would need something like the following:
select * from documents where categoryids like '%|' + #targetCategoryId + '|%'
My solution is to create an associative table as follows:
Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)
This is confusing to the developers. Is there some elegant alternate solution that I'm missing?
I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?
Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?
Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?
Thanks.
This is confusing to the developers.
Get better developers. That is the right approach.
Your suggestion IS the elegant, powerful, best practice solution.
Since I don't think the other answers said the following strongly enough, I'm going to do it.
If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,
Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.
Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.
I hope this answer is very helpful to you.
Update
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).
This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.
I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.
It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.
My solution is to create an
associative table as follows: This is
confusing to the developers
Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.
What you propose is the right solution!!
The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).
Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.
The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.
I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.
Why are the developers confused?
The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.
If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.
This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.
I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.
The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.
Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.
It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.
So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.
Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.
We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.
Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.
Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.

What should one map strings to in a database ORM?

Strings are unbounded, but it seems every normal relational database requires that a column declare its maximum length. This seems to be a rather significant discrepancy and I'm curious how typical ORMs handle this.
Using a 'text' column type would theoretically give you much more string-like storage, but as I understand it text columns are not queryable, or at least not efficiently (non-indexed?).
I'm thinking of using something like NHibernate perhaps, but my ORM needs are relatively simple, so if I can just write it myself it might save some bloat.
SqlServer for instance stores only the actual size of the data. For me, defining a large enough size for strings is sufficient that the user doesn't recognize the limits.
Exmaples:
name of a product, person etc: 500
Paths, Urls etc: 1000
Comments, free text: 2000 or even more
NHibernate does not anything with the size at runtime. You need to use some kind of validator or let the database either cut or throw an exception.
Quote: "my ORM needs are relatively simple". It's hard to say if NHibernate is overkill or not. Data access isn't generally that simple.
As a simple guide of the top of my head, take NHibernate if:
You have a fine-grained or complex domain model. You need to map inheritance.
You want your domain model somewhat independent from the database model.
You need some lazy loading features
You want to be database independent, eg. run it on SqlServer or Oracle
If you think that a class per table is what you need, you don't actually need a ORM.
According to my knowledge, they dont handle it, either you let the ORM define the schema, then:
The ORM will decide the size either defaulted or defined by your config.
Or the schema is not deined by the ORM, then it just has to obey the rules, if you insert too large strings, then you'll get errors from the DB.
I would stay on varcharish types, e.g. varchar2 for oracle or nvarchar for sql server, unless you're dealing with clobs.

Migrating from MySQL to arbitrary standards-compliant SQL2003 server

Is there an incantation of mysqldump or a similar tool that will produce a piece of SQL2003 code to create and fill the same databases in an arbitrary SQL2003 compliant RDBMS?
(The one I'm trying right now is MonetDB)
DDL statements are inherently database-vendor specific. Although they have the same basic structure, each vendor has their own take on how to define types, indexes, constraints, etc.
DML statements on the other hand are fairly portable. Therefore I suggest:
Dump the database without any data (mysqldump --no-data) to get the schema
Make necessary changes to get the schema loaded on the other DB - these need to be done by hand (but some search/replace may be possible)
Dump the data with extended inserts off and no create table (--extended-insert=0 --no-create-info)
Run the resulting script against the other DB.
This should do what you want.
However, when porting an application to a different database vendor, many other things will be required; moving the schema and data is the easy bit. Checking for bugs introduced, different behaviour and performance testing is the hard bit.
At the very least test every single query in your application for validity on the new database. Ideally do a lot more.
This one is kind of tough. Unless you've got a very simple DB structure with vanilla types (varchar, integer, etc), you're probably going to get the best results writing a migration tool. In a language like Perl (via the DBI), this is pretty straight-forward. The program is basically an echo loop that reads from one database and inserts into the other. There are examples of this sort of code that Google knows about.
Aside from the obvious problem of moving the data is the more subtle problem of how some datatypes are represented. For instance, MS SQL's datetime field is not in the same format as MySQL's. Other datatypes like BLOBs may have a different capacity in one RDBMs than in another. You should make sure that you understand the datatype definitions of the target DB system very well before porting.
The last problem, of course, is getting application-level SQL statements to work against the new system. In my work, that's by far the hardest part. Date math seems especially DB-specific, while annoying things like quoting rules are a constant source of irritation.
Good luck with your project.
From SQL Server 2000 or 2005 you can have it generate scripts for your objects, but I am not sure how well they will transfer to other RDBMS.
The generate script option is probably the easiest way to go. You'll undoubtedly have to do some search/replace on a few data types though.