Database Abstraction - supporting multiple syntaxes - sql

In a PHP project I'm working on we need to create some DAL extensions to support multiple database platforms. The main pitfall we have with this is that different platforms have different syntaxes - notable MySQL and MSSQL are quite different.
What would be the best solution to this?
Here are a couple we've discussed:
Class-based SQL building
This would involve creating a class that allows you to build SQL querys bit-by-bit. For example:
$stmt = new SQL_Stmt('mysql');
$stmt->set_type('select');
$stmt->set_columns('*');
$stmt->set_where(array('id' => 4));
$stmt->set_order('id', 'desc');
$stmt->set_limit(0, 30);
$stmt->exec();
It does involve quite a lot of lines for a single query though.
SQL syntax reformatting
This option is much cleaner - it would read SQL code and reformat it based on the input and output languages. I can see this being a much slower solution as far as parsing goes however.

I'd recommend class-based SQL building and recommend Doctrine, Zend_Db or MDB2. And yeah, if it requires more lines to write simple selects but at least you get to rely on a parser and don't need to re-invent the wheel.
Using any DBAL is a trade-off in speed, and not just database execution, but the first time you use either of those it will be more painful than when you are really familiar with it. Also, I'm almost a 100% sure that the code generated is not the fastest SQL query but that's the trade-off I meant earlier.
In the end it's up to you, so even though I wouldn't do it and it sure is not impossible, the question remains if you can actually save time and resources (in the long run) by implementing your own DBAL.

A solution could be to have different sets of queries for different platforms with ID's something like
MySql: GET_USERS = "SELECT * FROM users"
MsSql: GET_USERS = ...
PgSql: GET_USERS = ...
Then on startup you load the needed set of queries and refers then
Db::loadQueries(platform):
$users = $db->query(GET_USERS)

Such a scheme would not take account of all the richness which SQL offers, so you would be better off with code-generated stored procs for all your tables for each DB.
Even if you use parametrized stored procs which are more database model-aware (i.e. they do joins or are user-aware and so are optimized for each vendor), that's still a great approach. I always view the database interface layer as providing more than just simple tables to the application, because that approach can be bandwidth-intensive and roundtrip wasteful.

if you have a set of backends that support it, I would agree that generating stored procedures to form a contract is the best approach. This approach, however, doesnt work if you have a backend that is limited in capabilty with regards to stored procedures in which case you build an abstaction layer to implement SQL or generate target specific sql based on an abstract/limited sql syntax.

Related

Confused about the role of a query language

So, I haven't had any luck finding any articles or forum posts that have explained to me how exactly a query language works in conjunction with a general use programming language like c++ or vb. So I guess it wont hurt to ask >.<
Basically, I've been having a hard time understanding what the roles of the query language are ( we'll use SQL as an example for query language and VB6 for norm language) if i'm creating a simple database query that fills a table with normal information (first name, last name, address etc). I somewhat know the steps in setting up a program like this using ado objects for the connection and whatnot, but how do we decide which language of the 2 gets used for certain things ? Does vb6 specifically handle the basics like loops, if else's, declarations of your vars, and SQL specifically handles things like connecting to the database and doing the searching, filtering and sorting ? Is it possible to do certain general use vb6 actions (loops or conditionals) in SQL syntax instead ? Any help would be GREATLY appreciated.
SQL is a language to query a database. SQL is an ISO standard and relational database vendors implement to the ISO standard and then add on their own customizations. For example in SQL Server it is called T-SQL and in Oracle it is called PL-SQL. They both implement ISO standards and so each will have identical queries for a simple select like
select columname from tablename where columnname=1
However, each have different syntax for string functions, date functions, etc....
The ISO SQL standard by design is not a full procedural language with looping, subroutines, ect as in a full procedural language like VB.
However, each vendor has added capabilities to their version to add some of this functionality in.
For example both T-SQL and PL-SQL can "loop" through records using various constructs in their language.
There is also a difference when working with data that many developers are not well in tuned with. That is set based operations vs. procedural based.
Databases can work with procedural constructs but are often more performant with set based. A developer who is not versed in this concept may end up creating a very innefficient query. Here's an example of this discussion.
With any situation you have to weight out the pro's/con's of where it is best to do this work.
I tend to favor using procedural constructs such as loops in the language I am using over SQL. I find it easier to maintain and the language I am using offers more powerful syntax for me to get the job done.
However, I keep both options as a tool in the toolbox. For example, I have written data conversion scripts in SQL and in this case I have used the looping constructs in SQL.
Usually programming language are executed in the client side (app server too), and query languages are executed in the db server, so in the end it depends where you want to put all the work. Sometimes you can put lot of work in the client side by doing all the calculations with the programming language and other times you want to use more the db server and you end up using the query language or even better tsql/psql or whatever.
Relational databases are designed to manage data. In particular, they provide an efficient mechanism for managing memory, disk, and processors for large quantities of data. In addition, relational databases can handle multiple clients, guarantee transactional integrity, security, backups, persistence, and numerous other functions.
In general, if you are using an RDBMS with another language, you want to design the data structure first and then think about the API (applications programming interface) between the two. This is particularly true when you have an app/server relationship.
For a "simple" type of application, which uses a lot of data but with minimal or batch changes to it, you want to move as much of the processing into the database as is reasonable. Here are things you do not want to do:
Use queries to load things into arrays, and then do array manipulations at the language level. SQL provides joins for this.
Load data into an array and do manipulations and summaries on the array. SQL provides aggregations for this.
Save data into a file to have a backup. Databases provide backup mechanisms.
If you data fits into an array or on an Excel spreadsheet, it is often sufficient to get started with the data stored there. Only when you start to expand the needs (multiple clients, security, integration with other data) do the advantages of a database become more apparent.
These are just for guidance and to give you some ideas.
In terms of doing what where, do as much as is sensible in SQL (given it runs on a server) as you can.
So for instance don't do stuff like this (psuedo code)
foreach(row in "Select * from Orders")
if (row[CustomerID] = 876)
Display(row)
Do
foreach(row in "Select * from Orders where CustomerId = 876")
Display(row)
First it's likely Orders is indexed by CustomerID so it will find all 876s order way quicker.
Second to do the first one you just sucked every record in that table into the client's memory space probably across your network.
What language is used is essentially irrelevant, you could invent your own DBMS with it's own language.
It's where you do what processing that matters. It's Rule with exceptions, but the essential idea is let your backend do as much as it can.

Micro ORM - maintaining your SQL query strings

I will not go into the details why I am exploring the use of Micro ORMs at this stage - except to say that I feel powerless when I use a full blown ORM. There are too many things going on in the background that happens automatically, and not all of them are the best possible choices. I was quite ready to go back to raw database access, but I found out about the three new guys on the block: Dapper, PetaPoco and Massive. So I decided to give the low-level approach a go with a pet project. It is not relevant, but so far, I am using PetaPoco.
In any case, I am having trouble deciding how to go about maintaining the SQL strings that I will use from the higher levels. There are three main solutions that I can think of:
Sprinkle the SQL queries wherever I need them. This is the least infrastructure heavy method. However, it suffers in both maintainability and testability areas.
Limit the query usage to some service classes. This helps maintainability, is still low on infrastructure I need to implement. It may also be possible to build these service classes such that it would be easy to mock for testing purposes.
Prepare some classes to make the system somewhat flexible. I have started on this path. I implemented a Repository interface, and a database dependent Repository class. I have also build some tiny interfaces to capture SQL queries that can be passed to my Repository's GetMany() method. All the queries are implemented as individual classes right now, and I will probably need a little more interface around this to add some level of database independence - and maybe for some flexibility in decorating queries into paged and sorted queries (again, this would also make them a little bit more flexible in handling different databases).
What I am mainly worried about right now is that I have entered the slippery slope of writing all the functions needed for a full blown ORM, but badly. For example, it feels sensible right now that I write or find a library to convert linq calls into SQL statements so that I can massage my queries easily or write extenders that can decorate any query I pass to it, etc. But that is a large task, and is already done by the big guys, so I am resisting the urge to go there. I also want to retain control over what queries I send to the database - by explicitly writing them.
So what is the suggestion? Should I go #2 option, or try to stumble along on option #3? I am certain I cannot show any code written in the first option to anyone without blushing. Is there any other approach you can recommend?
EDIT: After I've asked the question, I realized there is another option, somewhat orthogonal to these three options: stored procedures. There seems to be a few advantages to putting all your queries inside the database as stored procedures. They are kept in a central location, and not spread through the code (though maintenance is an issue - the parameters may get out of sync). The reliance on database dialect is solved automatically: if you move databases, you port all your stored procedures, and you are done. And there is also the security benefits.
With the stored procedure option, the alternatives 1 and 2 seem a little bit more suitable. There seems to be not enough entities to warrant option 3 - but it is still possible to separate the procedure call commands from database accessing code.
I've implemented option 3 without stored procedures, and option 2 with stored procedures, and it seems like the latter is more suitable for me (in case anyone is interested with the outcome of the question).
I would say put the sql where you would have put the equivalent LINQ query, or the sql for DataContext.ExecuteQuery. As for where that is... well, that is up to you and depends on how much separation you want. - Marc Gravell, creator on Dapper
See Marc's opinion on the matter
I think the key point is, you shouldn't really be re-using the SQL. If your logic is re-used then it should be wrapped in a method called that can then be called from multiple places.
I know you've accepted your answer already but I still wanted to show you a nice alternative that may be helpful in your case as well. Now or in the future.
When using stored procedures it's wise to use T4
I tend to use stored procedures on my project even though it's not using PetaPoco, Dapper or Massive (project started before these were here). It uses BLToolkit instead. Anyway. Instead of writing my methods to run stored procedures and write code to provide stored procedure parameters, I've written a T4 template that generates the code for me.
Whenever stored procedures change (some may be added/removed, parameters added/removed/renamed/retyped), my code will break on compilation because method calls will not match their signature any more.
I keep my stored procedures in a file (so they get version controlled). If you work in a multi-developer team it may be sensible to have stored procedures each in its own file. It makes updates much less painful. I've experienced that on some project and it worked ok as long as number of SPs is not huge. You can restructure them into folders based on the entity they're related to.
Anyway. Maintenance is related to stored procedures, code change is just a simple click of a button in Visual Studio that converts all T4s at once. You don't have to search your methods that use those procedures. You'll be reported errors while compiling. One thing less to worry about.
So instead of writing
using (var db = new DbManager())
{
return db
.SetSpCommand(
"Person_SaveWithRelations",
db.Parameter("#Name", name),
db.Parameter("#Email", email),
db.Parameter("#Birth", birth),
db.Parameter("#ExternalID", exId))
.ExecuteObject<Person>();
}
and having a bunch of magic strings I can just simply write:
using (var db = new DataManager())
{
return db
.Person
.SaveWithRelations(name, email, birth, exId)
.ExecuteObject<Person>();
}
This is nicer, cleaner breaks on compile and provides intellisense so it's also faster to while developing.
The good thing is that stored procedures may become very complex and may do many things. In my upper example I check some data, insert person record and some related one as well and in the end return the newly inserted Person record. Inserts and updated should usually return data that was added/changed to reflect actual state.

When to use an ORM (Sequel, Datamapper, AR, etc.) vs. pure SQL for querying

A colleague of mine is currently designing SQL queries like the one below to produce reports, which are displayed in excel files through an external data query.
At present, only reporting processes on the DB are required (no CRUD operations).
I am trying to convince him that it would be better to use a ruby ORM in order to be able to display the data in a rails/sinatra app.
Despite the obvious advantages in displaying the data, what advantages are there for him in learning to use an ORM like Sequel or Datamapper?
The SQL queries he is writing are clearly quite complex, and being relatively new to SQL, he often complains that it is very time-consuming and confusing.
Is it possible to write extremely complex queries with an ORM? and if so, which is the most suitable(I have heard Sequel is good for legacy dbs)? and what are the advantages of learning ruby and using an ORM versus sticking with plain SQL, in making complex database queries?
I'm the DataMapper maintainer, and I think for complex reporting you should use SQL.
While I do think someday we'll have a DSL that provides the power and conciseness of SQL, everything I've seen so far requires you to write more Ruby code than SQL for complex queries. I would much rather maintain a 5 line SQL query than 10-15 lines of Ruby code to describe the same complex operation.
Please note I say complex.. if you have something simple, use the ORM's build-in finders. However, I do believe there is a line you can cross where SQL becomes simpler. Now, most apps aren't just reporting. You may have alot of CRUD type operations, for which an ORM is perfectly suited and far better than doing those things by hand.
One thing that an ORM will usually provide is some sort of organization to your application logic. You can group code based around each model in the same file. It's usually there that I'll put the complex SQL query, rather than embedding it in the controller, eg:
class User
include DataMapper::Resource
property :id, Serial
property :name, String, :length => 1..100, :required => true
property :age, Integer, :min => 1, :max => 130
def self.some_complex_query
repository.adapter.select <<-SQL
SELECT ...
FROM ...
WHERE ...
... more complex stuff here ...
SQL
end
end
Then I can just generate the report using User.some_complex_query. You could also push the SQL query into a view if you wanted to further cleanup this code.
EDIT: By "view" in the above sentence I meant RDBMS view, rather than view in the MVC context. Just wanted to clear up any potential confusion.
If you are writing your queries by hand you have the chance to optimize them. When I look at that query I see some potential for optimizations (E.ICGROUPNAME LIKE '%san-fransisco%' or E.ICGROUPNAME LIKE '%bordeaux%' wont use an index = Table Scan).
When using an OR Mapper (the native Objects/Tables) for reporting you have no or little control over the resulting SQL Query.
But: You could put that query in an View or Stored Procedure and map that View/Proc with an OR Mapper. You can optimize your queries and you can use all features of your Application Framework.
Unless you're dealing with objects, an ORM is not necessary. It sounds like your friend simply needs to generate reports, in which case pure SQL is just fine so long as he knows what he's doing (e.g. avoiding SQL injection issues).
ORM stands for "Object-Relational Mapping". If you don't have the "O" (objects), then it's probably not a good fit for your app. Where ORMs really shine is in persisting objects to the database and loading them from a database.
ORM stands for Object Relational Mapping - but looking at the query your friend seems to be wanting a pretty specific table of sums and other items... I've not used Ruby's Sequel, but I've used Hibernate, and Python's SQLAlchemy (for Django/Turbogears) and while you can do these sorts of queries, I don't believe that is their strength.
The power of ORM comes from being able to finding Foo->Bar object relationships, say you want all the Bar objects for Foo's field greater then X... That sort of thing. Therefore I would not classify an ORM as a "good" solution, though moving to a real programming language like Ruby and doing the SQL through it instead of Excel... that in itself is a win.
Just my 2 cents.
In a situation like that, I'd probably write them by hand or use a View (if the DB you're using supports views)
ORM's are used when you have Objects (Business Objects). I am therefore assuming that you have an application with which you creating and Managing the Business Objects that are ultimately saved into the database. If you have then you have almost definitely got some representation of the relationships and probably many of the calculations you are going to use in reports. The problem with using SQL to directly access your database for reports is simply maintainability.
You typically put a lot of effort into ensuring that your Business Objects hide any details of their database. You implement business rules and do common calculations in your Business Objects. Build a common language for all members of the team etc etc. You then use an ORM to map to the database and use Habanero or NHibernate or something like that to do this. This is all great. We do this all in the name of Maintainability and is great. You can migrate your application change your design etc etc.
You now go and write SQL to run reports over time you have hundreds of report. Firstly they often duplicate logic you already have in your BusinessObjects (Usually without any tests) and even worse Bham Damb sorry maintainability is now stuffed forget about moving a that field from one table to another forget about splitting that table into two changing that relationship etc you have a number of reports that are going to break unexpectedly.
The problem with quering through your Domain Objects/Business Objects is simply one of performance.
In summary if you are using Domain Driven Design or Business Object concepts try to use these for reports. (You will probably run directly from DB using SQL or stored procs for performance reasons but try limit these use your Business Objects first and then use SQL).
The other option of course is using a separate reporting database (Like some of the BI concepts) The mapping from your transactional DB to your reporting DB is therefore in one place and easily changeable in cases where you want to change your design.
Domain Objects (Business Objects) and ORMs have all the knowledge to allow you to start building high performing queries that run directly on the Database while using the Domain Terminology. Lets hope that these continue to evolve to a point where this is a reality.
Until then if you are using Business Objects in your application try use them for Reporting when performance is an issue resort to SQL.

Is this a valid benefit of using embedded SQL over stored procedures?

Here's an argument for SPs that I haven't heard. Flamers, be gentle with the down tick,
Since there is overhead associated with each trip to the database server, I would suggest that a POSSIBLE reason for placing your SQL in SPs over embedded code is that you are more insulated to change without taking a performance hit.
For example. Let's say you need to perform Query A that returns a scalar integer.
Then, later, the requirements change and you decide that it the results of the scalar is > x that then, and only then, you need to perform another query. If you performed the first query in a SP, you could easily check the result of the first query and conditionally execute the 2nd SQL in the same SP.
How would you do this efficiently in embedded SQL w/o perform a separate query or an unnecessary query?
Here's an example:
--This SP may return 1 or two queries.
SELECT #CustCount = COUNT(*) FROM CUSTOMER
IF #CustCount > 10
SELECT * FROM PRODUCT
Can this/what is the best way to do this in embedded SQL?
A very persuasive article
SQL and stored procedures will be there for the duration of your data.
Client languages come and go, and you'll have to re-implement your embedded SQL every time.
In the example you provide, the time saved is sending a single scalar value and a single follow-up query over the wire. This is insignificant in any reasonable scenario. That's not to say there might not be other valid performance reasons to use SPs; just that this isn't such a reason.
I would generally never put business logic in SP's, I like them to be in my native language of choice outside the database. The only time I agree SPs are better is when there is a lot of data movement that don't need to come out of the db.
So to aswer your question, I'd rather have two queries in my code than embed that in a SP, in my view I am trading a small performance hit for something a lot more clear.
How would you do this efficiently in
embedded SQL w/o perform a separate
query or an unnecessary query?
Depends on the database you are using. In SQL Server, this is a simple CASE statement.
Perhaps include the WHERE clause in that sproc:
WHERE (all your regular conditions)
AND myScalar > myThreshold
Lately I prefer to not use SPs (Except when uber complexity arises where a proc would just be better...or CLR would be better). I have been using the Repository pattern with LINQ to SQL where my query is written in my data layer in a strongly typed LINQ expression. The key here is that the query is strongly typed which means when I refactor I am refactoring properties of a class that is directly generated from the database table (which makes changes from the DB carried all the way forward super easy and accurate). While my SQL is generated for me and sent to the server I still have the option of sticking to DRY principles as the repository pattern allows me to break things down into their smallest component. I do have the issue that I might make a trip to the server and based on the results of query I may find that I need to make another trip to the server. I don't worry about this up front. If I find later that it becomes an issue then I may refactor that code into something more performant. The over all key here is that there is no one magic bullet. I tend to work on greenfield applications which allows this method of development to be most efficient for me.
Benefits of SPs:
Performance (are precompiled)
Easy to change (without compiling the application)
SQL set based features make very easy doing really difficult data tasks
Drawbacks:
Depend heavily on the database engine used
Makes deployment of upgrades a little harder (you have to deploy the App + the scripts)
My 2 cents...
About your example, it can be done like this:
select * from products where (select count(*) from customers>10)

Migrating from MySQL to arbitrary standards-compliant SQL2003 server

Is there an incantation of mysqldump or a similar tool that will produce a piece of SQL2003 code to create and fill the same databases in an arbitrary SQL2003 compliant RDBMS?
(The one I'm trying right now is MonetDB)
DDL statements are inherently database-vendor specific. Although they have the same basic structure, each vendor has their own take on how to define types, indexes, constraints, etc.
DML statements on the other hand are fairly portable. Therefore I suggest:
Dump the database without any data (mysqldump --no-data) to get the schema
Make necessary changes to get the schema loaded on the other DB - these need to be done by hand (but some search/replace may be possible)
Dump the data with extended inserts off and no create table (--extended-insert=0 --no-create-info)
Run the resulting script against the other DB.
This should do what you want.
However, when porting an application to a different database vendor, many other things will be required; moving the schema and data is the easy bit. Checking for bugs introduced, different behaviour and performance testing is the hard bit.
At the very least test every single query in your application for validity on the new database. Ideally do a lot more.
This one is kind of tough. Unless you've got a very simple DB structure with vanilla types (varchar, integer, etc), you're probably going to get the best results writing a migration tool. In a language like Perl (via the DBI), this is pretty straight-forward. The program is basically an echo loop that reads from one database and inserts into the other. There are examples of this sort of code that Google knows about.
Aside from the obvious problem of moving the data is the more subtle problem of how some datatypes are represented. For instance, MS SQL's datetime field is not in the same format as MySQL's. Other datatypes like BLOBs may have a different capacity in one RDBMs than in another. You should make sure that you understand the datatype definitions of the target DB system very well before porting.
The last problem, of course, is getting application-level SQL statements to work against the new system. In my work, that's by far the hardest part. Date math seems especially DB-specific, while annoying things like quoting rules are a constant source of irritation.
Good luck with your project.
From SQL Server 2000 or 2005 you can have it generate scripts for your objects, but I am not sure how well they will transfer to other RDBMS.
The generate script option is probably the easiest way to go. You'll undoubtedly have to do some search/replace on a few data types though.