Stored Procedure VS. F# - sql

For most SP-taught developers, there are no option between Linq and Stored-Procedures/Functions. That's may be true.
However, there are a road junctions nowadays. Before I spending too much time into syntax of F#, i would like more inputs about where the power (and opposite) of F# lies.
How will F# perform on this topic (against SP)?
F# have to communicate with a database on some way. Through Linq2Sql/Entity-app-layer or directly though AnyDbConnection. Nothing new there. But F# have the power of parallellism and less overhead in thier work (Functional Programming with C#/F#). Also F# has it's effeciency as a layer for data and machine. Pretty much like C# power of being a layer between human and machine.
Would I really still let the DB Server handle a request of recurring nodes, or just fetch plain data to F# and handle it there? Encapsulated nice and smoothly as a object method call from C#?
Would a stored procedure still be the best option for scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?
Would a SP or function still be best for a simple task as finding next parent node?
Would a SP still being best to collect a million records of data and return calculated sums and/or periods?
Wouldn't a single f# dll library fully built on the Single responsibility principle being of more use then stored procedures hooked up inside a sql server? There are pros and cons, of course. But what are they?

Stored procedures are not magically super-fast. Often, they're actually rather slow.
Many people will downvote this answer providing anecdotal evidence that a stored procedure once made an application faster overall. However, all of those examples that I've actually seen code for indicate that they totally rethought some bad SQL to package it as an SP. I submit that the discipline of repackaging bad SQL into a procedure helped more than the SP itself.
Most of your points can't be evaluated without a measured benchmark.
I suggest that you do the following.
Write it in F#.
Measure it.
If it's too slow for your production application, then try some stored procedures to see if it's faster. If it's fast enough for your production application, then you have your answer, F# worked for you. For your application. For your data. For your architecture.
There's no "general" answer. Although my benchmarks for some particular kinds of queries indicate that the SP engine is pretty slow compared with Java. F# will probably be faster than the SP engine also.
The important thing is to make sure that the database -- if it's going to be "pure" data -- is already optimized so that queries like your "scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?" would retrieve the rows as quickly as possible. This often involves tweaking buffers and array sizes and other elements of the database-to-F# connection. This usually means that you want a more direct connection so that you can adjust the sizes.

Databases are efficient for certain tasks (e.g. when they can uses index to search for a specified row), but probably won't be any faster than F# if you need to process all rows and ubdate them (in database) or calculate some new result based on all the data.
As S. Lott suggests, the best option is to try implementing what you need in F# and you'll find out. Parallelism can give you some performance benefits, especially if you're doing some computationally heavy calculations. However, you may still want to store the data in databases, load it and process it in F# (I believe this is how F# was used by adCenter at Microsoft).
Possibly the most important note is that databases give you various guarantees about the consistency of the data - no matter what happens, you'll still end up with consistent state. Implementing this yourself may be tricky (e.g. when updating data), but you need to consider whether you need it or not.

You ask this:
Would a stored procedure still be the best option for scanning 50 millions of records for finding orphans or a criteria that matching 0,5% of the result?
I take your question to mean 'I have this data in sql server. Should i query it in sql or in client code (F# in this case). Queries like this should absolutely be performed in sql if at all possible. If you're doing it in F#, you're transferring those 50 million rows to the client just to do some aggregation/lookups.
I hope I understood your question correctly.

As I understand an SP just means you call some precompiled execution plan, and you can call it through an API, instead of pushing a string to the server. These two save in the order of millseconds, nowhere near a second. For larger queries that difference is negligible. They're good for highfrequency/ throughput stuff (and of course encapsulating complex logic, but that doens't seem to apply here).
Because an SP uses a procompiled plan, it can indeed be slower than a normal query because it no longer checks the statitsics of the underlying data(becuase the execution plan is already compiled.) Since you mention a condition that applies to 0.5% of the rows, this could be important.
In the discussion of SP vs F# I would reword that to 'on the server' vs 'on the client'. If you're talking higher data volumes (50M rows qualifies) my first choice would always be to 'put the mill where the wood is', that means execute on the server if possible. Only if there is some very complicated logic involved you might want to consider F#, but I don't think that applies. Then still I'd prefer to execute that on the server than first drag all those rows over the network (potentially slow).
GJ

Related

Benefits of stored procedures vs. other forms of grabbing data from a database [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What are the pros and cons to keeping SQL in Stored Procs versus Code
Just curious on the advantages and disadvantages of using a stored procedure vs. other forms of getting data from a database. What is the preferred method to ensure speed, accuracy, and security (we don't want sql injections!).
(should I post this question to another stack exchange site?)
As per the answer to all database questions 'it depends'. However, stored procedures definitely help in terms of speed because of plan caching (although properly parameterized SQL will benefit from that too). Accuracy is no different - an incorrect query is incorrect whether it's in a stored procedure or not. And in terms of security, they can offer a useful way of limiting access for users - seeing as you don't need to give them direct access to the underlying tables - you can just allow them to execute the stored procedures that you want. There are, however, many many questions on this topic and I'd advise you to search a bit and find out some more.
There are several questions on Stackoverflow about this problem. I really don't think you'll get a "right" answer here, both can work out very well, and both can work horribly. I think if you are using Java then the general pattern is to use an ORM framework like Hibernate/JPA. This can be completely safe from SQL injection attacks as long as you use the framework correctly. My experience with .Net developers is that they are more likely to use stored procedure backed persistence, but that seems to be more open than it was before. Both NHibernate and other MS technologies seem to be gaining popularity.
My personal view is that in general an ORM will save you some time from lots of verbose coding since it can automatically generate much of the SQL you use in a typical CRUD type system. To gain this you will likely give up a little performance and some flexibility. If your system is low to medium volume (10's of thousands of requests per day) then an ORM will be just fine for you. If you start getting in to the millions of requests per day then you may need something a little more bare metal like straight SQL or stored procedures. Note than an ORM doesn't prevent you from going more direct to the DB, it's just not normally what you would use.
One final note, is that I think ORM persistence makes an application much more testable. If you use stored procedures for much of your persistence then you are almost bound to start getting a bunch of business logic in these. To test them you have to actually persist data and interact with the DB, this makes testing slow and brittle. Using an ORM framework you can either avoid most of this testing or use an in memory DB when you really want to test persistence.
See:
Stored Procedures and ORM's
Manual DAL & BLL vs. ORM
This may be better on the Programmers SE, but I'll answer here.
CRUD stored procedures used to be, and sometimes still are, the best practice for data persistence and retrieval on a SQL DBMS. Every such DBMS has stored procedures, so you're practically guaranteed to be able to use this solution regardless of the coding language and DBMS, and code which uses the solution can be pointed to any DB that has the proper stored procs and it'll work with minimal code changes (there are some syntax changes required when calling SPs in different DBMSes; often these are integrated into a language's library support for accessing SPs on a particular DBMS). Perhaps the biggest advantage is centralized access to the table data; you can lock the tables themselves down like Fort Knox, and dispense access rights for the SPs as necessary to more limited user accounts.
However, they have some drawbacks. First off, SPs are difficult to TDD, because the tools don't really exist within database IDEs; you have to create tests in other code that exercise the SPs (and so the test must set up the DB with the test data that is expected). From a technical standpoint, such a test is not and cannot be a "unit test", which is a small, narrow test of a small, narrow area of functionality, which has no side effects (such as reading/writing to the file system). Also, SPs are one more layer that has to be changed when making a needed change to functionality. Adding a new field to a query result requires changing the table, the retrieval source code, and the SP. Adding a new way to search for records of a particular type requires the statement to be created and tested, then encapsulated in a SP, and the corresponding method created on the DAO.
The new best practice where available, IMO, is a library called an object-relational mapper or ORM. An ORM abstracts the actual data layer, so what you're asking for becomes the code objects themselves, and you query for them based on properties of those objects, not based on table data. These queries are almost always code-configurable, and are translated into the DBMS's flavor of SQL based on one or more "mappings" that you define between the object model and the data model (objects of type A are persisted as records in table B, where this property C is written to field D).
The advantages are more flexibility within the code actually looking for data in the form of these code objects. The criteria of a query is usually able to be customized in-code; if a new query is needed that has a different WHERE clause, you just write the query, and the ORM will translate it into the new SQL statement. Because the ORM is the only place where SQL is actually used (and most ORMs use system stored procs to execute parameterized query strings where available) injection attacks are virtually impossible. Lastly, depending on the language and the ORM, queries can be compiler-checked; in .NET, a library called Linq is available that provides a SQL-ish keyword syntax, that is then converted into method calls that are given to a "query provider" that can translate those method calls into the data store's native query language. This also allows queries to be tested in-code; you can verify that the query used will produce the desired results given an in-memory collection of objects that stands in for the actual DBMS.
The disadvantages of an ORM is that the ORM library is usually language-specific; Hibernate is available in Java, NHibernate (and L2E and L2SQL) in .NET, and a few similar libraries like Pork in PHP, but if you're coding in an older or more esoteric language there's simply nothing of the sort available. Another one is that security becomes a little trickier; most ORMs require direct access to the tables in order to query and update them. A few will tolerate being pointed to a view for retrieval and SPs for updating (allowing segregation of view/SP and table security and the ability to restrict the retrievable fields), but now you're mixing the worst of both worlds; you still have to define mappings, but now you also have code in the data layer. The easiest way to overcome this is to implement your security elsewhere; force applications to get data using a web service, which provides the data using the ORM and has specific, limited "front doors". Also, many ORMs have some performance problems when used in certain ways; most are designed to "lazy-load" data, where data is retrieved the moment it's actually needed and not before, which increases up-front performance when you don't need every record you asked for. However, when you DO need every record you asked for, this creates extra round trips. You have to structure queries in specific ways to get around this expected use-case behavior.
Which is better? You have to decide. I can tell you now that using an ORM is MUCH easier to set up and get working correctly than SPs, and it's much easier to make (and limit the scope of) changes to the schema and to queries. In the modern development house, where the priority is to make it work first, and then make it perform well and/or be secure against intrusion, that's a HUGE plus. In most cases where you think security is an issue, it really isn't, and when security really is an issue, putting the solution in the DB layer is usually the wrong place, because the DBMS is the very last line of defense against intrusion; if the DBMS itself has to be counted on to stop something unwanted from happening, you have failed to do so (or even encouraged it to happen) in many layers of software and firmware above it.

Query design practices in SQL

I am building queries for a database in MS Access 2007 and I am wondering if my current design practices are up to par. Basically, the database was configured before I came, but I have been given the responsibility of building efficient queries to extract the data.
My current queries are small and simple, each accomplishing 2-3 tasks (sometimes only 1) at a time. The reason I am taking this approach is because I am completely new to SQL, and I find it easier to work with many, simple queries and use reports to consolidate the data, as opposed to building extremely complex queries which are 1) hard to build (for me, anyways) and 2) hard to maintain.
I was just curious if anyone had any best practices for query design, and if you could give me some specific feed back for the approach listed above, and whether or not I should start making complex queries, or just stick to simple queries and reports to consolidate the relevant data.
Thanks.
The people answering this question are not coming to it from an Access point of view, so I'll offer some observations as somebody who has been creating Access applications professionally full-time since 1996.
First off, there are several places where you'll have SQL in an Access application:
stored queries.
stored properties of forms, reports, combo boxes and list boxes.
in VBA code where you are writing SQL on the fly.
Managing all of these SQL statements in an organized fashion is difficult, if not impossible. But I'm not sure it's worth it!
First off, consider just stored queries. If you follow the advice of saving a query for every individual task so that each SQL statement is used in only one place, you'll soon have a mess in the list of queries, and you'll be forced into some kind of naming convention to keep track of what's what. Because of this, I generally don't save queries EXCEPT where they MUST be saved, or where the optimization that comes with a saved query is going to be helpful (i.e., large dataset or complex joins/filtering).
For example, when I first started programming in Access, I'd save all the rowsources of my combo boxes as saved queries. I developed a naming convention so they wouldn't be mixed in with the other queries in the list of queries, so it wasn't to hard to manage. At first, I thought I'd be re-using the saved queries, but it quickly became clear that I needed to make changes for individual circumstances, and changing a query that was used elsewhere might alter its results in other contexts, so really, there was no "shared code" benefit to the saved queries (as I thought there would be). The only place where it was helpful was where I had the same combo box on multiple forms, and then I could save the rowsource for that as a saved query and if I needed to alter it, I could it in just one place. However, that was really only an advantage for a relatively complex rowsource -- a simple SELECT on a couple of fields doesn't really benefit from that kind of sharing, particularly when it's used in only a couple of different places.
In short, I quickly concluded that it was just easier to save the SQL statements where they were used -- since there was very little re-use in the first place (once I gained enough experience to realize the pitfalls of trying to re-use them), this worked much better, and it kept the SQL close to where it was being used.
For forms and reports, I do some of the same things, but in general, use saved queries for the purpose of avoiding having to write too many complex subselects for use as derived tables. Where I needed those it was always easier to write it and save it and then use it with a JOIN in another SQL statement than it was to try to use the subselect inline as a derived table (which just makes for complicated SQL that's hard to read -- particularly when you can't comment or format your SQL, as is the case with saved Access queries).
In general, I don't save the recordsources of forms or reports except where there is real re-use going on (a report will often use the same recordsource as a form, so in that case, it's useful to save it, so that when you change the SQL of the form, the report that goes with it inherits the alteration).
That all leaves dynamic SQL assembled in VBA code. I use lots of this, from dynamically setting the rowsources of combo/listboxes, to setting the recordsources of subforms for filtering purposes. This is harder to manage, and sometimes I use string constants in the module to make that easier. For instance, in a case where you're writing dynamic SQL where everything remains the same except the WHERE clause, a constant with the SELECT and a second constant with the ORDER BY makes it a lot easier to assemble the complete SQL statement.
I don't know if this really answers your questions, but I have learned over the years that the benefits of re-using SQL statements are vastly outweighed by the uncertainty that comes from the inability to track easily where that SQL statement may be used. I find that storing the SQL statment as close to where it is used as possible is the best practice, as that is a form of "self-documentation" (though not a great one!).
I do make many exceptions and save queries when there is a real and demonstrable benefit in terms of performance or managing what would otherwise become much more comples SQL. However, I would also note that one should also not go too far in the other direction, using tons of nested saved queries, because then you run into other problems (i.e., the "too many databases" problem, which is actually caused by using up the 2048 table handles available at one time -- it's done more easily than you might think).
My humble opinion, it doesn't matter if DB engine is big and monstrous as MSSQL or Oracle, or tiny and simple as SQLite, every query (or stored procedure or any other unit of data processing) should be responsible only for 1 function. I use this principle anywhere (not only in DB development) and I can say it works.
If you are not sure, try to read books about refactoring, Fawler for example. I suppose his principles are applicable to any area of development.
If you are storing your data in MSAccess then your database cannot be too large and any optimization you do is limited by the constraints MSAccess imposes. If better (more optimized) queries is a goal, then perhaps migrating the data out of Access and into SQL Server may allow you to have better flexibility in development going forward. You can leverage cached execution plans, stored procedure, and views.
This may mean that you will need to enhance your T-SQL skills to accomplish this.
So weigh out the options you propose in your question:
1. Keep code simple (comfortable at your current skill level)
2. Meet the responsibility to create efficient queries for data extraction.
SQL Server Express could be a good starting point (it's free).

Pros and cons of putting logic in SQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
At a new job, I've just been exposed to the concept of putting logic into SQL statements.
In MySQL, a dumb example would be like this:
SELECT
P.LastName, IF(P.LastName='Baldwin','Michael','Bruce') AS FirstName
FROM
University.PhilosophyProfessors P
// This is like a ternary operator; if the condition is true, it returns
// the first value; else the second value. So if a professor's last name
// is 'Baldwin', we will get their first name as "Michael"; otherwise, "Bruce"**
For a more realistic example, maybe you're deciding whether a salesperson qualifies for a bonus. You could grab various sales numbers and do some calculations in your SQL query, and return true / false as a column value called "qualifies."
Previously, I would have gotten all the sales data back from the query, then done the calculation in my application code.
To me, this seems better, because if necessary, I can walk through the application logic step-by-step with a debugger, but whatever the database is doing is a black box to me. But I'm a junior developer, so I don't know what's normal.
What are the pros and cons of having the database server do some of your calculations / logic?
**Code example based on Monty Python sketch.
This way SQL becomes part of your domain model. It's one more (and not necessarily obvious) place where domain knowledge is implemented. Such leaks result in tighter coupling between business logic / application code and database, what usually is a bad idea.
One exception is views, report queries etc. But these usually are so isolated that it's obvious what role they play.
One of the most persuasive reasons to push logic out to the database is to minimise traffic. In the example given, there is little gain, since you are fetching the same amount of data whether the logic is in the query or in your app.
If you want to fetch only users with a first name of Michael, then it makes more sense to implement the logic on the server. Actually, in this simple example, it doesn't make much difference, since you could specify users who's lastname is Baldwin. But consider a more interesting problem, whereby you give each user a "popularity" score based on how common their first and last names are, and you want to fetch the 10 most "popular" users. Calculating "popularity" in the app would mean that you have to fetch every single user before ranking, sorting and choosing them locally. Calculating it on the server means you can fetch just 10 rows across the wire.
There aren't a lot of absolute pros and cons to this argument, so the answer is 'it depends.' Some scenarios with different conditions that affect this decision might be:
Client-server app
One example of a place where it might be appropriate to do this is an older 4GL or rich client application where all database operations were done through stored procedure based update, insert, delete sprocs. In this case the gist of the architecture was to have the sprocs act as the main interface for the database and all business logic relating to particular entities lived in the one place.
This type of architecture is somewhat unfashionable these days but at one point it was considered to be the best way to do it. Many VB, Oracle Forms, Informix 4GL and other client-server apps of the era were done like this and it actually works fairly well.
It's not without its drawbacks, however - SQL is not particularly good at abstraction, so it's quite easy to wind up with fairly obtuse SQL code that presents a maintenance issue through being hard to understand and not as modular as one might like.
Is it still relevant today? Quite often a rich client is the right platform for an application and there's certainly plenty of new development going on with Winforms and Swing. We do have good open-source ORMs today where a 1995 vintage Oracle Forms app might not have had the option of using this type of technology. However, the decision to use an ORM is certainly not a black and white one - Fowler's Patterns of Enterprise Application Architecture does quite a good job of running through a range of data access strategies and discussing their relative merits.
Three tier app with rich object model
This type of app takes the opposite approach, and places all of the business logic in the middle tier model object layer with a relatively thin database layer (or perhaps an off-the-shelf mechanism like an ORM). In this case you are attempting to place all the application logic in the middle-tier. The data access layer has relatively little intelligence, except perhaps for a handful of stored procedured needed to get around limits of an ORM.
In this case, SQL based business logic is kept to a minimum as the main repository of application logic is the middle-tier.
Overhight batch processes
If you have to do a periodic run to pick out records that match some complex criteria and do something with them it may be appropriate to implement this as a stored procedure. For something that may have to go over a significant portion of a decent sized database a sproc based approch is probably going to be the only reasonably performant way to do this sort of thing.
In this case SQL may well be the appropriate way to do this, although traditional 3GLs (particularly COBOL) were designed specifically for this type of processing. In really high volume environments (particularly mainframes) doing this type of processing with flat or VSAM files outside a database may be the fastest way to do it. In addition, some jobs may be inherently record-oriented and procedural, or may be much more transparent and maintanable if implemented in this way.
To paraphrase Ed Post, 'you can write COBOL in any language' - although you might not want to. If you want to keep it in the database, use SQL, but it's certainly not the only game in town.
Reporting
The nature of reporting tools tends to dictate the means of encoding business logic. Most are designed to work with SQL based data sources so the nature of the tool forces the choice on you.
Other domains
Some applications like ETL processing may be a good fit for SQL. ETL tools start to get unwiedly if the transformation gets too complex, so you may want to go for a stored procedure based architecture. Mixing Queries and transformations across extraction, ETL processing and stored-proc based processing can lead to a transformation process that is hard to test and troubleshoot.
Where you have a significant portion of your logic in sprocs it may be better to put all of the logic in this as it gives you a relatively homogeneous and modular code base. In fact I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors are done this way as an explicit design decision - for precisely this reason.
Many times the answer to this type of question is going to depend a great deal on deployment approach. Where it makes the most sense to place your logic depends on what you'll need to be able to get access to when making changes.
In the case of web applications that aren't compiled, it can be easier to deal with changes to a page or file than it is to work with queries (depending on query complexity, programming backgrounds / expertise, etc). In these kinds of situations, logic in the scripting language is typically ok and make make it easier to revise later.
In the case of desktop applications that require more effort to modify, placing this kind of logic in the database where it can be adjusted without requiring a recompilation of the application may benefit you. If there was a decision made that people used to qualify for bonuses at 20k, but now must make 25k, it'd be much easier to adjust that on the SQL Server than to recompile your accounting application for all of your users, for example.
I'm a strong advocate of putting as much logic as possible directly into the database. That means incorporating it in views and stored procedures. I believe that most follows the DRY principle.
For example, consider a table with FirstName and LastName columns, and an application that frequently makes use of a FullName field. You have three choices:
Query first and last name and compute the full name in application code.
Query first, last, and (first || last) in your application's SQL whenever you query the table.
Define a view CustomerExt that includes the first and last columns, and a computed full name column and then query against that view, rather than the customer table.
I believe option 3 is clearly correct. Consider the addition of a MiddleInitial field to the table and the full name computation. Using option 3, you simply need to replace the view and every application across your company will instantly use the new format for FullName. The view still makes the base columns available for those instances in which you need to do some special formatting, but for the standard instance everything works "automatically".
That's a simple case, but the principle is the same for more complex situations. Perform application- or company-wide data logic directly in the database and you do not need to concern yourself with keeping different applications up to date.
The answer depends on your expertise and your familiarity with the technologies involved. Also, if you're a technical manager, it depends on your analysis of the skills of the people working on your team and whom you intend on hiring / keeping on staff to support, extend and maintain the application in future.
If you are not literate and proficient in the database , (as you are not) then stick with doing it in code. If otoh, you are literate and proficient in database coding (as you should be), then there is nothing wrong (and a lot right) abput doing it in the database.
Two other considerations that might influence your decision are whether the logic is of such a complex nature that doing it in database code would be inordinately more complex or more abstract than in code, and second, if the process involved requires data from outside the database (from some other source) In either of these scenarios I would consider moving the logic to a code module.
The fact that you can step through the code in your IDE more easily is really the only advantage to your post-processing solution. Doing the logic in the database server reduces the sizes of result sets, often drastically, which leads to less network traffic. It also allows the query optimizer to get a much better picture of what you really want done, again often allowing better performance.
Therefore I would nearly always recommend SQL logic. If you treat a database as a mere dumb store, it will return the favor by behaving dumb, and depending on the situation, that can absolutely kill your performance - if not today, possibly next year when things have taken off...
That particular first example is a bad idea. Per-row functions do not scale well as the table gets bigger. In fact, a (likely) better way to do it would be to index LastName and use something like:
SELECT P.LastName, 'Michael' AS FirstName
FROM University.PhilosophyProfessors P
WHERE P.LastName = 'Baldwin'
UNION ALL SELECT P.LastName, 'Bruce' AS FirstName
FROM University.PhilosophyProfessors P
WHERE P.LastName <> 'Baldwin'
On databases where data are read more often than written (and that's most of them), these sorts of calculations should be done at write time such as using an insert/update trigger to populate a real FirstName field.
Databases should be used for storing and retrieving data, not doing massive non-databasey calculations that will slow down everything.
One big pro: a query may be all you can work with. Reports have been mentioned: many reporting tools or reporting plugins to existing programs only allow users to make their own queries (the results of which they will display).
If you cannot alter the code (because it isn't yours), you may yet be able to alter a query. And in some cases (data migration), you'll be writing queries to do migration as well.
I like to distinguish data vs business rules, and push the data rules into the stored procs as much as possible. There is not always a hard and fast distinction between the two, but in your example of calculating sales bonuses, the formula itself might be a business rule but the work of gathering and aggregating the various figures used in the formula is a data rule.
Sometimes, though, it depends on the deployment model and change control procedures. If the sales formula changes frequently and deployment of the business layer code is cumbersome, then tweaking just one function/stored proc in the database would be a great solution.
I'm a big fan of elegant database queries because the code is closer to the data and SQL works very well. But such queries, whether they're text in you app, generated by an OR mapper or stored in the database are harder to test, especially in the cloud, because you need a database to run against.
Database is exactly what it's called. DATABASE.
You should not mix the business logic with data layer.
Keep it separate as any close coupling between data and business makes impossible to follow best standards in programming.
I was working recently on a project where all logic was in MS SQL. Horrible idea, that back-fired after few years (energy company), no easy way to scale-out, no easy way to follow up CI/CD, Agile or code repos. Very difficult to co-work, very slow and very inefficient.
Company basically was reaching hardware limits in order to make it work (they've spent £100k on SSD SAN), while you could reach the same performance with C# for business and keep the database for data, with perhaps 3-4 cheap servers, that could easily scale-out.
Horrible, horrible idea. Guess what ? Company went under, as one time SQL server has reached it's potential (sometimes some queries were running for hours (very well written, but SQL is not for business logic. End of story)) when one time failed to bill all DD customers and basically didn't took the monthly payment that they needed to survive till next month (millions of pounds).

Database System Architecture discussion

I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....

Why is parameterized SQL generated by NHibernate just as fast as a stored procedure?

One of my co-workers claims that even though the execution path is cached, there is no way parameterized SQL generated from an ORM is as quick as a stored procedure. Any help with this stubborn developer?
I would start by reading this article:
http://decipherinfosys.wordpress.com/2007/03/27/using-stored-procedures-vs-dynamic-sql-generated-by-orm/
Here is a speed test between the two:
http://www.blackwasp.co.uk/SpeedTestSqlSproc.aspx
Round 1 - You can start a profiler trace and compare the execution times.
For most people, the best way to convince them is to "show them the proof." In this case, I would create a couple basic test cases to retrieve the same set of data, and then time how long it takes using stored procedures versus NHibernate. Once you have the results, hand it over to them and most skeptical people should yield to the evidence.
I would only add a couple things to Rob's answer:
First, Make sure the amount of data involved in the test cases is similiar to production values. In other words if your queries are normally against tables with hundreds of thousands or rows, then create such a test environment.
Second, make everything else equal except for the use of an nHibernate generated query and a s'proc call. Hopefully you can execute the test by simply swapping out a provider.
Finally, realize that there is usually a lot more at stake than just stored procedures vs. ORM. With that in mind the test should look at all of the factors: execution time, memory consumption, scalability, debugging ability, etc.
The problem here is that you've accepted the burden of proof. You're unlikely to change someone's mind like that. Like it or not, people--even programmers-- are just too emotional to be easily swayed by logic. You need to put the burden of proof back on him- get him to convince you otherwise- and that will force him to do the research and discover the answer for himself.
A better argument to use stored procedures is security. If you use only stored procedures, with no dynamic sql, you can disable SELECT, INSERT, UPDATE, DELETE, ALTER, and CREATE permissions for the application database user. This will protect you against most 2nd order SQL Injection, whereas parameterized queries are only effective against first order injection.
Measure it, but in a non-micro-benchmark, i.e. something that represents real operations in your system. Even if there would be a tiny performance benefit for a stored procedure it will be insignificant against the other costs your code is incurring: actually retrieving data, converting it, displaying it, etc. Not to mention that using stored procedures amounts to spreading your logic out over your app and your database with no significant version control, unit tests or refactoring support in the latter.
Benchmark it yourself. Write a testbed class that executes a sampled stored procedure a few hundred times, and run the NHibernate code the same amount of times. Compare the average and median execution time of each method.
It is just as fast if the query is the same each time. Sql Server 2005 caches query plans at the level of each statement in a batch, regardless of where the SQL comes from.
The long-term difference might be that stored procedures are many, many times easier for a DBA to manage and tune, whereas hundreds of different queries that have to be gleaned from profiler traces are a nightmare.
I've had this argument many times over.
Almost always I end up grabbing a really good dba, and running a proc and a piece of code with the profiler running, and get the dba to show that the results are so close its negligible.
Measure it.
Really, any discussion on this topic is probably futile until you've measured it.
He may be correct for the specific use case he is thinking of. A stored procedure will probably execute faster for some complex set of SQL, that can be arbitrarily tuned. However, something you get from things like hibernate is caching. This may prove much faster for the lifetime of your actual application.
The additional layer of abstraction will cause it to be slower than a pure call to a sproc. Just by the fact that you have additional allocations on the managed heap, and additional pushes and pops off the callstack, the truth of the matter is that it is more efficient to call a sproc over having an ORM build the query, regardless how good the ORM is.
How slow, if its even measurable, is debatable. This is also helped by the fact that most ORM's have a caching mechanism to avoid doing the query at all.
Even if the stored procedure is 10% faster (it probably isn't), you may want to ask yourself how much it really matters. What really matters in the end, is how easy it is to write and maintain code for your system. If you are coding a web app, and your pages all return in 0.25 seconds, then the extra time saved by using stored procedures is negligible. However, there can be many added advantages of using an ORM like NHibernate, which would be extremely hard to duplicate using only stored procedures.