Related
I want to implement unit-conversion functionality in a database for units of mass, volume, temperature, etc. My background is in software development, so I naturally lean towards creating a bunch of User-Defined Functions for this task. But I have sneaking suspicion there is a more "SQLish" way of accomplishing this.
The table-based method, such as the one described here or here would be excellent, but I see no simple way of extending this approach for units which are not related by a single multiplier, such as temperature (F-to-C, C-to-F) or atomic mass (kgmol-to-kg). I have seen multi-step approaches (such as the one described here), but these seem far too convoluted to be useful.
Am I missing something obvious here, or are functions really the only way to go?
To handle temperature conversions, your conversion table should have a multiplier and an offset. For F --> C, for instance, the offset would be -32 and the multiplier 5/9.
If you know all the possible units in advance, then a table-based message works fine. However, if you want a fully flexible system such as meters^5*liters to inches^5*gallons, then you'll want a basse units table and a user defined function to do the conversion. This function would probably use a recursive cte to parse the units expression. All this would be rather complicated, so hopefully you have a complete list of units.
In the end, I decided to stick with UDFs. The primary motivation for that decision was that the complexity of the table-based approach didn't seem to be justified, given the infrequency of this task. Also, several of the functions (e.g. ConvertMass()) are dependent on subqueries (to determine the atomic weight of a given component, for example). So UDFs seemed like the most prudent way to go.
This is a dangerous question, so let me try to phrase it correctly. Premature optimization is the root of all evil, but if you know you need it, there is a basic set of rules that should be considered. This set is what I'm wondering about.
For instance, imagine you got a list of a few thousand items. How do you look up an item with a specific, unique ID? Of course, you simply use a Dictionary to map the ID to the item.
And if you know that there is a setting stored in a database that is required all the time, you simply cache it instead of issuing a database request hundred times a second.
Or even something as simple as using a release instead of a debug build in prod.
I guess there are a few even more basic ideas.
I am specifically not looking for "don't do it, for experts: don't do it yet" or "use a profiler" answers, but for really simple, general hints. If you feel this is an argumentative question, you probably misunderstood my intention.
I am also not looking for concrete advice in any of my projects nor any sophisticated low level tricks. Think of it as an overview of how to avoid the most important performance mistakes you made as a very beginner.
Edit: This might be a good description of what I am looking for: Create a presentation (not a practical example) of common optimization rules for people who have a basic technical understanding (let's say they got a CS degree) but for some reason never wrote a single line of code. Point out the most important aspects. Pseudocode is fine. Do not assume specific languages or even architectures.
Two rules:
Use the right data structures.
Use the right algorithms.
I think that covers it.
Minimize the number of network roundtrips
Minimize the number of harddisk seeks
These are several orders of magnitude slower than anything else your program is likely to do, so avoiding them can be very important indeed. Typical methods to achieve this are:
Caching
Increasing the granularity of network and HD accesses
For example, B-Trees are absolutely ubiquitous in DB systems because the reduce the granularity of HD access for on-disk index lookups.
I think something extremely important is to be very carefully on all code that is frequently executed. This is normally the code in critical inner loops.
Rule 1: Know this code
For this code avoid all overhead. Small differences in runtime can make a big impact on the overall performance. E.g. if you implement an image filter a difference of 0.001ms per pixel will make a difference in 1s in the filter runtime on a image with size 1000x1000 (which is not big).
Things to avoid/do in inner loops are:
don't go through interfaces (e.g DB queries, RPC calls etc)
don't jump around in the RAM, try to access it linearly
if you have to read from disk then read large chunks outside the inner loop (paging)
avoid virtual function calls
avoid function calls / use inline functions
use float instead of double if possible
avoid numerical casts if possible
use ++a instead of a++ in C++
iterate directly on pointers if possible
The second general advice: Each layer/interface costs, try to avoid large stacks of different technologies, the system will spend more time in data transformation then in doing the actual job, keep things simple.
And as the others said, use the right algorithm, try to optimize the algorithm complexity first before you optimize the algorithm implementation.
I know you're looking for specific coding hints, but those are easy to find: cacheing, loop unrolling, code hoisting, data & code locality, blah, blah...
The biggest hint of all is don't use them.
Would it help to make this point if I said "This is the secret that the almighty Powers That Be don't want you to know!!"? Pick your Powers: Microsoft, Google, Sun, etc. etc.
Don't Use Them
Until you know, with dead certainty, what the problems are, and then the coding hints are obvious.
Here's an example where many coding tricks were used, but the heart and soul of the exercise is not the coding techniques, but the diagnostic technique.
Are your algorithms correct for the situation or are there better ones available?
My experience of using Adobe ColdFusion, even if still somewhat limited, was absolutely joyful and pleasant.
Of all good things I could say about ColdFusion, one feature completely blew me off my feet. It might be neither very effective, or useful in production, but anyway, I am talking about the so-called "query of queries" feature, or the dbtype="query" attribute of cfquery. It allows you to run SQL statements against arbitrary datasets, not just a database connection. You can, for example, join a resultset, that you've just retrieved from the database and an in-memory structure (that is, of course, subject to certain limitations). It provides a quick-and-dirty way to kind of "post-process" the data, which can sometimes be much more readable (and flexible, too!), than, say, iterating through the dataset in a loop.
However, ColdFusion is not a very popular product and I am not going to go over the reasons why it is like that. What I am asking is, is there any support for this technique in other languages (like a library, that does more or less the same)? Python? Perl? Ruby? PHP? Anything? Because, to me it seems, that the potential of this feature is huge, maybe not in production code, but it is an absolute life-saver if you need to test something quickly. Needless to say, the SQL ColdFusion uses for this is somewhat limited, to my knowledge, but still, the idea is still great.
If you don't find anything that handles data as well as ColdFusion then remember it plays very well with other programming languages. You can always do the heavy query processing in CF then just wrap your processing logic in remote CFCs and expose them as web services serving up JSON.
That will let you benefit from what you find great about ColdFusion while trying out some other languages.
If you need to get away from CF try SqlAlchemy in Python, or like other posters said Rails and LINQ are worth playing with.
i can't for python, ruby, perl, php. however .Net has something called LINQ which is essentially QoQ on steroids.
Lots of frameworks use object-relational mapping (ORM), which will convert your database tables to objects.
For example, using Rails you fetch data from a model instead of directly talking to the database. Queries, or finds, are returned as array objects, which can in turn be queried themselves.
You can also accomplish this in .NET by using LINQ. LINQ will let you query objects as well as databases.
In doing performance analysis of Query of Queries, I was surprised by their execution time, I could not get them to return in less than 10ms in my tests, where simply queries to the actual database would return in 1ms or less. My understanding (at least in CF MX 7) is that while this is a useful function, it is not a highly optimized one. I found it to be much faster to loop over the query manually doing conditional logic to replace what I was attempting to do with my query of queries.
That being said, it is faster than going to the database if the initial query is slow. Just don't use it thinking that it is going to always be faster than doing a more creative sort or initial query as each QofQ is far from instantaneous.
For Java, there's three projects worth taking a look at, each with it's own positives and negatives, some more SQL like than others. JoSQL JoSQL, JXPath, and MetaModel.
Maybe one of these day's I'll figure out how to call a QoQ directly from the Java underneath CF. ;)
This technique (ColdFusion's query-of-queries) is one of the worst ideas out there. Not only does it keep business logic in the database, but it takes what little business logic you have left in your code and shoves it out to the database just for spite.
What you need is a good language, not bad techniques to make up for deficiencies.
Python and Ruby, as well as other languages not on your list such as C# and Haskell, have exceptional support for writing arbitrary and powerful queries against in-memory objects. This is in fact the technique that you want, not ColdFusion's query-of-queries. The technique of writing queries against in-memory objects is an aspect of a general style of programming called functional programming.
I like ORM tools, but I have often thought that for large updates (thousands of rows), it seems inefficient to load, update and save when something like
UPDATE [table] set [column] = [value] WHERE [predicate]
would give much better performance.
However, assuming one wanted to go down this route for performance reasons, how would you then make sure that any objects cached in memory were updated correctly.
Say you're using LINQ to SQL, and you've been working on a DataContext, how do you make sure that your high-performance UPDATE is reflected in the DataContext's object graph?
This might be a "you don't" or "use triggers on the DB to call .NET code that drops the cache" etc etc, but I'm interested to hear common solutions to this sort of problem.
You're right, in this instance using an ORM to load, change and then persist records is not efficient. My process goes something like this
1) Early implementation use ORM, in my case NHibernate, exclusively
2) As development matures identify performance issues, which will include large updates
3) Refactor those out to sql or SP approach
4) Use Refresh(object) command to update cached objects,
My big problem has been informing other clients that the update has occured. In most instances we have accepted that some clients will be stale, which is the case with standard ORM usage anyway, and then check a timestamp on update/insert.
Most ORMs also have facilities for performing large or "bulk" updates efficiently. The Stateless Session is one such mechanism available in Hibernate for Java which apparently will be available in NHibernate 2.x:
http://ayende.com/Blog/archive/2007/11/13/What-is-going-on-with-NHibernate-2.0.aspx
ORMs are great for rapid development, but you're right -- they're not efficient. They're great in that you don't need to think about the underlying mechanisms which convert your objects in memory to rows in tables and back again. However, many times the ORM doesn't pick the most efficient process to do that. If you really care about the performance of your app, it's best to work with a DBA to help you design the database and tune your queries appropriately. (or at least understand the basic concepts of SQL yourself)
Bulk updates are a questionable design. Sometimes they seems necessary; in many cases, however, a better application design can remove the need for bulk updates.
Often, some other part of the application already touched each object one at a time; the "bulk" update should have been done in the other part of the application.
In other cases, the update is a prelude to processing elsewhere. In this case, the update should be part of the later processing.
My general design strategy is to refactor applications to eliminate bulk updates.
ORMs just won't be as efficient as hand-crafted SQL. Period. Just like hand-crafted assembler will be faster than C#. Whether or not that performance difference matters depends on lots of things. In some cases the higher level of abstraction ORMs give you might be worth more than potentailly higher performance, in other cases not.
Traversing relationships with object code can be quite nice but as you rightly point out there are potential problems.
That being said, I personally view ORms to be largely a false economy. I won't repeat myself here but just point to Using an ORM or plain SQL?
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
For some of the apps I've developed (then proceeded to forget about), I've been writing plain SQL, primarily for MySQL. Though I have used ORMs in python like SQLAlchemy, I didn't stick with them for long. Usually it was either the documentation or complexity (from my point of view) holding me back.
I see it like this: use an ORM for portability, plain SQL if it's just going to be using one type of database. I'm really looking for advice on when to use an ORM or SQL when developing an app that needs database support.
Thinking about it, it would be far better to just use a lightweight wrapper to handle database inconsistencies vs. using an ORM.
Speaking as someone who spent quite a bit of time working with JPA (Java Persistence API, basically the standardized ORM API for Java/J2EE/EJB), which includes Hibernate, EclipseLink, Toplink, OpenJPA and others, I'll share some of my observations.
ORMs are not fast. They can be adequate and most of the time adequate is OK but in a high-volume low-latency environment they're a no-no;
In general purpose programming languages like Java and C# you need an awful lot of magic to make them work (eg load-time weaving in Java, instrumentation, etc);
When using an ORM, rather than getting further from SQL (which seems to be the intent), you'll be amazed how much time you spend tweaking XML and/or annotations/attributes to get your ORM to generate performant SQL;
For complex queries, there really is no substitute. Like in JPA there are some queries that simply aren't possible that are in raw SQL and when you have to use raw SQL in JPA it's not pretty (C#/.Net at least has dynamic types--var--which is a lot nicer than an Object array);
There are an awful lot of "gotchas" when using ORMs. This includes unintended or unexpected behavior, the fact that you have to build in the capability to do SQL updates to your database (by using refresh() in JPA or similar methods because JPA by default caches everything so it won't catch a direct database update--running direct SQL updates is a common production support activity);
The object-relational mismatch is always going to cause problems. With any such problem there is a tradeoff between complexity and completeness of the abstraction. At times I felt JPA went too far and hit a real law of diminishing returns where the complexity hit wasn't justified by the abstraction.
There's another problem which takes a bit more explanation.
The traditional model for a Web application is to have a persistence layer and a presentation layer (possibly with a services or other layers in between but these are the important two for this discussion). ORMs force a rigid view from your persistence layer up to the presentation layer (ie your entities).
One of the criticisms of more raw SQL methods is that you end up with all these VOs (value objects) or DTOs (data transfer objects) that are used by simply one query. This is touted as an advantage of ORMs because you get rid of that.
Thing is those problems don't go away with ORMs, they simply move up to the presentation layer. Instead of creating VOs/DTOs for queries, you create custom presentation objects, typically one for every view. How is this better? IMHO it isn't.
I've written about this in ORM or SQL: Are we there yet?.
My persistence technology of choice (in Java) these days is ibatis. It's a pretty thin wrapper around SQL that does 90%+ of what JPA can do (it can even do lazy-loading of relationships although its not well-documented) but with far less overhead (in terms of complexity and actual code).
This came up last year in a GWT application I was writing. Lots of translation from EclipseLink to presentation objects in the service implementation. If we were using ibatis it would've been far simpler to create the appropriate objects with ibatis and then pass them all the way up and down the stack. Some purists might argue this is Badâ„¢. Maybe so (in theory) but I tell you what: it would've led to simpler code, a simpler stack and more productivity.
ORMs have some nice features. They can handle much of the dog-work of copying database columns to object fields. They usually handle converting the language's date and time types to the appropriate database type. They generally handle one-to-many relationships pretty elegantly as well by instantiating nested objects. I've found if you design your database with the strengths and weaknesses of the ORM in mind, it saves a lot of work in getting data in and out of the database. (You'll want to know how it handles polymorphism and many-to-many relationships if you need to map those. It's these two domains that provide most of the 'impedance mismatch' that makes some call ORM the 'vietnam of computer science'.)
For applications that are transactional, i.e. you make a request, get some objects, traverse them to get some data and render it on a Web page, the performance tax is small, and in many cases ORM can be faster because it will cache objects it's seen before, that otherwise would have queried the database multiple times.
For applications that are reporting-heavy, or deal with a large number of database rows per request, the ORM tax is much heavier, and the caching that they do turns into a big, useless memory-hogging burden. In that case, simple SQL mapping (LinQ or iBatis) or hand-coded SQL queries in a thin DAL is the way to go.
I've found for any large-scale application you'll find yourself using both approaches. (ORM for straightforward CRUD and SQL/thin DAL for reporting).
I say plain SQL for Reads, ORM for CUD.
Performance is something I'm always concerned about, specially in web applications, but also code maintainability and readability. To address these issues I wrote SqlBuilder.
ORM is not just portability (which is kinda hard to achieve even with ORMs, for that matter). What it gives you is basically a layer of abstraction over a persistent store, when a ORM tool frees you from writing boilerplate SQL queries (selects by PK or by predicates, inserts, updates and deletes) and lets you concentrate on the problem domain.
Any respectable design will need some abstraction for the database, just to handle the impedance mismatch. But the simplest first step (and adequate for most cases) I would expect would be a DAL, not a heavyweight ORM. Your only options aren't those at the ends of the spectrum.
EDIT in response to a comment requesting me to describe how I distinguish DAL from ORM:
A DAL is what you write yourself, maybe starting from a class that simply encapsulates a table and maps its fields to properties. An ORM is code you don't write for abstraction mechanisms inferred from other properties of your dbms schema, mostly PKs and FKs. (This is where you find out if the automatic abstractions start getting leaky or not. I prefer to inform them intentionally, but that may just be my personal preference).
The key that made my ORM use really fly was code generation. I agree that the ORM route isn't the fastest, in code performance terms. But when you have a medium to large team, the DB is changing rapidly the ability to regenerate classes and mappings from the DB as part of the build process is something brilliant to behold, especially when you use CI. So your code may not be the fastest, but your coding will be - I know which I'd take in most projects.
My recommendation is to develop using an ORM while the Schema is still fluid, use profiling to find bottlenecks, then tune those areas which need it using raw Sql.
Another thought, the caching built into Hibernate can often make massive performance improvements if used in the right way. No more going back to the DB to read reference data.
Dilemma whether to use a framework or not is quite common in modern day software development scenario.
What is important to understand is that every framework or approach has its pros and cons - for example in our experience we have found that ORM is useful when dealing with transactions i.e. insert/update/delete operations - but when it comes to fetch data with complex results it becomes important to evaluate the performance and effectiveness of the ORM tool.
Also it is important to understand that it is not compulsory to select a framework or an approach and implement everything in that. What we mean by that is we can have mix of ORM and native query language. Many ORM frameworks give extension points to plugin in native SQL. We should try not to over use a framework or an approach. We can combine certain frameworks or approaches and come with an appropriate solution.
You can use ORM when it comes to insertion, updation, deletion, versioning with high level of concurrency and you can use Native SQL for report generation and long listing
There's no 'one-tool-fits-all' solution, and this is also true for the question 'should i use an or/m or not ? '.
I would say: if you have to write an application/tool which is very 'data' focused, without much other logic, then I 'd use plain SQL, since SQL is the domain-specific language for this kind of applications.
On the other hand, if I was to write a business/enterprise application which contains a lot of 'domain' logic, then I'd write a rich class model which could express this domain in code. In such case, an OR/M mapper might be very helpfull to successfully do so, as it takes a lot of plumbing code out of your hands.
One of the apps I've developed was an IRC bot written in python. The modules it uses run in separate threads, but I haven't figured out a way to handle threading when using sqlite. Though, that might be better for a separate question.
I really should have just reworded both the title and the actual question. I've never actually used a DAL before, in any language.
Use an ORM that works like SQL, but provides compile-time checks and type safety. Like my favorite: Data Knowledge Objects (disclosure: I wrote it)
For example:
for (Bug bug : Bug.ALL.limit(100)) {
int id = bug.getId();
String title = bug.getTitle();
System.out.println(id +" "+ title);
}
Fully streaming. Easy to set up (no mappings to define - reads your existing schemas). Supports joins, transactions, inner queries, aggregation, etc. Pretty much anything you can do in SQL. And has been proven from giant datasets (financial time series) all the way down to trivial (Android).
I know this question is very old, but I thought that I would post an answer in case anyone comes across it like me. ORMs have come a long way. Some of them actually give you the best of both worlds: making development more productive and maintaining performance.
Take a look at SQL Data (http://sqldata.codeplex.com). It is a very light weight ORM for c# that covers all the bases.
FYI, I am the author of SQL Data.
I'd like to add my voice to the chorus of replies that say "There's a middle ground!".
To an application programmer, SQL is a mixture of things you might want to control and things you almost certainly don't want to be bothered controlling.
What I've always wanted is a layer (call it DAL, ORM, or micro-ORM, I don't mind which) that will take charge of the completely predictable decisions (how to spell SQL keywords, where the parentheses go, when to invent column aliases, what columns to create for a class that holds two floats and an int ...), while leaving me in charge of the higher-level aspects of the SQL, i.e. how to arrange JOINs, server-side computations, DISTINCTs, GROUP BYs, scalar subqueries, etc.
So I wrote something that does this: http://quince-lib.com/
It's for C++: I don't know whether that's the language you're using, but all the same it might be interesting to see this take on what a "middle ground" could look like.