I want to implement unit-conversion functionality in a database for units of mass, volume, temperature, etc. My background is in software development, so I naturally lean towards creating a bunch of User-Defined Functions for this task. But I have sneaking suspicion there is a more "SQLish" way of accomplishing this.
The table-based method, such as the one described here or here would be excellent, but I see no simple way of extending this approach for units which are not related by a single multiplier, such as temperature (F-to-C, C-to-F) or atomic mass (kgmol-to-kg). I have seen multi-step approaches (such as the one described here), but these seem far too convoluted to be useful.
Am I missing something obvious here, or are functions really the only way to go?
To handle temperature conversions, your conversion table should have a multiplier and an offset. For F --> C, for instance, the offset would be -32 and the multiplier 5/9.
If you know all the possible units in advance, then a table-based message works fine. However, if you want a fully flexible system such as meters^5*liters to inches^5*gallons, then you'll want a basse units table and a user defined function to do the conversion. This function would probably use a recursive cte to parse the units expression. All this would be rather complicated, so hopefully you have a complete list of units.
In the end, I decided to stick with UDFs. The primary motivation for that decision was that the complexity of the table-based approach didn't seem to be justified, given the infrequency of this task. Also, several of the functions (e.g. ConvertMass()) are dependent on subqueries (to determine the atomic weight of a given component, for example). So UDFs seemed like the most prudent way to go.
Related
Just starting to grok monads. I think in Clojure, so purity isn't terribly important to me.
I have a series of business operations (composable transforms) which may fail. This can be abstracted nicely with error-monad.
Some of the business operations involve database IO, and I need to perform the operations in bulk for speed. each bulk operation acts on a set of independent items, so one failure must not fail the whole set.
should I just think of my bulk transforms as a series of functions on one object (map) done inside something like error monad but acting on independent items in a seq? does seq-monad help me here? how should I be thinking about this? any other ideas?
I don't see any particular benefit in combining this with IO-monad for my database side effects in Clojure, thoughts on this?
It's hard to say what you need exactly, because your business case seems a little involved, but I think you may get some mileage from using the clojure.algo.monads library.
You can create your own variation on a error-monad or maybe-monad that internally deals with batches.
The idea is somewhat similar to what Apple has done in the OpenGL stack. I want to have that a bit more general.
Basically, I want to have specialised and optimised variants of some code for some specific cases.
In other words: I have given an algorithm/code for a function (let B = {0,1})
f : B^n -> B^m
Now, I special a specific case by a function (which predefines part of the input of f)
preset : {1..n} -> {0,1,unset}
The amount of predefinitions (∈ {0..n}) is then given by
pn := |preset⁻¹({0,1})|
Canonically, we now get a specialised function
f_preset : B^(n-pn) -> B^m
Also canonically, we get the code/algorithm for this specialised function. Naturally, the code for f_preset will be somewhat more fast than f with pn > 0. Then, you also can optimise this code further (there might be some dead code now, some loops can be unpacked now, some calculations can be precalculated, etc). In some cases, it can have noteable improvements.
Apple does roughly this for their OpenGL stack (from what I have read / know): They try to find a good preset at runtime after everything is setup for variables which will not change anymore, then make an optimised version of the specialised function and only use that one instead of the original function.
Initially, I thought about a way to optimise the physics simulation of some own game. There I have a lot of particle objects and a set of particle types (which is unknown at compile time). A particle type is a set of attributes. The particle types are fixed and constant once they are loaded. Each particle object is of one of theye particle types. The physic simulation for a particle object is some very heavy peace of code with many many branches and very heavily depends on the particle type. My idea was now to have an optimised physics simulation function for each particle type.
After thinking a bit about this, I wanted to go a bit further:
I want to automatically calculate a set of such presets at runtime and maintain the optimised code for each. And I want to automatically add or remove presets when the circumstances change.
There are several questions now:
Is there an easy way to calculate a good preset? How do I know what variables are constant for a given situation?
Is there an easy way to check how good a preset is? 'Good' refers to the performance of the resulting optimised code.
How to compare two algorithms/codes for performance? Via some heuristic? Or by testing with random input?
How many presets (and optimised code variants) should there be for a function? A fixed limit for all functions? Or is this different for every function? Is it maybe even depending on the current computer state?
How to maintain the different optimised code variants? A wrapper function around f which chooses automatically the best optimised variant doesn't seem to be very nice as this maybe not so easy check would be needed for every single call. A solution to this problem might also be deeply related to the question about how to find the set/amount of good presets. (In the particle type case, the optimised code would be attached to / saved together with the particle type. The amount of particle types also define the amount of presets.)
For my initial case, most of these questions are kind of obsolete but am really interested now in how to do this in a more general way. Of course, most/all of these questions are also uncalculateable but I wonder to what degree you may still get good results.
This whole topic is also very important for optimisations in JIT compilers. Are they doing these kind of optimisations already? To what degree?
Are there good recent research works which answers some of my questions? Or maybe also some results which say that it is just too hard to do this in such a general way?
It seems to me you are asking about partial evaluation.
I actually have a bit of a problem with that concept, because it is usually couched in terms that are over-academic and over-difficult.
The way it is usually expressed is that you have some general function F(Islow, Ifast) having arguments that can take different values at different times. The Islow arguments change seldom, and the Ifast arguments can be different every time it is called.
Then the problem is to write some kind of partial-evaluator function G(F, Islow) -> F1(Ifast) that takes function F and the Islow arguments, and generates a new (simpler) function F1 that only takes the Ifast arguments.
The problem with this is 1) somebody has to write the general function F, and 2) somebody has to write the general partial evaluator G.
What makes more sense to me is to write from scratch a function H(Islow) -> F1(Ifast), that is, write a code-generator specifically for F1, rather than writing two functions F and G, especially where G is very difficult to write.
H is usually much easier to write than F, and G need not be written at all! The result function F1 usually is smaller and has much higher performance than F, so it's a win-win situation.
When people write code generators, that is what they are doing, and it is a very effective programming technique.
This is a dangerous question, so let me try to phrase it correctly. Premature optimization is the root of all evil, but if you know you need it, there is a basic set of rules that should be considered. This set is what I'm wondering about.
For instance, imagine you got a list of a few thousand items. How do you look up an item with a specific, unique ID? Of course, you simply use a Dictionary to map the ID to the item.
And if you know that there is a setting stored in a database that is required all the time, you simply cache it instead of issuing a database request hundred times a second.
Or even something as simple as using a release instead of a debug build in prod.
I guess there are a few even more basic ideas.
I am specifically not looking for "don't do it, for experts: don't do it yet" or "use a profiler" answers, but for really simple, general hints. If you feel this is an argumentative question, you probably misunderstood my intention.
I am also not looking for concrete advice in any of my projects nor any sophisticated low level tricks. Think of it as an overview of how to avoid the most important performance mistakes you made as a very beginner.
Edit: This might be a good description of what I am looking for: Create a presentation (not a practical example) of common optimization rules for people who have a basic technical understanding (let's say they got a CS degree) but for some reason never wrote a single line of code. Point out the most important aspects. Pseudocode is fine. Do not assume specific languages or even architectures.
Two rules:
Use the right data structures.
Use the right algorithms.
I think that covers it.
Minimize the number of network roundtrips
Minimize the number of harddisk seeks
These are several orders of magnitude slower than anything else your program is likely to do, so avoiding them can be very important indeed. Typical methods to achieve this are:
Caching
Increasing the granularity of network and HD accesses
For example, B-Trees are absolutely ubiquitous in DB systems because the reduce the granularity of HD access for on-disk index lookups.
I think something extremely important is to be very carefully on all code that is frequently executed. This is normally the code in critical inner loops.
Rule 1: Know this code
For this code avoid all overhead. Small differences in runtime can make a big impact on the overall performance. E.g. if you implement an image filter a difference of 0.001ms per pixel will make a difference in 1s in the filter runtime on a image with size 1000x1000 (which is not big).
Things to avoid/do in inner loops are:
don't go through interfaces (e.g DB queries, RPC calls etc)
don't jump around in the RAM, try to access it linearly
if you have to read from disk then read large chunks outside the inner loop (paging)
avoid virtual function calls
avoid function calls / use inline functions
use float instead of double if possible
avoid numerical casts if possible
use ++a instead of a++ in C++
iterate directly on pointers if possible
The second general advice: Each layer/interface costs, try to avoid large stacks of different technologies, the system will spend more time in data transformation then in doing the actual job, keep things simple.
And as the others said, use the right algorithm, try to optimize the algorithm complexity first before you optimize the algorithm implementation.
I know you're looking for specific coding hints, but those are easy to find: cacheing, loop unrolling, code hoisting, data & code locality, blah, blah...
The biggest hint of all is don't use them.
Would it help to make this point if I said "This is the secret that the almighty Powers That Be don't want you to know!!"? Pick your Powers: Microsoft, Google, Sun, etc. etc.
Don't Use Them
Until you know, with dead certainty, what the problems are, and then the coding hints are obvious.
Here's an example where many coding tricks were used, but the heart and soul of the exercise is not the coding techniques, but the diagnostic technique.
Are your algorithms correct for the situation or are there better ones available?
Often I deal with aggregate or parent entities which have attributes derived from their constituent or children members. For example:
The byte_count and packet_count of a TcpConnection object is computed from
the same attributes of its two constituent TcpStream objects, which in turn are
computed from their constituent TcpPacket objects.
An Invoices object might have a total which is basically the SUM() of its
constituent InvoiceLineItems' prices, with a little freight, discount and tax
logic thrown in.
When dealing with millions of packets or millions of invoiced line items (I wish!), on-demand computation of these derived attributes -- either in a VIEW or more commonly in presentation logic like reports or web interfaces -- is often unacceptably slow.
How do you decide, before performance concerns force your hand, whether to "promote" derived attributes to precomputed fields?
I personally wouldn't denormalize until performance trade-offs force my hand (because the downside of denormalizations are too drastic IMHO), but you might also consider:
Convenience: e.g. if two different client apps want to calculate the same derived attributes, they both have to code up the queries to calculate them. Denormalization offers both client apps the derived attribute in a simpler way.
Stability over time: e.g. if the formula for calculating a derived attribute is changeable, denormalization allows you to capture and store the derived value at a point in time so future calculations will never get it wrong
Simpler queries: adding complexity to the DB structure can mean your Select query is simpler at the client end.
Performance: Select queries on denormalized data can be quicker.
Ref: The Database Programmer: The Argument for Denormalization. Be sure to read as well his article on Keeping Denormalized Values Correct - his recommendation is to use triggers. That brings home the kind of trade-off denormalization requires.
Basically, you don't. You left performance concerns force your hand.
That's the best answer because 99% of the time, you should not be pre-optimizing like this, it's better to just calc it on the fly.
However, it is quite common for client-application developers to come to the server-side with mistaken preconceptions like "on-demand computation of ...derived attributes... -- is often unacceptably slow", and this just IS NOT true. The correct wording here would be "is rarely unacceptably slow".
As such, unless you are an expert in this (a DB Development Architect, etc.), you should not be engaging in premature optimization. Wait until it's obvious that is has to be fixed, then look at pre-aggregation.
How current the data must be determines how you implement it, really.
I'll assume 2 simple states: current or not current.
Current: indexed views, triggers, stored procs to maintain aggregate tables etc
Not current: Reporting Service snapshots, log shipping/replication, data warehouse etc
That said, I would develop against the same quantity of data as I have in prod so I have some confidence in response time. You should rarely be surprised by your code performance...
I would like to know if somebody often uses metrics to validate its code/design.
As example, I think I will use:
number of lines per method (< 20)
number of variables per method (< 7)
number of paremeters per method (< 8)
number of methods per class (< 20)
number of field per class (< 20)
inheritance tree depth (< 6).
Lack of Cohesion in Methods
Most of these metrics are very simple.
What is your policy about this kind of mesure ? Do you use a tool to check their (e.g. NDepend) ?
Imposing numerical limits on those values (as you seem to imply with the numbers) is, in my opinion, not very good idea. The number of lines in a method could be very large if there is a significant switch statement, and yet the method is still simple and proper. The number of fields in a class can be appropriately very large if the fields are simple. And five levels of inheritance could be way too many, sometimes.
I think it is better to analyze the class cohesion (more is better) and coupling (less is better), but even then I am doubtful of the utility of such metrics. Experience is usually a better guide (though that is, admittedly, expensive).
A metric I didn't see in your list is McCabe's Cyclomatic Complexity. It measures the complexity of a given function, and has a correlation with bugginess. E.g. high complexity scores for a function indicate: 1) It is likely to be a buggy function and 2) It is likely to be hard to fix properly (e.g. fixes will introduce their own bugs).
Ultimately, metrics are best used at a gross level -- like control charts. You look for points above and below the control limits to identify likely special cases, then you look at the details. For example a function with a high cyclomatic complexity may cause you to look at it, only to discover that it is appropriate because it a dispatcher method with a number of cases.
management by metrics does not work for people or for code; no metrics or absolute values will always work. Please don't let a fascination with metrics distract from truly evaluating the quality of the code. Metrics may appear to tell you important things about the code, but the best they can do is hint at areas to investigate.
That is not to say that metrics are not useful. Metrics are most useful when they are changing, to look for areas that may be changing in unexpected ways. For example, if you suddenly go from 3 levels of inheritance to 15, or 4 parms per method to 12, dig in and figure out why.
example: a stored procedure to update a database table may have as many parameters as the table has columns; an object interface to this procedure may have the same, or it may have one if there is an object to represent the data entity. But the constructor for the data entity may have all of those parameters. So what would the metrics for this tell you? Not much! And if you have enough situations like this in the code base, the target averages will be blown out of the water.
So don't rely on metrics as absolute indicators of anything; there is no substitute for reading/reviewing the code.
Personally I think it's very difficult to adhere to these types of requirements (i.e. sometimes you just really need a method with more than 20 lines), but in the spirit of your question I'll mention some of the guidelines used in an essay called Object Calisthenics (part of the Thoughtworks Anthology if you're interested).
Levels of indentation per method (<2)
Number of 'dots' per line (<2)
Number of lines per class (<50)
Number of classes per package (<10)
Number of instance variances per class (<3)
He also advocates not using the 'else' keyword nor any getters or setters, but I think that's a bit overboard.
Hard numbers don't work for every solution. Some solutions are more complex than others. I would start with these as your guidelines and see where your project(s) end up.
But, regarding these number specifically, these numbers seem pretty high. I usually find in my particular coding style that I usually have:
no more than 3 parameters per method
signature about 5-10 lines per method
no more than 3 levels of inheritance
That isn't to say I never go over these generalities, but I usually think more about the code when I do because most of the time I can break things down.
As others have said, keeping to a strict standard is going to be tough. I think one of the most valuable uses of these metrics is to watch how they change as the application evolves. This helps to give you an idea how good a job you're doing on getting the necessary refactoring done as functionality is added, and helps prevent making a big mess :)
OO Metrics are a bit of a pet project for me (It was the subject of my master thesis). So yes I'm using these and I use a tool of my own.
For years the book "Object Oriented Software Metrics" by Mark Lorenz was the best resource for OO metrics. But recently I have seen more resources.
Unfortunately I have other deadlines so no time to work on the tool. But eventually I will be adding new metrics (and new language constructs).
Update
We are using the tool now to detect possible problems in the source. Several metrics we added (not all pure OO):
use of assert
use of magic constants
use of comments, in relation to the compelxity of methods
statement nesting level
class dependency
number of public fields in a class
relative number of overridden methods
use of goto statements
There are still more. We keep the ones that give a good image of the pain spots in the code. So we have direct feedback if these are corrected.