Use of cached hash value to speed up equality tests - oop

My target language is C++, but this is a question over object oriented programming in general.
Suppose I have a class for which testing equality takes a non-trivial amount of time, but I also have a hash value that I have computed over it. I can rely on the data to stay the same for the life of the object.
Is it common practice to cache the hash value, and use that to test for inequality?
To make this example more concrete, I have a class that contains a potentially long list of 2D locations, and I expect to make many many equality comparisons over it. I create the hash value upon construction by mixing the hashes of all of the locations.
When testing for equality, I check the hash values first. If the hashes are equal, I do the exhaustive, point-by-point equality test. Otherwise I call them unequal.

Sounds good to me as long as your hash algorithm is guaranteed to produce the same hash for the same input data. Then your check should speed things up considerably at the small cost of extra memory of storing the hash value (and a 1 time cost of computing the hash)

I think you meant to say, "If the hashes are unequal, I do the exhaustive, point-by-point equality test."
In any case -- yes, you've got a good idea, this is a perfectly reasonable design. You're trading a bit of cheap storage for a chunk of expensive CPU and/or wall time. A design like this can give enormous performance benefits. For example, consider the case where each individual point test requires a database query.

Related

Compound "OR" evaluation in DB2

I've searched the forums and found a few related threads but no definitive answer.
(case
when field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%'...
In the above statement, if field1 is "like" t001, will the others even be evaluated?
(case
when (field1 LIKE '%T001%' OR field1 LIKE '%T201%' OR field1 LIKE '%T301%')...
Does adding parenthesis as shown above change the evaluation?
In general, databases short-circuit boolean operations. That is, they stop at the first value that defines the result -- the first "true" for OR, the first "false" for AND.
That said, there are no guarantees. Nor are there guarantees about the order of evaluation. So, DB2 could decide to test the middle, the last, and then the first. That said, these are pretty equivalent, so I would expect the ordering to be either first to last or last to first.
Remember: SQL is a descriptive language, not a procedural language. A SQL query describers the result set, but not the steps used to generate it.
You don't know.
SQL is a declarative language, not an imperative one. You describe what you want, the engine provides it. The database engine will decide in which sequence it will evaluate those predicates, and you don't have control over that.
If you get the execution plan today it may show one sequence of steps, but if you get it tomorrow it may show something different.
Not strictly answering your question, but if you have many of these matches, a simpler, possibly faster, and easier to maintain solution would be to use REGEXP_LIKE. The example you've posted could be written like this:
CASE WHEN REGEXP_LIKE(field1, '.*T(0|2|3)01.*') ...
Just for indication how it really works in this simple case.
select
field1
, case
when field1 LIKE '%T001%' OR RAISE_ERROR('75000', '%T201%') LIKE '%T201%' then 1 else 0
end as expr
from
(
values
'abcT001xyz'
--, '***'
) t (field1);
The query returns expected result for the statement above as is.
But if you uncomment the commented out line, you get SQLCODE=-438.
This means, that 2-nd OR operand is not evaluated, if 1-st returns true.
Note, that it's just for demonstration. There is no any guarantee, that this will work in such a way in any case.
Just to add to some points made about the difference between so-called procedural languages on the one hand, and SQL (which is variously described as a declarative or descriptive language) on the other.
SQL defines a set of relatively high-level operators for working with arrays of data. They are "high-level" in this sense because they work with arrays in a concise fashion that has not been typical of general purpose or procedural languages. Like all operators (whether array-based or not), there is typically more than one algorithm capable of implementing an operator.
In contrast to "general purpose" programming languages to which the vast majority of programmers are exposed, the existence of these array operators - in particular, the ability to combine them algebraically into an expression which defines a composite operation (or a query), and the absence of any explicit algorithms for iteration - was once a defining feature of SQL.
The distinction is less sharp nowadays, with a resurgent interest in functional languages and features, but most still view SQL as a beast of its own kind amongst commercially-popular tooling.
It is often said that in SQL you define what results you want not how to get them. But this is true for every language. It is true even for machine code operators, if you account for how the implementation in circuitry can be varied - and does vary, between CPU designs. It is certainly true for all compiled languages, where compilers employ many different machine code algorithms to actually implement operations specified in the source code - loop unrolling, for one example.
The feature which continues to distinguish SQL (and the relational databases which implement it), is that the algorithm which implements an operation is determined at the time of each execution, and it does this not just by algebraic manipulation of queries (which is not dissimilar to what compilers do), but also by continuously generating and analysing statistics about the data being operated upon and the consequences of previous executions.
Put another way, the database execution engine is engaged in a constant search for the best practical algorithms (and combinations thereof) to implement its overall workload. It is capable of accomodating not just past experience, but of reacting to changes (such as in data volumes, in degree of concurrency and transactional conflict, or in systemic constraints like available memory or overall workload).
The upshot of all this is that there is a specific order of evaluation in SQL, just like any other language. It is this order which defines a correct result. But unless written in so-called RBAR style (and even then, but to a more limited extent...), the database engine has tremendous latitude to implement shortcuts and performance optimisations, provided these do not change the final result.
Many operators fall into a class where it is possible to determine the result in many cases without evaluating all operands. I'm not sure what the formal word is to describe this property - partial evaluativity, maybe - but casually it is called short-circuiting. The OR operator has this property.
Another property of the OR operation is that it is associative. That is, the order in which a series of them are applied does not matter - it behaves like the addition operator does, where you can add numbers in any order without affecting the result.
With a series of OR conditions, these are capable of being reordered and partially evaluated, provided the evaluation of any particular operand does not cause side-effects or depend on hidden variables. It is therefore quite likely that the database engine may reorder or partially evaluate them.
If the operands do cause side-effects or depend on hidden variables (functions which get the current date or time being a prime example of the latter), these often cause problems in queries - either because the database engine does not realise they have side-effects or hidden variables, or because the database does realise it but doesn't handle the case in the way the programmer expects. In such cases, a query may have to be completely rewritten (typically, cracked into multiple statements) to force a specific evaluation order or guarantee full evaluation.

Relationship between code optimization and data compression

I searched on the Internet about this question and Found that some researchers used data compression algorithms for compiler optimization like Huffman coding.
My question is more general :
Can we consider code optimization as lossy type of compression?
At a concrete level, it's apples and oranges. But at an abstract level it's an interesting question.
Data compression deals with redundancy, which is the difference between data and information.
It seeks to reduce needless redundancy by revising the coding of the information.
Often this coding works by taking a common substring and making a code that refers to it, rather that repeating the substring.
Compiler optimization (for speed) seeks to reduce needless cycles.
One way is if the result of some computation is needed twice or more,
make sure it is saved in some address or register (memoizing) so it can be re-used with fewer cycles.
Another form of encoding numbers is so-called "unary notation" where there is only one digit, and numbers are represented by repeating it. For example, the numbers "three" and "four" are "111" and "1111", which takes N digits.
This code is optimized by switching to binary, as in "011" and "100", which takes log(N) digits (base 2, of course).
A programming analogy to this is the difference between linear and binary search.
Linear search takes O(N) comparisons.
Each comparison can yield either a lot of information or very little - on average much less than a bit.
Binary search takes O(log(N)) comparisons, with each comparison yielding one bit.
With some thinking, it should be possible to find other parallels.

Postgresql -- All else equal, is querying for (small) integer or float values faster than querying for (small) string values?

I'm about to mark maybe 100,000 records retroactively/posthoc-wise with category-indicating string or integer values. There are more to come. The categories to be marked by this column reflect a scalar continuum of different category types, going anywhere from "looser" to "tighter" essentially. I was thinking about using string values though, instead of integers, in case one day I come back to it and not know what means what.
So that's the reasoning for using strings, readability.
But I'll be relying on these columns pretty significantly, selecting swaths of records based off this criteria.
Obviously whatever it is I'm going to put an index on it, but with an index, I'm not sure how much faster querying on integers is than using strings. I've noticed the speediness of using booleans, and can reasonably assume small integers can be queried on more quickly than strings based off this.
I've been pondering this trade off for some time now so thought I'd fire off a question. Thanks
If it's really a string representing some ordered level between "looser" and "tighter", consider using an enum:
http://www.postgresql.org/docs/current/static/datatype-enum.html
That way, you'll get the best of both worlds.
One tiny note, though: ideally, make sure you nail all possible values in advance. Changing an enum is of course possible, but doing so adds an extra lookup and sort step internally (on a 32-bit float field) when the order of its numeric representation (its oid, which is a 32-bit integer) no longer matches its final order. (The performance difference is minor, but one to keep in mind should your data ever grow to billions of rows. And, again: it only applies when you alter the order of an existing enum.)
Regarding the second part of your question, sorting small integers (16-bit) is, in my own admittedly limited testing from a few years back, a bit slower than normal integers (32-bit). I imagine it's because they're manipulated as 32-bit integers anyway. And sorting or querying integers, as in the case of enums, is faster than sorting arbitrary strings. Ergo, use enums if you don't need the flexibility of adding arbitrary values down thhe road: they'll give you the best of each world.

Sql Order By - Varchar vs Integer

Speaking with a friend of mine about DB structure, he says that for telephone number he use to create integer attributes and cast them directly into code that will extract data from DB (he add zero ahead number). Apart that method that could be questionable, I suggest him to use a varchar field.
He says that use a varchar is less efficient because:
It take more memory for information storage (and this is true)
It take more "time" for ordering the field
I'm pretty confused as I guess that rdbms, with all optimization, will do this sort in a O(n log(n)) or something like regardless of data type. Mining the internet for information, unfortunately, turned out to be useless.
Could someone helps me understand if what I'm saying here has sense or not?
Thank you.
RDBMS use indexes to optimize ordering(sorting). Indexes stored as B-tree in the storages and is more large as the indexed field(s) is. Thus the disk IO operations increases because the data is large. In another hands, the O(n log(n)) is different for different type of data. The semantic of comparing string and numeric(integer) are different and comparison of strings are more complicated than integers.

Performance difference between MOVE and = assignment in ABAP

Is there any kind of performance gain between 'MOVE TO' vs x = y? I have a really old program I am optimizing and would like to know if it's worth it to pull out all the MOVE TO. Any other general tips on ABAP optimization would be great as well.
No, that is just the same operation expressed in two different ways. Nothing to gain there. If you're out for generic hints, there's a good book available that I'd recommend studying in detail. If you have to optimize a specific program, use the tracing tools (transaction SAT in sufficiently current releases).
The two statements are equivalent:
"
To assign the value of a data object source to a variable destination, use the following statement:
MOVE source TO destination.
or the equivalent statement
destination = source.
"
No, they're the same.
Here's a couple quick hints from my years of performance enhancement:
1) if you use move-corresponding where possible, your code can be a lot more concise, modular, and extendable (in the distant past this was frowned upon but the technical reasons for this are generally not applicable anymore).
2) Use SAT at every opportunity, and be sure to turn on internal table tracking. This is like turning on the lights versus stumbling over furniture in the dark.
3) Make the database layer do as much work as possible for you. Try to combine queries wherever possible, especially when combining result sets. Two queries linked by a join is usually much better than select > itab > select FOR ALL ENTRIES.
4) This is a bit advanced, but FOR ALL ENTRIES often has much slower performance than the equivalent select-options IN phrase. This seems to be because the latter is built as one big query to the database layer while the former requires multiple trips to the database layer. The caveat, of course, is that if you have too many records in your select-options the generated query at the database layer will exceed the allowable size on your system, but large performance gains are possible within that limitation. In general, SAP just loves select-options.
5) Index, index, index!
First of all move does not really affect much performance.
What is affecting quite a lot in the projects I worked for is following:
Nested loops (very evil). For example, loop through all documents, and for each document select single to check it company code is allowed to be displayed.
Instead, make a list of company codes, consult them all once from db and consult this results table instead.
Use hash or sorted tables where possible. Where not possible, use standard table, but sort it by keys and use "binary search".
Select from DB by all key fields. If not possible, consider creating indexes.
For small and simple selects, use joins. For bigger selects using joins will still work faster, but would be difficult to follow up.
Minor thing - use field symbols to read table line, this makes it much faster.
1) You should be careful while using SELECT statement in ABAP language.
Unnecessary database connections significantly decreases the performance of ABAP program.
2) While using internal table with functions you should call it by reference to reduce memory usage.
Call By Reference:
Passes a pointer to the memory location.Changes made to the
variable within the subroutine affects the variable outside
the subroutine.
3)Should not use internal tables with workarea.
4)While using nested loops, use sorting algorithms.
They are the same, as is the ADD keyword and + operator.
If you do want to optimize your ABAP, I have found the largest culprits to be:
Not using binary lookups and/or (internal) table keys properly.
The syntax of ABAP is brain-dead when it comes to table use. Know how
to work with tables efficiently. Basically write
better/optimal/elegant high-level code. This is always a winner!
Fewer instructions == less time. The fewer instructions you hit the
faster the program will run. This is important in tight loops... I
know this sounds obvious, but ABAP is so slow, that if you are really
trying to optimize critical programs, this will make a difference.
(We have processes that run days... and shaving off an hour or so
makes a difference!)
Don't mix types. There is a little bit of overhead to some
implicit conversions... for instance if you are initializing a
string data type, then use the correct literal string with
(backtick) quotes: `literal`. This also counts for looking up in
tables using keys... use exact match datatypes.
Function calls... I cannot stress the overhead of function calls
enough... the less you have the better. Goes against anything a real
computer programmer believes, but there you have it... ABAP is a
special case.
Loop using ASSIGNING (or REF TO - slightly slower on certain
types), avoid INTO like a plague.
PS: Also keep in mind that SWITCH statements are just glorified IF conditionals... thus move the most common conditions to the top!
You can create CDS with ADT Eclipse. Or views(se11) have good performance for selecting.
"MOVE a TO" b and "a = b" are just same in ABAP. There is no performance difference "MOVE" is just a more visible, noticeable version.
But if you talk about "MOVE-CORRESPONDING", yes, there is a performance difference. It's more practical to code, but actually runs slower then direct movement.