SQL set-based range - sql

How can I have SQL repeat some set-based operation an arbitrary number of times without looping? How can I have SQL perform an operation against a range of numbers? I'm basically looking for a way to do a set-based for loop.
I know I can just create a small table with integers in it, say from 1 to 1000 and then use it for range operations that are within that range.
For example, if I had that table I could make a select to find the sum of numbers 100-200 like this:
select sum(n) from numbers where n between 100 and 200
Any ideas? I'm kinda looking for something that works for T-SQL but any platform would be okay.
[Edit] I have my own solution for this using SQL CLR which works great for MS SQL 2005 or 2008. See below.

I think the very short answer to your question is to use WITH clauses to generate your own.
Unfortunately, the big names in databases don't have built-in queryable number-range pseudo-tables. Or, more generally, easy pure-SQL data generation features. Personally, I think this is a huge failing, because if they did it would be possible to move a lot of code that is currently locked up in procedural scripts (T-SQL, PL/SQL, etc.) into pure-SQL, which has a number of benefits to performance and code complexity.
So anyway, it sounds like what you need in a general sense is the ability to generate data on the fly.
Oracle and T-SQL both support a WITH clause that can be used to do this. They work a little differently in the different DBMS's, and MS calls them "common table expressions", but they are very similar in form. Using these with recursion, you can generate a sequence of numbers or text values fairly easily. Here is what it might look like...
In Oracle SQL:
WITH
digits AS -- Limit recursion by just using it for digits.
(SELECT
LEVEL - 1 AS num
FROM
DUAL
WHERE
LEVEL < 10
CONNECT BY
num = (PRIOR num) + 1),
numrange AS
(SELECT
ones.num
+ (tens.num * 10)
+ (hundreds.num * 100)
AS num
FROM
digits ones
CROSS JOIN
digits tens
CROSS JOIN
digits hundreds
WHERE
hundreds.num in (1, 2)) -- Use the WHERE clause to restrict each digit as needed.
SELECT
-- Some columns and operations
FROM
numrange
-- Join to other data if needed
This is admittedly quite verbose. Oracle's recursion functionality is limited. The syntax is clunky, it's not performant, and it is limited to 500 (I think) nested levels. This is why I chose to use recursion only for the first 10 digits, and then cross (cartesian) joins to combine them into actual numbers.
I haven't used SQL Server's Common Table Expressions myself, but since they allow self-reference, recursion is MUCH simpler than it is in Oracle. Whether performance is comparable, and what the nesting limits are, I don't know.
At any rate, recursion and the WITH clause are very useful tools in creating queries that require on-the-fly generated data sets. Then by querying this data set, doing operations on the values, you can get all sorts of different types of generated data. Aggregations, duplications, combinations, permutations, and so on. You can even use such generated data to aid in rolling up or drilling down into other data.
UPDATE: I just want to add that, once you start working with data in this way, it opens your mind to new ways of thinking about SQL. It's not just a scripting language. It's a fairly robust data-driven declarative language. Sometimes it's a pain to use because for years it has suffered a dearth of enhancements to aid in reducing the redundancy needed for complex operations. But nonetheless it is very powerful, and a fairly intuitive way to work with data sets as both the target and the driver of your algorithms.

I created a SQL CLR table valued function that works great for this purpose.
SELECT n FROM dbo.Range(1, 11, 2) -- returns odd integers 1 to 11
SELECT n FROM dbo.RangeF(3.1, 3.5, 0.1) -- returns 3.1, 3.2, 3.3 and 3.4, but not 3.5 because of float inprecision. !fault(this)
Here's the code:
using System;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Collections;
[assembly: CLSCompliant(true)]
namespace Range {
public static partial class UserDefinedFunctions {
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRow", TableDefinition = "n bigint")]
public static IEnumerable Range(SqlInt64 start, SqlInt64 end, SqlInt64 incr) {
return new Ranger(start.Value, end.Value, incr.Value);
}
[Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, IsDeterministic = true, SystemDataAccess = SystemDataAccessKind.None, IsPrecise = true, FillRowMethodName = "FillRowF", TableDefinition = "n float")]
public static IEnumerable RangeF(SqlDouble start, SqlDouble end, SqlDouble incr) {
return new RangerF(start.Value, end.Value, incr.Value);
}
public static void FillRow(object row, out SqlInt64 n) {
n = new SqlInt64((long)row);
}
public static void FillRowF(object row, out SqlDouble n) {
n = new SqlDouble((double)row);
}
}
internal class Ranger : IEnumerable {
Int64 _start, _end, _incr;
public Ranger(Int64 start, Int64 end, Int64 incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerEnum(_start, _end, _incr);
}
}
internal class RangerF : IEnumerable {
double _start, _end, _incr;
public RangerF(double start, double end, double incr) {
_start = start; _end = end; _incr = incr;
}
public IEnumerator GetEnumerator() {
return new RangerFEnum(_start, _end, _incr);
}
}
internal class RangerEnum : IEnumerator {
Int64 _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerEnum(Int64 start, Int64 end, Int64 incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public long Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
internal class RangerFEnum : IEnumerator {
double _cur, _start, _end, _incr;
bool hasFetched = false;
public RangerFEnum(double start, double end, double incr) {
_start = _cur = start; _end = end; _incr = incr;
if ((_start < _end ^ _incr > 0) || _incr == 0)
throw new ArgumentException("Will never reach end!");
}
public double Current {
get { hasFetched = true; return _cur; }
}
object IEnumerator.Current {
get { hasFetched = true; return _cur; }
}
public bool MoveNext() {
if (hasFetched) _cur += _incr;
return (_cur > _end ^ _incr > 0);
}
public void Reset() {
_cur = _start; hasFetched = false;
}
}
}
and I deployed it like this:
create assembly Range from 'Range.dll' with permission_set=safe -- mod path to point to actual dll location on disk.
go
create function dbo.Range(#start bigint, #end bigint, #incr bigint)
returns table(n bigint)
as external name [Range].[Range.UserDefinedFunctions].[Range]
go
create function dbo.RangeF(#start float, #end float, #incr float)
returns table(n float)
as external name [Range].[Range.UserDefinedFunctions].[RangeF]
go

This is basically one of those things that reveal SQL to be less than ideal. I'm thinking maybe the right way to do this is to build a function that creates the range. (Or a generator.)
I believe the correct answer to your question is basically, "you can't".
(Sorry.)

You can use a common table expression to do this in SQL2005+.
WITH CTE AS
(
SELECT 100 AS n
UNION ALL
SELECT n + 1 AS n FROM CTE WHERE n + 1 <= 200
)
SELECT n FROM CTE

If using SQL Server 2000 or greater, you could use the table datatype to avoid creating a normal or temporary table. Then use the normal table operations on it.
With this solution you have essentially a table structure in memory that you can use almost like a real table, but much more performant.
I found a good discussion here: Temporary tables vs the table data type

Here's a hack you should never use:
select sum(numberGenerator.rank)
from
(
select
rank = ( select count(*)
from reallyLargeTable t1
where t1.uniqueValue > t2.uniqueValue ),
t2.uniqueValue id1,
t2.uniqueValue id2
from reallyLargeTable t2
) numberGenerator
where rank between 1 and 10
You can simplify this using the Rank() or Row_Number functions in SQL 2005

Related

Creating single use intermediate variables

I've read somewhere that a variable should be entered into the code if it is reused. But when I write my code for logic transparency, I sometimes create intermediate variables (with names reflecting what they contain) which are used only once.
How incorrect is this concept?
PS:
I want to do it right.
It is important to note that most of the time clarity takes precedence over re-usability or brevity. This is one of the basic principles of clean code. Most modern compilers optimize code anyway so creating new variables need not be a concern at all.
It is perfectly fine to create a new variable if it would add clarity to your code. Make sure to give it a meaningful name. Consider the following function:
public static boolean isLeapYear(final int yyyy) {
if ((yyyy % 4) != 0) {
return false;
}
else if ((yyyy % 400) == 0) {
return true;
}
else if ((yyyy % 100) == 0) {
return false;
}
else {
return true;
}
}
Even though the boolean expressions are used only once, they may confuse the reader of the code. We can rewrite it as follows
public static boolean isLeapYear(int year) {
boolean fourth = year % 4 == 0;
boolean hundredth = year % 100 == 0;
boolean fourHundredth = year % 400 == 0;
return fourth && (!hundredth || fourHundredth);
}
These boolean variables add much more clarity to the code.
This example is from the Clean Code book by Robert C. Martin.

Get rid of switch statement

recently working on a task and I encountered a problem which I am stuck on, the matter is with switch statement as follows:
private static final Double FIRST_HOUR_COST = 1.0;
private static final Double SECOND_HOUR_COST = 2.0;
private static final Double CONVERSION_RATE = 1.5
#Override
public double calculateReservationCost(Reservation reservation)
{
int hours = DateUtils.hoursDifference(reservation.getStartTime(), reservation.getStopTime()) + 1;
switch(hours)
{
case 1:
return FIRST_HOUR_COST;
case 2:
return FIRST_HOUR_COST + SECOND_HOUR_COST;
default:
return FIRST_HOUR_COST +
SECOND_HOUR_COST +
countEachNextHour(SECOND_HOUR_COST, 2, hours);
}
}
I tried probably every solution I could find on stack but wasn't able to adjust it to my needs (map with functions as values, single equation etc). Hoping there is any other way to replace it in an efficient way without breaking any OOP rules.
I would replace the Switch Statement with the Strategy pattern. This will allow you to add new strategies in the future without having a long switch statement.

Use multiple return statement

Is there a considerable difference of optimization between these two codes (in Java and/or C++, currently, even if I guess it's the same in every languages) ? Or is it just a question of code readability ?
int foo(...) {
if (cond) {
if (otherCondA)
return 1;
if (otherCondB)
return 2;
return 3;
}
int temp = /* context and/or param-dependent */;
if (otherCondA)
return 4 * temp;
if (otherCondB)
return 4 / temp;
return 4 % temp;
}
and
int foo(...) {
int value = 0;
if (cond) {
if (otherCondA)
value = 1;
else if (otherCondB)
value = 2;
else value = 3;
}
else {
int temp = /* context and/or param-dependent */;
if (otherCondA)
value = 4 * temp;
else if (otherCondB)
value = 4 / temp;
else
value = 4 % temp;
}
return value;
}
The first one is shorter, avoid multiple imbrications of else statement and economize one variable (or at least seems to do so), but I'm not sure that it really changes something...
After looking deeper into the different assembly codes generated by GCC, here's the results :
The multiple return statement is more efficient during "normal" compilation, but with the -O_ flag, the balance change :
The more you optimise the code, the less the first approach worths. It makes the code harder to optimise, so, use it carefully. As said in comments, it's very powerful when used at the front of the function when testing preconditions, but in the middle of the function, it's a nightmare for the compiler.
Of course the multiple return is acceptable.
Because you can halt the program as soon as the function is finished

What is the fastest way to compare two byte arrays?

I am trying to compare two long bytearrays in VB.NET and have run into a snag. Comparing two 50 megabyte files takes almost two minutes, so I'm clearly doing something wrong. I'm on an x64 machine with tons of memory so there are no issues there. Here is the code that I'm using at the moment and would like to change.
_Bytes and item.Bytes are the two different arrays to compare and are already the same length.
For Each B In item.Bytes
If B <> _Bytes(I) Then
Mismatch = True
Exit For
End If
I += 1
Next
I need to be able to compare as fast as possible files that are potentially hundreds of megabytes and even possibly a gigabyte or two. Any suggests or algorithms that would be able to do this faster?
Item.bytes is an object taken from the database/filesystem that is returned to compare, because its byte length matches the item that the user wants to add. By comparing the two arrays I can then determine if the user has added something new to the DB and if not then I can just map them to the other file and not waste hard disk drive space.
[Update]
I converted the arrays to local variables of Byte() and then did the same comparison, same code and it ran in like one second (I have to benchmark it still and compare it to others), but if you do the same thing with local variables and use a generic array it becomes massively slower. I’m not sure why, but it raises a lot more questions for me about the use of arrays.
What is the _Bytes(I) call doing? It's not loading the file each time, is it? Even with buffering, that would be bad news!
There will be plenty of ways to micro-optimise this in terms of looking at longs at a time, potentially using unsafe code etc - but I'd just concentrate on getting reasonable performance first. Clearly there's something very odd going on.
I suggest you extract the comparison code into a separate function which takes two byte arrays. That way you know you won't be doing anything odd. I'd also use a simple For loop rather than For Each in this case - it'll be simpler. Oh, and check whether the lengths are correct first :)
EDIT: Here's the code (untested, but simple enough) that I'd use. It's in C# for the minute - I'll convert it in a sec:
public static bool Equals(byte[] first, byte[] second)
{
if (first == second)
{
return true;
}
if (first == null || second == null)
{
return false;
}
if (first.Length != second.Length)
{
return false;
}
for (int i=0; i < first.Length; i++)
{
if (first[i] != second[i])
{
return false;
}
}
return true;
}
EDIT: And here's the VB:
Public Shared Function ArraysEqual(ByVal first As Byte(), _
ByVal second As Byte()) As Boolean
If (first Is second) Then
Return True
End If
If (first Is Nothing OrElse second Is Nothing) Then
Return False
End If
If (first.Length <> second.Length) Then
Return False
End If
For i as Integer = 0 To first.Length - 1
If (first(i) <> second(i)) Then
Return False
End If
Next i
Return True
End Function
The fastest way to compare two byte arrays of equal size is to use interop. Run the following code on a console application:
using System;
using System.Runtime.InteropServices;
using System.Security;
namespace CompareByteArray
{
class Program
{
static void Main(string[] args)
{
const int SIZE = 100000;
const int TEST_COUNT = 100;
byte[] arrayA = new byte[SIZE];
byte[] arrayB = new byte[SIZE];
for (int i = 0; i < SIZE; i++)
{
arrayA[i] = 0x22;
arrayB[i] = 0x22;
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Safe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Safe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Unsafe(arrayA, arrayB, (UIntPtr)SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Unsafe: {0}", after - before);
}
{
DateTime before = DateTime.Now;
for (int i = 0; i < TEST_COUNT; i++)
{
int result = MemCmp_Pure(arrayA, arrayB, SIZE);
if (result != 0) throw new Exception();
}
DateTime after = DateTime.Now;
Console.WriteLine("MemCmp_Pure: {0}", after - before);
}
return;
}
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint="memcmp", ExactSpelling=true)]
[SuppressUnmanagedCodeSecurity]
static extern int memcmp_1(byte[] b1, byte[] b2, UIntPtr count);
[DllImport("msvcrt.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint = "memcmp", ExactSpelling = true)]
[SuppressUnmanagedCodeSecurity]
static extern unsafe int memcmp_2(byte* b1, byte* b2, UIntPtr count);
public static int MemCmp_Safe(byte[] a, byte[] b, UIntPtr count)
{
return memcmp_1(a, b, count);
}
public unsafe static int MemCmp_Unsafe(byte[] a, byte[] b, UIntPtr count)
{
fixed(byte* p_a = a)
{
fixed (byte* p_b = b)
{
return memcmp_2(p_a, p_b, count);
}
}
}
public static int MemCmp_Pure(byte[] a, byte[] b, int count)
{
int result = 0;
for (int i = 0; i < count && result == 0; i += 1)
{
result = a[0] - b[0];
}
return result;
}
}
}
If you don't need to know the byte, use 64-bit ints that gives you 8 at once. Actually, you can figure out the wrong byte, once you've isolated it to a set of 8.
Use BinaryReader:
saveTime = binReader.ReadInt32()
Or for arrays of ints:
Dim count As Integer = binReader.Read(testArray, 0, 3)
Better approach... If you are just trying to see if the two are different then save some time by not having to go through the entire byte array and generate a hash of each byte array as strings and compare the strings. MD5 should work fine and is pretty efficient.
I see two things that might help:
First, rather than always accessing the second array as item.Bytes, use a local variable to point directly at the array. That is, before starting the loop, do something like this:
array2 = item.Bytes
That will save the overhead of dereferencing from the object each time you want a byte. That could be expensive in Visual Basic, especially if there's a Getter method on that property.
Also, use a "definite loop" instead of "for each". You already know the length of the arrays, so just code the loop using that value. This will avoid the overhead of treating the array as a collection. The loop would look something like this:
For i = 1 to max Step 1
If (array1(i) <> array2(i))
Exit For
EndIf
Next
Not strictly related to the comparison algorithm:
Are you sure your bottleneck is not related to the memory available and the time used to load the byte arrays? Loading two 2 GB byte arrays just to compare them could bring most machines to their knees. If the program design allows, try using streams to read smaller chunks instead.

How do I write a generic memoize function?

I'm writing a function to find triangle numbers and the natural way to write it is recursively:
function triangle (x)
if x == 0 then return 0 end
return x+triangle(x-1)
end
But attempting to calculate the first 100,000 triangle numbers fails with a stack overflow after a while. This is an ideal function to memoize, but I want a solution that will memoize any function I pass to it.
Mathematica has a particularly slick way to do memoization, relying on the fact that hashes and function calls use the same syntax:
triangle[0] = 0;
triangle[x_] := triangle[x] = x + triangle[x-1]
That's it. It works because the rules for pattern-matching function calls are such that it always uses a more specific definition before a more general definition.
Of course, as has been pointed out, this example has a closed-form solution: triangle[x_] := x*(x+1)/2. Fibonacci numbers are the classic example of how adding memoization gives a drastic speedup:
fib[0] = 1;
fib[1] = 1;
fib[n_] := fib[n] = fib[n-1] + fib[n-2]
Although that too has a closed-form equivalent, albeit messier: http://mathworld.wolfram.com/FibonacciNumber.html
I disagree with the person who suggested this was inappropriate for memoization because you could "just use a loop". The point of memoization is that any repeat function calls are O(1) time. That's a lot better than O(n). In fact, you could even concoct a scenario where the memoized implementation has better performance than the closed-form implementation!
You're also asking the wrong question for your original problem ;)
This is a better way for that case:
triangle(n) = n * (n - 1) / 2
Furthermore, supposing the formula didn't have such a neat solution, memoisation would still be a poor approach here. You'd be better off just writing a simple loop in this case. See this answer for a fuller discussion.
I bet something like this should work with variable argument lists in Lua:
local function varg_tostring(...)
local s = select(1, ...)
for n = 2, select('#', ...) do
s = s..","..select(n,...)
end
return s
end
local function memoize(f)
local cache = {}
return function (...)
local al = varg_tostring(...)
if cache[al] then
return cache[al]
else
local y = f(...)
cache[al] = y
return y
end
end
end
You could probably also do something clever with a metatables with __tostring so that the argument list could just be converted with a tostring(). Oh the possibilities.
In C# 3.0 - for recursive functions, you can do something like:
public static class Helpers
{
public static Func<A, R> Memoize<A, R>(this Func<A, Func<A,R>, R> f)
{
var map = new Dictionary<A, R>();
Func<A, R> self = null;
self = (a) =>
{
R value;
if (map.TryGetValue(a, out value))
return value;
value = f(a, self);
map.Add(a, value);
return value;
};
return self;
}
}
Then you can create a memoized Fibonacci function like this:
var memoized_fib = Helpers.Memoize<int, int>((n,fib) => n > 1 ? fib(n - 1) + fib(n - 2) : n);
Console.WriteLine(memoized_fib(40));
In Scala (untested):
def memoize[A, B](f: (A)=>B) = {
var cache = Map[A, B]()
{ x: A =>
if (cache contains x) cache(x) else {
val back = f(x)
cache += (x -> back)
back
}
}
}
Note that this only works for functions of arity 1, but with currying you could make it work. The more subtle problem is that memoize(f) != memoize(f) for any function f. One very sneaky way to fix this would be something like the following:
val correctMem = memoize(memoize _)
I don't think that this will compile, but it does illustrate the idea.
Update: Commenters have pointed out that memoization is a good way to optimize recursion. Admittedly, I hadn't considered this before, since I generally work in a language (C#) where generalized memoization isn't so trivial to build. Take the post below with that grain of salt in mind.
I think Luke likely has the most appropriate solution to this problem, but memoization is not generally the solution to any issue of stack overflow.
Stack overflow usually is caused by recursion going deeper than the platform can handle. Languages sometimes support "tail recursion", which re-uses the context of the current call, rather than creating a new context for the recursive call. But a lot of mainstream languages/platforms don't support this. C# has no inherent support for tail-recursion, for example. The 64-bit version of the .NET JITter can apply it as an optimization at the IL level, which is all but useless if you need to support 32-bit platforms.
If your language doesn't support tail recursion, your best option for avoiding stack overflows is either to convert to an explicit loop (much less elegant, but sometimes necessary), or find a non-iterative algorithm such as Luke provided for this problem.
function memoize (f)
local cache = {}
return function (x)
if cache[x] then
return cache[x]
else
local y = f(x)
cache[x] = y
return y
end
end
end
triangle = memoize(triangle);
Note that to avoid a stack overflow, triangle would still need to be seeded.
Here's something that works without converting the arguments to strings.
The only caveat is that it can't handle a nil argument. But the accepted solution can't distinguish the value nil from the string "nil", so that's probably OK.
local function m(f)
local t = { }
local function mf(x, ...) -- memoized f
assert(x ~= nil, 'nil passed to memoized function')
if select('#', ...) > 0 then
t[x] = t[x] or m(function(...) return f(x, ...) end)
return t[x](...)
else
t[x] = t[x] or f(x)
assert(t[x] ~= nil, 'memoized function returns nil')
return t[x]
end
end
return mf
end
I've been inspired by this question to implement (yet another) flexible memoize function in Lua.
https://github.com/kikito/memoize.lua
Main advantages:
Accepts a variable number of arguments
Doesn't use tostring; instead, it organizes the cache in a tree structure, using the parameters to traverse it.
Works just fine with functions that return multiple values.
Pasting the code here as reference:
local globalCache = {}
local function getFromCache(cache, args)
local node = cache
for i=1, #args do
if not node.children then return {} end
node = node.children[args[i]]
if not node then return {} end
end
return node.results
end
local function insertInCache(cache, args, results)
local arg
local node = cache
for i=1, #args do
arg = args[i]
node.children = node.children or {}
node.children[arg] = node.children[arg] or {}
node = node.children[arg]
end
node.results = results
end
-- public function
local function memoize(f)
globalCache[f] = { results = {} }
return function (...)
local results = getFromCache( globalCache[f], {...} )
if #results == 0 then
results = { f(...) }
insertInCache(globalCache[f], {...}, results)
end
return unpack(results)
end
end
return memoize
Here is a generic C# 3.0 implementation, if it could help :
public static class Memoization
{
public static Func<T, TResult> Memoize<T, TResult>(this Func<T, TResult> function)
{
var cache = new Dictionary<T, TResult>();
var nullCache = default(TResult);
var isNullCacheSet = false;
return parameter =>
{
TResult value;
if (parameter == null && isNullCacheSet)
{
return nullCache;
}
if (parameter == null)
{
nullCache = function(parameter);
isNullCacheSet = true;
return nullCache;
}
if (cache.TryGetValue(parameter, out value))
{
return value;
}
value = function(parameter);
cache.Add(parameter, value);
return value;
};
}
}
(Quoted from a french blog article)
In the vein of posting memoization in different languages, i'd like to respond to #onebyone.livejournal.com with a non-language-changing C++ example.
First, a memoizer for single arg functions:
template <class Result, class Arg, class ResultStore = std::map<Arg, Result> >
class memoizer1{
public:
template <class F>
const Result& operator()(F f, const Arg& a){
typename ResultStore::const_iterator it = memo_.find(a);
if(it == memo_.end()) {
it = memo_.insert(make_pair(a, f(a))).first;
}
return it->second;
}
private:
ResultStore memo_;
};
Just create an instance of the memoizer, feed it your function and argument. Just make sure not to share the same memo between two different functions (but you can share it between different implementations of the same function).
Next, a driver functon, and an implementation. only the driver function need be public
int fib(int); // driver
int fib_(int); // implementation
Implemented:
int fib_(int n){
++total_ops;
if(n == 0 || n == 1)
return 1;
else
return fib(n-1) + fib(n-2);
}
And the driver, to memoize
int fib(int n) {
static memoizer1<int,int> memo;
return memo(fib_, n);
}
Permalink showing output on codepad.org. Number of calls is measured to verify correctness. (insert unit test here...)
This only memoizes one input functions. Generalizing for multiple args or varying arguments left as an exercise for the reader.
In Perl generic memoization is easy to get. The Memoize module is part of the perl core and is highly reliable, flexible, and easy-to-use.
The example from it's manpage:
# This is the documentation for Memoize 1.01
use Memoize;
memoize('slow_function');
slow_function(arguments); # Is faster than it was before
You can add, remove, and customize memoization of functions at run time! You can provide callbacks for custom memento computation.
Memoize.pm even has facilities for making the memento cache persistent, so it does not need to be re-filled on each invocation of your program!
Here's the documentation: http://perldoc.perl.org/5.8.8/Memoize.html
Extending the idea, it's also possible to memoize functions with two input parameters:
function memoize2 (f)
local cache = {}
return function (x, y)
if cache[x..','..y] then
return cache[x..','..y]
else
local z = f(x,y)
cache[x..','..y] = z
return z
end
end
end
Notice that parameter order matters in the caching algorithm, so if parameter order doesn't matter in the functions to be memoized the odds of getting a cache hit would be increased by sorting the parameters before checking the cache.
But it's important to note that some functions can't be profitably memoized. I wrote memoize2 to see if the recursive Euclidean algorithm for finding the greatest common divisor could be sped up.
function gcd (a, b)
if b == 0 then return a end
return gcd(b, a%b)
end
As it turns out, gcd doesn't respond well to memoization. The calculation it does is far less expensive than the caching algorithm. Ever for large numbers, it terminates fairly quickly. After a while, the cache grows very large. This algorithm is probably as fast as it can be.
Recursion isn't necessary. The nth triangle number is n(n-1)/2, so...
public int triangle(final int n){
return n * (n - 1) / 2;
}
Please don't recurse this. Either use the x*(x+1)/2 formula or simply iterate the values and memoize as you go.
int[] memo = new int[n+1];
int sum = 0;
for(int i = 0; i <= n; ++i)
{
sum+=i;
memo[i] = sum;
}
return memo[n];