DXT1 compression is designed to be fast to decompress in hardware where its used in texture samplers. The Wikipedia article says that under certain circumstances you can work out the co-efficients of the interpolated colours as:
c2 = (2/3)*c0+(1/3)*c1
or rearranging that:
c2 = (1/3)*(2*c0+c1)
However you re-arrange the above equation, then you end up always having to multiply something by 1/3 (or dividing by 3, same deal even more expensive). And it seems weird to me that a texture format which is designed to be fast to decompress in hardware would require a multiplication or division. The FPGA I'm implementing my GPU on only has limited resources for multiplications and I want to save those for where they're really required.
So am I missing something? Is there an efficient way of avoiding the multiplications of the colour channels by a 1/3? Or should I just eat the cost of that multiplication?
This might be a bad way of imagining it, but could you implement it via the use of addition/subtraction of successive halves (shifts)?
As you have 16 bits this gives you the ability to get quite accurate with successive additions and subtractions.
A third could be represented as
a(n+1) = a(n) +/- A>>1, where, the list [0, 0, 1, 0, 1, etc] shows whether to add or subtract the shifted result.
I believe this is called fractional maths.
However, in FPGAs, it is difficult to know whether this is actually more power efficient than the native DSP blocks (e.g. DSP48E1) provided.
MY best answer I can come up with is that I can use the identity:
x/3 = sum(n=1 to infinity) (x/2^(2n))
and then take the first n terms. Using 4 terms I get:
(x/4)+(x/16)+(x/64)+(x/256)
which equals
x*0.33203125
which is probably good enough.
This relies on multiplication by a fixed power of 2 being free in hardware, then 3 additions of which I can run 2 in parallel.
Any better answer is appreciated though.
** EDIT **: Using a combination of this and #dyslexicgruffalo's answer I made a simple c++ program which iterated over the various sequences and tried them all and recorded the various average/max errors.
I did this for 0 <= x <= 189 (as 189 is the value of 2*c0.g + c1.g when g (which is 6 bits) maxes out.
The shortest good sequence (with a max error of 2, average error of 0.62) and is 4 ops was:
1 + x/4 + x/16 + x/64.
The best sequence which had a max error of 1, average error of 0.32, but is 6 ops was:
x/2 - x/4 + x/8 - x/16 + x/32 - x/64.
For the 5 bit values (red and blue) the maximum value is 31*3 and the above sequences are still good but not the best. These are:
x/4 + x/8 - x/16 + x/32 [max error of 1, average 0.38]
and
1 + x/4 + x/16 [max error of 2, average of 0.68]
(And, luckily, none of the above sequences ever guesses an answer which is too big so no clamping is needed even though they're not perfect)
I have the following code, and it does exactly what I want it to do, except that it is ridiculously slow. I would not be so bothered, except that when I process the code "manually", i.e., I break it into parts and do them individually, it's near instantaneous.
Here is my code:
Coefficient[Product[Sum[x^(j*Prime[i]), {j, 0, Floor[q/Prime[i]]}],
{i, 1, PrimePi[q]}], x, q]
Picture added for clarity:
I think it is trying to optimize the sum, but am not sure. Is there a way to stop that?
In addition, since all my coefficients are positive, and I only want the x^qth one, is there a way to get Mathematica to discard all exponents that are larger than that and not do all the multiplication with those?
I may be misunderstanding what you want but, as the coefficient will depend on q, I assume you want it evaluated for specific q. Since I suspected (like you) that the time is taken to optimise the produt and sum, I rewrote it. You had something like:
With[{q = 80}, Coefficient[\!\(
\*UnderoverscriptBox[\(\[Product]\), \(i = 1\), \(PrimePi[q]\)]\((
\*UnderoverscriptBox[\(\[Sum]\), \(j = 0\), \(\[LeftFloor]
\*FractionBox[\(q\), \(Prime[i]\)]\[RightFloor]\)]
\*SuperscriptBox[\(x\), \(j*Prime[i]\)])\)\), x, q]] // Timing
(*
-> {8.36181, 10003}
*)
which I rewrote with purely structural operations as
With[{q = 80},
Coefficient[Times ##
Table[Plus ## Table[x^(j*Prime[i]), {j, 0, Floor[q/Prime[i]]}],
{i, 1, PrimePi[q]}], x, q]] // Timing
(*
-> {8.36357, 10003}
*)
(this just builds up a list of the terms and then multiplies them, so no symbolic analysis is performed).
Just building up the polynomial is instantaneous, but it has a few thousand terms, so what is probably happening is that Coefficient spends a lot of time to make sure it has the right coefficient. Actually you can solve this by Expanding the polynomial. Thus:
With[{q = 80}, Coefficient[Expand[\!\(
\*UnderoverscriptBox[\(\[Product]\), \(i = 1\), \(PrimePi[q]\)]\((
\*UnderoverscriptBox[\(\[Sum]\), \(j = 0\), \(\[LeftFloor]
\*FractionBox[\(q\), \(Prime[i]\)]\[RightFloor]\)]
\*SuperscriptBox[\(x\), \(j*Prime[i]\)])\)\)], x, q]] // Timing
(*
-> {0.240862, 10003}
*)
and it also works for my method.
So to summarise, just stick Expand in front of the expression and before you take the coefficient.
I think that the reason that the original code is slow is because Coefficient is made to work even with very large expressions - ones that would not fit into the memory if naively expanded.
Here's the original polynomial:
poly[q_, x_] := Product[Sum[ x^(j*Prime[i]),
{j, 0, Floor[q/Prime[i]]}], {i, 1, PrimePi[q]}]
See how for not too large q, expanding the polynomial takes up a lot more memory and becomes fairly slow:
In[2]:= Through[{LeafCount, ByteCount}[poly[300, x]]] // Timing
Through[{LeafCount, ByteCount}[Expand#poly[300, x]]] // Timing
Out[2]= { 0.01, { 1859, 55864}}
Out[3]= {25.27, {77368, 3175840}}
Now let's define the coefficient in 3 different ways and time them
coeff[q_] := Module[{x}, Coefficient[poly[q, x], x, q]]
exCoeff[q_] := Module[{x}, Coefficient[Expand#poly[q, x], x, q]]
serCoeff[q_] := Module[{x}, SeriesCoefficient[poly[q, x], {x, 0, q}]]
In[7]:= Table[ coeff[q],{q,1,30}]//Timing
Table[ exCoeff[q],{q,1,30}]//Timing
Table[serCoeff[q],{q,1,30}]//Timing
Out[7]= {0.37,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
Out[8]= {0.12,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
Out[9]= {0.06,{0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}}
In[10]:= coeff[100]//Timing
exCoeff[100]//Timing
serCoeff[100]//Timing
Out[10]= {56.28,40899}
Out[11]= { 0.84,40899}
Out[12]= { 0.06,40899}
So SeriesCoefficient is definitely the way to go. Unless of course you're
a bit better at combinatorics than me and you know the following prime partition formulae
(oeis)
In[13]:= CoefficientList[Series[1/Product[1-x^Prime[i],{i,1,30}],{x,0,30}],x]
Out[13]= {1,0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}
In[14]:= f[n_]:=Length#IntegerPartitions[n,All,Prime#Range#PrimePi#n]; Array[f,30]
Out[14]= {0,1,1,1,2,2,3,3,4,5,6,7,9,10,12,14,17,19,23,26,30,35,40,46,52,60,67,77,87,98}