What would be a strategy to generate semilog sequence with just a few asm instructions? - sequence

I need to generate the following number sequences. The actual usage is to find the bin in a histogram with increasingly bigger bins.
The first column is just a log sequence such that the histogram's bins increase size in powers of two.
The next column (2) is a further subdivision of the first column such that each larger bin is subdivided in two sub-bins.
I have a reference discrete solution but I am looking to further reduce the number of cycles, perhaps using floating point tricks.
Index: (1) (2) (3) (4) (5) (6) (7)
1: 1 1 1 1 1 1 1
2: 2 2 2 2 2 2 2
3: 2 3 3 3 3 3 3
4: 3 4 4 4 4 4 4
5: 3 4 5 5 5 5 5
6: 3 5 6 6 6 6 6
7: 3 5 7 7 7 7 7
8: 4 6 8 8 8 8 8
9: 4 6 8 9 9 9 9
10: 4 6 9 10 10 10 10
11: 4 6 9 11 11 11 11
12: 4 7 10 12 12 12 12
13: 4 7 10 13 13 13 13
14: 4 7 11 14 14 14 14
15: 4 7 11 15 15 15 15
16: 5 8 12 16 16 16 16
17: 5 8 12 16 17 17 17
18: 5 8 12 17 18 18 18
19: 5 8 12 17 19 19 19
20: 5 8 13 18 20 20 20
21: 5 8 13 18 21 21 21
22: 5 8 13 19 22 22 22
23: 5 8 13 19 23 23 23
24: 5 9 14 20 24 24 24
25: 5 9 14 20 25 25 25
26: 5 9 14 21 26 26 26
27: 5 9 14 21 27 27 27
28: 5 9 15 22 28 28 28
29: 5 9 15 22 29 29 29
30: 5 9 15 23 30 30 30
31: 5 9 15 23 31 31 31
32: 6 10 16 24 32 32 32
33: 6 10 16 24 32 33 33
34: 6 10 16 24 33 34 34
35: 6 10 16 24 33 35 35
36: 6 10 16 25 34 36 36
37: 6 10 16 25 34 37 37
38: 6 10 16 25 35 38 38
39: 6 10 16 25 35 39 39
40: 6 10 17 26 36 40 40
41: 6 10 17 26 36 41 41
42: 6 10 17 26 37 42 42
43: 6 10 17 26 37 43 43
44: 6 10 17 27 38 44 44
45: 6 10 17 27 38 45 45
46: 6 10 17 27 39 46 46
47: 6 10 17 27 39 47 47
48: 6 11 18 28 40 48 48
49: 6 11 18 28 40 49 49
50: 6 11 18 28 41 50 50
51: 6 11 18 28 41 51 51
52: 6 11 18 29 42 52 52
53: 6 11 18 29 42 53 53
54: 6 11 18 29 43 54 54
55: 6 11 18 29 43 55 55
56: 6 11 19 30 44 56 56
57: 6 11 19 30 44 57 57
58: 6 11 19 30 45 58 58
59: 6 11 19 30 45 59 59
60: 6 11 19 31 46 60 60
61: 6 11 19 31 46 61 61
62: 6 11 19 31 47 62 62
63: 6 11 19 31 47 63 63

The columns show a semi-logarithmic mapping of values. This means that while the major switchover points proceed in logarithmic progression, intermediate points proceed in linear fashion. This is strongly reminiscent of IEEE-754 floating-point encodings, where the exponent field expresses the binary logarithm of the represented quantity, while the significand field provides a linear progression between powers of two. While use of IEEE-754 floating-point formats is extremely widespread, it is not universal, so this approach can only be considered semi-portable.
One idea for an efficient implementation on most modern processors, CPUs and GPU alike, is therefore to convert the input Index into an IEEE-754 binary32 number represented as float at C or C++ source code level. One then extracts the appropriate bits (consisting of the exponent bits and some leading bits of the significand) from the binary representation of that float number, where the number of included significand bits increases by one with each output column, i.e. the column number is a granularity factor of the mapping.
There are various ways of accomplishing the details of the process outlined above. The most straightforward implementation is to use either ldexpf or multiplication with an appropriate power of two so the smallest input is scaled to the lowest binade of the binary32 format using multiplication with a power of two produced via exp2f().
The biggest caveat with this approach is that the lowest binade (biased exponent of 0) contains subnormal numbers, which are not available on platforms operating in FTZ (- flush to zero) mode. This may be the case for both x86 SIMD as well as for NVIDIA GPUs, for example. Another caveat is that ldexpf() or exp2f() may not implemented efficiently, that is, via (almost) direct support in hardware. Lastly, the accuracy of these functions may be insufficient. For example, for CUDA 11.8 I found that exp2f() with a negative integer argument does not always deliver the correct power-of-two result (specifically, exp2f(-127) is off by one ulp), making the variant using exp2f() fail.
Alternate approaches convert Index into a floating-point number without scaling, i.e. starting the mapping in a binade near unity. This raises the issue that for column j > 0, the first 2j entries incorrectly have logarithmic mapping applied to them. This can be solved by manually enforcing a linear mapping to these entries, so that the result equals Index for the first 2j entries. The IEEE-754 exponent bias for the logarithmic portion of the computed values can be removed prior or after bit field extraction, with preference depending on specifics of the instruction set architecture.
The four design variants described above are enumerated in the exemplary code below by the macro VARIANT which can take values 0 through 3. From some initial experiments it seems that when compiling with modern compilers at high optimization level (I tried gcc, clang, and icx) coding at the assembly level may not be necessary. On platforms without IEEE-754 floating-point arithmetic, a quick simulation of integer to floating-point conversion based on a CLZ (count leading zeros) instruction may be helpful.
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <cstring>
#include <cmath>
#define VARIANT (3)
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
// provide semi-logarithmic mapping of i to output based on granularity parameter j
int semilogmap (int i, int j)
{
const int FP32_MANT_BITS = 23;
const int FP32_EXPO_BIAS = 127;
#if VARIANT == 0 // this requires subnormal support and will break when using FTZ (flush to zero)!
return float_as_uint32 ((float)i * exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
#elif VARIANT == 1 // this requires subnormal support and will break when using FTZ (flush to zero)!
return float_as_uint32 (ldexpf ((float)i, 1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
#elif VARIANT == 2
return (i < (1 << j)) ? i :
((float_as_uint32 ((float)i) - ((FP32_EXPO_BIAS - 1 + j) << FP32_MANT_BITS)) >> (FP32_MANT_BITS - j));
#elif VARIANT == 3
return (i < (1 << j)) ? i :
((float_as_uint32 ((float)i) >> (FP32_MANT_BITS - j)) - ((FP32_EXPO_BIAS - 1 + j) << j));
#else
#error unsupported VARIANT
#endif // VARIANT
}
int main (void)
{
int col [64][7];
for (int i = 1; i <= 63; i++) {
for (int j = 0; j < 7; j++) {
col[i][j] = semilogmap (i, j);
}
}
for (int i = 1; i <= 63; i++) {
printf ("%2d: ", i);
for (int j = 0; j < 7; j++) {
printf (" %2d", col[i][j]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
In terms of the number of instructions generated, it might be instructive to look at a CUDA version of variant 0 for execution on NVIDIA GPUs. I had to implement my own version of exp2f() to achieve the necessary accuracy.
/* compute 2**x with a maximum error of 2.055 ulp */
__forceinline__ __device__ float my_exp2f (float x)
{
float r, t;
t = x;
if (x < -126.0f) t = t + 24.0f;
asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(t));
if (x < -126.0f) r = r * 5.9604644775390625e-8f; // 0x1.0p-24
return r;
}
/* semi-logarithmic mapping of i to output based on granularity parameter j */
__device__ int semilogmap (int i, int j)
{
const int FP32_MANT_BITS = 23;
const int FP32_EXPO_BIAS = 127;
return __float_as_int ((float)i * my_exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
}
Compiled with nvcc -c -rdc=true -arch=sm_75 -o semilogmap.obj semilogmap.cu using the toolchain from CUDA 11.8, the following code, (comprising 11 instructions including the function return) is generated for semilogmap():
code for sm_75
Function : _Z10semilogmapii
.headerflags #"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
/*0000*/ IADD3 R0, -R5.reuse, -0x7e, RZ ;
/*0010*/ I2F R4, R4 ;
/*0020*/ ISETP.GT.AND P0, PT, R5.reuse, RZ, PT ;
/*0030*/ IADD3 R6, -R5, 0x17, RZ ;
/*0040*/ I2F R3, R0 ;
/*0050*/ #P0 FADD R3, R3, 24 ;
/*0060*/ MUFU.EX2 R7, R3 ;
/*0070*/ #P0 FMUL R7, R7, 5.9604644775390625e-08 ;
/*0080*/ FMUL R7, R4, R7 ;
/*0090*/ SHF.R.S32.HI R4, RZ, R6, R7 ;
/*00a0*/ RET.ABS.NODEC R20 0x0 ;

Related

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).

Right parameters for strip_unused_nodes

Tensorflow Graph Transforms page https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md shows how to use strip_unused_nodes.
But how to know the right values of X and Y in strip_unused_nodes(type=X, shape="y0,y1,y3,3") for my model?
Output of summarize_graph on my MobileNetV2 model :
Found 1 possible inputs: (name=image_tensor, type=uint8(4), shape=[?,?,?,3])
No variables spotted.
Found 4 possible outputs: (name=detection_boxes, op=Identity) (name=detection_scores, op=Identity) (name=detection_classes, op=Identity) (name=num_detections, op=Identity)
Found 3457096 (3.46M) const parameters, 0 (0) variable parameters, and 623 control_edges
Op types used: 1707 Const, 525 Identity, 277 Mul, 194 Add, 170 Reshape, 147 GatherV2, 133 Sub, 117 Minimum, 98 Slice, 92 Maximum, 77 ConcatV2, 77 Cast, 64 Rsqrt, 60 StridedSlice, 59 Relu6, 55 Conv2D, 54 Pack, 52 Greater, 49 Shape, 46 Split, 46 Where, 45 ExpandDims, 40 Fill, 37 Tile, 33 RealDiv, 33 DepthwiseConv2dNative, 30 Range, 29 Switch, 27 Unpack, 26 Enter, 25 Squeeze, 25 ZerosLike, 23 NonMaxSuppressionV2, 14 Merge, 12 BiasAdd, 12 FusedBatchNorm, 11 TensorArrayV3, 8 NextIteration, 6 TensorArrayWriteV3, 6 TensorArraySizeV3, 6 Sqrt, 6 Exit, 6 TensorArrayGatherV3, 5 TensorArrayScatterV3, 5 TensorArrayReadV3, 3 Rank, 3 Equal, 3 Transpose, 3 Assert, 2 Exp, 2 Less, 2 LoopCond, 1 All, 1 TopKV2, 1 Size, 1 Sigmoid, 1 ResizeBilinear, 1 Placeholder
To use with tensorflow/tools/benchmark:benchmark_model try these arguments:
bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=/home/ubuntu/model-optimization/frozen_inference_graph.pb --show_flops --input_layer=image_tensor --input_layer_type=uint8 --input_layer_shape=-1,-1,-1,3 --output_layer=detection_boxes,detection_scores,detection_classes,num_detections
I believe you should copy the input layer dims, you can find in the .ascii file of your model

Find Pattern in consecutive numbers

Whats the most effective method to detect n Pattern in consecutive numbers?
Maybe an SQL column or vector, R.
Some Pseudocode -R- to illustrate the "problem":
find Pattern in consecutive integers, where
2nd integer < 1st integer,
3rd integer > 2nd integer &
4th integer > 3rd integer.
a <- x
b <- x +1 < a
c <- x +2 > b
d <- x +3 > c
pattern <- c(a, b, c, d)
example: pattern <- c(10, 8, 9, 11) or pattern <- c(2.11, 2.08, 2.09, 2.11)
count(pattern)
find(pattern)
If you take the difference of the vector then the first should be negative and the other two positive, so,
a <- c(10, 8, 9, 11)
all((diff(a) < 0) == c(TRUE, FALSE, FALSE))
#[1] TRUE
To apply that to a bigger vector, you can use rollapply from zoo package, i.e.
library(zoo)
a <- sample(1:100,100,replace=T)
unique(rollapply(a, 4, by = 1, function(i) i[all((diff(i) < 0) == c(TRUE, FALSE, FALSE))]))
which gives,
[,1] [,2] [,3] [,4]
[1,] 85 18 85 92
[2,] 44 27 67 76
[3,] 58 2 39 54
[4,] 85 69 82 84
[5,] 61 4 40 44
[6,] 65 58 73 97
[7,] 19 9 92 96
[8,] 33 24 57 73
[9,] 79 11 37 100

inserting an empty line in between every two elements a column (data frame + pandas)

My data frame looks something like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
import pandas as pd
df =pd.read_csv('weekone.txt',)
df.columns=['Games']
I'm trying to put a blank line in between every two elements (teams).
So I want it to look like this:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
But when I'm using this loop
for i in df2.index:
if (df2.index[i])%2 == 1:
df2.Games[i]=df2.Games[i]+('\n')
else:
df2.Games[i] = df2.Games[i]
I'm getting an output like this:
Games
0 CAR 20
1 DEN 21\n
2 TB 31
3 ATL 24\n
4 SD 27
5 KC 33\n
6 CIN 23
7 NYJ 22\n
What am I doing wrong? Thanks.
you can do it this way:
In [172]: x
Out[172]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
In [173]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[''])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [174]: rslt
Out[174]:
Games
0 CAR 20
1 DEN 21
2 TB 31
3 ATL 24
4 SD 27
5 KC 33
6 CIN 23
7 NYJ 22
the index's dtype is object now:
In [178]: rslt.index.dtype
Out[178]: dtype('O')
or having -1 as an index for empty lines:
In [175]: %paste
empty_line = pd.DataFrame([''], columns=x.columns, index=[-1])
rslt = x.loc[:1]
g = x.groupby(x.index//2)
for i in range(1, len(g)):
rslt = pd.concat([rslt.append(empty_line), g.get_group(i)])
## -- End pasted text --
In [176]: rslt
Out[176]:
Games
0 CAR 20
1 DEN 21
-1
2 TB 31
3 ATL 24
-1
4 SD 27
5 KC 33
-1
6 CIN 23
7 NYJ 22
index dtype:
In [181]: rslt.index.dtype
Out[181]: dtype('int64')

Count preceding rows that match criteria

I am working time series data and I need to count the number of rows preceding the current row that matched a condition. For example, I need to know how many months prior to the row's month and customer had sales (NETSALES > 0). Ideally I would maintain a row counter that resets when the condition fails (e.g. NETSALES = 0).
Another way of solving the problem would be to flag any row that had more than 12 prior periods of NETSALES.
The closest I came was using the
COUNT(*)
OVER (PARTITION BY cust ORDER BY dt
ROWS 12 PRECEDING) as CtWindow,
http://sqlfiddle.com/#!6/990eb/2
In the example above, 201310 is correctly flagged as 12 but ideally the previous row would have been 11.
The solution can be in R or T-SQL.
Updated with data.table example:
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
The goal is to calculate a "run" column like below -- which gets reset to zero when the value is zero
NETSALES cust dt run
1: 36.956464 1 1 1
2: 83.767621 1 2 2
3: 28.585003 1 3 3
4: 10.250524 1 4 4
5: 6.537188 1 5 5
6: 0.000000 1 6 6
7: 95.489944 1 7 7
8: 46.351387 1 8 8
9: 0.000000 1 9 0
10: 0.000000 1 10 0
11: 99.621881 1 11 1
12: 76.755104 1 12 2
13: 64.288721 1 13 3
14: 0.000000 1 14 0
15: 36.504473 1 15 1
16: 43.157142 1 16 2
17: 71.808349 1 17 3
18: 53.039105 1 18 4
19: 0.000000 1 19 0
20: 27.387369 1 20 1
21: 58.308899 2 1 1
22: 65.929296 2 2 2
23: 20.529473 2 3 3
24: 58.970898 2 4 4
25: 13.785201 2 5 5
26: 4.796752 2 6 6
27: 72.758112 2 7 7
28: 7.088647 2 8 8
29: 14.516362 2 9 9
30: 94.470714 2 10 10
31: 51.254178 2 11 11
32: 99.544261 2 12 12
33: 66.475412 2 13 13
34: 8.362936 2 14 14
35: 96.742115 2 15 15
36: 15.677712 2 16 16
37: 0.000000 2 17 0
38: 95.684652 2 18 1
39: 65.639292 2 19 2
40: 95.721081 2 20 3
NETSALES cust dt run
This seems to do it:
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
DT[,dir:=ifelse(NETSALES>0,1,0)]
dir.rle <- rle(DT$dir)
DT <- transform(DT, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
DT[,runl:=cumsum(dir),by=indexer]
credit to Cumulative sums over run lengths. Can this loop be vectorized?
Edit by Roland:
Here is the same with better performance and also considering different customers:
#no need for ifelse
DT[,dir:= NETSALES>0]
#use a function to avoid storing the rle, which could be huge
runseq <- function(x) {
x.rle <- rle(x)
rep(1:length(x.rle$lengths), x.rle$lengths)
}
#never use transform with data.table
DT[,indexer := runseq(dir)]
#include cust in by
DT[,runl:=cumsum(dir),by=list(indexer,cust)]
Edit: joe added SQL solution
http://sqlfiddle.com/#!6/990eb/22
SQL solution is 48 minutes on a machine with 128gig of ram across 22m rows. R solution is about 20 seconds on a workstation with 4 gig of ram. Go R!